In [1]:
%reload_ext nb_black

<IPython.core.display.Javascript object>

# 12.1.A Introducing Pandas and Manipulating DataFrames

`import`s are usually [put at the top](https://www.python.org/dev/peps/pep-0008/#imports) of a notebook

* load the `pandas` and `numpy` libraries using `pd` and `np` as aliases for these packages

In [2]:
# load needed libraries



<IPython.core.display.Javascript object>

In [3]:
# load needed libraries
import numpy as np
import pandas as pd

<IPython.core.display.Javascript object>

## `numpy`

The slides have more information on `numpy`.  Here we'll stick more to the functionality of `numpy` that applies to `pandas`.  We won't get into working with arrays as much (which are foundational and can provide a deeper understanding of `pandas`), but we'll rather stick to common `numpy`-ish things that come up in `pandas` land.

* Given a list, how can we see its 'dimension'?
* Given a `numpy` 'array' we see its dimensions with `.shape`.
  * When working with `pandas`, the shape will typically be something like `(2, 3)`
  * This shape takes the form `(<n rows>, <n columns>)`

In [4]:
x = np.array([[1, 2], [3, 4], [5, 6]])
x

array([[1, 2],
       [3, 4],
       [5, 6]])

<IPython.core.display.Javascript object>

In [5]:
# What is the shape of x?


<IPython.core.display.Javascript object>

With lists we could mix and match datatypes; `numpy` arrays prefer to be a single data type.  To check an array's datatype, we can use `.dtype` (note, we can't us `type()` because the type is a `np.array`).

* What `type()` is `a`?
* What `type()` is the 1st item of `a`?
* What `type()` is the 2nd item of `a`?
* Write code to check if the the `type()`s of the 1st and 2nd elements are the same.

In [6]:
a = ["a", 1, "steak", "sauce"]
a

['a', 1, 'steak', 'sauce']

<IPython.core.display.Javascript object>

* Define a new variable, `b`, that is the `numpy` array version of `a`
  * Similar to how we convert to `str` using the `str()` function, to convert something to an array use `np.array()`.
* What `type()` is `b`?
* What `type()` is the 1st item of `b`?
* What `type()` is the 2nd item of `b`?
* Write code to check if the the `type()`s of the 1st and 2nd elements are the same.

* Use `.dtype` to show the `d`ata`type` of `b`

## `pandas` 🐼

First things first, why pandas 🐼?  One way you might refer to the type of data you find in a spread sheet is [**pan**el **da**ta](https://www.google.com/search?q=panel+data).  The `pandas` package is how we might work with spreadsheet like data in python.

We need some data if we want to play with `pandas`.  One common way that we'll load data for `pandas` is to use `pd.read_csv()`.  Which takes a file path.

* Our data is located [here](https://tf-assets-prod.s3.amazonaws.com/tf-curric/data-analytics-bootcamp/songs.csv)
  * Click the link and download the file
  * Open the file up with Excel for us to compare
* Use `pd.read_csv()` and the below defined `file_path` to read the data into a variable called `songs`
* Display what we just read into the `songs` variable
* What `type()` is `songs`?
* What is the `.shape` of `songs`?

In [7]:
file_path = "https://tf-assets-prod.s3.amazonaws.com/tf-curric/data-analytics-bootcamp/songs.csv"

<IPython.core.display.Javascript object>

In [8]:
songs = pd.read_csv(file_path)
songs

Unnamed: 0,Song Clean,ARTIST CLEAN,Release Year,COMBINED,First?,Year?,PlayCount,F*G
0,Caught Up in You,.38 Special,1982,Caught Up in You by .38 Special,1,1,82,82
1,Fantasy Girl,.38 Special,,Fantasy Girl by .38 Special,1,0,3,0
2,Hold On Loosely,.38 Special,1981,Hold On Loosely by .38 Special,1,1,85,85
3,Rockin' Into the Night,.38 Special,1980,Rockin' Into the Night by .38 Special,1,1,18,18
4,Art For Arts Sake,10cc,1975,Art For Arts Sake by 10cc,1,1,1,1
...,...,...,...,...,...,...,...,...
2224,She Loves My Automobile,ZZ Top,,She Loves My Automobile by ZZ Top,1,0,1,0
2225,Tube Snake Boogie,ZZ Top,1981,Tube Snake Boogie by ZZ Top,1,1,32,32
2226,Tush,ZZ Top,1975,Tush by ZZ Top,1,1,109,109
2227,TV Dinners,ZZ Top,1983,TV Dinners by ZZ Top,1,1,1,1


<IPython.core.display.Javascript object>

One useful way to get a quick run-down of your dataframe is with `.info`.  The output of info will tell us:

* How many rows/columns
* The column names (in order)
* The number of 'non-null' rows per column
* The datatype per column
* The tallies of each datatype (aka `dtype`) in the dataframe

----

* Display `song`'s `.info`
* What column contains nulls?
* Display `song`'s `.head()`.  Can we see any nulls in this output?

In [9]:
songs.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2229 entries, 0 to 2228
Data columns (total 8 columns):
Song Clean      2229 non-null object
ARTIST CLEAN    2229 non-null object
Release Year    1652 non-null object
COMBINED        2229 non-null object
First?          2229 non-null int64
Year?           2229 non-null int64
PlayCount       2229 non-null int64
F*G             2229 non-null int64
dtypes: int64(4), object(4)
memory usage: 139.4+ KB


<IPython.core.display.Javascript object>

## Working with columns

## Selecting columns

To subset a list we use `[]`, to subset a dictionary we use `[]`, to subset a `numpy` array we use `[]`, to subset a `pandas` dataframe we use.... `[]`.

To ask for a column, we can ask for it by name.

* Isolate the `'PlayCount'` column and assign it to the variable `plays`
* What is the max value in this column? min? mean? median?

To request multiple columns by name, we can use a list of column names.

* Isolate the `'PlayCount'` and `'ARTIST CLEAN'` columns and assign them to the variable `artist_plays`
* What is the `.shape` of `artist_plays`?

## Selecting rows

Arguably, the most common way to select rows is by filtering.

* When we filter we think: "give me all songs played by zz top"
* For `pandas` we need to think: "give me all rows where the `'ARTIST CLEAN'` column is equal to `'ZZ Top'`"

----

* In regular python, how can we check if the variable `x` is equal to `'ZZ Top'`?
* In `pandas`, how can we isolate the `'ARTIST CLEAN'` column?
* Compare the `'ARTIST CLEAN'` column to `'ZZ Top'` just like you would for `x`.  What is the output?

In [10]:
x = 'zz top'

# Check if equal to 'ZZ Top'


<IPython.core.display.Javascript object>