# Day Two - Introduction to Pandas

* TODO: VERY VERY Quick refresh of day one?

## Diving into Pandas

* Pandas is a 3rd-party library for doing data analysis
* It is a foundational component of Python data science
* Developed by [Wes McKinney](http://wesmckinney.com/pages/about.html) while working in the finance industry, so it has some...warts
* Vanilla Python (what we did yesterday) can do many of the same things, but Pandas does them *faster* and usually *easier*
* To do this, pandas introduces a set of data structures and analysis functions

In [1]:
# First, we need to load pandas into memory and give it the name "pd"
import pandas as pd

### Pandas Data Structures

* To understand Pandas, which is hard, it is helpful to start the data structures it adds to Python:
    * Series - For one dimensional data (lists) 
    * Dataframe - For two dimensional data (spreadsheets)
    * Index - For naming, selecting, and transforming data within a Pandas Series or Dataframe (column and row names)

### Series

* A one-dimensional array of indexed data
* Kind of like a blend of a Python list and dictionary
* You can create them from a Python list


In [2]:
# Create a regular Python list
my_list = [0.25, 0.5, 0.75, 1.0]

# Transform that list into a Series
data = pd.Series(my_list)

# Display the data
data

0    0.25
1    0.50
2    0.75
3    1.00
dtype: float64

* A Series is a list-like structure, which means it is *ordered* 
* You can use indexing to grab items in a Series, just like a list
* It is best to use the `iloc` method to grab elements by their location in the series.

In [3]:
# grab the first element
data.iloc[0]

0.25

In [4]:
# grab the last elemenet
data.iloc[-1]

1.0

#### Exercise

* Use index notation to grab the 2nd element of `data`

In [None]:
# hint: the 2nd element is 0.50


* Also, like lists, you can use *slicing* notation to grab sub-lists
* Again, it is best to use the `.iloc` method

#### Exercise

* Use slices to grab the 2nd and 3rd elements of this series

In [None]:
# hint: the 2nd & 3rd elements are 0.50 and 0.75
# your code here



* Series also act like Python dictionaries, *ordered* python dictionaries
* This means you can grab things by name in addition to location

In [5]:
# Create a regular Python Dictionary
population_dict = {'California': 38332521,
                   'Texas': 26448193,
                   'New York': 19651127,
                   'Florida': 19552860,
                   'Illinois': 12882135}

# Transform that dictionary into a Series 
population = pd.Series(population_dict)

# Display the data
population

California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
dtype: int64

* You can use indexing and slicing like above, but now with keys instead of numbers!
* It is best to use the `.loc` method when looking up things by name instead of by number


In [6]:
# select the data value with the name "California"
population.loc['California']

38332521

In [7]:
# What happens if you try an use a name when it wants
population.iloc['California']

TypeError: cannot do positional indexing on <class 'pandas.core.indexes.base.Index'> with these indexers [California] of <class 'str'>

* Like a Python dictionary, a Series is a list of key/value pairs
* But these are *ordered*, which means you can do slicing

#### Exercise

* Try slicing this series, but with keys instead of numbers!
* Select a subset of the data using the Python slicing notation
* Don't forget, use `loc`!

In [None]:
# Hint: Use the same : notation, but use the state names listed above
# Your code here:



### DataFrame

* `DataFrames` are the real workhorse of Pandas and Python Data Science
* We will be spending a lot of time with data inside of Dataframes, so buckle up!
* `DataFrames` contain two-dimensional data, just like an Excel spreadsheet
* In practice, a `DataFrame` is a bunch of `Series` lined up next to each other

In [8]:
# Start with our population Series
population

California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
dtype: int64

In [9]:
# Then create an area Series
area_dict = {'Illinois': 149995, 'California': 423967, 
             'Texas': 695662, 'Florida': 170312, 
             'New York': 141297}
area = pd.Series(area_dict)
area

Illinois      149995
California    423967
Texas         695662
Florida       170312
New York      141297
dtype: int64

In [11]:
# Now mash them together into a DataFrame
states = pd.DataFrame({'population': population,
                       'area': area}   )
# Display the data
states

Unnamed: 0,population,area
California,38332521,423967
Florida,19552860,170312
Illinois,12882135,149995
New York,19651127,141297
Texas,26448193,695662


* Pandas automatically lines everything up because they have shared index values

## Index

* Pandas `Series` and `DataFrames` are containers for data
* The Index (and Indexing) is the mechanism to make that data retrievable
* In a `Series` the index is the key to each value in the list
* In a `DataFrame` the index is the column names, but there is also an index for each row
    * Two indexes for two dimensions
* Indexing allows you to merge or join disparate datasets together

In [13]:
states

Unnamed: 0,population,area
California,38332521,423967
Florida,19552860,170312
Illinois,12882135,149995
New York,19651127,141297
Texas,26448193,695662


* The `loc` method I talked about above allows us to select specific rows and columns *by name*.
* Use the syntax `[<row>,<column>]` with index values

In [14]:
# Get the value of the population column from Illinois
states.loc["Illinois", "population"]

12882135

* We can also be tricky and use more advanced syntax to do more advanced queries.

In [15]:
# Get the area for states from Florida to Texas
# this is two dimensional slicing
states.loc["Florida":"Texas", "area"]

Florida     170312
Illinois    149995
New York    141297
Texas       695662
Name: area, dtype: int64

In [16]:
# Get the area for Florida and Texas
# Use a list to select multiple specific values
states.loc[["Florida", "Texas"], "area"]

Florida    170312
Texas      695662
Name: area, dtype: int64

In [17]:
# Get area and population for Florida and Texas
# use a ":" to specify "all columns"
states.loc[["Florida", "Texas"], :]

Unnamed: 0,population,area
Florida,19552860,170312
Texas,26448193,695662


* What is happening here is we are passing a list of names for the rows, and using the colon ":" to say "all columns
* OK, this is neat, but let's move on to some *read* examples

## Doing Stuff with Pandas

* Once your data is in a Pandas `DataFrame` you can easily use a ton of analytical tools
* You just have to get your data to fit into a dataframe
* Getting data to fit is a big part of the "data janitor" work...it is the craft of data carpentry
* However, as we will see, there is still a lot of carpentry work to do once your data fits into a `DataFrame`

### Open the file and load it into memory

* Pandas provides some very handy functions for reading in CSV files.
* Remember how we opened CSV files last week?

In [19]:
# Load up the CSV module
import csv

#
with open('community-center-attendance.csv') as file_handler:
    # Load the file into the CSV module
    reader = csv.reader(file_handler)
    # Read the first line of the file into a header variable
    headers = next(reader)
    # Read all the data into a variable as a list. look ma! one line!
    center_attendance_python = [row for row in reader]

# Print out the headers list
print(headers)

# Print out the first five rows 
center_attendance_python[0:5]

['_id', 'date', 'center_name', 'attendance_count']


[['1', '2018-06-08', 'Ormsby Community Center', '10'],
 ['2', '2018-06-08', 'Paulson Community Center', '19'],
 ['3', '2018-06-08', 'Phillips Community Center', '107'],
 ['4', '2018-06-08', 'Ammon Community Center', '81'],
 ['5', '2018-06-08', 'Brookline Community Center', '33']]

* Pandas can do this much easier

In [24]:
# Open up the csv file the Pandas way
center_attendance_pandas = pd.read_csv("community-center-attendance.csv", 
                              index_col="_id")
# Display the first five rows
center_attendance_pandas.iloc[0:5]

Unnamed: 0_level_0,date,center_name,attendance_count
_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,2018-06-08,Ormsby Community Center,10
2,2018-06-08,Paulson Community Center,19
3,2018-06-08,Phillips Community Center,107
4,2018-06-08,Ammon Community Center,81
5,2018-06-08,Brookline Community Center,33


* Pandas also has a special function, `head(n)` for looking at the first *n* rows in a dataframe

In [25]:
# Use the head function to look at the "head" 
# of the dataframe. Default is 5 rows.
center_attendance_pandas.head()

Unnamed: 0_level_0,date,center_name,attendance_count
_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,2018-06-08,Ormsby Community Center,10
2,2018-06-08,Paulson Community Center,19
3,2018-06-08,Phillips Community Center,107
4,2018-06-08,Ammon Community Center,81
5,2018-06-08,Brookline Community Center,33


* Notice the index starts at 1 instead of zero, that is because we told Pandas to use the "_id" column as the row index.
* This is when it is important to understand the difference between `loc` and `iloc`

In [None]:
# Select row by index name
center_attendance_pandas.loc[1]

In [None]:
# Select row by index location
center_attendance_pandas.iloc[1]

### Helpful functions

In [26]:
# Look at the last 5 rows
center_attendance_pandas.tail()

Unnamed: 0_level_0,date,center_name,attendance_count
_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
18363,2011-03-08,Magee Community Center,32
18364,2011-03-08,West Penn Community Center,3
18365,2011-03-07,Warrington Community Center,1
18366,2011-03-07,Magee Community Center,7
18367,2011-03-07,West Penn Community Center,2


In [27]:
# How many rows and columns
center_attendance_pandas.shape

(18367, 3)

In [28]:
# Inspect the datatypes
center_attendance_pandas.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 18367 entries, 1 to 18367
Data columns (total 3 columns):
date                18367 non-null object
center_name         18367 non-null object
attendance_count    18367 non-null int64
dtypes: int64(1), object(2)
memory usage: 574.0+ KB


In [None]:
# Compute summary statistics on the numerical columns
center_attendance_pandas.describe()

### Counting Numerical Data

* We can use traditional Python functions to get information about our Dataframe.

In [29]:
# use a standard python function to get the length of the sequence
len(center_attendance_pandas)

18367

* So this tells us our dataset has 18,367 rows.
* But this is just information about the dataset itself, can we 


In [37]:
# create a variable to hold the total attendance
total_attendance = 0

# loop over the data
for row in center_attendance_python:
    # add the row count to the total, convert string to int
    row_attendance = int(row[3])
    total_attendance = total_attendance + row_attendance

print(total_attendance)

1137558


In [33]:
# compute the total attendance with the pandas sum function
center_attendance_pandas['attendance_count'].sum()

1137558

### Counting Categorical Data

* Just like before we can start counting the distribution of values in the column. 
* how many entries per community center (this isn't counting attendance but counting the number of rows per center).

**The "Pythonic way"**

In [59]:
# Create a dictionary to store the counts
center_counter = dict()

# loop over the data
for row in center_attendance_python:
    center = row[2]
    
    # check to see if the gender is already in the diction
    if center not in center_counter:
        # create a new entry
        center_counter[center] = 1
    else:
        # increment a new entry
        center_counter[center] += 1

# Display the dictionary 
center_counter

{'Ormsby Community Center': 1990,
 'Paulson Community Center': 1547,
 'Phillips Community Center': 2116,
 'Ammon Community Center': 1825,
 'Brookline Community Center': 2159,
 'Jefferson Community Center': 1701,
 'Warrington Community Center': 1714,
 'West Penn Community Center': 2130,
 'Magee Community Center': 1800,
 'Arlington Community Center': 1331,
 'Gladstone Field': 5,
 'Mellon Tennis Center': 6,
 'Phillips Park Field': 13,
 'Schenley Ice Rink': 1,
 'Warrington Field': 1,
 'West Penn Fields': 1,
 'West Penn Pool': 7,
 'Dan Marino Field (Playground)': 1,
 'Paulson Field': 3,
 'Ormsby Field (Playground)': 8,
 'Ammon Pool': 3,
 'Moore Pool': 1,
 'Frick Environmental Center': 1,
 'Arlington Field (Playground)': 1,
 'Highland Pool': 1,
 'Ammon / Josh Gibson Field': 1}

The Pandas way

In [57]:
# Do the same thing with pandas
center_attendance_pandas['center_name'].value_counts()

Brookline Community Center       2159
West Penn Community Center       2130
Phillips Community Center        2116
Ormsby Community Center          1990
Ammon Community Center           1825
Magee Community Center           1800
Warrington Community Center      1714
Jefferson Community Center       1701
Paulson Community Center         1547
Arlington Community Center       1331
Phillips Park Field                13
Ormsby Field (Playground)           8
West Penn Pool                      7
Mellon Tennis Center                6
Gladstone Field                     5
Ammon Pool                          3
Paulson Field                       3
Frick Environmental Center          1
Dan Marino Field (Playground)       1
West Penn Fields                    1
Highland Pool                       1
Arlington Field (Playground)        1
Warrington Field                    1
Ammon / Josh Gibson Field           1
Moore Pool                          1
Schenley Ice Rink                   1
Name: center

* count the attendance per center

In [53]:
# Do the same thing with pands
center_attendance_pandas.groupby('center_name')['attendance_count'].sum().sort_values(ascending=False)


center_name
Brookline Community Center       312356
Phillips Community Center        179219
West Penn Community Center       146751
Ammon Community Center           130713
Ormsby Community Center           80594
Warrington Community Center       67966
Magee Community Center            61377
Arlington Community Center        54791
Paulson Community Center          52777
Jefferson Community Center        48637
Phillips Park Field                 604
Gladstone Field                     559
Ormsby Field (Playground)           405
Paulson Field                       159
Mellon Tennis Center                130
West Penn Fields                    127
West Penn Pool                      110
Dan Marino Field (Playground)       104
Ammon Pool                           87
Schenley Ice Rink                    50
Warrington Field                     15
Highland Pool                        11
Arlington Field (Playground)          9
Frick Environmental Center            5
Moore Pool                  

We can still use the Pandas way with string formatting to tell a data story.

## Vectorized String Operations

* There is a Pandas way of doing this that is much more terse and compact
* Pandas has a set of String operations that do much painful work for you
* Especially handling bad data!

In [None]:
data = ['peter', 'Paul', 'MARY', 'gUIDO']

for s in data:
    print(s.capitalize())

* But like above, this breaks very easily with missing values

In [None]:
data = ['peter', 'Paul', None, 'MARY', 'gUIDO']

for s in data:
    print(s.capitalize())

* The Pandas library has *vectorized string operations* that handle missing data

In [None]:
names = pd.Series(data)
names

In [None]:
names.str.capitalize()


* Look ma! No errors!
* Pandas includes a a bunch of methods for doing things to strings.

|             |                  |                  |                  |
|-------------|------------------|------------------|------------------|
|``len()``    | ``lower()``      | ``translate()``  | ``islower()``    | 
|``ljust()``  | ``upper()``      | ``startswith()`` | ``isupper()``    | 
|``rjust()``  | ``find()``       | ``endswith()``   | ``isnumeric()``  | 
|``center()`` | ``rfind()``      | ``isalnum()``    | ``isdecimal()``  | 
|``zfill()``  | ``index()``      | ``isalpha()``    | ``split()``      | 
|``strip()``  | ``rindex()``     | ``isdigit()``    | ``rsplit()``     | 
|``rstrip()`` | ``capitalize()`` | ``isspace()``    | ``partition()``  | 
|``lstrip()`` |  ``swapcase()``  |  ``istitle()``   | ``rpartition()`` |

#### Exercise

* In the cells below, try three of the string operations listed above on the Pandas Series `monte`
* Remember, you can hit tab to autocomplete and shift-tab to see documentation

In [None]:
monte = pd.Series(['Graham Chapman', 'John Cleese', 'Terry Gilliam',
                   'Eric Idle', 'Terry Jones', 'Michael Palin'])
monte

In [None]:
# First


In [None]:
# Second


In [None]:
# Third


## Example: Recipe Database

* Let's walk through the recipe database example from the Python Data Science Handbook
* There are a few concepts and commands I haven't yet covered, but I'll explain them as I go along
* Download the recipe file from [this link](https://s3.amazonaws.com/openrecipes/20170107-061401-recipeitems.json.gz) or run the cell below if you are on JupyterHub

In [None]:
recipes = pd.read_json("https://s3.amazonaws.com/openrecipes/20170107-061401-recipeitems.json.gz", 
                       compression='gzip',
                       lines=True)

We have downloaded the data and loaded it into a dataframe directly from the web.

In [None]:
recipes.head()

In [None]:
recipes.shape

We see there are nearly 200,000 recipes, and 17 columns.
Let's take a look at one row to see what we have:

In [None]:
# display the first item in the DataFrame
recipes.iloc[0]

In [None]:
# Show the first five items in the DataFrame
recipes.head()

There is a lot of information there, but much of it is in a very messy form, as is typical of data scraped from the Web.
In particular, the ingredient list is in string format; we're going to have to carefully extract the information we're interested in.
Let's start by taking a closer look at the ingredients:

In [None]:
# Summarize the length of the ingredients string
recipes['ingredients'].str.len().describe()

In [None]:
# which row has the longest ingredients string
recipes['ingredients'].str.len().idxmax()

In [None]:
# use iloc to fetch that specific row from the dataframe
recipes.iloc[135598]

In [None]:
# look at the ingredients string
recipes.iloc[135598]['ingredients']

* WOW! That is a lot of ingredients! That might need to be cleaned by hand instead of a machine
* What other questions can we ask of the recipe data?

In [None]:
# How many breakfasts?
recipes.description.str.contains('[Bb]reakfast').sum()

In [None]:
# How many have cinnamon as an ingredient?
recipes.ingredients.str.contains('[Cc]innamon').sum()

In [None]:
# How many misspell cinnamon as cinamon?
recipes.ingredients.str.contains('[Cc]inamon').sum()