# Day Two - Introduction to Pandas

## Today's agenda

* Talk about 🐼 (not that kind of panda)
* Pandas data structures
* Reading and writing CSV files to disk
* Exploring and summarizing data
* Subsetting Data
* Transforming Data

In [None]:
# First, we need to load pandas into memory and give it the name "pd"
import pandas as pd

## Diving into Pandas

* Pandas is a 3rd-party library for doing data analysis
* It is a foundational component of Python data science
* Developed by [Wes McKinney](http://wesmckinney.com/pages/about.html) while working in the finance industry, so it has some...warts
* Vanilla Python (what we did yesterday) can do many of the same things, but Pandas does them *faster* and usually *easier*
* To do this, pandas introduces a set of data structures and analysis functions

---

## Introduction to Pandas Data Structures

* To understand Pandas, which is hard, it is helpful to start the data structures it adds to Python:
    * Series - For one dimensional data (lists) 
    * Dataframe - For two dimensional data (spreadsheets)
    * Index - For naming, selecting, and transforming data within a Pandas Series or Dataframe (column and row names)

### Series

* A one-dimensional array of indexed data
* Kind of like a blend of a Python list and dictionary
* You can create them from a Python list


In [None]:
# Create a regular Python list
my_list = [0.25, 0.5, 0.75, 1.0]

# Transform that list into a Series
data = pd.Series(my_list)

# Display the data in the series
data

* A Series is a list-like structure, which means it is *ordered* 
* You can use indexing to grab items in a Series, just like a list
* Those numbers next to the other numbers, that is the *index* to the series
* It is best to use the `iloc` method to grab elements by their location in the series.

In [None]:
# grab the first element
data[0]

In [None]:
# grab the first element using iloc
data.iloc[0]

In [None]:
# grab the 4th elemenet
data.iloc[3]

#### Quick Exercise
* How might we grab the *last* element if we didn't know the length of the list?

In [None]:
# hint: think small
# your code below



#### Quick Exercise

* Use index notation to grab the 2nd element of `data`

In [None]:
# hint: the 2nd element is 0.50
# your code below



* Also, like lists, you can use *slicing* notation to grab sub-lists
* Again, it is best to use the `.iloc` method

#### Quick Exercise

* Use slices to grab the 2nd and 3rd elements of this series

In [None]:
# hint: the 2nd & 3rd elements are 0.50 and 0.75
# your code below



### Index by name

* Series also act like Python dictionaries, *ordered* python dictionaries
* This means you can grab things by name in addition to location

In [None]:
# Create a regular Python Dictionary
population_dict = {'California': 38332521,
                   'Texas': 26448193,
                   'New York': 19651127,
                   'Florida': 19552860,
                   'Illinois': 12882135}

# Transform that dictionary into a Series 
population = pd.Series(population_dict)

# Display the data
population

* You can use indexing and slicing like above, but now with keys instead of numbers!
* It is best to use the `.loc` method when looking up things by name instead of by number


In [None]:
# select the data value with the name "California"
population.loc['California']

In [None]:
# What happens if you try an use a name when it wants
population.iloc['California']

* Like a Python dictionary, a Series is a list of key/value pairs
* But these are *ordered*, which means you can do slicing

#### Quick Exercise

* Try slicing this series, but with keys instead of numbers!
* Select a subset of the data using the Python slicing notation
* Don't forget, use `loc`!

In [None]:
# Hint: Use the same : notation, but use the state names listed above
# Your code here:



### DataFrame

* `DataFrames` are the real workhorse of Pandas and Python Data Science
* We will be spending a lot of time with data inside of Dataframes, so buckle up!
* `DataFrames` contain two-dimensional data, just like an Excel spreadsheet
* In practice, a `DataFrame` is a bunch of `Series` lined up next to each other

In [None]:
# Start with our population Series
population

In [None]:
# Then create another Series for the area
area_dict = {'Illinois': 149995, 'California': 423967, 
             'Texas': 695662, 'Florida': 170312, 
             'New York': 141297}
area = pd.Series(area_dict)
area

In [None]:
# Create a dictionary with a key:value for each column
state_info_dictionary = {'population': population,
                       'area': area}

# Now mash them together into a DataFrame
states = pd.DataFrame(state_info_dictionary)
# Display the data
states

* Pandas automatically lines everything up because they have shared index values

In [None]:
# create a list of dictionaries that contain our data.
# one dictionary per observation/row
dead_people = [
    {"ssn":1, "first_name": "Bob", "last_name": "Jones", "age": 200},
    {"ssn":2, "first_name": "Jane", "last_name": "Jones", "age": 199},
    {"ssn":3, "first_name": "Ethel", "last_name": "Jones", "age": 180},
    {"ssn":4, "first_name": "Hortense", "last_name": "Jones", "age": 178},
    {"ssn":5, "first_name": "Vern", "last_name": "Jones", "age": 178}
]

# create a Dataframe from a list of dictionaries
pd.DataFrame(dead_people)

In [None]:
# create a list of lists, each sub-list is an observation/row
dead_people = [
    [1,"Bob","Jones",200],
    [2,"Jane","Jones",199],
    [3,"Ethel","Jones",180],
    [4,"Hortense","Jones",178],
    [5,"Vern","Jones",178]
]

# specify the column names seperately
column_names = ["ssn","first_name", "last_name", "age"]

# make a Dataframe with column names specified separately
pd.DataFrame(dead_people, columns=column_names)

## Index

* Pandas `Series` and `DataFrames` are containers for data
* The Index (and Indexing) is the mechanism to make that data retrievable
* In a `Series` the index is the key to each value in the list
* In a `DataFrame` the index is the column names, but there is also an index for each row
* Indexing allows you to merge or join disparate datasets together

In [None]:
states

* The `loc` method I talked about above allows us to select specific rows and columns *by name*.
* Use the syntax `[<row>,<column>]` with index values

In [None]:
# Get the value of the population column from Illinois
states.loc["Illinois", "population"]

* We can also be tricky and use more advanced syntax to do more advanced queries.

In [None]:
# Get the area for states from Florida to Texas
# this is two dimensional slicing
states.loc["Florida":"Texas", "area"]

In [None]:
# Get the area for Florida and Texas
# Use a list to select multiple specific values
states.loc[["Florida", "Texas"], "area"]

In [None]:
# Get area and population for Florida and Texas
# use a ":" to specify "all columns"
states.loc[["Florida", "Texas"], :]

* What is happening here is we are passing a list of names for the rows, and using the colon ":" to say "all columns
* OK, this is neat, but let's move on to some *real* examples

---

## Reading Files into a Dataframe

* Once your data is in a Pandas `DataFrame` you can easily use a ton of analytical tools
* You just have to get your data to fit into a dataframe
* Getting data to fit is a big part of the "data janitor" work...it is the craft of data carpentry
* However, as we will see, there is still a lot of carpentry work to do once your data fits into a `DataFrame`

### Open the file and load it into memory

* Pandas provides some very handy functions for reading in CSV files.

In [None]:
# look at a CSV files using the unix command head
!head community-center-attendance.csv

* This is how we open a CSV file with pure Python

In [None]:
# Load up the CSV module
import csv

#
file_handler = open('community-center-attendance.csv')
# Load the file into the CSV module
reader = csv.reader(file_handler)

# Read all the data into a variable as a list. look ma! one line!
center_attendance_python = [row for row in reader]

# close the file handler for good hygiene 
file_handler.close()


# Print out the first five rows 
center_attendance_python[0:5]

* Pandas can do this much easier using the `read_csv()` function

In [None]:
# Open up the csv file the Pandas way
center_attendance_pandas = pd.read_csv("community-center-attendance.csv", 
                                       index_col="_id") # use the column named _id as the row index
# Display the first five rows
center_attendance_pandas.iloc[0:5]

* Notice that Pandas figured out there is a header row and it create a row index from one of the columns
* Pandas also has a special function, `head(n)` for looking at the first *n* rows in a dataframe

In [None]:
# Use the head function to look at the "head" 
# of the dataframe. Default is 5 rows.
center_attendance_pandas.head()

* Notice the index starts at 1 instead of zero, that is because we told Pandas to use the "_id" column as the row index.
* This is when it is important to understand the difference between `loc` and `iloc`

In [None]:
# Select row by index name
center_attendance_pandas.loc[1]

In [None]:
# Select row by index location
center_attendance_pandas.iloc[1]

---

## Exploring and Summarizing data

* Once your data has been loaded as a Dataframe, you can start using Pandas various functions to quickly explore your data 

### Helpful functions for exploring the data

* Looking at parts of the Dataframe
* `<dataframe>.head(n)` - look at the first n rows of the dataframe
* `<dataframe>.tail(n)` - look at the last n rows of the dataframe
* `<dataframe>.sample(n)` - randomly select n rows from the dataframe

In [None]:
# Look at the first 10 rows
center_attendance_pandas.head(10)

In [None]:
# Look at the last 5 rows
center_attendance_pandas.tail()

In [None]:
# Grab 5 random rows
center_attendance_pandas.sample(5)

* How many rows and columns?
* `<dataframe>.shape` - return the rows and columns as a python data structure (not a function!)
* `<dataframe>.info()` - Display the datatypes of the index and columns as well as memory usage
* `<dataframe>.describe()` - Compute summary statistics for numerical columns

In [None]:
# How many rows and columns
center_attendance_pandas.shape

In [None]:
# Inspect the datatypes
center_attendance_pandas.info()

* The output above shows us a lot of implementation details about our dataframe
* Data types, number of rows and columns, and the datatype of the column
* Also shows us memory usage, which is useful because memory is a limited resource

* We can also start doing some computations on the data

In [None]:
# Compute summary statistics on the numerical columns
center_attendance_pandas.describe()

* The `describe()` function will automatically compute summary statistics for numerical columns and ignore categorical columns

### Counting Numerical Data

* We can use traditional Python functions to get information about our Dataframe.
* The `len()` function tells us the length of the sequence 

In [None]:
# use a standard python function to get the length of the sequence
len(center_attendance_pandas)

* So this tells us our dataset has 18,367 rows.
* But this is just information about the dataset itself, it doesn't tell us how many people visited community centers
* How many people visited all the community centers for all time (in the dataset)?
* First let's answer this using pure python

In [None]:
# look at the first ten rows of the data loaded in python
center_attendance_python[0:10]

In [None]:
# create a variable to hold the total attendance
total_attendance = 0

# loop over the data that was loaded using pure python
for row in center_attendance_python[1:]: # skip the header row using a list slice
    # add the row count to the total, convert string to int
    row_attendance = int(row[3])
    total_attendance = total_attendance + row_attendance

print(total_attendance)

* Now here is how we do the exact same thing with Pandas 😄
* This code selects the `attendance_count` column and then computes the sum of all the values.

In [None]:
# compute the total attendance with the pandas sum function
center_attendance_pandas['attendance_count'].sum()

* We can also look at the summary statistics individually
* `<dataframe>[<column name>].mean()` - calculate the mean value for the column values
* `<dataframe>[<column name>].std()` - calculate the standard deviation for the column values
* `<dataframe>[<column name>].var()` - calculate the variance value for the column values
* `<dataframe>[<column name>].median()` - calculate the median value for the column values
* `<dataframe>[<column name>].min()` - calculate the minimum value for the column values

In [None]:
# mean attendance per day at all community centers
center_attendance_pandas['attendance_count'].mean()

In [None]:
# standard deviation
center_attendance_pandas['attendance_count'].std()

In [None]:
# variance
center_attendance_pandas['attendance_count'].var()

In [None]:
# median attendance per day at all community centers
center_attendance_pandas['attendance_count'].median()

In [None]:
# minimum attendance at community centers
center_attendance_pandas['attendance_count'].min()

* Pandas is not only a tool for working with numerical data
* Lots of functionality for manipulating categorical data too

### Counting Categorical Data

* Just like before we can start counting the distribution of values in the column. 
* how many entries per community center (this isn't counting attendance but counting the number of rows per center).

* The "Pythonic way"

In [None]:
# Create a dictionary to store the counts
center_counter = dict()

# loop over the data
for row in center_attendance_python:
    center = row[2]
    
    # check to see if the gender is already in the diction
    if center not in center_counter:
        # create a new entry
        center_counter[center] = 1
    else:
        # increment a new entry
        center_counter[center] += 1

# Display the dictionary 
center_counter

* The Pandas way is a bit easier

In [None]:
# Do the same thing with pandas
center_attendance_pandas['center_name'].value_counts()

### Splitting Data with GroupBy


* A common pattern in data analysis is splitting data by a key and then performing some math on all of the values with that key and finally combining it all back together
* This is commonly known in data circles as *split, apply, combine*


In [None]:
# create a dataframe to illustrate GroupBy
df = pd.DataFrame({'key': ['A', 'B', 'C', 'A', 'B', 'C'],
                   'data': range(6),
                   'things':[45,234,6,2,1324,345]}, columns=['key', 'data', 'things'])
df

In [None]:
# Dataframes have a method, groupby(), that takes a column name be be the grouping key
df.groupby('key')

* Cool, but what is that? Well, we need to tell Pandas what to *do* with the groups
* This is where we get to the *apply* step
* We need to specify what kind of aggregation, transformation, or computation to perform on the group

In [None]:
# Tell pandas to add up all of the values for each key
df.groupby('key').sum()

In [None]:
grouped_dataframe = df.groupby('key')
grouped_dataframe.sum()

The following table summarizes some other built-in Pandas aggregations:

| Aggregation              | Description                     |
|--------------------------|---------------------------------|
| ``count()``              | Total number of items           |
| ``size()``               | Total number of items w/ NaNs   |
| ``first()``, ``last()``  | First and last item             |
| ``mean()``, ``median()`` | Mean and median                 |
| ``min()``, ``max()``     | Minimum and maximum             |
| ``std()``, ``var()``     | Standard deviation and variance |
| ``mad()``                | Mean absolute deviation         |
| ``prod()``               | Product of all items            |
| ``sum()``                | Sum of all items                |

These are all methods of ``DataFrame`` and ``Series`` objects.

#### Quick Exercise
* Using the community center dataframe  try the aggregations above

In [None]:
# Try a aggregation function to see what it does by adding it after the period.

center_attendance_pandas.groupby("key").


---

## Challenge 1

![](http://i0.kym-cdn.com/photos/images/newsfeed/001/296/855/bff.png)

I am the Data and I speak for the [trees](https://data.wprdc.org/dataset/city-trees).

Analyze PGH tree data with descriptive statistics. Answer the following questions using Pandas and the provided data about Pittsburgh's trees.

* What is the average height and width of trees in Pittsburgh?
* How tall is the tallest tree?
* What is the tallest type of tree in pittsburgh on average?
* Bonus: In what neighborhood is the widest tree? 
    * Hint: try the `idxmax()` to get the row id of the max value then select that row by id

In [None]:
#First step read csv into dataframe 
tree_data = pd.read_csv("Tree_data.csv", low_memory=False)
tree_data.head()

**First things first, let's break it down!**

![Break it down corgi](https://media0.giphy.com/media/TRff1szV5FYXu/giphy.gif)

In [None]:
# What is the average height of Pittsburgh trees?
# Your code below



In [None]:
# How tall is the tallest tree?
# Your code below


In [None]:
# What is the tallest type of tree in Pittsburgh on average?
# Your code below


In [None]:
# Bonus: In what neighborhood is the widest tree?
# Your code below


#### Challenge Answers

* What is the average height of Pittsburgh trees?

In [None]:
# first step, select just the height values of the trees
tree_heights = tree_data["height"]

# second step, compute the mean of those values using the max function
tree_height_mean = tree_heights.mean()

print("The average height of trees in Pittsburgh is", tree_height_mean, "feet")

* What is the average width of Pittsburgh trees?


In [None]:
# first step, select just the widths of the trees
tree_widths = tree_data["width"]

# second step, compute the mean of those values using the mean fuctions
tree_width_mean = tree_widths.mean()

print("The average width of trees in Pittsburgh is", tree_width_mean, "feet")

* How tall is the tallest tree?

In [None]:
# first step, use idxmax to get the row index of the max value
tallest_tree = tree_data["height"]

# second step, compute the max value of the heights 
tallest_tree_height = tree_heights.max()

print("The tallest tree is", tallest_tree_height, "feet")

* On average, what is the tallest type of tree in Pittsburgh?

In [None]:
# First step, group by the common name
grouped_by_name = tree_data.groupby("common_name")

heights_by_name = grouped_by_name['height']

average_heights_by_name = heights_by_name.mean()

average_heights_by_name.head()

* In what neighborhood is the widest tree?

In [None]:
# first and second step, select the widths and use idxmax to get the row index of the max value
widest_tree_id = tree_data["width"].idxmax()

# third step, select that row with by its index
widest_tree = tree_data.loc[widest_tree_id]

# forth step, select the neighborhood from that 
widest_tree_neighborhood = widest_tree['neighborhood']

print("The widest tree is in", widest_tree_neighborhood)

---

## Subsetting Data

* It is sometimes helpful to think of a Pandas Dataframe as a little database. 
* There is data and information stored in the Pandas Dataframe (or Series) and you want to *retrieve* it.
* Pandas has multiple mechanisms for getting specific bits of data and information from its data structures. 

### Masking: Filtering by Values

* The most common is to use *masking* to select just the rows you want. 
* Masking is a two stage process, first you create a sequence of boolean values based upon a conditional expression--which you can think of as a "query"--and then you index your dataframe using that boolean sequence. 

In [None]:
# Let's look at the tree data again
tree_data = pd.read_csv("Tree_data.csv", low_memory=False)
tree_data.head()

In [None]:
# Let's look at all the columns
tree_data.info()

* How might we only look at trees in a particular neighborhood?
* First step is to create a *query mask*, a list of `True/False` values for rows that satisfy a particular condition.

In [None]:
# create a query mask for trees in the Greenfield neighborhood
neighborhood_query = tree_data['neighborhood'] == "Greenfield"
neighborhood_query.sample(20)

* This tells us the row id and True or False if the neighborhood equals Greenfield
* We can look up that row by index and see if it is correct

In [None]:
tree_data.loc[18872]

* Yup! So now that we know the mask works, we can create a *subset* of our data containing Greenfield trees.

In [None]:
greenfield_trees = tree_data[neighborhood_query]
greenfield_trees.head()

* Now you can do things like calculate the average height for just the Greenfield neighborhood

In [None]:
# Calculate the mean height for greenfield trees
greenfield_trees['height'].mean()

* We can also combine query masks using boolean logic
* Can we look at just the Norway Maples in Squirrel Hill South?

In [None]:
# create a query mask for squirrel hill south
neighborhood_query = tree_data['neighborhood'] == "Squirrel Hill South"
# create a query mask for norway maples
tree_query = tree_data['common_name'] == "Maple: Norway"

# apply both query masks using boolean AND
norway_in_squirrel_hill = tree_data[neighborhood_query & tree_query]
norway_in_squirrel_hill.head()

* You can also use computations to subset the data
* For example, lets just select the tall trees over 100 feet.

In [None]:
# create a query mask that selects trees with a height greater than 100 feet
tall_trees_query = tree_data['height'] > 100

# create a subset of just tall trees and display them
tall_trees = tree_data[tall_trees_query]
tall_trees.head(10)


* It would appear there are five trees taller than 100 feet in the dataset

## Challenge 2

![Tree pandas](http://i.imgur.com/wvfuagf.gif)
 Now that we know how to query Pandas Dataframes, let's try and get some answers!

* Create a subset of the data that represents the most valuable trees in your favorite neighborhood.
    * Hint: Use `tree_data.info()` to look for columns related to value.
    * Hint: You will need to define a threshold for "valuable"
* How many trees are in that subset?
    * Hint: Use a standard Python function
* How many trees of different types are in that subset?
    * Hint: Look back to the Counting Categorical Data section 

**First things first, let's break it down!**

![Break it down corgi](https://media0.giphy.com/media/TRff1szV5FYXu/giphy.gif)

In [None]:
# Write some pseudo code before you write some real code

In [None]:
# Look at the distribution of dollar value to get a sense of tree values
tree_data['overall_benefits_dollar_value'].describe()

In [None]:
# Create a subset of data that represents the most valuable trees in your favorite neightborhood
favorite_neighborhood_query = tree_data['neighborhood'] == "Greenfield"
value_threshold = tree_data['overall_benefits_dollar_value'] > 100
greenfield_money_trees = tree_data[favorite_neighborhood_query & value_threshold]

# Look at the range of values for Greenfield trees
greenfield_money_trees.head(10)

In [None]:
# How many trees in that subset?
greenfield_money_trees['overall_benefits_dollar_value'].describe()

In [None]:
# How many trees of different types are in that subset?
greenfield_money_trees['common_name'].value_counts()

---

## Transforming Data

* Most of the previous stuff has focused on exploring the data through summary statistics and subsetting
* However, another powerful element of Pandas is the ability to 
* Pandas provides a bunch of mechanisms for manipulating your data

### Vectorized Math Operations

* When you perform a mathematical operation on a column, it will automatically apply that operation to every record/observation in the column.
* This is called a *vectorized* operations
* It is like a Python `for` loop, but much faster

In [None]:
# compute the height/width ratio in a python loop
# We use the iterrows() function to make the dataframe behave like a list
for index, tree in tree_data.iterrows():
    print(tree['height'] / tree['width'])

* Gah! Divide by zero, thats annoying. If we wanted to do this in pure python we'd have to add a bunch of error handling.
* Or we could use PANDAS!

In [None]:
tree_data['height'] / tree_data['width']

* Pandas didn't even break a sweat AND it handled the divide by zero case!
* Thanks Pandas! But how can we save that data?
* You can easy just create a new column with a Python assignment operator.

In [None]:
# Create a new column from the 
tree_data['height_width_ratio'] = tree_data['height'] / tree_data['width']

tree_data.head(10)

* You can also use vector operations with scalar values
* Pandas will automatically perform the operation on every value in the column

In [None]:
# convert feet to meters 
tree_data['metric_height'] = tree_data['height'] * 0.3048
tree_data[['height', 'metric_height']].head(10)

### Vectorized String Operations

* Sometimes you need to clean string or categorical data
* Pandas has a set of String operations that do this work for you
* Especially useful for handling bad data!

In [None]:
data = ['peter', 'Paul', 'MARY', 'gUIDO']

for s in data:
    print(s.capitalize())

* But like above, this breaks very easily with missing values

In [None]:
data = ['peter', 'Paul', None, 'MARY', 'gUIDO']

for s in data:
    print(s.capitalize())

* The Pandas library has *vectorized string operations* that handle missing data

In [None]:
names = pd.Series(data)
names

In [None]:
names.str.capitalize()


* Look ma! No errors!
* Pandas includes a a bunch of methods for doing things to strings.

|             |                  |                  |                  |
|-------------|------------------|------------------|------------------|
|``len()``    | ``lower()``      | ``translate()``  | ``islower()``    | 
|``ljust()``  | ``upper()``      | ``startswith()`` | ``isupper()``    | 
|``rjust()``  | ``find()``       | ``endswith()``   | ``isnumeric()``  | 
|``center()`` | ``rfind()``      | ``isalnum()``    | ``isdecimal()``  | 
|``zfill()``  | ``index()``      | ``isalpha()``    | ``split()``      | 
|``strip()``  | ``rindex()``     | ``isdigit()``    | ``rsplit()``     | 
|``rstrip()`` | ``capitalize()`` | ``isspace()``    | ``partition()``  | 
|``lstrip()`` |  ``swapcase()``  |  ``istitle()``   | ``rpartition()`` |

#### Exercise

* In the cells below, try three of the string operations listed above on the Pandas Series `monte`
* Remember, you can hit tab to autocomplete and shift-tab to see documentation

In [None]:
monte = pd.Series(['Graham Chapman', 'John Cleese', 'Terry Gilliam',
                   'Eric Idle', 'Terry Jones', 'Michael Palin'])
monte

In [None]:
# First


In [None]:
# Second


In [None]:
# Third


### Real *Messy* Data Example: Recipe Database

* Let's walk through the recipe database example from the [Python Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/03.10-working-with-strings.html)
* There are a few concepts and commands I haven't yet covered, but I'll explain them as I go along

In [None]:
recipes = pd.read_json("https://s3.amazonaws.com/openrecipes/20170107-061401-recipeitems.json.gz", 
                       compression='gzip',
                       lines=True)

We have downloaded the data and loaded it into a dataframe directly from the web.

In [None]:
recipes.head()

In [None]:
recipes.shape

We see there are nearly 200,000 recipes, and 17 columns.
Let's take a look at one row to see what we have:

In [None]:
# display the first item in the DataFrame
recipes.iloc[0]

In [None]:
# Show the first five items in the DataFrame
recipes.head()

There is a lot of information there, but much of it is in a very messy form, as is typical of data scraped from the Web.
In particular, the ingredient list is in string format; we're going to have to carefully extract the information we're interested in.
Let's start by taking a closer look at the ingredients:

In [None]:
# Summarize the length of the ingredients string
recipes['ingredients'].str.len().describe()

* This shows us a statistical summary of the number of characters in the ingredients descriptions.

In [None]:
# which row has the longest ingredients string
recipes['ingredients'].str.len().idxmax()

In [None]:
# use iloc to fetch that specific row from the dataframe
recipes.iloc[135598]

In [None]:
# look at the ingredients string
recipes.iloc[135598]['ingredients']

* WOW! That is a lot of ingredients! That might need to be cleaned by hand instead of a machine
* What other questions can we ask of the recipe data?

In [None]:
# How many breakfasts?
recipes.description.str.contains('[Bb]reakfast').sum()

In [None]:
# How many have cinnamon as an ingredient?
recipes.ingredients.str.contains('[Cc]innamon').sum()

In [None]:
# How many misspell cinnamon as cinamon?
recipes.ingredients.str.contains('[Cc]inamon').sum()

## Challenge 3

![Trash pandas](https://media0.giphy.com/media/Yck5MptncAGfS/giphy.gif)

Trash pandas love food!

* Let's see if we can draw out some additional information about the ingredients
* Like how many for each recipe?

* Use vectorized string operations to do the following:
    * Split the ingredients string into a list of items
    * Count the number of items in that list
        * Hint: Try chaining multiple string operations
        * Another hint: Don't forget `str` in your chain
    * Compute summary statistics on that list
    * Bonus: Create a new column "number_of_ingredients"

In [None]:
# This is it. This is the answer.
recipes['ingredients'].str.split("\n").str.len().describe()

![Pandas will F you up!](https://media1.giphy.com/media/EPcvhM28ER9XW/giphy.gif)

* Pandas fucks up dirty data!

In [None]:
recipes['number_of_ingredients'] = recipes['ingredients'].str.split("\n").str.len()
recipes.head()

---

* OK, thats it.
* WE DID IT!

![Celebration](https://media2.giphy.com/media/1ofR3QioNy264/giphy.gif)

## Further Readings and Resources

* Python for Everybody - https://www.py4e.com/
* Python Data Science Handbook - https://jakevdp.github.io/PythonDataScienceHandbook/index.html
* Dataquest Online tutorials - https://Dataquest.io
* Python Documentation - https://docs.python.org
* Pandas Documentation - https://pandas.pydata.org/pandas-docs/stable/