# Data Processing with Python and Pandas Part Two


## Today's Topics

* Review last week and subsetting data with query masks
* Data Cleaning
* Data Wrangling
* Working with Time

## A quick review of last week

* Series
* Data Frames
* Index

In [None]:
# Import pandas so we can do stuff
import pandas as pd


### Series

* One-dimensional data structure
* Mother was a list, father was a dictionary
* Dictionary keys become the Series *index*

In [None]:
# create a Series from a list with implicit index
my_list = [0.25, 0.5, 0.75, 1.0]
data = pd.Series(my_list)
data

In [None]:
# create a Series from a list with explicit index
my_list = [0.25, 0.5, 0.75, 1.0]
data = pd.Series(my_list, index=[1,2,3,"Picksburgh"])
data

* you can create a named index-by-one, but slicing is still index-by-zero 
* and that is why we always you `loc` and `iloc`

In [None]:
# get the item with the index location `
data.iloc[1]

In [None]:
# get the item with the index name
data.loc[1]

In [None]:
# get the items at the 2nd and 3rd locations
data.iloc[1:3]

* Series from python dictionaries

In [None]:
# create a Series from a dictionary where keys become the index
population_dict = {'California': 38332521,
                   'Texas': 26448193,
                   'New York': 19651127,
                   'Florida': 19552860,
                   'Illinois': 12882135}
population = pd.Series(population_dict)
population

In [None]:
# you can't slice a dictionary
population_dict['California':'Illinois']

In [None]:
# but you can slice a Series
population.loc['California':'Illinois']

In [None]:
population.loc['California']

* Series has a bunch of methods for manipulation data.
* [See the documentation for a list](https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.Series.html)

In [None]:
sorted_population = population.sort_values()
sorted_population

In [None]:
population

In [None]:
population.sort_index()

### Dataframes

* Two-dimensional data structure
* Made of columns, where each column is a Series
* A spreadsheet, but in Python 

In [None]:
# Quickly create two series with the same index, but different values 
population = pd.Series({'California': 38332521,
                   'Texas': 26448193,
                   'New York': 19651127,
                   'Florida': 19552860,
                   'Illinois': 12882135})
area = pd.Series({'Illinois': 149995, 'California': 423967, 
             'Texas': 695662, 'Florida': 170312, 
             'New York': 141297})

# now moosh them together into a dataframe
states = pd.DataFrame({'population': population,
                       'area': area})
states

* Reading CSV files into Dataframes


In [None]:
# read the data into a pandas dataframe, using the "_id" column for index
order_data  = pd.read_csv("../4 - data management one/chipotle.tsv", sep="\t")
# inspect the dataframe
order_data.head() 

* Writing a Dataframe to a CSV

In [None]:
# Write to a file in current working directory 
# Don't include the row index in output file
order_data.to_csv("chipotle.csv", index=False)

## Subsetting Data

* It is sometimes helpful to think of a Pandas Dataframe as a little database. 
* There is data and information stored in the Pandas Dataframe (or Series) and you want to *retrieve* it.
* Pandas has multiple mechanisms for getting specific bits of data and information from its data structures. 

### Masking: Filtering by Values

* The most common is to use *masking* to select just the rows you want. 
* Masking is a two stage process, first you create a sequence of boolean values based upon a conditional expression--which you can think of as a "query"--and then you index your dataframe using that boolean sequence. 

In [None]:
# Let's look at the chipotle order data
order_data.head(10)

In [None]:
# Let's look at all the columns
order_data.info()

* How might we only look at particular orders?
* First step is to create a *query mask*, a list of `True/False` values for rows that satisfy a particular condition.

In [None]:
# create a query mask for chicken bowls
query_mask = order_data['item_name'] == "Chicken Bowl"

#look at the first 20 items to see what matches
query_mask.head(20)

* This tells us the row id and True or False if the item type equals chicken bowl
* We can look up that row by index and see if it is correct

In [None]:
order_data.iloc[19]

* Yup! So now that we know the mask works, we can create a *subset* of our data containing chicken bowls.

In [None]:
chicken_bowls = order_data[query_mask]
chicken_bowls.head()

* Now you can do things like calculate the average price for chicken bowl orders

In [None]:
# Calculate the mean price for chicken bowls
chicken_bowls['item_price'].mean()

In [None]:
# See how many chicken bowls people order
chicken_bowls['quantity'].value_counts()

* We can also combine query masks using boolean logic
* Can we look at just the chicken bowl orders that were less than $10

In [None]:
# create a query mask for chicken bowls
item_query_mask = order_data['item_name'] == "Chicken Bowl"
# create a query mask for cheap orders
price_query_mask = order_data['item_price'] < 10

# apply both query masks using boolean AND
cheap_chicken_bowls = order_data[item_query_mask & price_query_mask]
cheap_chicken_bowls.head()

In [None]:
# Median price for cheap chicken bowls
cheap_chicken_bowls['item_price'].median()

* Query masks can be used to filter and create subsets of data
* Note, this method of subsetting data creates what is called a "view" of the data
* You are basically working with a big slice of the original dataframe, not a separate copy of the data
* This means if you try an do transformations on that view, you will get an error
* For more information, [see the pandas documentation](http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy)

In [None]:
cheap_chicken_bowls['half_price'] = cheap_chicken_bowls['item_price'] / 2

In [None]:
copy_of_cheap_chicken_bowls = cheap_chicken_bowls.copy()
copy_of_cheap_chicken_bowls['half_price'] = copy_of_cheap_chicken_bowls['item_price'] / 2
copy_of_cheap_chicken_bowls.head()