<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

## Intro to `pandas` Lab

*Instructor: Aymeric Flaisler*

___

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

### A. Explore _Mad Men_ Cast Data

---

#### 1. Load the _Mad Men_ cast data into a `pandas` DataFrame.

In [None]:
cast_data_csv = '../datasets/mad-men-cast-show-data.csv'

In [None]:
cast = pd.read_csv(cast_data_csv, encoding='latin-1')

#### 2. Print the head and tail of the data.

In [None]:
cast.head(2)

In [None]:
cast.tail(2)

#### 3. Print the columns of the data.

In [None]:
cast.columns

#### 4. Rename any columns with spaces or special characters to not contain any.

In [None]:
import string
# the string library has default strings that contain all letters or numbers
uppercase = string.ascii_uppercase
lowercase = string.ascii_lowercase

In [None]:
cast.columns = [''.join([ch for ch in col if ch in uppercase+lowercase+' _']) for col in cast.columns]
# for each character in each column name join characters together 
# if they are in the string "uppercase+lowercase+' _'"
# basically removes anything that is not a letter, space or underscore.
print(cast.columns)

In [None]:
cast.columns = map(lambda x: x.replace(' ', '_'), cast.columns)
# replaces spaces with underscores.

print(cast.columns)

#### 5. Subset the data where the status of the show is not "END" or "End."

In [None]:
print(cast.shape)
mask = (cast['Status'] != 'END') & (cast['Status'] != 'End')
mask

In [None]:
subset = cast[mask]
subset.shape

#### 6. Print out the performers where the show start is greater than 2005 and the score is greater than 7.

In [None]:
# double checking out dtypes to make sure they are correct.
cast.dtypes

In [None]:
cast[(cast['Show_Start'] > 2005) & (cast['Score'] > 7)].Performer.unique()

#### 7. Select the performer and show column for the 20th-25th LABELED rows.

In [None]:
cast.loc[20:25, ['Show','Performer']]

#### 8. Plot a histogram of score.

In [None]:
cast['Score'].hist()

### B. Explore San Francisco Crime Data

--- 

**9. Load the San Francisco crime data set into a DataFrame.**

In [3]:
crime_csv = '../datasets/sf_crime.csv'

In [4]:
crime = pd.read_csv(crime_csv)

**10. Look at the dimensions of the crime data.**

In [None]:
crime.shape

**11. Look at the data types of the columns and print out the column names.**

In [None]:
crime.dtypes

In [None]:
crime.columns

**12. How many distinct districts are there?**

In [None]:
print(crime.PdDistrict.unique())
print(len(crime.PdDistrict.unique()))

**13. Which day of the week has the most crime?**

In [None]:
for day in crime.DayOfWeek.unique():
    print(day, crime[crime.DayOfWeek == day].shape)

In [None]:
# Friday has the most crime.

#### 14. Make a new DataFrame featuring the crime categories and the number of crimes per category.

In [None]:
categories = np.unique(crime.Category.values)
counts = [crime[crime['Category'] == x].shape[0] for x in categories]
categories = pd.DataFrame({'crime_category':categories, 'crimes':counts})
print(categories.head())

**15. Make a DataFrame that includes the districts and crime counts per district. Which district has the most crime?**

*Hint: You can use the `.sort_values()` function to sort your DataFrame by column.*

In [None]:
districts = np.unique(crime.PdDistrict.values)
counts = [crime[crime['PdDistrict'] == x].shape[0] for x in districts]
districts = pd.DataFrame({'district':districts, 'crimes':counts})
print(districts.head())

In [None]:
districts.sort_values('crimes', ascending=False).head()