# Course Solutions

1. [Split Apply Combine Basics](#1.-Split-Apply-Combine-Basics)
1. [Split Apply Combine More](#2.-Split-Apply-Combine-More)
1. [Merging Data](#3.-Merging-Data)
1. [Relational Databases](#4.-Relational-Databases)
1. [Case Study - Counting Pandas](#Case-Study---Counting-Pandas-Solutions)

# 1. Split Apply Combine Basics

In [4]:
college = pd.read_csv('data/college.csv')

### Problem 1
<span  style="color:green; font-size:16px">In the **`college`** DataFrame without using a groupby, which city name appears the most frequently?</span>

In [6]:
# your code here
college['CITY'].value_counts().head()

New York       87
Chicago        78
Houston        72
Los Angeles    56
Miami          51
Name: CITY, dtype: int64

### Problem 2
<span  style="color:green; font-size:16px">Does the city **`Houston`** only appear in the state of **`Texas`**?</span>

In [11]:
# NO! There is a Houston, Missouri

# This uses slightly different boolean selection to grab a single column at the same time
college.loc[college['CITY'] == 'Houston', 'STABBR'].value_counts()

TX    71
MO     1
Name: STABBR, dtype: int64

### Problem 3
<span  style="color:green; font-size:16px">Find the maximum undergraduate population for each state?</span>

In [14]:
# your code here
college.groupby('STABBR')['UGDS'].max().head()

STABBR
AK     12865.0
AL     29851.0
AR     21405.0
AS      1276.0
AZ    151558.0
Name: UGDS, dtype: float64

### Problem 4
<span  style="color:green; font-size:16px">Among colleges that have the largest undergrad population for each state, what is the difference between the most and least populous college?</span>

In [15]:
# your code here

# from problem 3
largest_per_state = college.groupby('STABBR')['UGDS'].max()

largest_per_state.max() - largest_per_state.min()

150956.0

### Problem 5: Advanced
<span  style="color:green; font-size:16px">Find the name and population of the largest college per state.</span>

In [26]:
# couple ways to do this

#trim down the DataFrame and put the institution name in the index
college_instm = college.set_index('INSTNM')[['STABBR', 'UGDS']]
college_instm.head()

Unnamed: 0_level_0,STABBR,UGDS
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1
Alabama A & M University,AL,4206.0
University of Alabama at Birmingham,AL,11383.0
Amridge University,AL,291.0
University of Alabama in Huntsville,AL,5451.0
Alabama State University,AL,4811.0


In [27]:
# group by state and use idxmax
max_colleges = college_instm.groupby('STABBR')['UGDS'].idxmax()

In [29]:
college_instm.loc[max_colleges]

Unnamed: 0_level_0,STABBR,UGDS
INSTNM,Unnamed: 1_level_1,Unnamed: 2_level_1
University of Alaska Anchorage,AK,12865.0
The University of Alabama,AL,29851.0
University of Arkansas,AR,21405.0
American Samoa Community College,AS,1276.0
University of Phoenix-Arizona,AZ,151558.0
Ashford University,CA,44744.0
University of Colorado Boulder,CO,25873.0
University of Connecticut,CT,18016.0
George Washington University,DC,10433.0
University of Delaware,DE,18222.0


In [32]:
# second way

# trim data
college_trim = college[['STABBR', 'UGDS', 'INSTNM']]

# sort by state then by population descending
college_trim = college_trim.sort_values(['STABBR', 'UGDS'], ascending=[True, False])


# group by state and take the first in the group
college_trim.groupby('STABBR').first()

Unnamed: 0_level_0,UGDS,INSTNM
STABBR,Unnamed: 1_level_1,Unnamed: 2_level_1
AK,12865.0,University of Alaska Anchorage
AL,29851.0,The University of Alabama
AR,21405.0,University of Arkansas
AS,1276.0,American Samoa Community College
AZ,151558.0,University of Phoenix-Arizona
CA,44744.0,Ashford University
CO,25873.0,University of Colorado Boulder
CT,18016.0,University of Connecticut
DC,10433.0,George Washington University
DE,18222.0,University of Delaware


### Problem 6
<span  style="color:green; font-size:16px">Do distance only schools tend to have more or less student population than non-distance-only schools?</span>

In [36]:
# They have more
college.groupby('DISTANCEONLY')['UGDS'].mean()

DISTANCEONLY
0.0    2334.648135
1.0    6245.743590
Name: UGDS, dtype: float64

### Problem 7
<span  style="color:green; font-size:16px">Do distance only schools tend to be more or less religously affiliated than non-distance-only schools?</span>

In [38]:
# Less
college.groupby('DISTANCEONLY')['RELAFFIL'].mean()

DISTANCEONLY
0.0    0.149635
1.0    0.050000
Name: RELAFFIL, dtype: float64

### Problem 8
<span  style="color:green; font-size:16px">What state has the lowest percentage of currently operating schools of those that have religious affiliation?</span>

In [45]:
# your code here
cr = college[college['RELAFFIL'] == 1]

# Utah. Answer makes sense.
cr.groupby(['STABBR'])['CURROPER'].mean().sort_values().head()

STABBR
UT    0.400000
AZ    0.444444
NV    0.500000
CA    0.585366
CT    0.647059
Name: CURROPER, dtype: float64

### Problem 9
<span  style="color:green; font-size:16px">Trim the **`college`** DataFrame to only the 'race' columns - those beginning with **`UGDS_`**. Create a new column called **`UGDS_OTHER`** that is the sum of any race column that averages under 4% for the entire dataset.</span>

In [51]:
# trim dataframe
df_race = college.filter(like='UGDS_')

race_average = df_race.mean()

race_average

UGDS_WHITE    0.510207
UGDS_BLACK    0.189997
UGDS_HISP     0.161635
UGDS_ASIAN    0.033544
UGDS_AIAN     0.013813
UGDS_NHPI     0.004569
UGDS_2MOR     0.023950
UGDS_NRA      0.016086
UGDS_UNKN     0.045181
dtype: float64

In [57]:
# keep only those less than 4%
other_race = race_average[race_average < .04]

other_race

UGDS_ASIAN    0.033544
UGDS_AIAN     0.013813
UGDS_NHPI     0.004569
UGDS_2MOR     0.023950
UGDS_NRA      0.016086
dtype: float64

In [58]:
# get the column names
race_columns = other_race.index

race_columns

Index(['UGDS_ASIAN', 'UGDS_AIAN', 'UGDS_NHPI', 'UGDS_2MOR', 'UGDS_NRA'], dtype='object')

In [65]:
# grab the columns and sum accross the rows
df_race['UGDS_OTHER'] = df_race[race_columns].sum(axis=1)

# can drop the low percentage columns
df_race.drop(race_columns, axis=1).head(10)

Unnamed: 0,UGDS_WHITE,UGDS_BLACK,UGDS_HISP,UGDS_UNKN,UGDS_OTHER
0,0.0333,0.9353,0.0055,0.0138,0.0121
1,0.5922,0.26,0.0283,0.01,0.1094
2,0.299,0.4192,0.0069,0.2715,0.0034
3,0.6988,0.1255,0.0382,0.035,0.1025
4,0.0158,0.9208,0.0121,0.0137,0.0376
5,0.7825,0.1119,0.0348,0.0026,0.0682
6,0.7255,0.2613,0.0044,0.0019,0.0069
7,0.7823,0.12,0.0191,0.0334,0.0451
8,0.5328,0.3376,0.0074,0.0246,0.0975
9,0.8507,0.0704,0.0248,0.014,0.0401


### Problem 10
<span  style="color:green; font-size:16px">Use the column **`UG25ABV`** and the **`quantile`** Series function to get 5 evenly spaced quantiles (use 6 numbers). Use this output to create a categorical variable using the **`cut`** function and label the bins Youngest, Young, Average, Old, Oldest and assign it to the **`AGEGROUP`** column.

Then find the average SAT math scores by AGEGROUP. Any surprising result?</span>

In [84]:
# your code here
quants = college.UG25ABV.quantile([0, .2, .4, .6, .8, 1])
college['AGEGROUP'] = pd.cut(college.UG25ABV, quants, labels=['Youngest', 'Young', 'Average', 'Old', 'Oldest'])
college.groupby('AGEGROUP')['SATMTMID'].mean()

AGEGROUP
Youngest    547.380645
Young       502.910072
Average     491.941176
Old         481.230769
Oldest      482.333333
Name: SATMTMID, dtype: float64

### Problem 11
<span  style="color:green; font-size:16px">Which are top 5 historically black colleges that have the highest white percentage?</span>

In [90]:
# your code here
college.loc[college.HBCU == 1, ['INSTNM', 'UGDS_WHITE']].sort_values('UGDS_WHITE', ascending=False).head()

Unnamed: 0,INSTNM,UGDS_WHITE
4021,Bluefield State College,0.8437
17,Gadsden State Community College,0.6921
4050,West Virginia State University,0.5816
48,Shelton State Community College,0.5613
55,H Councill Trenholm State Community College,0.3951


### Problem 12: Advanced
<span  style="color:green; font-size:16px">Again make a DataFrame of all the race percentage columns. Read the documentation on the **`mul`** DataFrame method and use it to multiply the race percentage DataFrame to get an actual population of each race.</span>

In [91]:
# your code here
df_race = college.filter(like='UGDS_')

In [103]:
df_race.mul(college['UGDS'], axis=0).round(0).head(15)

Unnamed: 0,UGDS_WHITE,UGDS_BLACK,UGDS_HISP,UGDS_ASIAN,UGDS_AIAN,UGDS_NHPI,UGDS_2MOR,UGDS_NRA,UGDS_UNKN
0,140.0,3934.0,23.0,8.0,10.0,8.0,0.0,25.0,58.0
1,6741.0,2960.0,322.0,590.0,25.0,8.0,419.0,204.0,114.0
2,87.0,122.0,2.0,1.0,0.0,0.0,0.0,0.0,79.0
3,3809.0,684.0,208.0,205.0,78.0,1.0,94.0,181.0,191.0
4,76.0,4430.0,58.0,9.0,5.0,3.0,47.0,117.0,66.0
5,23358.0,3340.0,1039.0,316.0,113.0,27.0,779.0,800.0,78.0
6,1155.0,416.0,7.0,4.0,7.0,0.0,0.0,0.0,3.0
7,2340.0,359.0,57.0,16.0,47.0,3.0,52.0,17.0,100.0
8,2293.0,1453.0,32.0,95.0,19.0,7.0,128.0,171.0,106.0
9,17451.0,1444.0,509.0,466.0,152.0,0.0,0.0,205.0,287.0


# 2. Split Apply Combine More

In [1]:
import pandas as pd
import numpy as np

pd.options.display.max_columns = 40
hou = pd.read_csv('data/coh_employee.csv')

## Problem 1
<span  style="color:green; font-size:16px">What are the 5 least common departments?</span>

In [2]:
hou.DEPARTMENT.value_counts().tail()

Houston Information Tech Svcs    9
Planning & Development           7
Mayor's Office                   5
City Controller's Office         5
Convention and Entertainment     1
Name: DEPARTMENT, dtype: int64

### Problem 2
<span  style="color:green; font-size:16px">Filter out departments with less than 50 occurences and save it to **`hou_filter`**. Then test your code by outputing the frequencies of all the remaining departments. </span>

In [6]:
hou_filter = hou.groupby('DEPARTMENT').filter(lambda df: len(df) >= 50)

hou_filter.DEPARTMENT.value_counts()

Houston Police Department-HPD     638
Houston Fire Department (HFD)     384
Public Works & Engineering-PWE    343
Health & Human Services           110
Houston Airport System (HAS)      106
Parks & Recreation                 74
Name: DEPARTMENT, dtype: int64

### Problem 3
<span  style="color:green; font-size:16px">Filter out departments from the original **`hou`** DataFrame with average salaries less than $70,000 and save it to **`hou_filter_salary`**. Then test your code by outputing the average salaries for the remaining departments.</span>

In [7]:
hou_filter_salary = hou.groupby('DEPARTMENT').filter(lambda df: df['BASE_SALARY'].mean() >= 70000)

In [12]:
# added astype(int) to remove decimals
hou_filter_salary.groupby('DEPARTMENT')['BASE_SALARY'].mean().astype(int)

DEPARTMENT
Finance                           79650
Houston Information Tech Svcs     76112
Legal Department                 104959
Mayor's Office                    86489
Name: BASE_SALARY, dtype: int64

### Problem 4
<span  style="color:green; font-size:16px">Filter *`for`* those departments from the original **`hou`** DataFrame with average salaries of at least 65,000 or having at least 25 unique position titles. Save result to **`hou_more`**</span>

In [58]:
def more(df):
    if (df['BASE_SALARY'].mean() > 65000) | (df['POSITION_TITLE'].nunique() >= 25):
        return True
    return False

hou_more = hou.groupby('DEPARTMENT').filter(more)

In [59]:
hou_more.shape

(1696, 31)

### Problem 5: Advanced
<span  style="color:green; font-size:16px">Find a way to do problem 4 without using the **`filter`** method. Make clever use of aggregate groupby and boolean logic</span>

In [42]:
# do aggregations for each boolean piece separately
salary_grp = hou.groupby('DEPARTMENT')['BASE_SALARY'].mean()
uniq_grp = hou.groupby('DEPARTMENT')['POSITION_TITLE'].nunique()

salary_grp.head()

DEPARTMENT
Admn. & Regulatory Affairs      50890.551724
City Controller's Office        55711.600000
City Council                    59089.222222
Convention and Entertainment    38397.000000
Dept of Neighborhoods (DON)     47092.882353
Name: BASE_SALARY, dtype: float64

In [44]:
uniq_grp.head()

DEPARTMENT
Admn. & Regulatory Affairs      19
City Controller's Office         3
City Council                     7
Convention and Entertainment     1
Dept of Neighborhoods (DON)     14
Name: POSITION_TITLE, dtype: int64

In [60]:
# create boolean criteria

deps = (salary_grp > 65000) | (uniq_grp >= 25)

deps

DEPARTMENT
Admn. & Regulatory Affairs        False
City Controller's Office          False
City Council                      False
Convention and Entertainment      False
Dept of Neighborhoods (DON)       False
Finance                            True
Fleet Management Department       False
General Services Department       False
Health & Human Services            True
Housing and Community Devp.       False
Houston Airport System (HAS)       True
Houston Emergency Center (HEC)    False
Houston Fire Department (HFD)      True
Houston Information Tech Svcs      True
Houston Police Department-HPD      True
Human Resources Dept.             False
Legal Department                   True
Library                           False
Mayor's Office                     True
Municipal Courts Department       False
Parks & Recreation                 True
Planning & Development            False
Public Works & Engineering-PWE     True
Solid Waste Management            False
dtype: bool

In [71]:
# filter Series with itself and grab index values
deps_true = deps[deps].index.values

deps_true

array(['Finance', 'Health & Human Services',
       'Houston Airport System (HAS)', 'Houston Fire Department (HFD)',
       'Houston Information Tech Svcs', 'Houston Police Department-HPD',
       'Legal Department', "Mayor's Office", 'Parks & Recreation',
       'Public Works & Engineering-PWE'], dtype=object)

In [62]:
hou_more_check = hou[hou.DEPARTMENT.isin(deps_true)]

In [70]:
# can check equality of dataframes with equals method
hou_more.equals(hou_more_check)

True

### Problem 6: Advanced
<span  style="color:green; font-size:16px">Group by department, gender and race and get the mean, min and max base salary for each group. Also get the number of unique position titles and the most frequent position title for each group. Rename each aggregation to something that makes sense. Then remove the top level of the column index. Hint: This [stackoverflow answer](http://stackoverflow.com/questions/15222754/group-by-pandas-dataframe-and-select-most-common-string-factor) will be useful </span>

In [72]:
df = hou.groupby(['DEPARTMENT', 'GENDER','RACE']).agg({'BASE_SALARY':{'salary_mean':'mean',
                                                                'salary_min':'min',
                                                                'salary_max':'max'},
                                                 'POSITION_TITLE':{'unique_positions':'nunique',
                                                                  'most_frequent_position':lambda x: x.value_counts().index[0]}})

In [73]:
df.columns = df.columns.droplevel(0)

In [74]:
df.head(12)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,salary_min,salary_mean,salary_max,unique_positions,most_frequent_position
DEPARTMENT,GENDER,RACE,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Admn. & Regulatory Affairs,Female,Asian/Pacific Islander,37710.0,72293.666667,130416.0,3,3-1-1 TELECOMMUNICATOR SUPERVISOR
Admn. & Regulatory Affairs,Female,Black or African American,33550.0,49727.5,72741.0,8,STAFF ANALYST
Admn. & Regulatory Affairs,Female,Hispanic/Latino,28205.0,36616.4,47341.0,4,ANIMAL CARE TECHNICIAN
Admn. & Regulatory Affairs,Female,White,33280.0,47664.666667,62129.0,3,ANIMAL ENFORCEMENT OFFICER TRAINEE
Admn. & Regulatory Affairs,Male,Black or African American,29557.0,29827.5,30098.0,2,ANIMAL CARE TECHNICIAN
Admn. & Regulatory Affairs,Male,Hispanic/Latino,35318.0,35318.0,35318.0,1,CUSTOMER SERVICE REPRESENTATIVE I
Admn. & Regulatory Affairs,Male,White,103776.0,122096.0,140416.0,2,DEPUTY ASSISTANT DIRECTOR (EXECUTIVE LEV
City Controller's Office,Female,Asian/Pacific Islander,59077.0,59077.0,59077.0,1,ASSISTANT CITY CONTROLLER III
City Controller's Office,Female,Black or African American,55536.0,56295.0,57054.0,1,ADMINISTRATIVE ASSISTANT
City Controller's Office,Female,Hispanic/Latino,64251.0,64251.0,64251.0,1,ADMINISTRATIVE ASSISTANT


### Problem 7
<span  style="color:green; font-size:16px"> Create a column **`is_max`** that is equal to 1 if the base salary is currently the max base salary (out of all previous rows) for that department and 0 otherwise. See sample data below.</span>

In [78]:
hou_1 = hou[['DEPARTMENT', 'BASE_SALARY']].copy()
hou_1['is_max'] = hou_1.groupby('DEPARTMENT')['BASE_SALARY'].transform(lambda x: x == x.cummax())

In [80]:
hou_1.head(20)

Unnamed: 0,DEPARTMENT,BASE_SALARY,is_max
0,Municipal Courts Department,121862.0,1.0
1,Library,26125.0,1.0
2,Houston Police Department-HPD,45279.0,1.0
3,Houston Fire Department (HFD),63166.0,1.0
4,General Services Department,56347.0,1.0
5,Houston Police Department-HPD,66614.0,1.0
6,Public Works & Engineering-PWE,71680.0,1.0
7,Houston Airport System (HAS),42390.0,1.0
8,Public Works & Engineering-PWE,107962.0,1.0
9,Houston Airport System (HAS),44616.0,1.0


### Problem 8: Advanced
<span  style="color:green; font-size:16px"> Programatically Find the 10th occurence of 0 for **`is_max`** and return a DataFrame that ends after the tenth occurence.</span>

In [92]:
hou_1['occur'] = hou_1.groupby('is_max').cumcount()
hou_1.head(20)

Unnamed: 0,DEPARTMENT,BASE_SALARY,is_max,occur
0,Municipal Courts Department,121862.0,1.0,0
1,Library,26125.0,1.0,1
2,Houston Police Department-HPD,45279.0,1.0,2
3,Houston Fire Department (HFD),63166.0,1.0,3
4,General Services Department,56347.0,1.0,4
5,Houston Police Department-HPD,66614.0,1.0,5
6,Public Works & Engineering-PWE,71680.0,1.0,6
7,Houston Airport System (HAS),42390.0,1.0,7
8,Public Works & Engineering-PWE,107962.0,1.0,8
9,Houston Airport System (HAS),44616.0,1.0,9


In [97]:
idx_10 = hou_1.index[(hou_1.occur == 10) & (hou_1.is_max == 0)][0]

In [98]:
idx_10

23

In [107]:
hou_1.loc[:idx_10]

Unnamed: 0,DEPARTMENT,BASE_SALARY,is_max,occur
0,Municipal Courts Department,121862.0,1.0,0
1,Library,26125.0,1.0,1
2,Houston Police Department-HPD,45279.0,1.0,2
3,Houston Fire Department (HFD),63166.0,1.0,3
4,General Services Department,56347.0,1.0,4
5,Houston Police Department-HPD,66614.0,1.0,5
6,Public Works & Engineering-PWE,71680.0,1.0,6
7,Houston Airport System (HAS),42390.0,1.0,7
8,Public Works & Engineering-PWE,107962.0,1.0,8
9,Houston Airport System (HAS),44616.0,1.0,9


### Problem 9
<span  style="color:green; font-size:16px"> Write a function that accepts a single argument that will filter **`hou_1`** for a specific department where **`is_max`** is 1. Test your function with departments like 'Library' and 'Public Works & Engineering-PWE'.</span>

In [102]:
def filter_dep(dep):
    criteria = (hou_1.DEPARTMENT == dep) & (hou_1.is_max == 1)
    return hou_1[criteria]

In [103]:
filter_dep('Library')

Unnamed: 0,DEPARTMENT,BASE_SALARY,is_max,occur
1,Library,26125.0,1.0,1
280,Library,31034.0,1.0,49
298,Library,38563.0,1.0,51
308,Library,49317.0,1.0,53
412,Library,79302.0,1.0,61
1165,Library,107763.0,1.0,78


In [105]:
filter_dep('Public Works & Engineering-PWE')

Unnamed: 0,DEPARTMENT,BASE_SALARY,is_max,occur
6,Public Works & Engineering-PWE,71680.0,1.0,6
8,Public Works & Engineering-PWE,107962.0,1.0,8
186,Public Works & Engineering-PWE,110881.0,1.0,39
297,Public Works & Engineering-PWE,141948.0,1.0,50
1067,Public Works & Engineering-PWE,146141.0,1.0,77
1232,Public Works & Engineering-PWE,178331.0,1.0,80


### Problem 10
<span  style="color:green; font-size:16px">A good skill to have is to ask a difficult question for yourself and then answer it. Ask yourself a question that involes grouping and answer it.</span>

In [None]:
# fill in your answer here

# 3. Merging Data

In [108]:
import pandas as pd
import numpy as np

college = pd.read_csv('data/college.csv')
pd.options.display.max_columns = 40

### Problem 1
<span  style="color:green; font-size:16px">Slice the **`college`** dat to get the first 5 rows and 10 columns. Then insert a column called **`SAT_AVG`** that averages the math and verbal SAT scores before the **`SATVRMID`** column.</span>

In [122]:
# get slice
college_slice = college.iloc[:5, :10]

# get location
insert_num = college_slice.columns.get_loc('SATVRMID')

# get new column values
avg = (college_slice.SATMTMID + college_slice.SATVRMID) / 2

# insert at specific location
college_slice.insert(insert_num, 'SAT_AVG', avg)

# output
college_slice

Unnamed: 0,INSTNM,CITY,STABBR,HBCU,MENONLY,WOMENONLY,RELAFFIL,SAT_AVG,SATVRMID,SATMTMID,DISTANCEONLY
0,Alabama A & M University,Normal,AL,1.0,0.0,0.0,0,422.0,424.0,420.0,0.0
1,University of Alabama at Birmingham,Birmingham,AL,0.0,0.0,0.0,0,567.5,570.0,565.0,0.0
2,Amridge University,Montgomery,AL,0.0,0.0,0.0,1,,,,1.0
3,University of Alabama in Huntsville,Huntsville,AL,0.0,0.0,0.0,0,592.5,595.0,590.0,0.0
4,Alabama State University,Montgomery,AL,1.0,0.0,0.0,0,427.5,425.0,430.0,0.0


### Problem 2
<span  style="color:green; font-size:16px">Read in all three stock csv files and concatenate them horizontally and vertically. Create a hierarchical index that labels each year of data.</span>

In [123]:
stocks_2014 = pd.read_csv('data/stocks/stocks_2014.csv', index_col='Symbol')
stocks_2015 = pd.read_csv('data/stocks/stocks_2015.csv', index_col='Symbol')
stocks_2016 = pd.read_csv('data/stocks/stocks_2016.csv', index_col='Symbol')

In [124]:
pd.concat([stocks_2014, stocks_2015, stocks_2016], keys=['2014', '2015', '2016'])

Unnamed: 0_level_0,Unnamed: 1_level_0,High,Low,Shares,Short
Unnamed: 0_level_1,Symbol,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2014,AAPL,110,95,80,
2014,TSLA,130,80,50,
2014,WMT,70,55,40,
2015,AAPL,140,120,50,
2015,GE,40,30,100,
2015,IBM,95,75,87,
2015,SLB,85,55,20,
2015,TXN,23,15,500,
2015,TSLA,300,100,100,
2016,AAPL,120,90,30,0.2


In [125]:
pd.concat([stocks_2014, stocks_2015, stocks_2016], keys=['2014', '2015', '2016'], axis=1)

Unnamed: 0_level_0,2014,2014,2014,2015,2015,2015,2016,2016,2016,2016
Unnamed: 0_level_1,Shares,Low,High,Shares,Low,High,Shares,Low,High,Short
AAPL,80.0,95.0,110.0,50.0,120.0,140.0,30,90,120,0.2
GE,,,,100.0,30.0,40.0,120,35,50,0.1
IBM,,,,87.0,75.0,95.0,90,89,101,0.3
SLB,,,,20.0,55.0,85.0,30,60,90,0.25
TSLA,50.0,80.0,130.0,100.0,100.0,300.0,300,130,245,0.4
TXN,,,,500.0,15.0,23.0,300,17,22,0.05
WMT,40.0,55.0,70.0,,,,40,55,70,0.1


### Problem 3
<span  style="color:green; font-size:16px">Take a look at the DataFrame below. Count the total appearances of each letter.</span>

In [134]:
from string import ascii_lowercase
np.random.seed(1)
df = pd.DataFrame(np.random.choice(list(ascii_lowercase), (20,5), replace=True), 
                  columns = ['col1', 'col2', 'col3', 'col4', 'col5'])

df

Unnamed: 0,col1,col2,col3,col4,col5
0,f,l,m,i,j
1,l,f,p,a,q
2,b,m,h,n,g
3,z,s,u,f,s
4,u,l,k,o,s
5,e,x,x,j,r
6,x,a,w,n,j
7,j,h,w,z,b
8,a,r,i,y,n
9,t,p,k,z,i


In [135]:
# make one long series and then use value_counts
pd.concat([df.col1, df.col2, df.col3, df.col4, df.col5]).value_counts()

z    8
x    7
h    6
p    6
j    6
a    5
i    5
r    4
e    4
k    4
f    4
t    4
l    4
n    4
s    4
g    3
u    3
w    3
d    3
m    3
b    2
c    2
q    2
y    2
v    1
o    1
dtype: int64

### Problem 4
<span  style="color:green; font-size:16px">Each Series below represents the amount of TV watched for each sport. Combine all Series so that each column represents a different labeled day. Fill in the missing values with 0. Save it to **`df_sports`**</span>

In [139]:
day1 = pd.Series({'soccer':45, 'basketball':30, 'tennis':10})
day2 = pd.Series({'soccer':55, 'basketball':10, 'bowling':10, 'volleyball':30})
day3 = pd.Series({'soccer':15, 'basketball':20, 'volleyball':40})
day4 = pd.Series({'bowling':100, 'volleyball':20, 'basketball':1})

In [140]:
df_sports = pd.concat([day1, day2, day3, day4],axis=1, keys=['day1', 'day2', 'day3', 'day4'])

df_sports

Unnamed: 0,day1,day2,day3,day4
basketball,30.0,10.0,20.0,1.0
bowling,,10.0,,100.0
soccer,45.0,55.0,15.0,
tennis,10.0,,,
volleyball,,30.0,40.0,20.0


### Problem 5
<span  style="color:green; font-size:16px">Use **`df_sports`** to find the total TV watched per sport for all the days and also the total amount of TV watched per day. Sort both results from greatest to least.</span>

In [145]:
df_sports.sum().sort_values(ascending=False)

day4    121.0
day2    105.0
day1     85.0
day3     75.0
dtype: float64

In [146]:
df_sports.sum(axis=1).sort_values(ascending=False)

soccer        115.0
bowling       110.0
volleyball     90.0
basketball     61.0
tennis         10.0
dtype: float64

### Problem 6
<span  style="color:green; font-size:16px">Look up the method **`isnull`** and count the number of nulls per sport.</span>

In [147]:
df_sports.isnull().sum(axis=1)

basketball    0
bowling       2
soccer        1
tennis        3
volleyball    1
dtype: int64

### Problem 7
<span  style="color:green; font-size:16px">Combine all Series again, keeping only the sports that have no missing values for any days.</span>

In [148]:
pd.concat([day1, day2, day3, day4], axis=1, keys=['day1', 'day2', 'day3', 'day4'], join='inner')

Unnamed: 0,day1,day2,day3,day4
basketball,30,10,20,1


In [157]:
df_null_summary = df.isnull().mean().

In [166]:
df_1 = pd.concat([df.isnull().sum(), df.isnull().mean()], axis=1, keys=['Missing Values', 'Percentage'])

In [168]:
df.columns = pd.MultiIndex.from_arrays([range(len(df.columns)), df.columns])

In [170]:
pd.concat([df.isnull().sum(), df.isnull().mean()], axis=1, keys=['Missing Values', 'Percentage'])

Unnamed: 0,Unnamed: 1,Missing Values,Percentage
0,col1,0,0.0
1,col2,0,0.0
2,col3,0,0.0
3,col4,0,0.0
4,col5,0,0.0


In [173]:
df[df.columns[df.isnull().mean() < .4]]

Unnamed: 0_level_0,0,1,2,3,4
Unnamed: 0_level_1,col1,col2,col3,col4,col5
0,f,l,m,i,j
1,l,f,p,a,q
2,b,m,h,n,g
3,z,s,u,f,s
4,u,l,k,o,s
5,e,x,x,j,r
6,x,a,w,n,j
7,j,h,w,z,b
8,a,r,i,y,n
9,t,p,k,z,i


# 4. Relational Databases

### Problem 1
<span  style="color:green; font-size:16px">How many media types does each track have? Answer this by looking at the data diagram and then programmatically.</span>

In [177]:
# the diagram indicates each track has one and only one media type.
tracks = pd.read_csv('data/chinook/tracks.csv')
media = pd.read_csv('data/chinook/media_types.csv')

In [179]:
track_media = tracks.merge(media, on='MediaTypeId')

In [186]:
# if max repetitions of trackid is one then it is one to one
track_media['TrackId'].value_counts().head(10)

2047    1
2624    1
597     1
2644    1
593     1
2640    1
589     1
2636    1
585     1
2632    1
Name: TrackId, dtype: int64

### Problem 2
<span  style="color:green; font-size:16px">Which track has sold the most copies?</span>

In [232]:
invoice_items = pd.read_csv('data/chinook/invoice_items.csv')

In [233]:
tracks_invoice_items = tracks.merge(invoice_items, on='TrackId')

In [234]:
# Find track Id that has highest count
track_quantity = tracks_invoice_items.groupby('TrackId')['Quantity'].sum().sort_values(ascending=False)

track_quantity.head()

TrackId
2031    38
184     36
1888    36
925     36
2352    36
Name: Quantity, dtype: int64

In [240]:
tracks_invoice_items[tracks_invoice_items.TrackId == 2031]

Unnamed: 0,TrackId,Name,AlbumId,MediaTypeId,GenreId,Composer,Milliseconds,Bytes,UnitPrice_x,InvoiceLineId,InvoiceId,UnitPrice_y,Quantity
1318,2031,Nossa Gente (Avisa Là),166,1,7,,188212,6233201,1.25,907,166,1.25,19
1319,2031,Nossa Gente (Avisa Là),166,1,7,,188212,6233201,1.25,2058,380,1.25,19


### Problem 3
<span  style="color:green; font-size:16px">Which playlist has the most tracks?</span>

In [241]:
# your code here

In [242]:
playlists = pd.read_csv('data/chinook/playlists.csv')
playlist_tracks = pd.read_csv('data/chinook/playlist_track.csv')

In [245]:
playlist_trackid = playlists.merge(playlist_tracks, on = 'PlaylistId')
playlist_trackid.head(10)

Unnamed: 0,PlaylistId,Name,TrackId
0,1,Music,3402
1,1,Music,3389
2,1,Music,3390
3,1,Music,3391
4,1,Music,3392
5,1,Music,3393
6,1,Music,3394
7,1,Music,3395
8,1,Music,3396
9,1,Music,3397


In [246]:
playlist_trackid['Name'].value_counts()

Music                         6580
90’s Music                    1477
TV Shows                       426
Classical                       75
Brazilian Music                 39
Heavy Metal Classic             26
Classical 101 - Next Steps      25
Classical 101 - Deep Cuts       25
Classical 101 - The Basics      25
Grunge                          15
On-The-Go 1                      1
Music Videos                     1
Name: Name, dtype: int64

### Problem 4
<span  style="color:green; font-size:16px">Which playlist, that has at least 15 tracks has on average the most expensive tracks?</span>

In [251]:
playlist_track_price = playlist_trackid.merge(tracks[['TrackId', 'UnitPrice']], on='TrackId')

In [260]:
playlist_track_price_filtered = playlist_track_price.groupby('Name').filter(lambda x: x.size >= 15)

In [261]:
playlist_track_price_filtered.groupby('Name')['UnitPrice'].mean().sort_values(ascending=False)

Name
Brazilian Music               1.021795
TV Shows                      1.001455
Music                         0.997860
Classical 101 - Deep Cuts     0.995600
Grunge                        0.995333
90’s Music                    0.993521
Heavy Metal Classic           0.987692
Classical                     0.952933
Classical 101 - The Basics    0.937200
Classical 101 - Next Steps    0.926000
Name: UnitPrice, dtype: float64

### Problem 5: Advanced
<span  style="color:green; font-size:16px">Find the most sold genre per country.</span>

In [263]:
customers = pd.read_csv('data/chinook/customers.csv')
invoices = pd.read_csv('data/chinook/invoices.csv')

In [273]:
inv_cust =  invoices.merge(customers, on='CustomerId')

In [276]:
inv_cust.head()

Unnamed: 0,InvoiceId,CustomerId,InvoiceDate,BillingAddress,BillingCity,BillingState,BillingCountry,BillingPostalCode,Total,FirstName,LastName,Company,Address,City,State,Country,PostalCode,Phone,Fax,Email,SupportRepId
0,1,2,2009-01-01 00:00:00,Theodor-Heuss-Straße 34,Stuttgart,,Germany,70174,1.98,Leonie,Köhler,,Theodor-Heuss-Straße 34,Stuttgart,,Germany,70174,+49 0711 2842222,,leonekohler@surfeu.de,5
1,12,2,2009-02-11 00:00:00,Theodor-Heuss-Straße 34,Stuttgart,,Germany,70174,13.86,Leonie,Köhler,,Theodor-Heuss-Straße 34,Stuttgart,,Germany,70174,+49 0711 2842222,,leonekohler@surfeu.de,5
2,67,2,2009-10-12 00:00:00,Theodor-Heuss-Straße 34,Stuttgart,,Germany,70174,8.91,Leonie,Köhler,,Theodor-Heuss-Straße 34,Stuttgart,,Germany,70174,+49 0711 2842222,,leonekohler@surfeu.de,5
3,196,2,2011-05-19 00:00:00,Theodor-Heuss-Straße 34,Stuttgart,,Germany,70174,1.98,Leonie,Köhler,,Theodor-Heuss-Straße 34,Stuttgart,,Germany,70174,+49 0711 2842222,,leonekohler@surfeu.de,5
4,219,2,2011-08-21 00:00:00,Theodor-Heuss-Straße 34,Stuttgart,,Germany,70174,3.96,Leonie,Köhler,,Theodor-Heuss-Straße 34,Stuttgart,,Germany,70174,+49 0711 2842222,,leonekohler@surfeu.de,5


In [277]:
## Select only important columns
inv_cust = inv_cust[['InvoiceId', 'Country']]

In [280]:
inv_cust_quant = inv_cust.merge(invoice_items, on='InvoiceId')
inv_cust_quant.head(10)

Unnamed: 0,InvoiceId,Country,InvoiceLineId,TrackId,UnitPrice,Quantity
0,1,Germany,1,2,1.25,15
1,1,Germany,2,4,1.25,1
2,12,Germany,60,331,0.99,19
3,12,Germany,61,340,0.75,2
4,12,Germany,62,349,0.99,11
5,12,Germany,63,358,0.99,1
6,12,Germany,64,367,0.99,8
7,12,Germany,65,376,0.99,1
8,12,Germany,66,385,0.75,18
9,12,Germany,67,394,0.99,15


In [281]:
# keep only relevant columns
inv_cust_quant = inv_cust_quant[['TrackId', 'Country', 'Quantity']]

In [286]:
inv_cust_quant_track = inv_cust_quant.merge(tracks, on='TrackId')

In [287]:
inv_cust_quant_track.head()

Unnamed: 0,TrackId,Country,Quantity,Name,AlbumId,MediaTypeId,GenreId,Composer,Milliseconds,Bytes,UnitPrice
0,2,Germany,15,Balls to the Wall,2,2,1,,342562,5510424,1.25
1,2,Canada,4,Balls to the Wall,2,2,1,,342562,5510424,1.25
2,4,Germany,1,Restless and Wild,3,2,1,"F. Baltes, R.A. Smith-Diesel, S. Kaufman, U. D...",252051,4331779,1.25
3,331,Germany,19,Lavadeira,29,1,9,"Do Vale, Valverde/Gal Oliveira/Luciano Pinto",214256,7254147,0.99
4,340,Germany,2,Dazed and Confused,30,1,1,Jimmy Page,401920,13035765,0.75


In [288]:
# cut columns again
inv_cust_quant_track = inv_cust_quant_track[['Country', 'Quantity', 'GenreId']]

In [290]:
inv_cust_quant_track.shape

(2240, 3)

In [289]:
inv_cust_quant_track.head()

Unnamed: 0,Country,Quantity,GenreId
0,Germany,15,1
1,Canada,4,1
2,Germany,1,1
3,Germany,19,9
4,Germany,2,1


In [293]:
genre = pd.read_csv('data/chinook/genres.csv')

In [297]:
# finally append genre
country_genre = inv_cust_quant_track.merge(genre, on='GenreId')

country_genre.head()

Unnamed: 0,Country,Quantity,GenreId,Name
0,Germany,15,1,Rock
1,Canada,4,1,Rock
2,Germany,1,1,Rock
3,Germany,2,1,Rock
4,Germany,11,1,Rock


In [304]:
final_country_genre = country_genre.groupby(['Country', 'Name'], as_index=False)['Quantity'].sum()

In [305]:
final_country_genre.head()

Unnamed: 0,Country,Name,Quantity
0,Argentina,Alternative & Punk,106
1,Argentina,Easy Listening,25
2,Argentina,Jazz,28
3,Argentina,Latin,79
4,Argentina,Metal,62


In [307]:
final_country_genre_sorted = final_country_genre.sort_values(['Country', 'Quantity'], ascending=[True, False])

In [310]:
final_country_genre_sorted.head(15)

Unnamed: 0,Country,Name,Quantity
0,Argentina,Alternative & Punk,106
3,Argentina,Latin,79
5,Argentina,Rock,69
4,Argentina,Metal,62
2,Argentina,Jazz,28
1,Argentina,Easy Listening,25
6,Argentina,Soundtrack,5
12,Australia,Rock,226
10,Australia,Metal,79
8,Australia,Heavy Metal,48


In [311]:
final_country_genre_sorted.groupby('Country').first()

Unnamed: 0_level_0,Name,Quantity
Country,Unnamed: 1_level_1,Unnamed: 2_level_1
Argentina,Alternative & Punk,106
Australia,Rock,226
Austria,Rock,97
Belgium,Rock,194
Brazil,Rock,769
Canada,Rock,1086
Chile,Latin,98
Czech Republic,Rock,275
Denmark,Rock,214
Finland,Rock,167


### Problem 6
<span  style="color:green; font-size:16px">Find the name and email of each employee's boss. When the left table and right table have different column names for the joining key use the arguments **`left_on`** and **`right_on`**. Make use of the suffix arguments to better label the merged data. Be sure to include employees that don't have bosses. This is called a recursive relationship.</span>

In [330]:
employees = pd.read_csv('data/chinook/employees.csv')

In [335]:
employee_boss = employees.merge(employees, left_on='ReportsTo', right_on='EmployeeId', how='left', suffixes=('_Emp', '_Boss'))

In [336]:
employee_boss

Unnamed: 0,EmployeeId_Emp,LastName_Emp,FirstName_Emp,Title_Emp,ReportsTo_Emp,BirthDate_Emp,HireDate_Emp,Address_Emp,City_Emp,State_Emp,Country_Emp,PostalCode_Emp,Phone_Emp,Fax_Emp,Email_Emp,EmployeeId_Boss,LastName_Boss,FirstName_Boss,Title_Boss,ReportsTo_Boss,BirthDate_Boss,HireDate_Boss,Address_Boss,City_Boss,State_Boss,Country_Boss,PostalCode_Boss,Phone_Boss,Fax_Boss,Email_Boss
0,1,Adams,Andrew,General Manager,,1962-02-18 00:00:00,2002-08-14 00:00:00,11120 Jasper Ave NW,Edmonton,AB,Canada,T5K 2N1,+1 (780) 428-9482,+1 (780) 428-3457,andrew@chinookcorp.com,,,,,,,,,,,,,,,
1,2,Edwards,Nancy,Sales Manager,1.0,1958-12-08 00:00:00,2002-05-01 00:00:00,825 8 Ave SW,Calgary,AB,Canada,T2P 2T3,+1 (403) 262-3443,+1 (403) 262-3322,nancy@chinookcorp.com,1.0,Adams,Andrew,General Manager,,1962-02-18 00:00:00,2002-08-14 00:00:00,11120 Jasper Ave NW,Edmonton,AB,Canada,T5K 2N1,+1 (780) 428-9482,+1 (780) 428-3457,andrew@chinookcorp.com
2,3,Peacock,Jane,Sales Support Agent,2.0,1973-08-29 00:00:00,2002-04-01 00:00:00,1111 6 Ave SW,Calgary,AB,Canada,T2P 5M5,+1 (403) 262-3443,+1 (403) 262-6712,jane@chinookcorp.com,2.0,Edwards,Nancy,Sales Manager,1.0,1958-12-08 00:00:00,2002-05-01 00:00:00,825 8 Ave SW,Calgary,AB,Canada,T2P 2T3,+1 (403) 262-3443,+1 (403) 262-3322,nancy@chinookcorp.com
3,4,Park,Margaret,Sales Support Agent,2.0,1947-09-19 00:00:00,2003-05-03 00:00:00,683 10 Street SW,Calgary,AB,Canada,T2P 5G3,+1 (403) 263-4423,+1 (403) 263-4289,margaret@chinookcorp.com,2.0,Edwards,Nancy,Sales Manager,1.0,1958-12-08 00:00:00,2002-05-01 00:00:00,825 8 Ave SW,Calgary,AB,Canada,T2P 2T3,+1 (403) 262-3443,+1 (403) 262-3322,nancy@chinookcorp.com
4,5,Johnson,Steve,Sales Support Agent,2.0,1965-03-03 00:00:00,2003-10-17 00:00:00,7727B 41 Ave,Calgary,AB,Canada,T3B 1Y7,1 (780) 836-9987,1 (780) 836-9543,steve@chinookcorp.com,2.0,Edwards,Nancy,Sales Manager,1.0,1958-12-08 00:00:00,2002-05-01 00:00:00,825 8 Ave SW,Calgary,AB,Canada,T2P 2T3,+1 (403) 262-3443,+1 (403) 262-3322,nancy@chinookcorp.com
5,6,Mitchell,Michael,IT Manager,1.0,1973-07-01 00:00:00,2003-10-17 00:00:00,5827 Bowness Road NW,Calgary,AB,Canada,T3B 0C5,+1 (403) 246-9887,+1 (403) 246-9899,michael@chinookcorp.com,1.0,Adams,Andrew,General Manager,,1962-02-18 00:00:00,2002-08-14 00:00:00,11120 Jasper Ave NW,Edmonton,AB,Canada,T5K 2N1,+1 (780) 428-9482,+1 (780) 428-3457,andrew@chinookcorp.com
6,7,King,Robert,IT Staff,6.0,1970-05-29 00:00:00,2004-01-02 00:00:00,590 Columbia Boulevard West,Lethbridge,AB,Canada,T1K 5N8,+1 (403) 456-9986,+1 (403) 456-8485,robert@chinookcorp.com,6.0,Mitchell,Michael,IT Manager,1.0,1973-07-01 00:00:00,2003-10-17 00:00:00,5827 Bowness Road NW,Calgary,AB,Canada,T3B 0C5,+1 (403) 246-9887,+1 (403) 246-9899,michael@chinookcorp.com
7,8,Callahan,Laura,IT Staff,6.0,1968-01-09 00:00:00,2004-03-04 00:00:00,923 7 ST NW,Lethbridge,AB,Canada,T1H 1Y8,+1 (403) 467-3351,+1 (403) 467-8772,laura@chinookcorp.com,6.0,Mitchell,Michael,IT Manager,1.0,1973-07-01 00:00:00,2003-10-17 00:00:00,5827 Bowness Road NW,Calgary,AB,Canada,T3B 0C5,+1 (403) 246-9887,+1 (403) 246-9899,michael@chinookcorp.com


In [337]:
## select important columns for boss
employee_boss[['LastName_Emp', 'FirstName_Emp', 'Title_Emp', 'ReportsTo_Emp',
              'LastName_Boss', 'FirstName_Boss', 'Title_Boss', 'ReportsTo_Boss', 'Email_Boss']]

Unnamed: 0,LastName_Emp,FirstName_Emp,Title_Emp,ReportsTo_Emp,LastName_Boss,FirstName_Boss,Title_Boss,ReportsTo_Boss,Email_Boss
0,Adams,Andrew,General Manager,,,,,,
1,Edwards,Nancy,Sales Manager,1.0,Adams,Andrew,General Manager,,andrew@chinookcorp.com
2,Peacock,Jane,Sales Support Agent,2.0,Edwards,Nancy,Sales Manager,1.0,nancy@chinookcorp.com
3,Park,Margaret,Sales Support Agent,2.0,Edwards,Nancy,Sales Manager,1.0,nancy@chinookcorp.com
4,Johnson,Steve,Sales Support Agent,2.0,Edwards,Nancy,Sales Manager,1.0,nancy@chinookcorp.com
5,Mitchell,Michael,IT Manager,1.0,Adams,Andrew,General Manager,,andrew@chinookcorp.com
6,King,Robert,IT Staff,6.0,Mitchell,Michael,IT Manager,1.0,michael@chinookcorp.com
7,Callahan,Laura,IT Staff,6.0,Mitchell,Michael,IT Manager,1.0,michael@chinookcorp.com


### Problem 7
<span  style="color:green; font-size:16px">Which artists have the longest tracks on average? Return answer in minutes.</span>

In [359]:
artists = pd.read_csv('data/chinook/artists.csv')
albums = pd.read_csv('data/chinook/albums.csv')
tracks = pd.read_csv('data/chinook/tracks.csv')

In [360]:
artist_album = pd.merge(artists, albums, on='ArtistId')
artist_album.head(10)

Unnamed: 0,ArtistId,Name,AlbumId,Title
0,1,AC/DC,1,For Those About To Rock We Salute You
1,1,AC/DC,4,Let There Be Rock
2,2,Accept,2,Balls to the Wall
3,2,Accept,3,Restless and Wild
4,3,Aerosmith,5,Big Ones
5,4,Alanis Morissette,6,Jagged Little Pill
6,5,Alice In Chains,7,Facelift
7,6,Antônio Carlos Jobim,8,Warner 25 Anos
8,6,Antônio Carlos Jobim,34,Chill: Brazil (Disc 2)
9,7,Apocalyptica,9,Plays Metallica By Four Cellos


In [361]:
# rename 'Name' to something more descriptive
artist_album = artist_album.rename(columns={'Name':'ArtistName'})

In [362]:
artist_time = artist_album.merge(tracks[['AlbumId', 'Name', 'Milliseconds']], on='AlbumId')

In [363]:
artist_time.head(10)

Unnamed: 0,ArtistId,ArtistName,AlbumId,Title,Name,Milliseconds
0,1,AC/DC,1,For Those About To Rock We Salute You,For Those About To Rock (We Salute You),343719
1,1,AC/DC,1,For Those About To Rock We Salute You,Put The Finger On You,205662
2,1,AC/DC,1,For Those About To Rock We Salute You,Let's Get It Up,233926
3,1,AC/DC,1,For Those About To Rock We Salute You,Inject The Venom,210834
4,1,AC/DC,1,For Those About To Rock We Salute You,Snowballed,203102
5,1,AC/DC,1,For Those About To Rock We Salute You,Evil Walks,263497
6,1,AC/DC,1,For Those About To Rock We Salute You,C.O.D.,199836
7,1,AC/DC,1,For Those About To Rock We Salute You,Breaking The Rules,263288
8,1,AC/DC,1,For Those About To Rock We Salute You,Night Of The Long Knives,205688
9,1,AC/DC,1,For Those About To Rock We Salute You,Spellbound,270863


In [366]:
# get average time per artist
artist_time_summary = artist_time.groupby('ArtistName')['Milliseconds'].mean() / 60000

In [368]:
artist_time_summary.sort_values(ascending=False).head(10)

ArtistName
Battlestar Galactica (Classic)               48.759567
Battlestar Galactica                         46.174400
Heroes                                       43.319033
Lost                                         43.166400
Aquaman                                      41.409450
The Office                                   23.562400
Leonard Bernstein & New York Philharmonic     9.941983
Scholars Baroque Ensemble                     9.700483
Terry Bozzio, Tony Levin & Steve Stevens      9.596500
Adrian Leaper & Doreen de Feis                9.458233
Name: Milliseconds, dtype: float64

# Case Study - Counting Pandas Solutions

In [371]:
import pandas as pd
api_tables = pd.read_html('http://pandas.pydata.org/pandas-docs/stable/api.html')

## Problem 1
<span  style="color:green; font-size:16px"> Writing a new for loop every time we want to count a new word in our dataset is cumbersome. Can you write a function that accepts the parameter **word** and returns the count of this word if it appears as in the pandas API as a functions/methods/attributes. Count a few words with it like DataFrame or MultiIndex</span>

In [372]:
def count_functionality(word):
    return sum([table[0].str.contains(word).sum() for table in api_tables])

In [373]:
count_functionality('Series'), count_functionality('DataFrame'), count_functionality('MultiIndex')

(283, 225, 11)

## Problem 2
<span  style="color:green; font-size:16px">Define a new function by modifying the above function slightly to have it return a list of all the methods</span>

In [374]:
def list_functionality(word):
    return_list = []
    for table in api_tables:
        s = table[0] # get first column
        cur_list = s[s.str.contains(word)].tolist() # get only items with word in it and convert to list
        return_list.extend(cur_list)
    return return_list

In [375]:
# these methods should look very familiar from the builtin python str methods
str_series = list_functionality('Series.str')
str_series

['Series.strides',
 'Series.str.capitalize()',
 'Series.str.cat([others,\xa0sep,\xa0na_rep])',
 'Series.str.center(width[,\xa0fillchar])',
 'Series.str.contains(pat[,\xa0case,\xa0flags,\xa0na,\xa0...])',
 'Series.str.count(pat[,\xa0flags])',
 'Series.str.decode(encoding[,\xa0errors])',
 'Series.str.encode(encoding[,\xa0errors])',
 'Series.str.endswith(pat[,\xa0na])',
 'Series.str.extract(pat[,\xa0flags,\xa0expand])',
 'Series.str.extractall(pat[,\xa0flags])',
 'Series.str.find(sub[,\xa0start,\xa0end])',
 'Series.str.findall(pat[,\xa0flags])',
 'Series.str.get(i)',
 'Series.str.index(sub[,\xa0start,\xa0end])',
 'Series.str.join(sep)',
 'Series.str.len()',
 'Series.str.ljust(width[,\xa0fillchar])',
 'Series.str.lower()',
 'Series.str.lstrip([to_strip])',
 'Series.str.match(pat[,\xa0case,\xa0flags,\xa0na,\xa0...])',
 'Series.str.normalize(form)',
 'Series.str.pad(width[,\xa0side,\xa0fillchar])',
 'Series.str.partition([pat,\xa0expand])',
 'Series.str.repeat(repeats)',
 'Series.str.replace

## Problem 3
<span  style="color:green; font-size:16px">Explore several of the Series `.str` methods that you should now have captured in a list on one of the API reference tables to get </span>

In [376]:
s = api_tables[44][0]

In [377]:
s.str.swapcase()

0      sERIES.FROM_CSV(PATH[, SEP, PARSE_DATES, ...])
1                              sERIES.TO_PICKLE(PATH)
2      sERIES.TO_CSV([PATH, INDEX, SEP, NA_REP, ...])
3                                    sERIES.TO_DICT()
4                             sERIES.TO_FRAME([NAME])
5                                  sERIES.TO_XARRAY()
6         sERIES.TO_HDF(PATH_OR_BUF, KEY, \*\*KWARGS)
7     sERIES.TO_SQL(NAME, CON[, FLAVOR, SCHEMA, ...])
8          sERIES.TO_MSGPACK([PATH_OR_BUF, ENCODING])
9          sERIES.TO_JSON([PATH_OR_BUF, ORIENT, ...])
10               sERIES.TO_SPARSE([KIND, FILL_VALUE])
11                                  sERIES.TO_DENSE()
12               sERIES.TO_STRING([BUF, NA_REP, ...])
13                  sERIES.TO_CLIPBOARD([EXCEL, SEP])
Name: 0, dtype: object

In [378]:
s.str.split() # split each element by blank space

0     [Series.from_csv(path[,, sep,, parse_dates,, ....
1                              [Series.to_pickle(path)]
2     [Series.to_csv([path,, index,, sep,, na_rep,, ...
3                                    [Series.to_dict()]
4                             [Series.to_frame([name])]
5                                  [Series.to_xarray()]
6       [Series.to_hdf(path_or_buf,, key,, \*\*kwargs)]
7     [Series.to_sql(name,, con[,, flavor,, schema,,...
8         [Series.to_msgpack([path_or_buf,, encoding])]
9        [Series.to_json([path_or_buf,, orient,, ...])]
10              [Series.to_sparse([kind,, fill_value])]
11                                  [Series.to_dense()]
12             [Series.to_string([buf,, na_rep,, ...])]
13                 [Series.to_clipboard([excel,, sep])]
Name: 0, dtype: object

In [379]:
# chain another str method and get the second element in the split list
s.str.split().str.get(1)

0             sep,
1              NaN
2           index,
3              NaN
4              NaN
5              NaN
6             key,
7            con[,
8       encoding])
9          orient,
10    fill_value])
11             NaN
12         na_rep,
13           sep])
Name: 0, dtype: object

## Problem 4
<span  style="color:green; font-size:16px">Lets get some 'live' data.</span>
1. Naviate to [real clear politics](http://www.realclearpolitics.com) 
1. In the top left corner of the page, hover over the polls section and click on Clinton vs Trump
1. use pandas read_html to read in that full table at the bottom of the page and display it here in the notebook
1. use the header parameter to find the correct header instead of the default numbers
1. Inspect the info to make sure the clinton and trump data types are float64
1. add a column that calculates the difference of trump vs clinton
1. sort the dataframe by this newly created column
1. Do you see anything suspicious about the polls where Trump is leading?

In [380]:
rcp_tables = pd.read_html('http://www.realclearpolitics.com/epolls/2016/president/us/general_election_trump_vs_clinton-5491.html', header=0)

In [381]:
len(rcp_tables)

3

In [382]:
rcp_final = rcp_tables[2]

In [383]:
rcp_final.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 261 entries, 0 to 260
Data columns (total 7 columns):
Poll           261 non-null object
Date           261 non-null object
Sample         261 non-null object
MoE            261 non-null object
Clinton (D)    261 non-null float64
Trump (R)      261 non-null float64
Spread         261 non-null object
dtypes: float64(2), object(5)
memory usage: 14.4+ KB


In [384]:
rcp_final['diff'] = rcp_final['Clinton (D)'] - rcp_final['Trump (R)']

In [385]:
rcp_final.head(20)

Unnamed: 0,Poll,Date,Sample,MoE,Clinton (D),Trump (R),Spread,diff
0,Final Results,--,--,--,48.2,46.2,Clinton +2.0,2.0
1,RCP Average,11/1 - 11/7,--,--,46.8,43.6,Clinton +3.2,3.2
2,BloombergBloomberg,11/4 - 11/6,799 LV,3.5,46.0,43.0,Clinton +3,3.0
3,IBD/TIPP TrackingIBD/TIPP Tracking,11/4 - 11/7,1107 LV,3.1,43.0,42.0,Clinton +1,1.0
4,Economist/YouGovEconomist,11/4 - 11/7,3669 LV,--,49.0,45.0,Clinton +4,4.0
5,LA Times/USC TrackingLA Times,11/1 - 11/7,2935 LV,4.5,44.0,47.0,Trump +3,-3.0
6,ABC/Wash Post TrackingABC/WP Tracking,11/3 - 11/6,2220 LV,2.5,49.0,46.0,Clinton +3,3.0
7,FOX NewsFOX News,11/3 - 11/6,1295 LV,2.5,48.0,44.0,Clinton +4,4.0
8,MonmouthMonmouth,11/3 - 11/6,748 LV,3.6,50.0,44.0,Clinton +6,6.0
9,NBC News/Wall St. JrnlNBC/WSJ,11/3 - 11/5,1282 LV,2.7,48.0,43.0,Clinton +5,5.0
