# Lecture 5

 Fall 2023

A demonstration of advanced `pandas` syntax to accompany Lecture 5.

## Pivot Tables

### `Groupby` with multiple columns

We want to build a table showing the total number of babies born of each sex in each year. One way is to `groupby` using both columns of interest:

In [2]:
# Find total count of baby names for both female and Male for each year
import numpy as np
import pandas as pd
import plotly.express as px
import urllib.request
import os.path
import zipfile
import pandas as pd

data_url = "https://www.ssa.gov/oact/babynames/state/namesbystate.zip"
local_filename = "babynamesbystate.zip"
if not os.path.exists(local_filename): # If the data exists don't download again
    with urllib.request.urlopen(data_url) as resp, open(local_filename, 'wb') as f:
        f.write(resp.read())

zf = zipfile.ZipFile(local_filename, 'r')

ca_name = 'STATE.CA.TXT'
field_names = ['State', 'Sex', 'Year', 'Name', 'Count']
with zf.open(ca_name) as fh:
    babynames = pd.read_csv(fh, header=None, names=field_names)

In [5]:
# Find total count of baby names for both female and Male for each year
baby_count=babynames.groupby(['Year','Sex']).agg(sum).head(6)
baby_count

  baby_count=babynames.groupby(['Year','Sex']).agg(sum).head(6)


Unnamed: 0_level_0,Unnamed: 1_level_0,State,Name,Count
Year,Sex,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1910,F,CACACACACACACACACACACACACACACACACACACACACACACA...,MaryHelenDorothyMargaretFrancesRuthEvelynAlice...,5950
1910,M,CACACACACACACACACACACACACACACACACACACACACACACA...,JohnWilliamJamesRobertGeorgeFrankJosephCharles...,3213
1911,F,CACACACACACACACACACACACACACACACACACACACACACACA...,MaryDorothyHelenMargaretRuthFrancesAliceEvelyn...,6602
1911,M,CACACACACACACACACACACACACACACACACACACACACACACA...,JohnWilliamRobertGeorgeJamesCharlesFrankJoseph...,3381
1912,F,CACACACACACACACACACACACACACACACACACACACACACACA...,MaryDorothyHelenMargaretRuthFrancesAliceVirgin...,9804
1912,M,CACACACACACACACACACACACACACACACACACACACACACACA...,JohnWilliamRobertGeorgeJamesFrankCharlesJoseph...,8142


### `pivot_table`

In [7]:
# Find total count of baby names for both female and Male for each year using Pivot table
baby_pivot=babynames.pivot_table(index="Year",columns="Sex",values=['Count'],aggfunc=np.sum,)
baby_pivot

  baby_pivot=babynames.pivot_table(index="Year",columns="Sex",values=['Count'],aggfunc=np.sum,)


Unnamed: 0_level_0,Count,Count
Sex,F,M
Year,Unnamed: 1_level_2,Unnamed: 2_level_2
1910,5950,3213
1911,6602,3381
1912,9804,8142
1913,11860,10234
1914,13815,13111
...,...,...
2018,189208,206228
2019,184228,202768
2020,173763,189119
2021,173913,188669


![pivot_picture.png](attachment:pivot_picture.png)

### `pivot_table` with Multiple values

In [23]:
# Form a pivot table as describr in Lecture Slides
baby_pivot2=babynames.pivot_table(index="Year",columns="Sex",values=['Count','State'],aggfunc=np.sum,)
baby_pivot2.head(6)

  baby_pivot2=babynames.pivot_table(index="Year",columns="Sex",values=['Count','State'],aggfunc=np.sum,)


Unnamed: 0_level_0,Count,Count,State,State
Sex,F,M,F,M
Year,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
1910,5950,3213,CACACACACACACACACACACACACACACACACACACACACACACA...,CACACACACACACACACACACACACACACACACACACACACACACA...
1911,6602,3381,CACACACACACACACACACACACACACACACACACACACACACACA...,CACACACACACACACACACACACACACACACACACACACACACACA...
1912,9804,8142,CACACACACACACACACACACACACACACACACACACACACACACA...,CACACACACACACACACACACACACACACACACACACACACACACA...
1913,11860,10234,CACACACACACACACACACACACACACACACACACACACACACACA...,CACACACACACACACACACACACACACACACACACACACACACACA...
1914,13815,13111,CACACACACACACACACACACACACACACACACACACACACACACA...,CACACACACACACACACACACACACACACACACACACACACACACA...
1915,18643,17192,CACACACACACACACACACACACACACACACACACACACACACACA...,CACACACACACACACACACACACACACACACACACACACACACACA...


---

## Join Tables

What if we want to know the popularity of presidential candidates' first names in California in 2022? What can we do?

In [26]:
election = pd.read_csv(r'C:\Users\pc\Desktop\Data Science with ML\pandas\elections.csv')
#first_n=election[election["Year"==2022] & election["State"=="CA"] ]
election["First Name"]=election['Candidate'].str.split().str[0]


In [16]:
# Collect baby names for 2022
baby2022=babynames[babynames['Year']==2022]
baby2022

Unnamed: 0,State,Sex,Year,Name,Count
235835,CA,F,2022,Olivia,2178
235836,CA,F,2022,Emma,2080
235837,CA,F,2022,Camila,2046
235838,CA,F,2022,Mia,1882
235839,CA,F,2022,Sophia,1762
...,...,...,...,...,...
407423,CA,M,2022,Zayvier,5
407424,CA,M,2022,Zia,5
407425,CA,M,2022,Zora,5
407426,CA,M,2022,Zuriel,5


In [18]:
# Use split the candidate names in elections dataframe
election["First Name"]=election['Candidate'].str.split().str[0]
election["First Name"]

0       Reagan
1       Carter
2     Anderson
3       Reagan
4      Mondale
5         Bush
6      Dukakis
7      Clinton
8         Bush
9        Perot
10     Clinton
11        Dole
12       Perot
13        Gore
14        Bush
15       Kerry
16        Bush
17       Obama
18      McCain
19       Obama
20      Romney
21     Clinton
22       Trump
Name: First Name, dtype: object

`join` in pandas

In [21]:
#Merge both elections and babynames and report your analysis
merged = pd.merge(left = election, right = baby2022,
left_on = "First Name", right_on = "Name")
merged

Unnamed: 0,Candidate,Party,%,Year_x,Result,First Name,State,Sex,Year_y,Name,Count
0,Reagan,Republican,50.7,1980,win,Reagan,CA,F,2022,Reagan,156
1,Reagan,Republican,50.7,1980,win,Reagan,CA,M,2022,Reagan,21
2,Carter,Democratic,41.0,1980,loss,Carter,CA,F,2022,Carter,35
3,Carter,Democratic,41.0,1980,loss,Carter,CA,M,2022,Carter,384
4,Anderson,Independent,6.6,1980,loss,Anderson,CA,M,2022,Anderson,82
5,Reagan,Republican,58.8,1984,win,Reagan,CA,F,2022,Reagan,156
6,Reagan,Republican,58.8,1984,win,Reagan,CA,M,2022,Reagan,21
7,Clinton,Democratic,43.0,1992,win,Clinton,CA,M,2022,Clinton,6
8,Clinton,Democratic,49.2,1996,win,Clinton,CA,M,2022,Clinton,6
9,Kerry,Democratic,48.3,2004,loss,Kerry,CA,M,2022,Kerry,5


In [22]:
# Sort using Count
merged.sort_values('Count')

Unnamed: 0,Candidate,Party,%,Year_x,Result,First Name,State,Sex,Year_y,Name,Count
9,Kerry,Democratic,48.3,2004,loss,Kerry,CA,M,2022,Kerry,5
7,Clinton,Democratic,43.0,1992,win,Clinton,CA,M,2022,Clinton,6
8,Clinton,Democratic,49.2,1996,win,Clinton,CA,M,2022,Clinton,6
10,Clinton,Democratic,48.2,2016,loss,Clinton,CA,M,2022,Clinton,6
1,Reagan,Republican,50.7,1980,win,Reagan,CA,M,2022,Reagan,21
6,Reagan,Republican,58.8,1984,win,Reagan,CA,M,2022,Reagan,21
2,Carter,Democratic,41.0,1980,loss,Carter,CA,F,2022,Carter,35
4,Anderson,Independent,6.6,1980,loss,Anderson,CA,M,2022,Anderson,82
0,Reagan,Republican,50.7,1980,win,Reagan,CA,F,2022,Reagan,156
5,Reagan,Republican,58.8,1984,win,Reagan,CA,F,2022,Reagan,156
