# Discussion 2: Pandas Group Work

Now that you've tested your knowledge on the practice worksheet, it's time to actually use that knowledge to make interesting queries on real data. We will be using the California `babynames` data from lecture, but we will be making different data queries.

The following are 5 tasks that require relatively complex queries, especially compared to what you just did on the worksheet. Form groups of 3 or 4, and choose one task to complete as a group. Discuss best strategies to accomplish the task (in terms of code efficiency, conciseness, etc.) Do not hesitate to Google for help (either to look at the Pandas documentation or StackExchange). 
 
If time permits, we will discuss our responses as a class and consider alternatives, if any. If your group finishes their task early, challenge yourself with another one!

In [2]:
import numpy as np
import pandas as pd

In [3]:
# loading up the California baby names data

import urllib.request
import os.path
import zipfile

data_url = "https://www.ssa.gov/oact/babynames/state/namesbystate.zip"
local_filename = "babynamesbystate.zip"
if not os.path.exists(local_filename): # if the data exists don't download again
    with urllib.request.urlopen(data_url) as resp, open(local_filename, 'wb') as f:
        f.write(resp.read())

zf = zipfile.ZipFile(local_filename, 'r')

ca_name = 'CA.TXT'
field_names = ['State', 'Sex', 'Year', 'Name', 'Count']
with zf.open(ca_name) as fh:
    babynames = pd.read_csv(fh, header=None, names=field_names)

## Task 1

There's over 20,000 unique names in this dataset. However, some have been used for longer periods of time than others. Create a Pandas DataFrame where the index is the name, and there are two columns: one corresponding to the first year in which that name appeared in the dataset, and the other corresponding to the last year in which it appeared. Add a third column corresponding to the longevity of the name (how many years it's been used), and arrange the table by decreasing longevity, and display only the first 10 names.

In [4]:
# one of many possible solutions

minmax_names = babynames.groupby('Name')['Year'].agg([min, max])
minmax_names['Longevity'] = minmax_names['max'] - minmax_names['min']
minmax_names.sort_values(by = 'Longevity', ascending = False).head(10)

Unnamed: 0_level_0,min,max,Longevity
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Marcella,1910,2021,111
Catherine,1910,2021,111
Juan,1910,2021,111
Ann,1910,2021,111
Elvira,1910,2021,111
Virgil,1910,2021,111
Vera,1910,2021,111
Ina,1910,2021,111
Inez,1910,2021,111
John,1910,2021,111


## Task 2

Some think that baby names are getting longer, on average, as time wears on. We're not sure if that's true, but let's query the data to check this out. Write code to return a Pandas Series whose index is the year (from 1910 to 2021), and whose values are the average name length **among all babies in the dataset for that year**. *(More concretely, for each year, we are asking for the sum of the name lengths of every individual baby included in the dataset, divided by the number of babies included in that year.)*

Print out the first 10 years and the last 10 years and see if you notice any significant differences.

In [5]:
# one of many possible solutions

babynames['lengths_name'] = babynames['Name'].str.len()*babynames['Count']
avg_lens = babynames.groupby('Year')['lengths_name'].agg(sum)/babynames.groupby('Year')['Count'].agg(sum)
avg_lens.head(10)

Year
1910    5.800939
1911    5.784133
1912    5.813440
1913    5.803657
1914    5.805690
1915    5.793135
1916    5.785206
1917    5.802109
1918    5.804126
1919    5.790559
dtype: float64

In [51]:
avg_lens.tail(10)

Year
2012    5.926478
2013    5.908037
2014    5.896814
2015    5.881191
2016    5.862503
2017    5.845563
2018    5.821742
2019    5.800851
2020    5.792876
2021    5.762284
dtype: float64

## Task 3

Of the unique names given in different years, what proportion of them start with a vowel? Write Pandas code to return a Pandas Series with the index as the year and the value as the proportion of different names given that year that start with a vowel.

*Hint: the str.startswith() method for a Pandas series may be useful to you. You may also find it useful to define a function and input to the agg method for groupby objects, as in lecture.*

In [22]:
# one solution, using a user-defined function

def prop_vowels(series):
    return(np.mean(series.str.startswith(('A', 'E', 'I', 'O', 'U'))))

babynames.groupby('Year')['Name'].agg(prop_vowels)

Year
1910    0.225895
1911    0.239186
1912    0.236559
1913    0.230016
1914    0.229577
          ...   
2017    0.255640
2018    0.251298
2019    0.255907
2020    0.256907
2021    0.259578
Name: Name, Length: 112, dtype: float64

## Task 4

Are names becoming more unique over time? Return a Pandas Series whose index is the Year and whose values are the number of names given that year whose count is less than 15. Print the first 10 and last 10 elements of the series.

In [7]:
## one of many possible solutions

names = babynames[babynames['Count'] < 15].groupby('Year')['Name'].count()
names.head(10)

Year
1910    205
1911    227
1912    308
1913    323
1914    384
1915    445
1916    469
1917    475
1918    513
1919    514
Name: Name, dtype: int64

In [77]:
names.tail(10)

Year
2012    3978
2013    3879
2014    3892
2015    3832
2016    3717
2017    3678
2018    3571
2019    3531
2020    3485
2021    3586
Name: Name, dtype: int64

## Task 5



Among names that were very popular (let's say, have a count greater than 1000), how many different names are there across sex and year? Write Pandas code that returns a Pandas DataFrame, whose columns correspond to sex and whose row indices correspond to year. Each entry should be the number of unique, "popular" names for that year for that sex (given our definition of popular from above).

If there's a NaN value (missing value) in the table, why do you think it's there? What do you think is a reasonable value to impute into these missing values? Fill in all missing values with the value your group finds most appropriate (look for the `fillna()` method for pandas dataframes online for info on how to do this)

*Answer: Because in 1915, there are no female names whose count is more than 1000 (see bottom cell for demonstration and code). A reasonable value to impute would be 0, since when we have a missing value, it's because there are no names that are 'popular' for that year.*

In [35]:
## one of many possible solutions

# accomplishing first part of task, revealing NaN
df = babynames[babynames['Count'] > 1000].pivot_table(index = 'Year', columns = 'Sex', values = 'Name', aggfunc = 'count')
# accomplishing next part of task, where we fill in 0
df.fillna(0)

Sex,F,M
Year,Unnamed: 1_level_1,Unnamed: 2_level_1
1915,0.0,1.0
1916,1.0,1.0
1917,1.0,3.0
1918,1.0,3.0
1919,1.0,2.0
...,...,...
2017,19.0,41.0
2018,18.0,36.0
2019,18.0,37.0
2020,17.0,35.0


In [33]:
# A little justification for why we have the missing value; clear that the only 'popular' name whose count is greater than 1000 is John
# No female names were given more than 1000 times in that year

babynames[babynames['Year'] == 1915].sort_values(by = 'Count', ascending = False)

Unnamed: 0,State,Sex,Year,Name,Count
236940,CA,M,1915,John,1033
1488,CA,F,1915,Mary,998
236941,CA,M,1915,William,886
236942,CA,M,1915,Robert,840
1489,CA,F,1915,Dorothy,717
...,...,...,...,...,...
1920,CA,F,1915,Madge,5
1919,CA,F,1915,Madelyn,5
1918,CA,F,1915,Lulu,5
1917,CA,F,1915,Lucia,5
