# Pandas Exercise

In this exercise, we will be using the California `babynames` data from lecture, but we will be making different data queries.

The following are 5 tasks that require relatively complex queries. Please work in a group and discuss best strategies to accomplish the task (in terms of code efficiency, conciseness, etc.) 
 
If time permits, we will discuss our responses as a class and consider alternatives.

In [1]:
import numpy as np
import pandas as pd

In [2]:
# loading up the California baby names data

import urllib.request
import os.path
import zipfile

data_url = "https://www.ssa.gov/oact/babynames/state/namesbystate.zip"
local_filename = "babynamesbystate.zip"
if not os.path.exists(local_filename): # if the data exists don't download again
    with urllib.request.urlopen(data_url) as resp, open(local_filename, 'wb') as f:
        f.write(resp.read())

zf = zipfile.ZipFile(local_filename, 'r')

ca_name = 'CA.TXT'
field_names = ['State', 'Sex', 'Year', 'Name', 'Count']
with zf.open(ca_name) as fh:
    babynames = pd.read_csv(fh, header=None, names=field_names)

## Task 1

There's over 20,000 unique names in this dataset. However, some have been used for longer periods of time than others. Create a Pandas DataFrame where the index is the name, and there are two columns: one corresponding to the first year in which that name appeared in the dataset, and the other corresponding to the last year in which it appeared. Add a third column corresponding to the longevity of the name (how many years it's been used), and arrange the table by decreasing longevity, and display only the first 10 names.

In [5]:
# insert solution
first_year = babynames.groupby(['Name'])['Year'].min()
last_year = babynames.groupby(['Name'])['Year'].max()
longevity_name = babynames.groupby(['Name'])['Year'].unique().apply(lambda x: len(x))
# make a table including above three columns
longevity = pd.DataFrame({'first_year': first_year, 'last_year': last_year, 'longevity_name': longevity_name})
longevity.sort_values(by='longevity_name', ascending=False).head(10)

Unnamed: 0_level_0,first_year,last_year,longevity_name
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Lydia,1910,2021,112
Genevieve,1910,2021,112
Cecelia,1910,2021,112
Lucy,1910,2021,112
Sam,1910,2021,112
Teresa,1910,2021,112
Sally,1910,2021,112
Catherine,1910,2021,112
Caroline,1910,2021,112
Nick,1910,2021,112


## Task 2

Some think that baby names are getting longer, on average, as time wears on. We're not sure if that's true, but let's query the data to check this out. Write code to return a Pandas Series whose index is the year (from 1910 to 2021), and whose values are the average name length **among all babies in the dataset for that year**. *(More concretely, for each year, we are asking for the sum of the name lengths of every individual baby included in the dataset, divided by the number of babies included in that year.)*

Print out the first 10 years and the last 10 years and see if you notice any significant differences.

In [10]:
# insert solution
babynames['Name_length'] = babynames['Name'].apply(len)
Name_mean_length = babynames.groupby(['Year'])['Name_length'].agg(['mean'])
first10=Name_mean_length.sort_values(by='mean', ascending=False).head(10)
last10=Name_mean_length.sort_values(by='mean', ascending=False).tail(10)
print(first10)
print(last10)

          mean
Year          
1991  6.097318
1997  6.092251
1990  6.089733
1993  6.083294
1994  6.082359
1989  6.081217
1995  6.080913
1992  6.079277
1996  6.069417
1988  6.067446
          mean
Year          
1916  5.801843
1940  5.798846
2021  5.797046
1941  5.794302
1915  5.790920
1910  5.768595
1913  5.761827
1914  5.760563
1912  5.754480
1911  5.720102


## Task 3

Of the unique names given in different years, what proportion of them start with a vowel? Write Pandas code to return a Pandas Series with the index as the year and the value as the proportion of different names given that year that start with a vowel.

*Hint: the str.startswith() method for a Pandas series may be useful to you. You may also find it useful to define a function to plug into to the agg method for groupby objects, as in lecture.*

In [12]:
# insert solution
babynames['Start_with_vowel'] = babynames['Name'].str.startswith(('A', 'E', 'I', 'O', 'U'))
babynames.groupby(['Year'])['Start_with_vowel'].agg(lambda x: sum(x)/len(x))

Year
1910    0.225895
1911    0.239186
1912    0.236559
1913    0.230016
1914    0.229577
          ...   
2017    0.255640
2018    0.251298
2019    0.255907
2020    0.256907
2021    0.259578
Name: Start_with_vowel, Length: 112, dtype: float64

## Task 4

Are names becoming more unique over time? Return a Pandas Series whose index is the Year and whose values are the number of names given that year whose count is less than 15. Print the first 10 and last 10 elements of the series.

In [17]:
## insert solution
# Are names becoming more unique over time? Return a Pandas Series whose index is the Year and whose values are the number of names given that year whose count is less than 15. Print the first 10 and last 10 elements of the series.
babynames_count = babynames[babynames['Count'] < 15].groupby(['Year'])['Count'].count()
print(babynames_count.head(10))
print(babynames_count.tail(10))



Year
1910    205
1911    227
1912    308
1913    323
1914    384
1915    445
1916    469
1917    475
1918    513
1919    514
Name: Count, dtype: int64
Year
2012    3978
2013    3879
2014    3892
2015    3832
2016    3717
2017    3678
2018    3571
2019    3531
2020    3485
2021    3586
Name: Count, dtype: int64


Year
1910    -1
1911    -1
1912    -1
1913    -1
1914    -5
        ..
2017   -55
2018   -53
2019   -57
2020   -58
2021   -59
Name: Name, Length: 112, dtype: int64

## Task 5



Among names that were very popular (let's say, have a count greater than 1000), how many different names are there across sex and year? Write Pandas code that returns a Pandas DataFrame, whose columns correspond to sex and whose row indices correspond to year. Each entry should be the number of unique, "popular" names for that year for that sex (given our definition of popular from above).

If there's a NaN value (missing value) in the table, why do you think it's there? What do you think is a reasonable value to impute into these missing values? Fill in all missing values with the value your group finds most appropriate (look for the `fillna()` method for pandas dataframes online for info on how to do this)

In [24]:
## insert solution
babynames_pivot  = babynames.pivot_table(
    index='Year',
    columns='Sex',
    values='Count',
    aggfunc= lambda x : sum(x > 1000),
)
babynames_pivot

Sex,F,M
Year,Unnamed: 1_level_1,Unnamed: 2_level_1
1910,0,0
1911,0,0
1912,0,0
1913,0,0
1914,0,0
...,...,...
2017,19,41
2018,18,36
2019,18,37
2020,17,35
