# Week 6 (3/7-3/13)

## Project 

* [Baby names](../Projects/baby_names/baby_names.ipynb)

## Resources

### 1. Sample DataFrame merging data

In [32]:
import pandas as pd
import numpy as np

rng = np.random.default_rng(10)

names = ["Ava", "Benjamin", "Charlotte", "Daniel", "Emma", "Fredric", "Gianna"]
courses = ["MTH 141", "MTH 142", "MTH 241", "MTH 306", "MTH 309", "MTH 311"]
rooms = ["NSC 216", "Capen 110", "Park 440"]

# concat rows data
scores1 = rng.integers(0, 100, 12).reshape(4, 3)
scores2 = rng.integers(0, 100, 9).reshape(3, 3)
columns = ["problem_1", "problem_2", "problem_3"]
sec1 = pd.DataFrame(scores1, index=names[:4], columns=columns)
sec2 = pd.DataFrame(scores2, index=names[4:7], columns=columns)

# concat columns data
scores1 = rng.integers(0, 100, 8).reshape(4, 2)
scores2 = rng.integers(0, 100, 9).reshape(3, 3)
part1 = pd.DataFrame(scores1,
                     index=names[:4],
                     columns=["problem_1", "problem_2"])
part2 = pd.DataFrame(scores2,
                     index=names[:3],
                     columns=["problem_3", "problem_4", "problem_5"])

# merging data
office_nums = rng.integers(100, 150, len(names[:-1]))
courses = pd.DataFrame({"course": courses,
                        "instructor": rng.choice(names[1:], len(courses))})
instructors = pd.DataFrame({"name": names[:-1], "office": office_nums}, dtype="object")

### 2. Plotly installation

In [None]:
%pip install plotly

### 3. Data for choropleth maps

In [2]:
url = "https://raw.githubusercontent.com/plotly/datasets/master/2011_us_ag_exports.csv"

## Exercises

**Note.** All exercises, except for the first one, use data contained in the `names.zip` file. 

In [1]:
# "nbsphinx": "hidden"

# dataframe with baby names

import zipfile
import pandas as pd

zip = zipfile.ZipFile('names.zip')
files = [n for n in  zip.namelist() if n.endswith(".txt")]
years = [int(f.split(".")[0][-4:]) for f in files]

frames = []
for f in files: 
    with zip.open(f) as foo:
        df = pd.read_csv(foo, names=["name", "sex", "count"])
    frames.append(df)

names = (pd.concat(frames,keys=years)
         .reset_index(level=0)
         .rename({"level_0": "year"}, axis=1)
         .reset_index(drop=True)
        )
names

Unnamed: 0,year,name,sex,count
0,1880,Mary,F,7065
1,1880,Anna,F,2604
2,1880,Emma,F,2003
3,1880,Elizabeth,F,1939
4,1880,Minnie,F,1746
...,...,...,...,...
2020858,2020,Zykell,M,5
2020859,2020,Zylus,M,5
2020860,2020,Zymari,M,5
2020861,2020,Zyn,M,5


### Exercise 1

Construct a DataFrame with baby names data coming from the `namesbystate.zip` file. The DataFrame columns should have names "state", "sex", "year", "name", and "count". You don't need to save this data to a single csv file; if you do, the file size will be about 130 MB. 

**Check:** The DataFrame should have 6,215,834 rows.

In [2]:
# "nbsphinx": "hidden"

zip = zipfile.ZipFile('namesbystate.zip')
files = [n for n in  zip.namelist() if n.endswith(".TXT")]
states = [f.split(".")[0] for f in files]

frames = []
for f in files: 
    with zip.open(f) as foo:
        df = pd.read_csv(foo, names=["state", "sex", "year", "name", "count"])
    frames.append(df)

bystate = pd.concat(frames).reset_index(drop=True)
bystate

Unnamed: 0,state,sex,year,name,count
0,AK,F,1910,Mary,14
1,AK,F,1910,Annie,12
2,AK,F,1910,Anna,10
3,AK,F,1910,Margaret,8
4,AK,F,1910,Helen,7
...,...,...,...,...,...
6215829,WY,M,2020,Simon,5
6215830,WY,M,2020,Sterling,5
6215831,WY,M,2020,Stetson,5
6215832,WY,M,2020,Timothy,5


### Exercise 2

Compute a DataFrame that lists the total number of babies recorded each year. 

**Check:** There were 201,484 babies recorded in 1880 and 3,305,259 in 2020.

In [3]:
# "nbsphinx": "hidden"

n_babies = names.groupby(by="year")["count"].sum()
n_babies

year
1880     201484
1881     192691
1882     221533
1883     216944
1884     243461
         ...   
2016    3662277
2017    3568294
2018    3505963
2019    3455946
2020    3305259
Name: count, Length: 141, dtype: int64

### Exercise 3

Compute a DataFrame that lists the number of male babies named "John" for each year. 

**Check:** There were 9,655 such babies recorded in 1880 and 8,180 in 2020.

In [5]:
# "nbsphinx": "hidden"

n_john = (names[(names["name"] == "John") &
               (names["sex"] == "M")]
               .groupby(by="year")["count"].sum())
n_john

year
1880     9655
1881     8769
1882     9557
1883     8894
1884     9388
        ...  
2016    10034
2017     9503
2018     9170
2019     8813
2020     8180
Name: count, Length: 141, dtype: int64

### Exercise 4

Compute a DataFrame that lists how many different names were used each year for males and how many for females.

**Check:** In 1880 there were 942 different names used for females and 1,058 for males. In 2020 these numbers were 17,360 for females and 13,911 for males.

In [6]:
# "nbsphinx": "hidden"

num_names = names.groupby(by=["year", "sex"])["name"].count()
num_names

year  sex
1880  F        942
      M       1058
1881  F        938
      M        996
1882  F       1028
             ...  
2018  M      14073
2019  F      17948
      M      14082
2020  F      17360
      M      13911
Name: name, Length: 282, dtype: int64

### Exercise 5

Compute a DataFrame that for each name shows in which year the name appeared in the records for the first time.

**Check:** Here are the first recorded years for a few names: Aaban 2007, Aabha 2011, Aabid 2003, Aabidah 2018. 

In [7]:
# "nbsphinx": "hidden"

first_use = names.groupby(by=["name"])["year"].min()
first_use

name
Aaban      2007
Aabha      2011
Aabid      2003
Aabidah    2018
Aabir      2016
           ... 
Zyvion     2009
Zyvon      2015
Zyyanna    2010
Zyyon      2014
Zzyzx      2010
Name: year, Length: 100364, dtype: int64

### Exercise 6

Compute a DataFrame that shows what was the most popular name for males and the most popular name for females each year. 

**Check:** The most popular names in 1880  were John and Mary, and in 2020 Liam and Olivia. 

In [32]:
# "nbsphinx": "hidden"

def top_name(grp):
    return grp.sort_values(by="count", ascending=False).head(1)

most_popular = names.groupby(by=["year", "sex"]).apply(top_name)
most_popular.droplevel(level=2)[['name', 'count']]

Unnamed: 0_level_0,Unnamed: 1_level_0,name,count
year,sex,Unnamed: 2_level_1,Unnamed: 3_level_1
1880,F,Mary,7065
1880,M,John,9655
1881,F,Mary,6919
1881,M,John,8769
1882,F,Mary,8148
...,...,...,...
2018,M,Liam,19924
2019,F,Olivia,18508
2019,M,Liam,20555
2020,F,Olivia,17535
