## Software Packages

We will be using a wide range of different Python software packages.  The following is a list of packages that are already installed on [ds100.lsit.ucsb.edu](ds100.list.ucsb.edu) that we will routinely use in lectures and homeworks:

### Linear Algebra

In [2]:
import numpy as np

### Data manipulation

In [3]:
import pandas as pd

### Visualization

In [4]:
import altair as alt 

# Starting with a Question: **Who are you (the students of DS100)?**

This is a pretty vague question but let's start with the goal of learning something about the students in the class.


## Data Acquisition and Cleaning 

**In DS100 we will study various methods to collect data.**

To answer this question, I downloaded the course roster and extracted everyones first name and major.  I'll also asign everybody a random number (who will have the largest?!)

In [5]:
students = pd.read_csv("roster.csv", index_col=0)

## Every student gets a random normal 
students['Random Number'] = np.random.randn(students.shape[0])

students.tail()

Unnamed: 0_level_0,Major,Random Number
Name,Unnamed: 1_level_1,Unnamed: 2_level_1
STEPHANIE,STSDS,0.316262
FRANCIE,CMPSC,1.36936
REGINA,CMPSC,0.747675
YU,CMPSC,-0.512013
KEVIN,STSDS,0.865561


In [6]:
students.index

Index(['NICK', 'PRIYANKA', 'MEGAN', 'ANITA', 'ALEC', 'DAVID', 'TINA', 'AARON',
       'SOPHIE', 'ROBERTO', 'ROBIN', 'CALVIN', 'HAN', 'STEPHANIE', 'JASMINE',
       'DANA', 'KARSYN', 'YAQI', 'CHENKAI', 'JIAQI', 'MEIYU', 'ANNE',
       'ADHITYA', 'ZIQING', 'MELINDA', 'BRANDON', 'MISHA', 'ARI', 'LIA',
       'MITCHELL', 'CATHERINE', 'LERON', 'SARAH', 'TAYLOR', 'RYDER', 'SHUYUN',
       'MENG', 'JIANGHUA', 'JUSTIN', 'PUCHUAN', 'QUANSEN', 'JONATHAN',
       'PHILIP', 'JASON', 'LINA', 'EDDIE', 'JESSICA', 'WENXUAN', '#REF',
       'THIHA', 'TRUNG', 'JULIA', 'NATHAN', 'XINQI', 'YUEHAN', 'JINGWEN',
       'REBECCA', 'CAMERON', 'MAX', 'ERAN', 'NATHAN', 'JOSH', 'STEPHANIE',
       'FRANCIE', 'REGINA', 'YU', 'KEVIN'],
      dtype='object', name='Name')

In [7]:
students.columns

Index(['Major', 'Random Number'], dtype='object')

## How do we select individual features (columns)?

There are a few ways.

In [8]:
students['Major'].head()

Name
NICK        STSDS
PRIYANKA    CMPSC
MEGAN       ACTSC
ANITA       ENVST
ALEC        STSDS
Name: Major, dtype: object

---

## Exploratory Data Analysis

**In DS100 we will study exploratory data analysis and practice analyzing new datasets.**


### How many records do we have:
A good starting point is understanding the size of the data. 

#### Solution

In [9]:
print("There are", len(students), "students on the roster.")

There are 67 students on the roster.


### Is my data representative of the population I want to study?

**Answer:**

This is (or at least was) a complete **census** of the class containing all the official students.


### Understanding the structure of data

It is important that we understand the meaning of each field and how the data is organized.

In [10]:
students.sort_index(inplace=True)
students.head()


Unnamed: 0_level_0,Major,Random Number
Name,Unnamed: 1_level_1,Unnamed: 2_level_1
#REF,#REF,-0.805329
AARON,STSDS,1.852837
ADHITYA,STSDS,-0.223945
ALEC,STSDS,0.151044
ANITA,ENVST,0.651284


What is the meaning of the **Major** field?

**Solution** 
Understanding the meaning of field can often be achieved by looking at the types of data it contains (in particular the counts of its unique values).

In [11]:
students['Major'].__class__

pandas.core.series.Series

In [12]:
students['Major'].value_counts().to_frame()

Unnamed: 0,Major
STSDS,30
CMPSC,15
ACTSC,3
STSCI,3
MTHSC,3
CMPEN,2
PRMTH,2
ECON,1
COMM,1
FINMS,1


It appears that one student has an erroneous major given as "#REF". What else can we learn about this student? Let's see their name.  We'll find this student using the `loc` method from pandas.

In [13]:
students['Major'] == '#REF'

Name
#REF          True
AARON        False
ADHITYA      False
ALEC         False
ANITA        False
ANNE         False
ARI          False
BRANDON      False
CALVIN       False
CAMERON      False
CATHERINE    False
CHENKAI      False
DANA         False
DAVID        False
EDDIE        False
ERAN         False
FRANCIE      False
HAN          False
JASMINE      False
JASON        False
JESSICA      False
JIANGHUA     False
JIAQI        False
JINGWEN      False
JONATHAN     False
JOSH         False
JULIA        False
JUSTIN       False
KARSYN       False
KEVIN        False
             ...  
MENG         False
MISHA        False
MITCHELL     False
NATHAN       False
NATHAN       False
NICK         False
PHILIP       False
PRIYANKA     False
PUCHUAN      False
QUANSEN      False
REBECCA      False
REGINA       False
ROBERTO      False
ROBIN        False
RYDER        False
SARAH        False
SHUYUN       False
SOPHIE       False
STEPHANIE    False
STEPHANIE    False
TAYLOR       False
THIHA  

In [14]:
students.loc[students['Major'] == "#REF"]

Unnamed: 0_level_0,Major,Random Number
Name,Unnamed: 1_level_1,Unnamed: 2_level_1
#REF,#REF,-0.805329


Though this single bad record won't have much of an impact on our analysis, we can clean our data by removing this record.

In [15]:
students = students.loc[students['Major'] != "#REF"]

Let's double check that our record removal only removed the single bad record.

In [16]:
students['Major'].value_counts().to_frame()

Unnamed: 0,Major
STSDS,30
CMPSC,15
ACTSC,3
STSCI,3
MTHSC,3
CMPEN,2
PRMTH,2
ECON,1
COMM,1
FINMS,1


## Selecting Rows
We can select specific rows (observations of a data frame) using either `loc` or `iloc`.  What's the difference?

In [17]:
## Same result

#students['Random Number']

# display(students.iloc[[0, 5, 42]])
# print() ## Empty line
# display(students.loc[["AARON", "ARI", "PHILIP"]])


students
students.iloc[3, 1]

0.651284128711575

`iloc` takes row numbers only (where `0` is the first row).  `loc` takes index names.  However, `loc` can also select rows using boolean arrays.

In [18]:
## students['Major'] == "ACTSC"

students.loc[students['Major'] == "ACTSC"]

Unnamed: 0_level_0,Major,Random Number
Name,Unnamed: 1_level_1,Unnamed: 2_level_1
JESSICA,ACTSC,1.165409
JUSTIN,ACTSC,0.0605
MEGAN,ACTSC,-1.250314


Row indices do not have to be unique

In [19]:
students.loc["STEPHANIE"]

Unnamed: 0_level_0,Major,Random Number
Name,Unnamed: 1_level_1,Unnamed: 2_level_1
STEPHANIE,STSDS,0.316262
STEPHANIE,STSDS,0.588751


In [20]:
students.iloc[[0, 1, 2]]

Unnamed: 0_level_0,Major,Random Number
Name,Unnamed: 1_level_1,Unnamed: 2_level_1
AARON,STSDS,1.852837
ADHITYA,STSDS,-0.223945
ALEC,STSDS,0.151044


In [21]:
students['Random Number']
# students.Random Number

Name
AARON        1.852837
ADHITYA     -0.223945
ALEC         0.151044
ANITA        0.651284
ANNE        -0.650121
ARI          0.584012
BRANDON      0.417771
CALVIN       0.907917
CAMERON      0.619657
CATHERINE   -0.012277
CHENKAI      0.590429
DANA         0.314005
DAVID       -0.811137
EDDIE        0.020701
ERAN        -0.112329
FRANCIE      1.369360
HAN         -1.365845
JASMINE     -0.766471
JASON        1.160966
JESSICA      1.165409
JIANGHUA     0.367546
JIAQI       -0.195165
JINGWEN     -0.154325
JONATHAN     1.225893
JOSH         0.151901
JULIA        0.673930
JUSTIN       0.060500
KARSYN      -0.652573
KEVIN        0.865561
LERON       -0.830123
               ...   
MENG        -0.002620
MISHA       -0.999615
MITCHELL     0.160247
NATHAN      -0.689485
NATHAN       0.157919
NICK        -0.134492
PHILIP       0.385471
PRIYANKA     0.522488
PUCHUAN     -1.153538
QUANSEN      2.323638
REBECCA     -0.797209
REGINA       0.747675
ROBERTO      0.223539
ROBIN        1.060477
RYDER

### Summarizing the Data

We will often want to numerically or visually summarize the data. The describe method provides a brief high level description of our data frame. 

In [22]:
students.describe()

Unnamed: 0,Random Number
count,66.0
mean,0.166938
std,0.783988
min,-1.481164
25%,-0.272754
50%,0.159083
75%,0.61235
max,2.323638


In [23]:
students.sort_values(by="Random Number", ascending=False).iloc[0:3]

Unnamed: 0_level_0,Major,Random Number
Name,Unnamed: 1_level_1,Unnamed: 2_level_1
QUANSEN,STSDS,2.323638
AARON,STSDS,1.852837
LINA,CMPSC,1.674827


In [24]:
sorted_df = students.sort_values(by="Random Number", ascending=False)
sorted_df.iloc[0:3]

Unnamed: 0_level_0,Major,Random Number
Name,Unnamed: 1_level_1,Unnamed: 2_level_1
QUANSEN,STSDS,2.323638
AARON,STSDS,1.852837
LINA,CMPSC,1.674827


**In DS100 we will deal with many different kinds of data (not just numbers) and we will study techniques to diverse types of data.**

**How can we summarize the name field?** A good starting point might be to examine the lengths of the strings. 

In [25]:
students.index.str.len()

students["Name Length"] = students.index.str.len() 
students['Name Length']

Name
AARON        5
ADHITYA      7
ALEC         4
ANITA        5
ANNE         4
ARI          3
BRANDON      7
CALVIN       6
CAMERON      7
CATHERINE    9
CHENKAI      7
DANA         4
DAVID        5
EDDIE        5
ERAN         4
FRANCIE      7
HAN          3
JASMINE      7
JASON        5
JESSICA      7
JIANGHUA     8
JIAQI        5
JINGWEN      7
JONATHAN     8
JOSH         4
JULIA        5
JUSTIN       6
KARSYN       6
KEVIN        5
LERON        5
            ..
MENG         4
MISHA        5
MITCHELL     8
NATHAN       6
NATHAN       6
NICK         4
PHILIP       6
PRIYANKA     8
PUCHUAN      7
QUANSEN      7
REBECCA      7
REGINA       6
ROBERTO      7
ROBIN        5
RYDER        5
SARAH        5
SHUYUN       6
SOPHIE       6
STEPHANIE    9
STEPHANIE    9
TAYLOR       6
THIHA        5
TINA         4
TRUNG        5
WENXUAN      7
XINQI        5
YAQI         4
YU           2
YUEHAN       6
ZIQING       6
Name: Name Length, Length: 66, dtype: int64

## Visualizing name length in Altair

In [26]:
# pseuocode alt.Chart(pandas dataframe).mark_[type of chart].encode()


alt.Chart(students).mark_bar().encode(
     x = "Name Length",
     y = "count()"
)

**In DS100 we will learn a lot about how to visualize data.**

**Does the above plot seem reasonable?  Why might we want to check the lengths of strings.**

**Answer**
Yes the above plot seems reasonable for name lengths.  We might be concerned if there were 0 or even 1 letter names as these might represent abbreviations or missing entries. 

In [27]:
students

Unnamed: 0_level_0,Major,Random Number,Name Length
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
AARON,STSDS,1.852837,5
ADHITYA,STSDS,-0.223945,7
ALEC,STSDS,0.151044,4
ANITA,ENVST,0.651284,5
ANNE,MTHSC,-0.650121,4
ARI,CMPSC,0.584012,3
BRANDON,CMPSC,0.417771,7
CALVIN,CMPEN,0.907917,6
CAMERON,PRPBS,0.619657,7
CATHERINE,STSDS,-0.012277,9


In [28]:
students.groupby('Major')





<pandas.core.groupby.generic.DataFrameGroupBy object at 0x119622588>

In [29]:
students.groupby('Major').groups





{'ACTSC': Index(['JESSICA', 'JUSTIN', 'MEGAN'], dtype='object', name='Name'),
 'CHEME': Index(['NATHAN'], dtype='object', name='Name'),
 'CMPEN': Index(['CALVIN', 'MISHA'], dtype='object', name='Name'),
 'CMPSC': Index(['ARI', 'BRANDON', 'EDDIE', 'ERAN', 'FRANCIE', 'JIANGHUA', 'LERON',
        'LINA', 'MAX', 'PRIYANKA', 'REGINA', 'ROBERTO', 'SOPHIE', 'TRUNG',
        'YU'],
       dtype='object', name='Name'),
 'COMM ': Index(['JASMINE'], dtype='object', name='Name'),
 'ECON ': Index(['PUCHUAN'], dtype='object', name='Name'),
 'ENVST': Index(['ANITA'], dtype='object', name='Name'),
 'FINMS': Index(['DANA'], dtype='object', name='Name'),
 'MATCS': Index(['LIA'], dtype='object', name='Name'),
 'MTHSC': Index(['ANNE', 'JOSH', 'MITCHELL'], dtype='object', name='Name'),
 'PRBIO': Index(['HAN'], dtype='object', name='Name'),
 'PRMTH': Index(['JONATHAN', 'NATHAN'], dtype='object', name='Name'),
 'PRPBS': Index(['CAMERON'], dtype='object', name='Name'),
 'STSCI': Index(['DAVID', 'JINGWEN', 'XI

In [30]:
actsc_indices = students.groupby('Major').groups['ACTSC']
actsc_indices

students.loc[actsc_indices]

Unnamed: 0_level_0,Major,Random Number,Name Length
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
JESSICA,ACTSC,1.165409,7
JUSTIN,ACTSC,0.0605,6
MEGAN,ACTSC,-1.250314,5


Short version of above is:

In [31]:
students.groupby("Major").get_group("ACTSC")
students["Major"].value_counts()

STSDS    30
CMPSC    15
ACTSC     3
STSCI     3
MTHSC     3
CMPEN     2
PRMTH     2
ECON      1
COMM      1
FINMS     1
PRBIO     1
CHEME     1
PRPBS     1
ENVST     1
MATCS     1
Name: Major, dtype: int64

What if I wanted to know the average of the random numbers within each major? 

- Which majors are likely to have the largest mean random number?

- Which majors are not likely to have the largest mean random number?




In [32]:
students.groupby('Major').agg(np.mean)[['Random Number']].sort_values(by="Random Number")

Unnamed: 0_level_0,Random Number
Major,Unnamed: 1_level_1
PRBIO,-1.365845
ECON,-1.153538
COMM,-0.766471
CHEME,-0.689485
STSCI,-0.408556
MTHSC,-0.112657
CMPEN,-0.045849
ACTSC,-0.008135
STSDS,0.23676
FINMS,0.314005


In [33]:
students.groupby('Major').mean()# [['Random Number']]  ## .mean().sort_values(by="Random Number")

Unnamed: 0_level_0,Random Number,Name Length
Major,Unnamed: 1_level_1,Unnamed: 2_level_1
ACTSC,-0.008135,6.0
CHEME,-0.689485,6.0
CMPEN,-0.045849,5.5
CMPSC,0.417284,5.333333
COMM,-0.766471,7.0
ECON,-1.153538,7.0
ENVST,0.651284,5.0
FINMS,0.314005,4.0
MATCS,0.342124,3.0
MTHSC,-0.112657,5.333333


<div class="inner_cell">
<div class="text_cell_render border-box-sizing rendered_html">
<p>The following table summarizes some other built-in Pandas aggregations:</p>
<table>
<thead><tr>
<th>Aggregation</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td><code>count()</code></td>
<td>Total number of items</td>
</tr>
<tr>
<td><code>first()</code>, <code>last()</code></td>
<td>First and last item</td>
</tr>
<tr>
<td><code>mean()</code>, <code>median()</code></td>
<td>Mean and median</td>
</tr>
<tr>
<td><code>min()</code>, <code>max()</code></td>
<td>Minimum and maximum</td>
</tr>
<tr>
<td><code>std()</code>, <code>var()</code></td>
<td>Standard deviation and variance</td>
</tr>
<tr>
<td><code>mad()</code></td>
<td>Mean absolute deviation</td>
</tr>
<tr>
<td><code>prod()</code></td>
<td>Product of all items</td>
</tr>
<tr>
<td><code>sum()</code></td>
<td>Sum of all items</td>
</tr>
</tbody>
</table>
<p>These are all methods of <code>DataFrame</code> and <code>Series</code> objects.</p>

</div>
</div>

In [34]:
students.groupby('Major').max().sort_values(by="Random Number")

Unnamed: 0_level_0,Random Number,Name Length
Major,Unnamed: 1_level_1,Unnamed: 2_level_1
PRBIO,-1.365845,3
ECON,-1.153538,7
COMM,-0.766471,7
CHEME,-0.689485,6
STSCI,-0.154325,7
MTHSC,0.160247,8
FINMS,0.314005,4
MATCS,0.342124,3
PRPBS,0.619657,7
ENVST,0.651284,5


## Manually interating over groups

In [35]:
for (major, group_df) in students.groupby('Major'):
    print("{0} {1}".format(major, len(group_df)))


ACTSC 3
CHEME 1
CMPEN 2
CMPSC 15
COMM  1
ECON  1
ENVST 1
FINMS 1
MATCS 1
MTHSC 3
PRBIO 1
PRMTH 2
PRPBS 1
STSCI 3
STSDS 30


## Multi-column Indices

In [36]:
students_agg = students.groupby('Major').agg([np.mean, np.median, np.max] )


students_agg.columns

## students_agg["Random Number"][["min"]]

MultiIndex(levels=[['Random Number', 'Name Length'], ['mean', 'median', 'amax']],
           codes=[[0, 0, 0, 1, 1, 1], [0, 1, 2, 0, 1, 2]])

In [37]:
students_agg.columns

MultiIndex(levels=[['Random Number', 'Name Length'], ['mean', 'median', 'amax']],
           codes=[[0, 0, 0, 1, 1, 1], [0, 1, 2, 0, 1, 2]])

In [38]:
students_multi = students.groupby('Major').agg({"Random Number" : np.mean, "Name Length" : [np.min, np.max]})

students_multi["Name Length"][["amin"]]

#students.groupby('Major').aggregate({"Random Number" : "mean", "Name Length" : ["min", "max", lambda x: max(x) - min(x)]})

Unnamed: 0_level_0,amin
Major,Unnamed: 1_level_1
ACTSC,5
CHEME,6
CMPEN,5
CMPSC,2
COMM,7
ECON,7
ENVST,5
FINMS,4
MATCS,3
MTHSC,4


<br/><br/><br/><br/> 

---

# What does a name tell us about a person?

Most people don't pick their own names, but nonetheless they can say a lot about us. 

**Question: What information might a name reveal about a person?**

Here are some examples we will explore in this lecture:
1. Gender
1. Age



---

<br/><br/><br/>

# Obtaining More Data

To study what a name tells about a person we will download data from the United States Social Security office containing the number of registered names broken down by **year**, **sex**, and **name**.  This is often called the baby names data as social security numbers are typically given at birth.

Note: In the following we download the data programmatically to ensure that the process is reproducible.

In [39]:
import urllib.request
import os.path

data_url = "https://www.ssa.gov/oact/babynames/names.zip"
local_filename = "babynames.zip"
if not os.path.exists(local_filename): # if the data exists don't download again
    with urllib.request.urlopen(data_url) as resp, open(local_filename, 'wb') as f:
        f.write(resp.read())

The data is organized into separate files in the format `yobYYYY.txt` with each file containing the `name`, `sex`, and `count` of babies registered in that year.

## Loading the Data

Note: In the following we load the data directly into python without decompressing the zipfile.

**In DS100 we will think a bit more about how we can be efficient in our data analysis to support processing large datasets.**

In [40]:
import zipfile
babynames = [] 
with zipfile.ZipFile(local_filename, "r") as zf:
    data_files = [f for f in zf.filelist if f.filename[-3:] == "txt"]
    def extract_year_from_filename(fn):
        return int(fn[3:7])
    for f in data_files:
        year = extract_year_from_filename(f.filename)
        with zf.open(f) as fp:
            df = pd.read_csv(fp, names=["Name", "Sex", "Count"])
            df["Year"] = year
            babynames.append(df)
babynames = pd.concat(babynames)


---
<br/><br/><br/>

## Understanding the Setting

Reading from [SSN Office description](https://www.ssa.gov/oact/babynames/background.html): 

    All names are from Social Security card applications for births that occurred in the United States after 1879. Note  that many people born before 1937 never applied for a Social Security card, so their names are not included in our data. For others who did apply, our records may not show the place of birth, and again their names are not included in our data.

    All data are from a 100% sample of our records on Social Security card applications as of March 2017.

Note: the sex data we are using is based on classification of a person as male or female at birth. Infants are assigned a sex, usually based on the appearance of their external anatomy.


---
<br/><br/><br/>

## Data Cleaning 

Examining the data:

In [41]:
babynames.head()

Unnamed: 0,Name,Sex,Count,Year
0,Mary,F,7065,1880
1,Anna,F,2604,1880
2,Emma,F,2003,1880
3,Elizabeth,F,1939,1880
4,Minnie,F,1746,1880


In [42]:
babynames.tail()

Unnamed: 0,Name,Sex,Count,Year
32028,Zylas,M,5,2018
32029,Zyran,M,5,2018
32030,Zyrie,M,5,2018
32031,Zyron,M,5,2018
32032,Zzyzx,M,5,2018


In our earlier analysis we converted names to lower case.  We will do the same again here:

In [43]:
babynames['Name'] = babynames['Name'].str.lower()
babynames.head()

Unnamed: 0,Name,Sex,Count,Year
0,mary,F,7065,1880
1,anna,F,2604,1880
2,emma,F,2003,1880
3,elizabeth,F,1939,1880
4,minnie,F,1746,1880


---
<br/><br/><br/>

## Exploratory Data Analysis

How many people does this data represent?

In [44]:
format(babynames['Count'].sum(), ',d') 

'351,653,025'

In [45]:
len(babynames)

1957046

**Is this number low or high?**

**Answer**

It seems low. However the social security website states: 

    All names are from Social Security card applications for births that occurred in the United States after 1879. **Note that many people born before 1937 never applied for a Social Security card, so their names are not included in our data.** For others who did apply, our records may not show the place of birth, and again their names are not included in our data. All data are from a 100% sample of our records on Social Security card applications as of the end of February 2016.

The DataFrame has a method based on evaluated strings, called `query()`. We can use the query function to find names that match our desired conditions.

In [46]:
babynames.query("Name == 'vela' & Sex == 'F'")

Unnamed: 0,Name,Sex,Count,Year
1263,vela,F,6,1888
1517,vela,F,5,1890
1640,vela,F,5,1893
1695,vela,F,5,1894
1795,vela,F,5,1895
1808,vela,F,5,1896
1249,vela,F,8,1897
1164,vela,F,10,1898
1825,vela,F,5,1899
1220,vela,F,11,1900


In [47]:
#%timeit babynames.loc[babynames['Name'].str.contains("data")]

babynames.query('Name.str.contains("data")', engine='python')

Unnamed: 0,Name,Sex,Count,Year
9760,kidata,F,5,1975
24915,datavion,M,5,1995
23609,datavious,M,7,1997
12100,datavia,F,7,2000
27502,datavion,M,6,2001
28908,datari,M,5,2001
29135,datavian,M,5,2002
29136,datavious,M,5,2002
30570,datavion,M,5,2004
17135,datavia,F,5,2005


In [48]:
babynames.query('Name.str.endswith("ue") and Count > 50', engine='python')

Unnamed: 0,Name,Sex,Count,Year
189,sue,F,65,1880
185,sue,F,67,1881
171,sue,F,84,1882
216,sue,F,68,1883
194,sue,F,92,1884
194,sue,F,94,1885
202,sue,F,96,1886
181,sue,F,123,1887
211,sue,F,117,1888
219,sue,F,112,1889


---
<br/><br/><br/>

### Temporal Patterns Conditioned on Gender

**In DS100 we still study how to visualize and analyze relationships in data.**

In this example we aggregates the number of babies registered for each year by `Sex` using `groupby`.

In [49]:
baby_counts = babynames.groupby(["Year", "Sex"], as_index=False).sum()
baby_counts.head(10)


Unnamed: 0,Year,Sex,Count
0,1880,F,90994
1,1880,M,110490
2,1881,F,91953
3,1881,M,100743
4,1882,F,107847
5,1882,M,113686
6,1883,F,112319
7,1883,M,104625
8,1884,F,129019
9,1884,M,114442


In [50]:
baby_counts.tail(10)


Unnamed: 0,Year,Sex,Count
268,2014,F,1782350
269,2014,M,1916564
270,2015,F,1780453
271,2015,M,1911537
272,2016,F,1766212
273,2016,M,1891585
274,2017,F,1719138
275,2017,M,1842837
276,2018,F,1686961
277,2018,M,1800392


We can visualize these descriptive statistics:

In [51]:
    
alt.Chart(baby_counts).mark_line().encode(
        x = 'Year',
        y = 'Count',
        color = 'Sex'
)

There currently seem to be more male babies than female babies. Why is this?

**Some observations:**
1. Registration data seems limited in the early 1900s.  Because many people did not register before 1937.  
1. You can see the [baby boomers](https://www.wikiwand.com/en/Baby_boomers) and the echo boom.
1. Females have greater diversity of names.

In [157]:
students["Major"].value_counts()


#students.describe()

STSDS    30
CMPSC    15
ACTSC     3
STSCI     3
MTHSC     3
CMPEN     2
PRMTH     2
ECON      1
COMM      1
FINMS     1
PRBIO     1
CHEME     1
PRPBS     1
ENVST     1
MATCS     1
Name: Major, dtype: int64

In [159]:
students.groupby("Major").agg(np.max).sort_values(by="Random Number")

Unnamed: 0_level_0,Random Number,Name Length
Major,Unnamed: 1_level_1,Unnamed: 2_level_1
PRBIO,-1.365845,3
ECON,-1.153538,7
COMM,-0.766471,7
CHEME,-0.689485,6
STSCI,-0.154325,7
MTHSC,0.160247,8
FINMS,0.314005,4
MATCS,0.342124,3
PRPBS,0.619657,7
ENVST,0.651284,5



# Estimating the Sex of a Baby Given it's Name with Pivot Tables

We can use the baby names dataset to compute the total number of babies with each name broken down by Sex.  


Questions to ask:
- What are the ethical issues surrounding building a "machine learning" tool for predicting sex?
- How could such a tool be used? For good? For bad?
- Who are we leaving out when fitting this model? 
- What are some possible solutions?



### Babynames is Tidy!

In [160]:
babynames.head(10)

Unnamed: 0,Name,Sex,Count,Year
0,mary,F,7065,1880
1,anna,F,2604,1880
2,emma,F,2003,1880
3,elizabeth,F,1939,1880
4,minnie,F,1746,1880
5,margaret,F,1578,1880
6,ida,F,1472,1880
7,alice,F,1414,1880
8,bertha,F,1320,1880
9,sarah,F,1288,1880


In [162]:
grouped_babynames = babynames.groupby(['Name', 'Sex']).aggregate({'Count' : sum})

grouped_babynames.head(10)
# grouped_babynames.loc[["alex", "kate"]]

Unnamed: 0_level_0,Unnamed: 1_level_0,Count
Name,Sex,Unnamed: 2_level_1
aaban,M,114
aabha,F,35
aabid,M,16
aabidah,F,5
aabir,M,10
aabriella,F,38
aada,F,13
aadam,M,273
aadan,M,130
aadarsh,M,209


### This is not tidy!!!

In [170]:
sex_counts = pd.pivot_table(babynames, index='Name', columns='Sex', values='Count',
                            aggfunc='sum', fill_value=0., margins=True)
sex_counts.head(10)

sex_counts.tail(10)


#sex_counts = pd.pivot_table(babynames, index='Name', columns='Year',
#                            aggfunc='sum', fill_value=0., margins=True)
## sex_counts.head()

Sex,F,M,All
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
zytavion,0,5,5
zytavious,0,43,43
zyus,0,11,11
zyva,23,0,23
zyvion,0,5,5
zyvon,0,7,7
zyyanna,6,0,6
zyyon,0,6,6
zzyzx,0,10,10
All,174079232,177573793,351653025


In [142]:
sex_counts = pd.pivot_table(babynames, index='Name', columns='Sex', values='Count',
                            aggfunc='sum', fill_value=0., margins=True)
sex_counts.head()

Sex,F,M,All
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
aaban,0,114,114
aabha,35,0,35
aabid,0,16,16
aabidah,5,0,5
aabir,0,10,10


### Melt to get back to tidy form

In [171]:
sex_counts.reset_index().melt(id_vars="Name")

Unnamed: 0,Name,Sex,value
0,aaban,F,0
1,aabha,F,35
2,aabid,F,0
3,aabidah,F,5
4,aabir,F,0
5,aabriella,F,38
6,aada,F,13
7,aadam,F,0
8,aadan,F,0
9,aadarsh,F,0


For example, let's see some names that skew female with 60% probability.

In [173]:
sex_counts.query("F / All > 0.9").sort_values(by="All", ascending=False).head()

Sex,F,M,All
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
mary,4125675,15165,4140840
elizabeth,1638349,5181,1643530
patricia,1572016,4962,1576978
jennifer,1467207,4837,1472044
linda,1452668,3757,1456425


For each name we would like to estimate the **probability** that the baby is `Female`. 

$$ \Large
\hat{\textbf{P}\hspace{0pt}}(\texttt{Female} \,\,\, | \,\,\, \texttt{Name} ) = \frac{\textbf{Count}(\texttt{Female and Name})}{\textbf{Count}(\texttt{Name})}
$$

The ^ ("hat") symbol indicates that this is an estimate of the probability.  We can calculate this estimate from the data:

In [174]:
prob_female = sex_counts['F'] / sex_counts['All'] 
prob_female.head(10)

Name
aaban        0.0
aabha        1.0
aabid        0.0
aabidah      1.0
aabir        0.0
aabriella    1.0
aada         1.0
aadam        0.0
aadan        0.0
aadarsh      0.0
dtype: float64

**Testing the function:**

In [176]:
prob_female['audi']

0.6

In [177]:
prob_female["josh"]

0.0

In [178]:
prob_female["deborah"]

0.9977652472372044

In [179]:
prob_female["alex"]

0.03177332987481834

In [60]:
prob_female["sarah"]

0.9969234241567629

We can define a function to return the most likely `Sex` for a name. If there is an exact tie, the function returns Male. If the name does not appear in the social security dataset, we return Unknown.

In [180]:
def sex_from_name(name):
    lower_name = name.lower()
    if lower_name in prob_female.index:
        return 'F' if prob_female[lower_name] >= 0.5 else 'M'
    else:
        return "Unknown"

In [181]:
#chance is 50%, so return female
sex_from_name("alex")

'M'

## Estimating the fraction of Females in DS100


### Can we use the Baby Names data?

What fraction of the student names are in the baby names database:

In [182]:
## Intersect class names with ssn names
common_names = students.index.str.lower().intersection(prob_female.index)
print("Fraction of names in the babynames data:" , len(common_names) / len(students))

Fraction of names in the babynames data: 0.803030303030303


In [183]:
#names of students in ds100 that are not in the social security database
missing_names = students.index.str.lower().difference(prob_female.index)
missing_names.tolist()

['chenkai',
 'jianghua',
 'jingwen',
 'meiyu',
 'puchuan',
 'quansen',
 'shuyun',
 'thiha',
 'wenxuan',
 'xinqi',
 'yaqi',
 'yuehan',
 'ziqing']

## Applying the `sex_from_name` function to all students

We can apply the `sex_from_name` function to all students in the class to estimate the number of male and female students.

In [184]:
count_by_sex = students.reset_index()['Name'].str.lower().apply(sex_from_name).value_counts().to_frame()
count_by_sex


Unnamed: 0,Name
M,28
F,25
Unknown,13


Using the above we can estimate the fraction of female students in the class

In [185]:
count_by_sex.loc['F'] / (count_by_sex.loc['M'] + count_by_sex.loc['F'])

Name    0.471698
dtype: float64

1. **How do we feel about this estimate?** 
1. **Do we trust it?**

## Using simulation to estimate uncertainty

Below we build a primitive estimate of the uncertainty in our model by simulating the `Sex` of each student according to the baby names dataset.

In [186]:
## From above: names = pd.Index(students["Name"]).intersection(prob_female.index)
ds100_prob_female = prob_female.loc[common_names]
ds100_prob_female.tail()

Name
stephanie    0.996449
taylor       0.743777
tina         0.996906
trung        0.000000
yu           0.491459
dtype: float64

In [189]:
one_simulation = np.random.rand(len(ds100_prob_female)) < ds100_prob_female
one_simulation.tail()

Name
stephanie     True
taylor       False
tina          True
trung        False
yu           False
dtype: bool

Given such a simulation, we can compute the fraction of the class that is female by computing the average number of True values.

In [190]:
one_simulation = np.random.rand(len(ds100_prob_female)) < ds100_prob_female
np.mean(one_simulation)

0.4339622641509434

In [191]:
#function that performs many simulations
def simulate_class(students):
    ds100_prob_female = prob_female.loc[common_names]
    is_female = np.random.rand(len(ds100_prob_female)) < ds100_prob_female
    return np.mean(is_female)

fraction_female_simulations = np.array([simulate_class(students) for n in range(1000)])

In [192]:
frac_female_df = pd.DataFrame({"frac_female" : fraction_female_simulations})

alt.Chart(frac_female_df).mark_bar().encode(
    x=alt.X("frac_female:Q", bin=True, title="Fraction classified as female"),
    y = alt.Y("count()", title="Number of Simulations")
)

In [195]:
missing_names.to_list()

['chenkai',
 'jianghua',
 'jingwen',
 'meiyu',
 'puchuan',
 'quansen',
 'shuyun',
 'thiha',
 'wenxuan',
 'xinqi',
 'yaqi',
 'yuehan',
 'ziqing']

## Distribution of a Name over Time

We want to estimate the probability of being born in a particular year given someone's name.  To construct this estimate for each name we need to compute the number of babies born each year with that name. 

In the following code block we use the `pivot_table` expression to compute the total number of babies born with a given name for each year. We use fillna to ensure that names that do not occur are listed as occurring 0 times instead of as NaN.

In [196]:
name_year_pivot = babynames.pivot_table( 
        index=['Year'], columns=['Name'], values='Count', aggfunc=np.sum).fillna(0.0)
name_year_pivot.tail()

Name,aaban,aabha,aabid,aabidah,aabir,aabriella,aada,aadam,aadan,aadarsh,...,zytaveon,zytavion,zytavious,zyus,zyva,zyvion,zyvon,zyyanna,zyyon,zzyzx
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2014,16.0,9.0,0.0,0.0,0.0,5.0,0.0,19.0,8.0,18.0,...,8.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,6.0,0.0
2015,15.0,7.0,0.0,0.0,0.0,5.0,5.0,22.0,10.0,15.0,...,0.0,0.0,0.0,5.0,0.0,0.0,7.0,0.0,0.0,0.0
2016,9.0,7.0,5.0,0.0,5.0,11.0,0.0,18.0,0.0,11.0,...,0.0,0.0,0.0,0.0,8.0,0.0,0.0,0.0,0.0,0.0
2017,11.0,0.0,0.0,0.0,0.0,6.0,0.0,18.0,8.0,15.0,...,0.0,0.0,0.0,0.0,9.0,0.0,0.0,0.0,0.0,0.0
2018,7.0,0.0,6.0,5.0,5.0,6.0,8.0,19.0,0.0,10.0,...,0.0,0.0,0.0,6.0,6.0,0.0,0.0,0.0,0.0,5.0


In [197]:
name_year_pivot['alex'].to_frame().tail()

Unnamed: 0_level_0,alex
Year,Unnamed: 1_level_1
2014,3341.0
2015,3289.0
2016,3017.0
2017,2781.0
2018,2673.0


To estimate the probability of being born in a year given the name we need to compute: 

$$ \Large
\hat{\textbf{P}\hspace{0pt}}(\texttt{Year} \,\,\, | \,\,\, \texttt{Name} ) = \frac{\textbf{Count}(\texttt{Year and Name})}{\textbf{Count}(\texttt{Name})}
$$

In [198]:
prob_year_given_name = name_year_pivot.div(name_year_pivot.sum()).fillna(0.0)
prob_year_given_name.head()

Name,aaban,aabha,aabid,aabidah,aabir,aabriella,aada,aadam,aadan,aadarsh,...,zytaveon,zytavion,zytavious,zyus,zyva,zyvion,zyvon,zyyanna,zyyon,zzyzx
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1880,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1881,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1882,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1883,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1884,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [200]:
prob_year_given_name['alex'].to_frame().tail(40)

Unnamed: 0_level_0,alex
Year,Unnamed: 1_level_1
1979,0.005945
1980,0.00608
1981,0.006769
1982,0.007149
1983,0.00776
1984,0.011054
1985,0.014174
1986,0.018665
1987,0.022215
1988,0.023715


## Visualizing the $\hat{\textbf{P}\hspace{0pt}}(\texttt{Year} \,\,\, | \,\,\, \texttt{Name} )$

In the following we visualize the probability of names over time:

In [108]:
melted_frame = prob_year_given_name[["alexander", "kate", "franky", "brian"]].reset_index().melt(id_vars='Year')
melted_frame


Unnamed: 0,Year,Name,value
0,1880,alexander,0.000311
1,1881,alexander,0.000308
2,1882,alexander,0.000332
3,1883,alexander,0.000276
4,1884,alexander,0.000348
5,1885,alexander,0.000303
6,1886,alexander,0.000304
7,1887,alexander,0.000266
8,1888,alexander,0.000350
9,1889,alexander,0.000273


In [109]:
alt.Chart(melted_frame).mark_line().encode(
    x = 'Year:O',
    y= 'value',
    color='Name'
).properties( width=1000, height=400 ) 

Notice that some names are spread over time while others have more specific times when they were popular.

---
<br/><br/><br/>

## Trying more Contemporary Names

We can also examine some more contemporary names.

In [202]:
## melted_frame = prob_year_given_name[["britney", "shiloh"]].reset_index().melt(id_vars='Year').query('Year > 1975')

## melted_frame = prob_year_given_name[["kanye", "khaleesi"]].reset_index().melt(id_vars='Year').query('Year > 1975')


melted_frame = prob_year_given_name[["malia", "barack", "sasha"]].reset_index().melt(id_vars='Year').query('Year > 1975')


alt.Chart(melted_frame).mark_line().encode(
    x = 'Year:O',
    y= 'value',
    color='Name'
).properties(width=1000, height=500)


Plot 1:
- Britney Spears first album was released in 1999.
- Shiloh is the name of Brad Pitt and Angelina Jolie's child, born in 2006 (celebrity couple)


Plot 2 
- Notice that Kanye was popular in 2004 
- The HBO version of Game of Thrones was launched in 2011 (though the book series began in 1996). 

Plot 3
- Barack Obama became president in 2008.  He has 2 daughters, Sasha and Malia.  Note the uptick in these names in 2009.