# Lecture 1 – Data 100, Spring 2021

by Joseph E. Gonzalez

adapted from Anthony D. Joseph, Josh Hug, Suraj Rampure

## Simple Questions about the Class

1. How many students do we have?
1. What are their majors?
1. What year are they?
1. Diversity ...?

In [None]:
import pandas as pd
import numpy as np

## Plotly plotting support
import plotly.offline as py
py.init_notebook_mode()
import plotly.graph_objs as go
import plotly.figure_factory as ff
import plotly.express as px

## Load and clean the roster

In [None]:
names = pd.read_csv("data/names.csv")
majors = pd.read_csv("data/majors.csv")

## Peeking at the Data

In [None]:
names.head()

In [None]:
names["Name"] = names["Name"].str.lower()

In [None]:
names.head()

In [None]:
majors.head(20)

### How many students do we have?

In [None]:
names.describe()

### What are their Majors?

In [None]:
majors.describe()

What are the top majors:

In [None]:
majors["Majors"].value_counts().sort_values().tail(20)
# majors["Majors"].str.split(",").explode().value_counts().sort_values().tail(20)

### We will often use visualizations to make sense of data

In [None]:
fig = px.bar(majors["Majors"].value_counts().sort_values().tail(20),
             orientation="h")
fig.update_layout(dict(showlegend=False, xaxis_title="Count", yaxis_title="Major"))

## What Year are they?

In [None]:
fig = px.bar(majors["Terms in Attendance"].value_counts())
fig.update_layout(xaxis_title="Term", yaxis_title="Count", showlegend=False)

<br/><br/><br/><br/><br/><br/><br/><br/>

---


## Diversity and Data Science:

Unfortunately, surveys of data scientists suggest that there are far fewer women in data science:

<img src="images/kaggle_gender_data.png" width="400px" />

To learn more checkout the [Kaggle Executive Summary](https://www.kaggle.com/kaggle-survey-2019) or study the [Raw Data](https://www.kaggle.com/c/kaggle-survey-2019).


<br/><br/><br/><br/><br/><br/><br/><br/>


---

## What fraction of the students are female?

I actually get asked this question a lot as we try to improve the data science program at Berkeley.

This is actually a fairly complex question.  What do we mean by female? Is this a question about the **sex** or **gender identity** of the students?  **They are not the same thing.**  

* **Sex** refers predominantly to biological characteristics. 
* **Gender** is much more complex with societal and cultural implications and refers to how people identify themselves.  

Most likely, my colleagues are interested in improving **gender diversity**, by ensuring that our program is inclusive.



<br/><br/><br/>

### How could we answer this question?

<br/><br/><br/><br/><br/>


In [None]:
print(majors.columns)
print(names.columns)

<br/><br/><br/><br/><br/><br/>

---

### We don't have the data.

Where can we get the data?

<br/><br/><br/><br/><br/>

---

### (1) We coudl run a survey!

<br/><br/><br/><br/><br/><br/><br/>

### (2) ... or we could try to use the data we have to estimate the _sex_ of the students as a proxy for gender.  

What I am about to do is flawed in so many ways and we will discuss these flaws in a moment and throughout the semester.  However, it will illustrate some very basic inferential modeling and how we might combine multiple data sources to try and reason about something we haven't measured.  


<br/><br/><br/><br/><br/><br/><br/>

---

### US Social Security Data

Public dataset containing baby names and their **sex**.

### Understanding the Setting

**In Data 100 you will have to learn about different data sources (and their limitations) on your own.**

Reading from [SSN Office description](https://www.ssa.gov/oact/babynames/background.html): 


> All names are from Social Security card applications for births that occurred in the United States after 1879. Note that many people born before 1937 never applied for a Social Security card, so their names are not included in our data. For others who did apply, our records may not show the place of birth, and again their names are not included in our data.

> To safeguard privacy, we exclude from our tabulated lists of names those that would indicate, or would allow the ability to determine, names with fewer than 5 occurrences in any geographic area. If a name has less than 5 occurrences for a year of birth in any state, the sum of the state counts for that year will be less than the national count.

> All data are from a 100% sample of our records on Social Security card applications as of March 2020.




### Get data programatically

In [None]:
import urllib.request
import os.path

# Download data from the web directly
data_url = "https://www.ssa.gov/oact/babynames/names.zip"
local_filename = "babynames.zip"
if not os.path.exists(local_filename): # if the data exists don't download again
    with urllib.request.urlopen(data_url) as resp, open(local_filename, 'wb') as f:
        f.write(resp.read())

        
# Load data without unzipping the file
import zipfile
babynames = [] 
with zipfile.ZipFile(local_filename, "r") as zf:
    data_files = [f for f in zf.filelist if f.filename[-3:] == "txt"]
    def extract_year_from_filename(fn):
        return int(fn[3:7])
    for f in data_files:
        year = extract_year_from_filename(f.filename)
        with zf.open(f) as fp:
            df = pd.read_csv(fp, names=["Name", "Sex", "Count"])
            df["Year"] = year
            babynames.append(df)
babynames = pd.concat(babynames)


babynames.head()

A little bit of data cleaning:

In [None]:
babynames['Name'] = babynames['Name'].str.lower()
babynames.tail()

## Exploratory Data Analysis

How many people does this data represent?

In [None]:
format(babynames['Count'].sum(), ',d')

In [None]:
format(babynames.shape[0], ',d')

Trying a simple query:

In [None]:
babynames[(babynames['Name'] == 'nora') & (babynames['Year'] == 2018)]

Let's use this data to estimate the fraction of female students in the class.

### Proportion of Male and Female Individuals Over Time

In this example we construct a **pivot table** which aggregates the number of babies registered for each year by `Sex`.

In [None]:
year_sex = pd.pivot_table(babynames, 
        index=['Year'], # the row index
        columns=['Sex'], # the column values
        values='Count', # the field(s) to processed in each group
        aggfunc=np.sum,
    )

year_sex.head()

In [None]:
px.line(year_sex)

### How many unique names for each year?

In [None]:
year_sex_unique = pd.pivot_table(babynames, 
        index=['Year'], 
        columns=['Sex'], 
        values='Name', 
        aggfunc=lambda x: len(np.unique(x)),
    )
px.line(year_sex_unique)

**Some observations:**
1. Registration data seems limited in the early 1900s.  Because many people did not register before 1937.  
1. You can see the [baby boomers](https://www.wikiwand.com/en/Baby_boomers) and the echo boom.
1. Females have greater diversity of names.

## Computing the Proportion of Female Babies For Each Name

In [None]:
name_sex = pd.pivot_table(babynames, index='Name', columns='Sex', values='Count',
                            aggfunc='sum', fill_value=0., margins=True)
name_sex.head()

Compute proportion of female babies given each name.

In [None]:
prop_female = (name_sex['F'] / name_sex['All']).rename("Prop. Female")
prop_female.head(10)

### Testing a few names

In [None]:
prop_female['joey']

In [None]:
prop_female['andrew']

In [None]:
prop_female['avery']

In [None]:
prop_female["min"]

In [None]:
prop_female["pat"]

### Build Simple Classifier (Model)

We can define a function to return the most likely `Sex` for a name. If there is an exact tie or the name does not appear in the social security dataset the function returns `Unknown`.

In [None]:
def sex_from_name(name):
    lower_name = name.lower()
    if lower_name not in prop_female.index or prop_female[lower_name] == 0.5:
        return "Unknown"
    elif prop_female[lower_name] > 0.5:
        return "F"
    else:
        return "M"

In [None]:
sex_from_name("nora")

In [None]:
sex_from_name("joey")

## Estimating the fraction of female and male students

In [None]:
names['Pred. Sex'] = names['Name'].apply(sex_from_name)
px.bar(names['Pred. Sex'].value_counts()/len(names))

### What fraction of students in Data 100 this semester have names in the SSN dataset?

In [None]:
print("Fraction of names in the babynames data:", 
      names["Name"].isin(prop_female.index).mean())

### Which names are not in the dataset?

Why might these names not appear?  

In [None]:
names[~names["Name"].isin(prop_female.index)]

### Using simulation to estimate uncertainty

Previously we treated a name which is given to females 40% of the time as a "Male" name.  This doesn't capture our uncertainty.  We can use simulation to provide a better distributional estimate.

In [None]:
names["Prop. Female"] = (
    names[["Name"]].merge(prop_female, how='left', left_on="Name", 
                          right_index=True)["Prop. Female"]
        .fillna(0.5)
)
names.head(10)

### Running the simulation

In [None]:
names['Sim. Female'] = names['Prop. Female'] > np.random.rand(len(names))
names.tail(20)

In [None]:
# function that performs many simulations
def simulate_class(students):
    is_female = names['Prop. Female'] > np.random.rand(len(names['Prop. Female'])) 
    return np.mean(is_female)

sim_frac_female = np.array([simulate_class(names) for n in range(10000)])

In [None]:
ff.create_distplot([sim_frac_female], ['Fraction Female'], bin_size=0.0025, show_rug=False)