# Lecture 1 – Data 100, Summer 2023

by Lisa Yan

- adapted from Joseph E. Gonzalez, Anthony D. Joseph, Josh Hug, Suraj Rampure.
- minor updates by Narges Norouzi, Fernando Pérez, and Dominic Liu.

## Software Packages 
We will be using a wide range of different Python software packages.  To install and manage these packages we will be using the Conda environment manager.  The following is a list of packages we will routinely use in lectures and homeworks:

In [None]:
# linear algebra, probability
import numpy as np

# data manipulation
import pandas as pd

# visualization
import matplotlib.pyplot as plt
import seaborn as sns

## interactive visualization library
import plotly.offline as py
py.init_notebook_mode()
import plotly.graph_objs as go
import plotly.figure_factory as ff
import plotly.express as px

We will learn how to use all of the technologies used in this demo.

For now, just sit back and think critically about the data and our guided analysis.

# 1. Starting with a Question: **Who are you (the students of Data 100)?**

<img src="images/ask.png" width="300px" />

This is a pretty vague question but let's start with the goal of learning something about the students in the class.

Here are some "simple" questions:
1. How many students do we have?
1. What are your majors?
1. What year are you?
1. Diversity ...?


# 2. Data Acquisition and Cleaning 

**In Data 100 we will study various methods to collect data.**

<img src="images/data_acquisition.PNG" width="300px" />

To answer this question, I downloaded the course roster and extracted everyone's names and majors.

In [None]:
# pd stands for pandas, which we will learn starting from tomorrow
# some pandas syntax shared with data8's datascience package
majors = pd.read_csv("data/majors.csv")
names = pd.read_csv("data/names.csv")

# 3. Exploratory Data Analysis

**In Data 100 we will study exploratory data analysis and practice analyzing new datasets.**

<img src="images/understand_data.PNG" width="300px" />

I didn't tell you the details of the data! Let's check out the data and infer its structure. Then we can start answering the simple questions we posed.

### Peeking at the Data

In [None]:
majors.head(20)

In [None]:
names.head()

### What is one potential issue we may need to address in this data?

**Answer:**
Some names appear capitalized. 

In the above sample we notice that some of the names are capitalized and some are not.  This will be an issue in our later analysis so let's convert all names to lower case.

In [None]:
names['Name'] = names['Name'].str.lower()

In [None]:
names.head()

### How many records do we have?

In [None]:
print(len(names))
print(len(majors))

Based on what we know of our class, each record is most likely a student.

### Understanding the structure of data

It is important that we understand the meaning of each field and how the data is organized.

In [None]:
names.head()

**Q: What is the meaning of the *Role* field?**

A: Understanding the meaning of field can often be achieved by looking at the types of data it contains (in particular the *counts of its unique values*).

We use the `value_counts()` function in pandas:

In [None]:
names['Role'].value_counts().to_frame()  # counts of unique Roles

It appears that one student has an erroneous role given as "#REF!". What else can we learn about this student? Let's see their name.

In [None]:
# boolean index to find rows where Role is #REF!
names[names['Role'] == "#REF!"]

Though this single bad record won't have much of an impact on our analysis, we can clean our data by removing this record.

In [None]:
names = names[names['Role'] != "#REF!"]

**Double check**: Let's double check that our record removal only removed the single bad record.

In [None]:
names['Role'].value_counts().to_frame()  # again, counts of unique Roles

Remember we loaded in two files. Let's explore the fields of `majors` and check for bad records:

In [None]:
majors.columns   # get column names

In [None]:
majors['Terms in Attendance'].value_counts().to_frame()

It looks like numbers represents semesters, `G` represents graduate students, and `U` might represent something else---maybe campus visitors. But we do still have a bad record:

In [None]:
majors[majors['Terms in Attendance'] == "#REF!"]

In [None]:
majors = majors[majors['Terms in Attendance'] != "#REF!"]
majors['Terms in Attendance'].value_counts().to_frame()

Detail: The deleted `majors` record number is different from the record number of the bad `names` record. So while the number of records in each table matches, the row indices don't match, so we'll have to keep these tables separate in order to do our analysis.

### Summarizing the Data

We will often want to numerically or visually summarize the data. The `describe()` method provides a brief high level description of our data frame. 

In [None]:
names.describe()

In [None]:
majors.describe()

**Q: What do you think `top` and `freq` represent?**

A: `top`: most frequent entry, `freq`: the frequency of that entry

---

<img src="images/understand_world.PNG" width="300px" />


### What are your majors?

What are the top majors:

In [None]:
majors_count = (       # method chaining in pandas
    majors['Majors']
    .value_counts()
    .sort_values(ascending=False) # highest first
    .to_frame()
    .head(20)          # get the top 20
)

# or, comment out to parse double majors
# majors_count = (
#     majors['Majors']
#     .str.split(", ") # get double majors
#     .explode()       # one major to every row
#     .value_counts()
#     .sort_values(ascending=True)
#     .to_frame()
#     .tail(20)
# )

majors_count

### We will often use visualizations to make sense of data
**In Data 100, we will deal with many different kinds of data (not just numbers) and we will study techniques to describe types of data.**

**How can we summarize the Majors field?** A good starting point might be to use a bar plot:

In [None]:
# interactive using plotly
fig = px.bar(majors_count, orientation='h')
fig.update_layout(showlegend=False,
                  xaxis_title='Count',
                  yaxis_title='Major')

### What year are you?

In [None]:
fig = px.histogram(majors['Terms in Attendance'].sort_values(),
                   histnorm='probability')
fig.update_layout(showlegend=False,
                  xaxis_title="Term",
                  yaxis_title="Fraction of Class")

---


## Diversity and Data Science:

Unfortunately, surveys of data scientists suggest that there are far fewer women in data science:

<img src="images/kaggle_gender_data.png" width="600px" />

To learn more check out the [Kaggle Executive Summary](https://www.kaggle.com/kaggle-survey-2022) or study the [Raw Data](https://www.kaggle.com/c/kaggle-survey-2022).



---

## What fraction of the students are female?

<img src="images/ask.png" width="300px" />

We often ask this question because we want to improve the data science program here in Berkeley, especially it has now grown into a new college—[College of Computing, Data Science, and Society](https://data.berkeley.edu/)—Berkeley's first new college in 50 years.

This is actually a fairly complex question.  What do we mean by female? Is this a question about the **sex** or **gender identity** of the students?  **They are not the same thing.**  

* **Sex** refers predominantly to biological characteristics. 
* **Gender** is much more complex with societal and cultural implications and refers to how people identify themselves.  

Most likely, the college of CDSS is interested in improving **gender diversity**, by ensuring that our program is inclusive. Let's reword this question:


## Reworded: What is the gender diversity of our students?




### How could we answer this question?



In [None]:
print(majors.columns)
print(names.columns)


---

### We don't have the data.

Where can we get the data?

<img src="images/data_acquisition.PNG" width="300px" />


---

### (1) We could run a survey!

**In Data 100, we will learn different ways people collect data (e.g. sampling and census) and their limitations.**

We actually did, in the Pre-semester Survey. However, it's not due yet, so we don't have the complete data. Here's what we have so far:

<img src="images/survey.png" width="600px" />


Considering UC Berkeley's enrolled undergraduate students are [54% women](https://diversity.berkeley.edu/reports-data/diversity-data-dashboard), Data 100 is not doing great :(

Two limitations of this result:
- Not everyone filled out the form. (But is it a representative sample of our students?)
- The question is optional, so some students chose not to answer it. (What problem will this cause?)


### (2) ... or we could try to use the data we have to estimate the <u>_sex_ of the students as a proxy for gender</u>?!?!

Please do not attempt option (2) alone. What I am about to do is **flawed in so many ways** and we will discuss these flaws in a moment and throughout the semester.

However, it will illustrate some very basic inferential modeling and how we might combine **multiple data sources** to try and reason about something we haven't measured. 

The idea is to use **first name** as a proxy for **sex**, as a proxy for **gender**.

$$
\text{Name} \rightarrow \text{Sex} \rightarrow \text{Gender}
$$

What potential problems do you see with the method?


To attempt option (2), we will first look at a second data source.



---

# US Social Security Data

To study what a name tells about a person we will download data from the United States Social Security office containing the number of registered names broken down by **year**, **sex**, and **name**. This is often called the Baby Names Data as social security numbers (SSNs) are typically given at birth.

## 1. What does a name tell us about a person?

A: In this demo we'll use a person's name to estimate their sex. But a person's name tells us *many* things (more on this later).

## 2. Acquire data programatically

Note 1: In the following we download the data programmatically to ensure that the process is reproducible.

Note 2: We also load the data directly into python without decompressing the zipfile.

**In Data 100 we will think a bit more about how we can be efficient in our data analysis to support processing large datasets.**

In [None]:
import urllib.request
import os.path

# Download data from the web directly
data_url = "https://www.ssa.gov/oact/babynames/names.zip"
local_filename = "babynames.zip"
if not os.path.exists(local_filename): # if the data exists don't download again
    with urllib.request.urlopen(data_url) as resp, open(local_filename, 'wb') as f:
        f.write(resp.read())

        
# Load data without unzipping the file
import zipfile
babynames = [] 
with zipfile.ZipFile(local_filename, "r") as zf:
    data_files = [f for f in zf.filelist if f.filename[-3:] == "txt"]
    def extract_year_from_filename(fn):
        return int(fn[3:7])
    for f in data_files:
        year = extract_year_from_filename(f.filename)
        with zf.open(f) as fp:
            df = pd.read_csv(fp, names=["Name", "Sex", "Count"])
            df["Year"] = year
            babynames.append(df)
babynames = pd.concat(babynames)


babynames.head() # show the first few rows

---
## 2 (cont). Understanding the Setting

**In Data 100 you will have to learn about different data sources (and their limitations) on your own.**

Reading from [SSN Office description](https://www.ssa.gov/oact/babynames/background.html), bolded for readability: 


> All names are from Social Security card applications for **births that occurred in the United States** after 1879. **Note that many people born before 1937 never applied** for a Social Security card, so their names are not included in our data. For others who did apply, our records may not show the place of birth, and again their names are not included in our data.

> **To safeguard privacy, we exclude** from our tabulated lists of names those that would indicate, or would allow the ability to determine, **names with fewer than 5 occurrences** in any geographic area. If a name has less than 5 occurrences for a year of birth in any state, the sum of the state counts for that year will be less than the national count.

> All data are from a **100% sample** of our records on Social Security card applications as of March 2022.




## A little bit of data cleaning

Examining the data:

In [None]:
babynames

In our earlier analysis we converted names to lower case. We will do the same again here:

In [None]:
babynames['Name'] = babynames['Name'].str.lower()
babynames.head()

## 3. Exploratory Data Analysis (and Visualization)

<img src="images/understand_data.PNG" width="300px" />

How many people does this data represent?

In [None]:
format(babynames['Count'].sum(), ',d') # sum of 'Count' column

In [None]:
format(len(babynames), ',d')       # number of rows

**Q: Is this number low or high?**

**Answer**

It seems low (the 2021 US population was 331.9 million). However the social security website states: 

> All names are from Social Security card applications for births that occurred in the United States after 1879. **Note that many people born before 1937 never applied for a Social Security card, so their names are not included in our data.** For others who did apply, our records may not show the place of birth, and again their names are not included in our data. All data are from a 100% sample of our records on Social Security card applications as of the end of February 2016.

Trying a simple query:

In [None]:
# how many Nora's were born in 2018?
babynames[(babynames['Name'] == 'nora') & (babynames['Year'] == 2018)]

Trying a more complex query using `query()` (to be discussed soon!): 

In [None]:
# how many baby names contain the word "data"?
babynames.query('Name.str.contains("data")', engine='python')

### Temporal Patterns Conditioned on Male/Female

**In Data 100 we still study how to visualize and analyze relationships in data.**

In this example we construct a **pivot table** which aggregates the number of babies registered for each year by `Sex`.

We'll discuss pivot tables in detail in the next few lectures.

In [None]:
# counts number of M and F babies per year
year_sex = pd.pivot_table(
        babynames, 
        index=['Year'], # the row index
        columns=['Sex'], # the column values
        values='Count', # the field(s) to processed in each group
        aggfunc=np.sum,
    )[["M", "F"]]

year_sex.head()

We can visualize these descriptive statistics:

In [None]:
# more interactive using plotly
fig = px.line(year_sex)
fig.update_layout(title="Total Babies per Year",
                  yaxis_title="Number of Babies")

### How many unique names for each year?

In [None]:
# counts number of M and F *names* per year
year_sex_unique = pd.pivot_table(babynames, 
        index=['Year'], 
        columns=['Sex'], 
        values='Name', 
        aggfunc=lambda x: len(np.unique(x)),
    )
fig = px.line(year_sex_unique)
fig.update_layout(title="Unique Names Per Year",
                  yaxis_title="Number of Baby Names")

**Some observations:**
1. Registration data seems limited in the early 1900s.  Because many people did not register before 1937.  
1. You can see the [baby boomers](https://en.wikipedia.org/wiki/Baby_boomers) (born 1940s-1960s) and the [Echo Boomers](https://en.wikipedia.org/wiki/Echo%20Boomers) (aka millenials, 1980s to 2000).
1. Females have greater a sightly greater diversity of names.

## 4. Understand the World: Prediction and Inference

<img src="images/understand_world.PNG" width="300px" />

Let's use the Baby Names dataset to estimate the fraction of female students in the class.

### Compute the Proportion of Female Babies For Each Name

First, we construct a pivot table to compute the total number of babies registered for each Name, broken down by Sex.

In [None]:
# counts number of M and F babies per name
name_sex = pd.pivot_table(babynames, index='Name', columns='Sex', values='Count',
                            aggfunc='sum', fill_value=0., margins=True)
name_sex.sample(5)

In [None]:
name_sex.loc["alex",]

Second, we compute proportion of female babies for each name. This is our **estimated probability** that the baby is Female:

$$ \hat{\textbf{P}\hspace{0pt}}(\text{Female} \,\,\, | \,\,\, \text{Name} ) = \frac{\textbf{Count}(\text{Female and Name})}{\textbf{Count}(\text{Name})}
$$

In [None]:
prop_female = (name_sex['F'] / name_sex['All']).rename("Prop. Female")
prop_female.to_frame().sample(10)

### Test a few names

In [None]:
prop_female["bella"]

In [None]:
prop_female["dominic"]

In [None]:
prop_female["joey"]

In [None]:
prop_female["ani"]

In [None]:
prop_female["minh"]

In [None]:
prop_female["pat"]

In [None]:
prop_female["jaspreet"]

### Next, Build a Simple Classifier (Model)

We can define a function to return the most likely `Sex` for a name. If there is an exact tie or the name does not appear in the social security dataset the function returns `Unknown`.

In [None]:
def sex_from_name(name):
    lower_name = name.lower()
    if lower_name not in prop_female.index or prop_female[lower_name] == 0.5:
        return "Unknown"
    elif prop_female[lower_name] > 0.5:
        return "F"
    else:
        return "M"

In [None]:
sex_from_name("nora")

In [None]:
sex_from_name("joey")

In [None]:
sex_from_name("pat")

## 4 (cont). Estimating the fraction of female and male students in Data 100
Let's try out our simple classifier! We'll use the `apply()` function to classify each student name:

In [None]:
# apply sex_from_name to each student name
names['Pred. Sex'] = names['Name'].apply(sex_from_name)
px.bar(names['Pred. Sex'].value_counts()/len(names))

### Interpreting the unknowns

That's a lot of `Unknown`s.

...But we can still estimate the fraction of female students in the class:

In [None]:
count_by_sex = names['Pred. Sex'].value_counts().to_frame()
count_by_sex

In [None]:
count_by_sex.loc['F']/(count_by_sex.loc['M'] + count_by_sex.loc['F'])

**Questions**:
1. **How do we feel about this estimate?**
1. **Do we trust it?**

---
<br/><br/><br/>

**Q: What fraction of students in Data 100 this semester have names in the SSN dataset?**

In [None]:
print("Fraction of names in the babynames data:", 
      names['Name'].isin(prop_female.index).mean())

**Q: Which names are *not* in the dataset?**

Why might these names not appear?  

In [None]:
# the tilde ~ negates the boolean index. More next week.
names[~names['Name'].isin(prop_female.index)].sample(10)

In [None]:
"hangxing" in prop_female.index

### Using simulation to estimate uncertainty

Previously we treated a name which is given to females 40% of the time as a "Male" name, because the probability was less than 0.5.  This doesn't capture our uncertainty.

We can use simulation to provide a better distributional estimate. We'll use 50% for names not in the Baby Names dataset.

In [None]:
# add the computed SSN F proportion to each row. 0.5 for Unknowns.
# merge() effectively "join"s two tables together. to be covered next week.
names['Prop. Female'] = (
    names[['Name']].merge(prop_female, how='left', left_on='Name', 
                          right_index=True)['Prop. Female']
        .fillna(0.5)
)
names.head(10)

### Running the simulation

In [None]:
# if a randomly picked number from [0.0, 1.0) is under the Female proportion, then F
names['Sim. Female'] = np.random.rand(len(names)) < names['Prop. Female']
names.tail(20)

Given such a simulation, we can compute the fraction of the class that is female.

1. **How do we feel about this new estimate?** 
1. **Do we trust it?**

In [None]:
# proportion of Trues in the 'Sim. Female' column
names['Sim. Female'].mean()

Now that we're performing a simulation, the above proportion is *random*: it depends on the random numbers we picked to determine whether a student was Female.

Let's run the above simulation several times and see what the distribution of this Female proportion is. The below cell may take a few seconds to run.

In [None]:
# function that performs many simulations
def simulate_class(students):
    is_female = names['Prop. Female'] > np.random.rand(len(names['Prop. Female'])) 
    return np.mean(is_female)

sim_frac_female = np.array([simulate_class(names) for n in range(10000)])

In [None]:
fig = ff.create_distplot([sim_frac_female], ['Fraction Female'], bin_size=0.0025, show_rug=False)
fig.update_layout(xaxis_title='Prop. Female',
                  yaxis_title='Percentage',
                  title='Distribution of Simulated Proportions of Females in the Class')
ax = sns.histplot(sim_frac_female, stat='probability', kde=True, bins=20)
sns.rugplot(sim_frac_female, ax=ax)
ax.set_xlabel("Fraction Female")
ax.set_title('Distribution of Simulated Fractions Female in the Class');

**In Data 100 we will understand Kernel Density Functions, Rug Plots, and other visualization techniques.**

---
<br/><br/><br/>
## Limitations of Baby Names dataset

### UC Berkeley teaches students from around the world.

We saw with our Simple Classifier that many student names were classified as "Unknown," often because they weren't in the SSN Baby Names Dataset.

Recall the SSN dataset:

> All names are from Social Security card applications for births that occurred in the United States after 1879.

That statement is not reflective of all of our students!!

In [None]:
# students who were not in the SSN Baby Names Dataset
names[~names['Name'].isin(prop_female.index)].sample(10)

### Names change over time.

Using data from 1879 (or even 1937) does not represent the diversity and **context** of U.S. baby names today.

Here are some choice names to show you how the distribution of particular names has varied with time:

In [None]:
subset_names = ["edris", "jamie", "jordan", "leslie", "taylor", "willie"]
subset_babynames_year = (pd.pivot_table(
                    babynames[babynames['Name'].isin(subset_names)],
                    index=['Name', 'Year'], columns='Sex', values='Count',
                    aggfunc='sum', fill_value=0, margins=True)
                 .drop(labels='All', level=0, axis=0) # drop cumulative row
                 .rename_axis(None, axis=1) # remove pivot table col name
                 .reset_index() # move (name, year) back into columns
                 .assign(Propf = lambda row: row.F/(row.F + row.M))
                )
ax = sns.lineplot(data=subset_babynames_year,
                  x='Year', y='Propf', hue='Name')
ax.set_title("Ratio of Female Babies over Time for Select Names")
ax.set_ylabel("Proportion of Female Names in a Year")
ax.legend(loc="lower left");

### Bonus: How we selected which names to plot

Curious as to how we got the above names? We picked out two types of names:
* names that had a high variability in F/M naming over years
* common names that had an average F/M ratio over a set threshold

Check it out:

In [None]:
"""
get a subset of names that:
    have had propf above a threshold, as well as
    have been counted for more than a certain number of years
Note: while we could do our analysis over all names,
    it turns out many names don't matter.
    So to save computation power, we just work
    with a subset of names we know may be candidates.
"""
# these are thresholds we set as data analysts
propf_min = 0.2
propf_max = 0.8
year_thresh = 30

propf_countyear = (babynames
                   .groupby('Name').count()
                   .merge(prop_female.to_frame(), on='Name')
                   .rename(columns={'Prop. Female': 'Propf'})
                   .query("@propf_min < Propf < @propf_max & Year > @year_thresh & Name != 'All'")
                  )[['Propf', 'Year']]
propf_countyear

In [None]:
# construct a pivot table of (name, year) to count
keep_names = propf_countyear.reset_index()['Name']
name_year_sex = (pd.pivot_table(
                    babynames[babynames['Name'].isin(keep_names)],
                    index=['Name', 'Year'], columns='Sex', values='Count',
                    aggfunc='sum', fill_value=0, margins=True)
                 .drop(labels='All', level=0, axis=0) # drop cumulative row
                 .rename_axis(None, axis=1) # remove pivot table col name
                 .reset_index() # move (name, year) back into columns
                 .assign(Propf = lambda row: row.F/(row.F + row.M))
                )
name_year_sex

In [None]:
"""
Compute two statistics per name:
- Count of number of babies with name
- Variance of proportion of females
  (i.e., how much the proportion of females varied
  across different years)
"""
names_to_include = 40
group_names =  (name_year_sex
                       .groupby('Name')
                       .agg({'Propf': 'var', 'All': 'sum'})
                       .rename(columns={'Propf': 'Propf Var', 'All': 'Total'})
                       .reset_index()
                      )


In [None]:
# pick some high variance names
high_variance_names = (group_names
                       .sort_values('Propf Var', ascending=False)
                       .head(names_to_include)
                       .sort_values('Total', ascending=False)
                      )

high_variance_names.head(5)

In [None]:
# pick some common names
common_names = (group_names
                .sort_values('Total', ascending=False)
                .head(names_to_include)
               )
common_names.head(10)