In [None]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = (10, 5)

import util

# Lecture 2 – Pandas 🐼

## DSC 80, Spring 2022

### Announcements

- Lab 1 is due on **Monday, April 4th at 11:59PM**.
    - Watch [this video 🎥](https://youtu.be/FpTo4AM9B30) for setup instructions.
- Discussion 1 is today from **7-7:50PM or 8-8:50PM** (in-person in PCNYH 122 or remote via Zoom).
    - There was a typo in the discussion times stated in Lecture 1.
    - Remember that discussion assignments can be submitted for extra credit!
- Don't forget to fill out the [Welcome + Alternate Exams Form](https://docs.google.com/forms/d/e/1FAIpQLSdBKLcPs4Xi0plaIw0MVZ0DyGcvnSZyHxKVC7S7LwEiCchepQ/viewform) by Monday as well.

### Agenda

- Wrap up our case study of City of San Diego employee salaries.
- Introduction to `pandas`.
    - DataFrames, Series, and Indexes.
- Selecting rows and columns using `[]` and `loc`. 

## The data science lifecycle

<center><img src="imgs/DSLC.png" width="40%"></center>

### Recap: City of San Diego salary data

Our dataset is downloaded from [Transparent California](https://transparentcalifornia.com/salaries/san-diego/).

In [None]:
salary_path = util.safe_download('https://transcal.s3.amazonaws.com/public/export/san-diego-2020.csv')

In [None]:
salaries = pd.read_csv(salary_path)
util.anonymize_names(salaries)
salaries

### Question: Does gender influence pay?

- Do employees of different genders have similar pay?
- The salary dataset we downloaded does not contain employee gender, so we can't answer this question using just the data we have.

In [None]:
salaries.head()

- We **do**, however, have the first name of each employee.

### Social Security Administration baby names 👶

- The US Social Security Administration (SSA) keeps track of the **first name**, **birth year**, and **assigned gender at birth** for all babies born in the US.
- We can somehow combine the SSA's dataset with the `salaries` dataset to infer the gender of San Diego employees.

In [None]:
names_path = util.safe_download('https://www.ssa.gov/oact/babynames/names.zip')

In [None]:
import pathlib

dfs = []
for path in pathlib.Path('data/names/').glob('*.txt'):
    year = int(str(path)[14:18])
    if year >= 1964:
        df = pd.read_csv(path, names=['firstname', 'gender', 'count']).assign(year=year)
        dfs.append(df)
        
names = pd.concat(dfs)
names

> We began compiling the baby name list in 1997, with names dating back to 1880. At the time of a child’s birth, parents supply the name to us when applying for a child’s Social Security card, thus making Social Security America’s source for the most popular baby names. Please share this with your friends and family—and help us spread the word on social media. - [Social Security’s Top Baby Names for 2020
](https://blog.ssa.gov/social-securitys-top-baby-names-for-2020/)

### Exploring `names`

- The only values of `'gender'` in `names` are `'M'` and `'F'`.
- Many names have non-zero counts for both `'M'` and `'F'`.
- Most names occur only a few times per year, but a few names occur very often.

In [None]:
names.head()

In [None]:
# Get the count of each unique value in the 'gender' column
names['gender'].value_counts()

In [None]:
# Look at a single name
names[names['firstname'] == 'Billy']

In [None]:
# Look at various summary statistics
names.describe()

### Data Modeling

<center><img src="imgs/DSLC.png" width="40%"></center>

### Determining the most common gender for each name

- Recall, our goal is to infer the gender of each San Diego city employee. To do this, we need a mapping of first names to genders.

- **A (very imperfect) model:** If someone has a name that is predominantly used by gender $g$, we'll infer their gender to be $g$.

- **Approach:** Create a DataFrame indexed by `'firstname'` that describes the total number of `'F'` and `'M'` babies in `names` for each unique `'firstname'`.
    - If there are more female babies born with a given name than male babies, we will "classify" the name as female.
    - Otherwise, we will classify the name as male.

### Determining the most common gender for each name

In [None]:
counts_by_gender = (
    names
    .groupby(['firstname', 'gender'])
    .sum()
    .reset_index()
    .pivot('firstname', 'gender', 'count')
    .fillna(0)
)
counts_by_gender

In [None]:
counts_by_gender['F'] > counts_by_gender['M']

In [None]:
genders = counts_by_gender.assign(gender=np.where(counts_by_gender['F'] > counts_by_gender['M'], 'F', 'M'))
genders

### Adding a `'gender'` column to `salaries`

This involves two steps:
1. Extracting just the first name from `'Employee Name'`.
2. **Merging** `salaries` and `genders`.

In [None]:
# Add firstname column
salaries['firstname'] = salaries['Employee Name'].str.split().str[0]
salaries

In [None]:
# Merge salaries and genders
salaries_with_gender = salaries.merge(genders[['gender']], on='firstname', how='left')
salaries_with_gender

### Predictions and Inference

<center><img src="imgs/DSLC.png" width="40%"></center>

### Question: Does gender influence pay?

This was our original question. Let's find out!

In [None]:
pd.concat([
    salaries_with_gender.groupby('gender')['Total Pay'].describe().T,
    salaries_with_gender['Total Pay'].describe().rename('All')
], axis=1)

- Unfortunately, there's a fairly large difference between the mean salaries of male employees and female employees.
- A similar difference also exists for the median.
- Can this difference be explained by random chance?

### A hypothesis test

- **Null Hypothesis:** Gender is independent of salary, and any observed differences are due to random chance.
- **Alternate Hypothesis:** Gender is not independent of salary. Female employees earn less than male employees.

In [None]:
n_female = np.count_nonzero(salaries_with_gender['gender'] == 'F')
n_female

**Strategy:** 
- Randomly select 4075 employees from `salaries_with_gender` and compute their median salary.
- Repeat this many times.
- See where the observed median salary of female employees lies in this empirical distribution.

### Running the hypothesis test

In [None]:
# Observed statistic
female_median = salaries_with_gender.loc[salaries_with_gender['gender'] == 'F']['Total Pay'].median()

# Simulate 1000 samples of size n_female from the population
medians = np.array([])
for _ in np.arange(1000):
    median = salaries_with_gender.sample(n_female)['Total Pay'].median()
    medians = np.append(medians, median)

medians[:10]

In [None]:
title='Median salary of randomly chosen groups from population'
pd.Series(medians).plot(kind='hist', density=True, ec='w', title=title);
plt.axvline(x=female_median, color='red')
plt.legend(['Observed Median Salary of Female Employees', 'Median Salaries of Random Groups']);

- Our hypothesis test has a p-value of 0, so we reject the null.
    - Under the assumption that gender is independent of salary, the chance of seeing a median salary this low is essentially 0.

<center><img src="imgs/DSLC.png" width="40%"></center>

### Even more questions...

While trying to answer one question, many more popped up.

* Is our dataset representative of all San Diego employees?
* How reliable is our name-to-gender assignment?
* How reliable is our join between `salaries` and `names`?
* Is the pay disparity between genders correlated to pay-type? Job status? Job type?
* What is the **cause** of the disparity?

### Is our dataset representative of all San Diego employees?

- In this case, yes – the dataset we downloaded from [Transparent California](https://transparentcalifornia.com/salaries/san-diego/) is a **census**, meaning that it accounts for all members of the population.
- But perhaps `'Total Pay'` is not the most relevant column, as it may include reimbursements that are separate from take-home pay (e.g. gas for driving a car).

### How reliable is our join between `salaries` and `names`?
* Are there names in the salaries dataset that aren't in the SSA dataset?
    - Who might not be in the SSA dataset? 
    - Could individuals with those names be biased towards certain salaries?
* Does the salaries dataset have a disproportionately large portion of unisex names?
* Is it better to use a subset of the SSA dataset (e.g. by state?)
    - Do the gender of names typically vary by geography?

### How reliable is our join between `salaries` and `names`?

In [None]:
salaries_with_gender[salaries_with_gender['gender'].isnull()]

In [None]:
# Proportion of employees whose names aren't in SSA dataset
salaries_with_gender['gender'].isnull().mean()

In [None]:
# Description of total pay by joined vs. not joined
(
    salaries_with_gender
    .assign(joined=salaries_with_gender['gender'].notnull())
    .groupby('joined')['Total Pay']
    .describe()
    .T
)

In [None]:
nonjoins = salaries_with_gender.loc[salaries_with_gender['gender'].isnull()]

title = 'Distribution of Salaries'
nonjoins['Total Pay'].plot(kind='hist', bins=np.arange(0, 320000, 10000), alpha=0.5, density=True, sharex=True)
salaries_with_gender['Total Pay'].plot(kind='hist', bins=np.arange(0, 320000, 10000), alpha=0.5, density=True, sharex=True, title=title)
plt.legend(['Not in SSA','All']);

### How reliable is our join between `salaries` and `names`?

**Lesson:** joining to another dataset can bias your sample! 

## Introduction to `pandas`

<center><img src='imgs/babypanda.jpg' width=400></center>

<center><img src='imgs/angrypanda.jpg' width=600></center>

### `pandas`

<center><img src='imgs/pandas.png' width=200></center>

- `pandas` is **the** Python library for tabular data manipulation.
- Before `pandas` was developed, the standard data science workflow involved using multiple languages (Python, R, Java) in a single project.
- Wes McKinney, the original developer of `pandas`, wanted a library which would allow everything to be done in Python.
    - Python is faster to develop in than Java, and is more production-capable than R.

### `pandas` data structures

There are three key data structures at the core of `pandas`:
- DataFrame: 2 dimensional tables.
- Series: 1 dimensional (columnar) array.
- Index: immutable sequence of column/row labels.

<center><img src='imgs/example-df.png' width=600></center>

### Importing `pandas` and related libraries

We've already run this at the top of the notebook, so we won't repeat it here. But `pandas` is almost always imported in conjunction with `numpy`:

```py
import pandas as pd
import numpy as np
```

### Series are "slices"
* Rows and columns of DataFrame are stored as `pd.Series`.
* A `pd.Series` object is a one-dimensional sequence with labels (index).

In [None]:
names

In [None]:
names['firstname']

In [None]:
names.iloc[3]

### Initializing a Series

- The function `pd.Series` can create a new Series, given either an existing sequence or dictionary.
- By default, the index will be set to 0, 1, 2, 3,... and the Series will have no "name".
    - You can use optional `index` and `name` arguments to change this behavior.

In [None]:
pd.Series([10, 23, 45, 53, 87])

In [None]:
pd.Series({'a': 10, 'b': 23, 'c': 45, 'd': 53, 'e': 87}, name='people')

### Initializing a DataFrame

* `pd.DataFrame` initializes a DataFrame using either: 
    - a list of rows, or
    - a dictionary of columns.
* There are various optional arguments: `index`, `columns`, `dtype`, etc.
    - To see the signature of a function `f`, run `f?` in a cell (e.g. `pd.DataFrame?`).

In [None]:
pd.DataFrame?

### Method 1: Using a list of rows

In [None]:
row_data = [
    ['Granger, Hermione', 'A13245986', 1],
    ['Potter, Harry', 'A17645384', 1],
    ['Weasley, Ron', 'A32438694', 1],
    ['Longbottom, Neville', 'A52342436', 1]
]

row_data

By default, the column names are set to 0, 1, 2, ...

In [None]:
pd.DataFrame(row_data)

You can change that using the `columns` argument.

In [None]:
pd.DataFrame(row_data, columns=['Name', 'PID', 'LVL'])

### Method 2: Using a dictionary of columns

In [None]:
column_dict = {
    'Name': ['Granger, Hermione', 'Potter, Harry', 'Weasley, Ron', 'Longbottom, Neville'],
    'PID': ['A13245986', 'A17645384', 'A32438694', 'A52342436'],
    'LVL': [1, 1, 1, 1]
}
column_dict

In [None]:
enrollments = pd.DataFrame(column_dict)
enrollments

### DataFrame index and column labels

- Access column labels with the `columns` attribute.
- Access index labels with the `index` attribute.
- The default for both is 0-indexed position (0, 1, 2, ...).

In [None]:
enrollments.columns

In [None]:
enrollments.index

### Axis

- The rows and columns of a DataFrame are both stored as Series.
- The **axis** specifies the direction of a slice of a DataFrame.

<center><img src='imgs/axis.png' width=300></center>

- Axis 0 refers to the index.
- Axis 1 refers to the columns.

### DataFrame methods with `axis`

In [None]:
A = pd.DataFrame([[1, 2, 3], [4, 5, 6]], columns=['A', 'B', 'C'])
A

If we specify `axis=0`, `A.sum` will "compress" along axis 0, and keep the column labels intact.

In [None]:
A.sum(axis=0)

If we specify `axis=1`, `A.sum` will "compress" along axis 1, and keep the row labels (index) intact.

In [None]:
A.sum(axis=1)

<center><img src='imgs/axis-sum.png' width=600></center>

## Selecting rows and columns using `[]` and `loc`

### Throwback to `babypandas` 👶

- In `babypandas`, you accessed columns using the `.get` method.
- `.get` also works in `pandas`, but it is not **idiomatic** – people don't usually use it.

In [None]:
enrollments

In [None]:
enrollments.get('Name')

In [None]:
# Doesn't error
enrollments.get('billy')

### Selecting columns with `[]`

- The standard way to access a column in `pandas` is by using the `[]` operator.
    - Think of a DataFrame as a dictionary of arrays!
* Specifying a column name returns the column as a Series.
* Specifying a list of column names returns a DataFrame.

In [None]:
enrollments

In [None]:
# Returns a Series
enrollments['Name']

In [None]:
# Returns a DataFrame
enrollments[['Name', 'PID']]

In [None]:
# 🤔
enrollments[['Name']]

In [None]:
# KeyError
enrollments['billy']

### Selecting columns with attribute notation

- It is also possible to access columns using attribute notation, i.e. `.<column name>`.
- **Don't do this.**
    - What if the column name clashes with a DataFrame method, like `.mean`?
    - What if the column name contains spaces or special characters?

In [None]:
enrollments.LVL

In [None]:
enrollments.mean

### Selecting rows with `loc`

If `df` is a DataFrame, then:
* `df.loc[idx]` returns the Series whose index is `idx`.
* `df.loc[idx_list]` returns a DataFrame containing the rows whose indexes are in `idx_list`.

In [None]:
enrollments

In [None]:
enrollments.loc[3]

In [None]:
enrollments.loc[[1, 3]]

In [None]:
enrollments.loc[[3]]

### Boolean sequence selection

* The `loc` operator also supports Boolean sequences (lists, arrays, Series) as input. 
* The length of the sequence must be the same as the number of rows in the DataFrame. 
* The result is a filtered DataFrame, containing only the rows in which the sequence contained `True`.

In [None]:
enrollments

In [None]:
bool_arr = [
    False,  # Hermione
    True,   # Harry
    False,  # Ron
    True    # Neville
]

enrollments.loc[bool_arr]

### Querying

- Comparisons with arrays (Series) result in Boolean arrays (Series).
- We can use comparisons along with the `loc` operator to **query** a DataFrame.
- Querying is the act of selecting rows in a DataFrame that satisfy certain condition(s).

In [None]:
enrollments

In [None]:
enrollments['Name'].str.contains('on')

In [None]:
# Rows where Name includes 'on'
enrollments.loc[enrollments['Name'].str.contains('on')]

In [None]:
# Rows where the first letter of Name is between A and L
enrollments.loc[enrollments['Name'] < 'M']

When using a Boolean sequence, e.g. `enrollments['Name'] < 'M'`, `loc` is not strictly necessary:

In [None]:
enrollments[enrollments['Name'] < 'M']

### Selecting columns and rows simultaneously

So far, we used `[]` to select columns and `loc` to select rows.

In [None]:
enrollments.loc[enrollments['Name'] < 'M']['PID']

### Selecting columns and rows simultaneously

`loc` can also be used to select both rows and columns. The general pattern is:

```
df.loc[<row selector>, <column selector>]
```

Examples:
- `df.loc[idx_list, col_list]` returns a DataFrame containing the rows in `idx_list` and columns in `col_list`.
- `df.loc[bool_arr, col_list]` returns a DataFrame contaning the rows for which `bool_arr` is `True` and columns in `col_list`.
- If `:` is used as the first input, all rows are kept. If `:` is used as the second input, all columns are kept.

In [None]:
enrollments

In [None]:
enrollments.loc[enrollments['Name'] < 'M', 'PID']

In [None]:
enrollments.loc[enrollments['Name'] < 'M', ['PID']]

### Even more ways of selecting rows and columns

In `df.loc[<row selection>, <column selection>]`:

- Both the first and second inputs can be Boolean sequences.
- Both the first and second inputs can be **slices**, which use `:` syntax (e.g. `0:2`, `'Name': 'PID'`).
- If both the first and second inputs are primitives (strings or numbers), the result is a single value, not a DataFrame or Series.
- The first input can be a **function** that takes a row as input and returns a Boolean.

There are many, many more – see the [`pandas` documentation](https://pandas.pydata.org/docs/user_guide/index.html#user-guide) for more.

In [None]:
enrollments

In [None]:
enrollments.loc[2, 'LVL']

In [None]:
enrollments.loc[0:2, 'Name': 'PID']

### Don't forget `iloc`!

- `iloc` stands for "integer location".
- `iloc` is like `loc`, but it selects rows and columns based off of integer positions only.

In [None]:
enrollments

In [None]:
enrollments.iloc[2:4, 0:2]

In [None]:
other = enrollments.set_index('Name')
other

In [None]:
other.iloc[2]

In [None]:
other.loc[2]

### Discussion Question

Let's return to the `names` DataFrame.

In [None]:
names

**Question:** How many babies were born with the name `'Billy'` and gender `'M'`?

In [None]:
...

### More Practice

Consider the DataFrame below.

In [None]:
jack = pd.DataFrame({1: ['fee', 'fi'], '1': ['fo', 'fum']})
jack

For each of the following pieces of code, predict what the output will be. Then, uncomment the line of code and see for yourself.

In [None]:
# jack[1]

In [None]:
# jack[[1]]

In [None]:
# jack['1']

In [None]:
# jack[[1,1]]

In [None]:
# jack.loc[1]

In [None]:
# jack.loc[jack[1] == 'fo']

In [None]:
# jack[1, ['1', 1]]

In [None]:
# jack.loc[1,1]

## Summary, next time

### Summary

- `pandas` is **the** library for tabular data manipulation in Python.
- There are three key data structures in `pandas`: DataFrame, Series, and Index.
- Refer to the lecture notebook and the [`pandas` documentation](https://pandas.pydata.org/docs/user_guide/index.html#user-guide) for tips.
- **Next time:** useful methods for working with DataFrames and Series.