In [2]:
import pandas as pd
import numpy as np
import os
import seaborn as sns

from IPython.display import display, IFrame
def show_paradox_slides():
    src = 'https://docs.google.com/presentation/d/e/2PACX-1vSbFSaxaYZ0NcgrgqZLvjhkjX-5MQzAITWAsEFZHnix3j1c0qN8Vd1rogTAQP7F7Nf5r-JWExnGey7h/embed?start=false'
    width = 960
    height = 569
    display(IFrame(src, width, height))

# Lecture 8 – Pivoting, Simpson's Paradox, and Concatenation

## DSC 80, Spring 2022

### Announcements

- Project 1 is due **tomorrow at 11:59PM**.
- Discussion section 3 is today from 7-8PM, and the discussion notebook is due (for extra credit!) on **Saturday, April 16th at 11:59PM**.
- Lab 3 is due on **Monday, April 18th 11:59PM**.
    - Check [here](https://campuswire.com/c/G325FA25B/feed/507) for clarifications.
- Lab 1 (+more) grades are released – see [this post](https://campuswire.com/c/G325FA25B/feed/509) for details, and [this post](https://campuswire.com/c/G325FA25B/feed/508) for assignment solutions.
- Watch [this video 🎥](https://www.youtube.com/watch?v=uUawZfAgA64) for tips on how to work with the command-line.

### Agenda

- Grouping with multiple columns.
- Pivoting.
- Simpson's paradox.
- Concatenation.
- Time permitting: time series data.

## Grouping

<center><h1>🐧</h1></center>

In [None]:
penguins = sns.load_dataset('penguins').dropna()
penguins.head()

### Discussion Question

For each species, find the island on which the heaviest penguin of that species lives.

In [None]:
# Why doesn't this work?
penguins.groupby('species').max()

In [None]:
penguins.sort_values('body_mass_g', ascending=False).groupby('species').first()

### Grouping with multiple columns

When we group with multiple columns, one group is created for **every unique combination** of elements in the specified columns.

In [None]:
double_group = penguins.groupby(['species', 'island'])
double_group

In [None]:
double_group.groups

In [None]:
for key, df in double_group:
    display(df.head())

In [None]:
penguins.groupby(['species', 'island']).mean()

### Grouping and indexes

- The `groupby` method creates an index based on the specified columns.
- When grouping by multiple columns, the resulting DataFrame has a `MultiIndex`.
- Advice: When working with a `MultiIndex`, use `reset_index` or set `as_index=False` in `groupby`.

In [None]:
weird = penguins.groupby(['species', 'island']).mean()
weird

In [None]:
weird['body_mass_g']

In [None]:
weird.loc['Adelie']

In [None]:
weird.loc[('Adelie', 'Torgersen')]

In [None]:
weird.reset_index()

In [None]:
penguins.groupby(['species', 'island'], as_index=False).mean()

## Pivoting

### Average body mass for every combination of species and island

To find the above information, we can group by both `'species'` and `'island'`.

In [None]:
penguins.groupby(['species', 'island'])['body_mass_g'].mean()

But we can also create a **pivot table**.

In [None]:
penguins.pivot_table(index='species', 
                     columns='island', 
                     values='body_mass_g', 
                     aggfunc='mean')

Note that the DataFrame above shows the same information as the Series above it, just in a different arrangement.

### `pivot_table`

- The `pivot_table` DataFrame method aggregates a DataFrame using two columns. To use it:

```py
df.pivot_table(index=index_col,
               columns=columns_col,
               values=values_col,
               aggfunc=func)
```
- The resulting DataFrame will have:
    - One row for every unique value in `index_col`.
    - One column for every unique value in `columns_col`.
    - Values determined by applying `func` on values in `values_col`.

### Example

Find the number of penguins per island and species.

In [None]:
penguins.pivot_table(index='island', 
                     columns='species', 
                     values='bill_length_mm', 
                     aggfunc='count')

Note that there is a `NaN` at the intersection of `'Biscoe'` and `'Chinstrap'`, because there were no Chinstrap penguins on Biscoe Island.

We can either use the `fillna` method afterwards or the `fill_values` argument to fill in `NaN`s.

In [None]:
penguins.pivot_table(index='island', 
                     columns='species', 
                     values='bill_length_mm', 
                     aggfunc='count').fillna(0)

In [None]:
penguins.pivot_table(index='island', 
                     columns='species', 
                     values='bill_length_mm', 
                     aggfunc='count', 
                     fill_value=0)

### Example

Find the mean body mass per species and sex.

In [None]:
penguins.pivot_table(index='species', columns='sex', values='body_mass_g', aggfunc='mean')

**Important:** In `penguins`, each row corresponds to an individual/observation. In the pivot table above, that is no longer true.

### Joint and conditional distributions

When using `aggfunc='count'`, a pivot table describes the joint distribution of two categorical variables.

In [None]:
counts = penguins.pivot_table(index='species', 
                              columns='sex', 
                              values='body_mass_g', 
                              aggfunc='count', 
                              fill_value=0)

counts

We can normalize the DataFrame by dividing by the total number of penguins. The resulting numbers can be interpreted as **probabilities** that a randomly selected penguin from the dataset belongs to a given combination of species and sex.

In [None]:
joint = counts / counts.sum().sum()
joint

If we sum over one of the axes, we can compute **marginal probabilities**.

In [None]:
joint

In [None]:
joint.sum(axis=1)

In [None]:
joint.sum(axis=0)

For instance, the first Series tells us that a randomly selected penguin has a 0.357357 chance of being of species `'Gentoo'`.

If we divide `counts` by row or column sums, we can compute **conditional probabilities**.

In [None]:
counts

In [None]:
counts.sum(axis=0)

The conditional distribution of species **given** sex is below.

In [None]:
counts / counts.sum(axis=0)

For instance, the above DataFrame tells us that the probability that a randomly selected penguin is of species `'Adelie'` **given** that they are of sex `'Female'` is 0.442424.

The conditional distribution of sex given species is below.

In [None]:
counts.T / counts.sum(axis=1)

### `pivot_table` aggregates and reshapes

- The `pivot_table` method does two things. It:
    - Aggregates based on two columns.
    - Reshapes the data from "long" to "wide".
        - Rows no longer correspond to observations.
- At times, we may only want to do the second step – reshape the data.

### Example: Tic-tac-toe

<center><img src='imgs/tic-tac-toe.png' width=20%></center>

In [None]:
moves = pd.DataFrame([
    [1, 1, 'O'],
    [2, 1, 'X'],
    [2, 2, 'X'],
    [2, 3, 'O'],
    [3, 1, 'O'],
    [3, 3, 'X']
], columns=['i', 'j', 'move'])
moves

In [None]:
moves.pivot(index='i', columns='j', values='move').fillna('')

The `pivot` method **only** reshapes a DataFrame. It does not change any of the values in it (i.e. `aggfunc` doesn't work with `pivot`).

### `pivot_table` = `groupby` + `pivot`

- `pivot_table` is a shortcut for using `groupby` and then using `pivot`.
- For example, both of the following code cells find the mean body mass per species and sex.

In [None]:
(
    penguins.groupby(['species', 'sex'])[['body_mass_g']]
            .mean()
            .reset_index()
            .pivot(index='species', columns='sex', values='body_mass_g')
)

In [None]:
penguins.pivot_table(index='species', columns='sex', values='body_mass_g', aggfunc='mean')

`aggfunc='mean'` plays the same role that `.mean()` does.

### Reshaping

- `pivot_table` and `pivot` reshape DataFrames from "long" to "wide".
- Other DataFrame reshaping methods:
    - `melt`: un-pivots a DataFrame.
    - `stack`: pivots multi-level columns to multi-indices.
    - `unstack`: pivots multi-indices to columns.
    - Google and the documentation are your friends!

### Simpson's paradox

<center><img src="imgs/image_2.png" width=50%></center>

### Example: Grades

- Two students, Lisa and Bart, just finished freshman year. They both took a different number of classes in Fall, Winter, and Spring.

- Within each quarter, Lisa had a higher GPA than Bart.

- But Bart has a higher overall GPA.

- How is this possible? 🤔

**Note:** The number of "grade points" you earn for a course is

$$\text{number of units} \cdot \text{grade (out of 4)}$$

So an A- in a 4 unit course earns $3.7 \cdot 4 = 14.8$ grade points.

In [None]:
lisa = pd.DataFrame([
        [20, 46],
        [18, 54],
        [5, 20]
    ],
    columns=['Units', 'Grade Points Earned'], 
    index=['Fall', 'Winter', 'Spring'])

lisa

In [None]:
bart = pd.DataFrame([
        [5, 10],
        [5, 13.5],
        [22, 81.4]
    ],
    columns=['Units', 'Grade Points Earned'], 
    index=['Fall', 'Winter', 'Spring'])

bart

The following DataFrame shows that Lisa had a higher GPA in all three quarters.

In [None]:
quarterly_gpas = pd.DataFrame(
    {
        "Lisa's Quarter GPA": lisa['Grade Points Earned'] / lisa['Units'],
        "Bart's Quarter GPA": bart['Grade Points Earned'] / bart['Units']
    }
)

quarterly_gpas

But Lisa's overall GPA is less than Bart's overall GPA.

In [None]:
tot = lisa.sum()
tot['Grade Points Earned'] / tot['Units']

In [None]:
tot = bart.sum()
tot['Grade Points Earned'] / tot['Units']

### What happened?

- When Lisa and Bart both performed poorly, Lisa took more units than Bart.
    - This brings down Lisa's overall average.
- When Lisa and Bart both performed well, Bart took more units than Annie.
    - This brings up Bart's overall average.

In [None]:
quarterly_gpas.assign(Lisa_units=lisa['Units']) \
              .assign(Bart_units=bart['Units']) \
              .iloc[:, [0, 2, 1, 3]]

### Simpson's paradox

- Simpson's paradox occurs when **grouped data and ungrouped data show opposing trends**.
    - It is named after Edward H. Simpson, not Lisa or Bart Simpson.
    
- It is **purely arithmetic** – it is a consequence of weighted averages.

- It often happens because there is a hidden factor (i.e. a **confounder**) within the data that influences results.

- **Question:** What is the "correct" way to summarize your data? What if you had to act on these results?

### Example: How Berkeley was sued for gender discrimination (1973)

### What do you notice?

<center><img src='imgs/berkeley.png' width=70%></center>

In [None]:
show_paradox_slides()

### What happened?

- The overall acceptance rate for women (30%) was lower than it was for men (45%).
- However, most departments (A, B, D, F) had a higher acceptance rate for women.
- Department A had a 62% acceptance rate for men and an 82% acceptance rate for women!
    - 31% of men applied to Department A.
    - 6% of women applied to Department A.
- Department F had a 6% acceptance rate for men and a 7% acceptance rate for women!
    - 14% of men applied to Department F.
    - 19% of women applied to Department F.
- **Conclusion:** Women tended to apply to departments with a lower acceptance rate.

### Caution!

This doesn't mean that admissions are free from gender discrimination! 

From [Moss-Racusin et al., 2012, PNAS](https://www.pnas.org/doi/10.1073/pnas.1211286109) (cited 2600+ times):

> In a randomized double-blind study (n = 127), **science faculty** from research-intensive universities **rated the application materials of a student—who was randomly assigned either a male or female** name—for a laboratory manager position. Faculty **participants rated the male applicant as significantly more competent and hireable than the (identical) female applicant**. These participants also selected a higher starting salary and offered more career mentoring to the male applicant. The gender of the faculty participants did not affect responses, such that female and male faculty were equally likely to exhibit bias against the female student.

### But then...

From [Williams and Ceci, 2015, PNAS](https://www.pnas.org/doi/10.1073/pnas.1418878112):

> Here we report five hiring experiments in which faculty evaluated hypothetical female and male applicants, using systematically varied profiles disguising identical scholarship, for assistant professorships in biology, engineering, economics, and psychology. Contrary to prevailing assumptions, **men and women faculty members from all four fields preferred female applicants 2:1 over identically qualified males** with matching lifestyles (single, married, divorced), with the exception of male economists, who showed no gender preference.

### Do these conflict?

Not necessarily. One explanation, from William and Ceci:

> Instead, past studies have used ratings of students’ hirability for a range of posts that do not include tenure-track jobs, such as managing laboratories or performing math assignments for a company. However, hiring tenure-track faculty differs from hiring lower-level staff: it entails selecting among highly accomplished candidates, all of whom have completed Ph.D.s and amassed publications and strong letters of support. **Hiring bias may occur when applicants’ records are ambiguous, as was true in studies of hiring bias for lower-level staff posts, but such bias may not occur when records are clearly strong**, as is the case with tenure-track hiring.

### Do these conflict?

From Witteman, et al, 2019, in *The Lancet*:

> Thus, evidence of scientists favouring women comes exclusively from hypothetical scenarios, whereas evidence of scientists favouring men comes from hypothetical scenarios and real behaviour. This **might reflect academics' growing awareness of the social desirability of achieving gender balance, while real academic behaviour might not yet put such ideals into action**.

### Example: Restaurant reviews and phone types

* You are deciding whether to eat at Dirty Birds or The Loft.
* Suppose Yelp shows ratings aggregated by phone type (Android vs. iPhone).
* Should you choose Dirty Birds or The Loft? 

|Phone Type|Stars for Dirty Birds|Stars for The Loft|
|---|---|---|
|Android|4.24|4.0|
|iPhone|2.99|2.79|
|**All**|**3.32**|**3.37**|



### Restaurant reviews and phone types

* It's doubtful that your phone type will **cause** you to prefer one restaurant over another.
* Again, Simpson's paradox is merely a property of weighted averages!

### Verifying Simpson's paradox

In [None]:
ratings = pd.read_csv('data/ratings.csv')
ratings.sample(5).head()

Aggregated means:

In [None]:
ratings.pivot_table(index='phone', columns='restaurant', values='rating', aggfunc='mean')

Disaggregated means:

In [None]:
ratings.groupby('restaurant').mean()

### Takeaways

Be skeptical of...
- Aggregate statistics.
- People misusing statistics to "prove" that discrimination doesn't exist.
- Drawing conclusions from individual publications (p-hacking, publication bias, narrow focus, etc.).
- Everything!

### Further reading

- [Gender Bias in Admission Statistics?](https://www.cantorsparadise.com/gender-bias-in-admission-statistics-eaabca650810)
    - Contains a **great** visualization.
- [Simpson's paradox](https://en.wikipedia.org/wiki/Simpson%27s_paradox#UC_Berkeley_gender_bias) on Wikipedia.

## Concatenating vertically

### Segue

For the rest of this week, we will look at how to combine multiple DataFrames.

### Data spread across multiple files

<center><img src="imgs/many_files.png" width="50%"></center>
<center>The SSA baby names dataset from Lecture 1/2 was downloaded as multiple files – one per year.</center>

**Question:** How do we combine multiple datasets?

### Row-wise combination of data:  `pd.concat`

<center><img src="imgs/merging_append3.png" width="30%"></center>




* The `pd.concat` function combines DataFrame and Series objects.
* By default, the **rows of objects are stacked on top of one another**.
* `pd.concat` has many options; we'll learn them slowly.

### Example: Grades

By default, `pd.concat` stacks DataFrames row-wise, i.e. on top of one another.

In [3]:
section_A = pd.DataFrame({
    'Name': ['Annie', 'Billy', 'Sally', 'Tommy'],
    'Midterm': [98, 82, 23, 45],
    'Final': [88, 100, 99, 67]
})

section_A

Unnamed: 0,Name,Midterm,Final
0,Annie,98,88
1,Billy,82,100
2,Sally,23,99
3,Tommy,45,67


In [4]:
section_B = pd.DataFrame({
    'Name': ['Junior', 'Rex', 'Flash'],
    'Midterm': [70, 99, 81],
    'Final': [42, 25, 90]
})

section_B

Unnamed: 0,Name,Midterm,Final
0,Junior,70,42
1,Rex,99,25
2,Flash,81,90


Let's use `pd.concat` on a list of the above two DataFrames.

In [5]:
pd.concat([section_A, section_B])

Unnamed: 0,Name,Midterm,Final
0,Annie,98,88
1,Billy,82,100
2,Sally,23,99
3,Tommy,45,67
0,Junior,70,42
1,Rex,99,25
2,Flash,81,90


Setting the optional argument `ignore_index` to `True` fixes the index (which `.reset_index()` also could do).

In [6]:
pd.concat([section_A, section_B], ignore_index=True)

Unnamed: 0,Name,Midterm,Final
0,Annie,98,88
1,Billy,82,100
2,Sally,23,99
3,Tommy,45,67
4,Junior,70,42
5,Rex,99,25
6,Flash,81,90


To keep track of which original DataFrame each row came from, we can use the `keys` optional argument.

In [7]:
combined = pd.concat([section_A, section_B], keys=['Section A', 'Section B'])
combined

Unnamed: 0,Unnamed: 1,Name,Midterm,Final
Section A,0,Annie,98,88
Section A,1,Billy,82,100
Section A,2,Sally,23,99
Section A,3,Tommy,45,67
Section B,0,Junior,70,42
Section B,1,Rex,99,25
Section B,2,Flash,81,90


The resulting DataFrame has a MultiIndex, though.

In [None]:
combined.loc['Section A']

### Missing columns?

If we concatenate two DataFrames that don't share the same column names, `NaN`s are added in the columns that aren't shared.

In [8]:
section_C = pd.DataFrame({
    'Name': ['Justin', 'Marina'],
    'Final': [98, 52]
})

section_C

Unnamed: 0,Name,Final
0,Justin,98
1,Marina,52


In [9]:
section_D = pd.DataFrame({
    'Name': ['Janine', 'Aaron', 'Suraj'],
    'Midterm': [10, 80, 40]
})

section_D

Unnamed: 0,Name,Midterm
0,Janine,10
1,Aaron,80
2,Suraj,40


In [10]:
pd.concat([section_C, section_D])

Unnamed: 0,Name,Final,Midterm
0,Justin,98.0,
1,Marina,52.0,
0,Janine,,10.0
1,Aaron,,80.0
2,Suraj,,40.0


### ⚠️ Warning: No loops!

- `pd.concat` returns a copy; it does not modify any of the input DataFrames.
- Do **not** use `pd.concat` in a loop, as it has terrible time and space efficiency.

```py
total = pd.DataFrame()
for df in dataframes:
    total = total.concat(df)
```

- Instead, use `pd.concat(dataframes)`, where `dataframes` is a list of DataFrames.

### Aside: accessing file names programmatically

- At times, you'll need to load in all of the files in a given folder.
- `os.listdir(dirname)` returns a **list** of the names of the files in the folder `dirname`.

In [3]:
os.listdir('data')

['ratings.csv']

In [4]:
os.listdir('../')

['lec08',
 'lec03',
 'lec01',
 'lec06',
 'lec09',
 'lec05',
 'lec02',
 'lec07',
 'lec04']

The following does something similar.

In [None]:
!ls ../

## Aside: Working with time series data

### Time series – why now?

- Data is often partitioned by time. For instance, there may be one `.csv` file per day for 1 year.
- To combine the datasets, we will need to load in the files as DataFrames and `pd.concat` the DataFrames together.
- Note: "time series" is a general term and is not related to Series in `pandas`.

### Datetime types

When working with time data, you will see two different kinds of "times":

* **Datetimes** reference particular moments in time (e.g. November 26th, 1998 at 8:26AM).
    - Could just be a date, e.g. September 15, 2014.
    - Could just be a time, e.g. 4:45 AM.
    - Datetimes typically don't keep track of timezones.
* **Timedeltas**, or durations, reference an exact length of time (e.g. a duration of 3 hours).

### The `datetime` module

Python has an in-built `datetime` module, which contains `datetime` and `timedelta` types. These are much more convenient to deal with than strings that contain times.

In [None]:
import datetime

In [None]:
datetime.datetime.now()

In [None]:
datetime.datetime.now() + datetime.timedelta(days=3, hours=5)

Recall, Unix timestamps count the number of seconds since January 1st, 1970.

In [None]:
datetime.datetime.now().timestamp()

### Times in `pandas`

- `pd.Timestamp` is the `pandas` equivalent of `datetime`.
- `pd.to_datetime` converts strings to `pd.Timestamp` objects.

In [None]:
pd.Timestamp(year=1998, month=11, day=26)

In [None]:
final_start = pd.to_datetime('June 4th, 2022, 11:30AM')
final_start

In [None]:
final_finish = pd.to_datetime('June 4th, 2022, 2:30PM')
final_finish

Timestamps have time-related attributes, e.g. `dayofweek`, `hour`, `min`, `sec`.

In [None]:
final_finish.dayofweek

In [None]:
final_finish.year

Subtracting timestamps yields `pd.Timedelta` objects.

In [None]:
final_finish - final_start

### Timestamps in DataFrames

- If we create a Series of datetimes with `pd.to_datetime`, `pandas` stores them as yet *another* type:
`np.datetime64`.
    - These are similar to `pd.Timestamp`, but optimized for memory and speed efficiency.
- If we access a single time, we get a `pd.Timestamp` back.
- See [the documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html) for more details.

In [None]:
times = pd.DataFrame({'finish': pd.to_datetime(['Sun, Jan 01, 1989', 
                                                '2022-04-13T11:00', 
                                                '1/1/1970'])})
times

In [None]:
times.info()

In [None]:
times.iloc[0, 0]

In [None]:
times.sort_values('finish')

### Example: Exam speeds

Below, we have the Final Exam starting and ending times for two sections of a course.

In [None]:
times_A = pd.DataFrame({
    'Name': ['Annie', 'Billy', 'Sally', 'Tommy'],
    'start_exam': ['15:00', '15:02', '15:01', '15:00'],
    'finish_exam': ['16:00', '17:58', '17:05', '16:55']
})

times_B = pd.DataFrame({
    'Name': ['Junior', 'Rex', 'Flash'],
    'start_exam': ['18:00', '18:06', '19:07'],
    'finish_exam': ['20:00', '20:50', '20:59']
})

display(times_A)
display(times_B)

**Question:** Who finished the exam the fastest amongst all students in the course?

Approach:
1. Concatenate the two DataFrames.
2. Convert the time columns to `pd.Timestamp`.
3. Find the difference between `'finish_exam'` and `'start_exam'`.
4. Sort.
5. Pick the fastest exam taker.

In [None]:
# Step 1
both_versions = pd.concat([times_A, times_B])
both_versions

In [None]:
# Step 2
both_versions = both_versions.assign(
    start_exam=pd.to_datetime(both_versions['start_exam']),
    finish_exam=pd.to_datetime(both_versions['finish_exam'])
)

both_versions.info()

In [None]:
# Step 3
both_versions = both_versions.assign(
    elapsed=both_versions['finish_exam'] - both_versions['start_exam']
)

both_versions

In [None]:
# Steps 4 and 5
both_versions.sort_values('elapsed').iloc[0].loc['Name']

## Summary, next time

### Summary

- `pivot_table` aggregates data based on two categorical columns, and reshapes the result to be "wide" instead of "long".
- Simpson's paradox occurs when grouped data and ungrouped data show opposing trends.
    - It is a consequence of arithmetic.
- To "stack" different DataFrames on top of one another vertically, use `pd.concat` with a list of DataFrames.
- Timestamps in `pandas` are stored using `pd.Timestamp` and `pd.Timedelta` objects.
- **Next time:** Horizontal concatenation. Merging.