In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib_inline.backend_inline import set_matplotlib_formats
from IPython.display import display, IFrame

# Pandas Tutor setup
%reload_ext pandas_tutor
%set_pandas_tutor_options {"maxDisplayCols": 8, "nohover": True, "projectorMode": True}

set_matplotlib_formats("svg")
sns.set_context("poster")
sns.set_style("whitegrid")
plt.rcParams["figure.figsize"] = (10, 5)
pd.set_option("display.max_rows", 8)
pd.set_option("display.max_columns", 8)
pd.set_option("display.precision", 2)

def show_paradox_slides():
    src = 'https://docs.google.com/presentation/d/e/2PACX-1vSbFSaxaYZ0NcgrgqZLvjhkjX-5MQzAITWAsEFZHnix3j1c0qN8Vd1rogTAQP7F7Nf5r-JWExnGey7h/embed?start=false'
    width = 960
    height = 569
    display(IFrame(src, width, height))

# Lecture 3 – Aggregating, Simpson's paradox

## DSC 80, Fall 2023

### Agenda

- Data granularity.
- Grouping using `.groupby()` and `.resample()`.
- Pivot tables using `.pivot_table()`.
- Conditional probabilities
- Simpson's paradox

## 📣 Announcements 📣

- Good job turning in Lab 1!
- Lab 2 out, due Monday.
- Project 1 checkpoint due tomorrow.
    - Project 1 due next Wed.

## Data granularity

### Granularity

- **Granularity** refers to what each observation in a dataset represents.
    - Fine: small details.
    - Coarse: bigger picture.
- Most commonly, rows in a DataFrame correspond to observations, and columns correspond to attributes. Data formatted in this way is called [tidy data](https://r4ds.had.co.nz/tidy-data.html).

### Example: Baby Names

What is a single observation in the baby names data?

In [None]:
baby = pd.read_csv('data/baby.csv')
baby

### Example: CO2 readings

What is a single observation in this dataset of CO2 readings?

In [None]:
# Don't about this code, we'll cover it when we talk about data cleaning
co2 = pd.read_csv('data/co2_mm_mlo.txt', 
                  header=None, skiprows=72, sep='\s+',
                  names=['Yr', 'Mo', 'DecDate', 'Avg', 'co2', 'Trend', 'days'],
                  usecols=['Yr', 'Mo', 'DecDate', 'co2'])
co2

### Example: CO2 readings by month

In [None]:
# Don't worry about understanding this code for now
sns.lineplot(data=co2, x='DecDate', y='co2');

### Collecting data

- If you can control how your dataset is created, you should opt for **finer granularity**, i.e. for more detail.
- You can easily remove detail, but it's difficult to add detail if it is not already present in the dataset.
- Tradeoff: obtaining fine-grained data can take more time/money.

### Manipulating granularity

- We'll now explore how to change the level of granularity present in our dataset.
    - While it may seem like we are "losing information," removing detail can help us understand bigger-picture trends in our data.

### Example: Penguins

<center><img src="imgs/lter_penguins.png" width=60%>
<i><a href="https://github.com/allisonhorst/palmerpenguins/blob/main/README.md">Artwork by @allison_horst</a></i>

</center>

The dataset we'll work with for the rest of the lecture involves various measurements taken of three species of penguins in Antarctica.

In [None]:
import seaborn as sns
penguins = sns.load_dataset('penguins').dropna()
penguins

### Video: Palmer Penguins

In [None]:
IFrame('https://www.youtube-nocookie.com/embed/CCrNAHXUstU?si=-DntSyUNp5Kwitjm&amp;start=11',
       width=560, height=315)

### Aggregating: Basics

We know how to find the mean body mass for all the penguins:

In [None]:
...

### 💡 Pro-Tip: Using f-strings

[Python f-strings](https://docs.python.org/3/tutorial/inputoutput.html#formatted-string-literals) give an easy way to print variables nicely:

In [None]:
mean_body_mass = penguins['body_mass_g'].mean()
print(...)

### Aggregating: Basics

But what about the mean for each type of penguin?

In [None]:
penguins['body_mass_g'].mean()

### Naive approach: looping through unique values

In [None]:
species_map = pd.Series([], dtype=float)

for species in penguins['species'].unique():
    species_only = penguins.loc[penguins['species'] == species]
    species_map.loc[species] = species_only['body_mass_g'].mean()
    
species_map

- For each unique `'species'`, we make a pass through the entire dataset.
    - The asymptotic runtime of this procedure is $\Theta(ns)$, where $n$ is the number of rows and $s$ is the number of unique species.

- While there are other loop-based solutions that only involve a single pass over the DataFrame, we'd like to avoid Python loops entirely, as they're slow.

## Grouping

In [None]:
# Before:
penguins['body_mass_g'].mean()

# After:
...

Somehow, the `groupby` method computes what we're looking for in just one line. How?

In [None]:
%%pt

...

### "Split-apply-combine" paradigm

The `groupby` method involves three steps: **split**, **apply**, and **combine**. This is the same terminology that the [`pandas` documentation](https://pandas.pydata.org/docs/user_guide/groupby.html) uses.

<center><img src="imgs/image_0.png" width=40%></center>

- **Split** breaks up and "groups" the rows of a DataFrame according to the specified **key**. There is one "group" for every unique value of the key.

- **Apply** uses a function (e.g. aggregation, transformation, filtration) within the individual groups.

- **Combine** stitches the results of these operations into an output DataFrame.

- The split-apply-combine pattern can be **parallelized** to work on multiple computers or threads, by sending computations for each group to different processors.

### More examples

Before we dive into the internals, let's look at a few more examples.

In [None]:
penguins.head()

In [None]:
penguins.shape

Which `'species'` has the highest median `'bill_length_mm'`?

In [None]:
...

What proportion of penguins of each `'species'` live on `'Dream'` island?

In [None]:
...

## `DataFrameGroupBy` objects and aggregation

### `DataFrameGroupBy` objects

We've just evaluated a few expressions of the following form.

In [None]:
penguins.groupby('species').mean()

There are two method calls in the expression above: `.groupby('species')` and `.mean()`. What happens if we remove the latter?

In [None]:
penguins.groupby('species')

### Peeking under the hood

If `df` is a DataFrame, then `df.groupby(key)` returns a `DataFrameGroupBy` object.

This object represents the "split" in "split-apply-combine".

In [None]:
# Simplified table for demostration:
penguins_small = penguins.iloc[[0, 1, 150, 151, 251, 300, 301], [0, 5, 6]]

# Creates one group for each unique value in the species column.
penguin_groups = penguins_small.groupby('species')
penguin_groups

In [None]:
%%pt
penguin_groups

`DataFrameGroupBy` objects have a `groups` attribute, which is a dictionary in which the keys are group names and the values are lists of row labels.

In [None]:
penguin_groups#

`DataFrameGroupBy` objects also have a `get_group(key)` method, which returns a DataFrame with only the values for the given key.

In [None]:
penguin_groups#

In [None]:
# Same as the above!
penguins_small.query('species == "Chinstrap"')

We usually don't use these attributes and methods, but they're useful in understanding how `groupby` works under the hood.

### Aggregation

- Once we create a `DataFrameGroupBy` object, we need to **apply** some function to each group, and **combine** the results.

- The most common operation we apply to each group is an **aggregation**.
    - Aggregation refers to the process of reducing many values to one.

- To perform an aggregation, use an aggregation method on the `DataFrameGroupBy` object, e.g. `.mean()`, `.max()`, or `.median()`.

Let's look at some examples.

In [None]:
penguins_small

### Column independence

Within each group, the aggregation method is applied to **each column independently**.

In [None]:
penguins_small.groupby('species').max()

It **is not** telling us that there is a male `'Adelie'` penguin with a `'body_mass_g'` of `3800.0`!

In [None]:
# This penguin is Female!
penguins_small.loc[(penguins['species'] == 'Adelie') &
                   (penguins['body_mass_g'] == 3800.0)]

### Discussion Question

Find the species and weights of the heaviest `Male` and `Female` penguins.

In [None]:
# Fill in this cell

### Column selection and performance implications

- By default, the aggregator will be applied to **all** columns that it can be applied to.
    - `max` and `min` are defined on strings, while `median` and `mean` are not.

- If we only care about one column, we can select that column before aggregating **to save time**.
    - `DataFrameGroupBy` objects support `[]` notation, just like `DataFrame`s.

In [None]:
# Back to the big penguins dataset
penguins.groupby('species').mean()

To demonstrate that the former is slower than the latter, we can use `%%timeit`. For reference, we'll also include our earlier `for`-loop-based solution.

In [None]:
%%timeit
penguins.groupby('species').mean()['bill_length_mm']

In [None]:
%%timeit
penguins.groupby('species')['bill_length_mm'].mean()

In [None]:
%%timeit
species_map = pd.Series([], dtype=float)

for species in penguins['species'].unique():
    species_only = penguins.loc[penguins['species'] == species]
    species_map.loc[species] = species_only['body_mass_g'].mean()
    
species_map

### Takeaways

- It's important to understand _what_ each piece of your code evaluates to – in the first two timed examples, the code is almost identical, but the performance is quite different.

                # Slower
                penguins.groupby('species').mean()['bill_length_mm']

                # Faster
                penguins.groupby('species')['bill_length_mm'].mean()

- The `groupby` method is much quicker than `for`-looping over the DataFrame in Python. It can often produce results using just a **single, fast pass** over the data, updating the sum, mean, count, min, or other aggregate for each group along the way.

### Beyond default aggregation methods

- There are many built-in aggregation methods.
- What if you want to apply different aggregation methods to different columns?
- What if the aggregation method you want to use doesn't already exist in `pandas`?

### The `aggregate` method

- The `DataFrameGroupBy` object has a general `aggregate` method, which aggregates using one or more operations.
    - Remember, aggregation refers to the process of reducing many values to one.
- There are many ways of using `aggregate`; refer to [the documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.core.groupby.DataFrameGroupBy.aggregate.html) for a comprehensive list.
- Example arguments:
    - A single function.
    - A list of functions.
    - A dictionary mapping column names to functions.
- Per [the documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.agg.html), `agg` is an alias for `aggregate`.

### Example

How many penguins are there of each `'species'`, and what is the mean `'body_mass_g'` of each species?

Note what happens when we don't select a column before aggregating.

### Example

What is the maximum `'bill_length_mm'` of each species, and which `'island'`s is each `'species'` found on?

### Example

What is the **interquartile range** of the `'body_mass_g'` of each `'species'`?

## Other `DataFrameGroupBy` methods

### Split-apply-combine, revisited

When we introduced the split-apply-combine pattern, the "apply" step involved **aggregation** – our final DataFrame had one row for each group.

<center><img src="imgs/image_0.png" width=40%></center>

Instead of aggregating during the apply step, we could instead perform a:

- **Transformation**, in which we perform operations to every value within each group.

- **Filtration**, in which we keep only the groups that satisfy some condition.

### Transformations

- Suppose we want to convert the `'body_mass_g'` column to to z-scores (i.e. standard units):

$$z(x_i) = \frac{x_i - \text{mean of } x}{\text{SD of } x}$$

In [None]:
z_score(penguins['body_mass_g'])

### Transformations within groups

- Now, what if we wanted the z-score within each group?

- To do so, we can use the `transform` method on a `DataFrameGroupBy` object. The `transform` method takes in a function, which itself takes in a Series and returns a new Series.

- A transformation produces a DataFrame or Series of the same size – it is **not** an aggregation!

In [None]:
z_mass = ...
z_mass

In [None]:
penguins.assign(z_mass=z_mass)

Note that below, penguin 340 has a larger `'body_mass_g'` than penguin 0, but a lower `'z_mass'`. 
- Penguin 0 has an above average `'body_mass_g'` among `'Adelie'` penguins.
- Penguin 340 has a below average `'body_mass_g'` among `'Gentoo'` penguins. Remember from earlier that the average `'body_mass_g'` of `'Gentoo'` penguins is much higher than for other species.

### Filtering Groups

- To keep only the groups that satisfy a particular condition, use the `filter` method on a `DataFrameGroupBy` object.

- The `filter` method takes in a function, which itself takes in a DataFrame/Series and return a single Boolean. The result is a new DataFrame/Series with only the groups for which the filter function returned `True`.

For example, suppose we want only the `'species'` whose average `'bill_length_mm'` is above 39.

In [None]:
...

No more `'Adelie'`s!

Or, as another example, suppose we only want `'species'` with at least 100 penguins:

In [None]:
...

No more `'Chinstrap'`s!

### Grouping with multiple columns

When we group with multiple columns, one group is created for **every unique combination** of elements in the specified columns.

In [None]:
species_and_island = ...
species_and_island

### Grouping and indexes

- The `groupby` method creates an index based on the specified columns.
- When grouping by multiple columns, the resulting DataFrame has a `MultiIndex`.
- Advice: When working with a `MultiIndex`, use `reset_index` or set `as_index=False` in `groupby`.

In [None]:
species_and_island

## Discussion: Checking your knowledge

Find the most popular male and female baby name for each year in the dataset. Exclude years where there were fewer than 1 million births recorded.

In [None]:
baby = pd.read_csv('data/baby.csv')
baby

In [None]:
# Fill me in

## Pivot Tables: An extension of grouping

Pivot tables are a compact way to display tables for humans to read:

<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th>Sex</th>
      <th>F</th>
      <th>M</th>
    </tr>
    <tr>
      <th>Year</th>
      <th></th>
      <th></th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>2018</th>
      <td>1698373</td>
      <td>1813377</td>
    </tr>
    <tr>
      <th>2019</th>
      <td>1675139</td>
      <td>1790682</td>
    </tr>
    <tr>
      <th>2020</th>
      <td>1612393</td>
      <td>1721588</td>
    </tr>
    <tr>
      <th>2021</th>
      <td>1635800</td>
      <td>1743913</td>
    </tr>
    <tr>
      <th>2022</th>
      <td>1628730</td>
      <td>1733166</td>
    </tr>
  </tbody>
</table>

- Notice that each value in the table is a sum over the counts, split by year and sex.
- You can think of pivot tables as grouping using two columns, then "pivoting" one of the group keys into columns

### `pivot_table`

The `pivot_table` DataFrame method aggregates a DataFrame using two columns. To use it:

```py
df.pivot_table(index=index_col,
               columns=columns_col,
               values=values_col,
               aggfunc=func)
```
The resulting DataFrame will have:
- One row for every unique value in `index_col`.
- One column for every unique value in `columns_col`.
- Values determined by applying `func` on values in `values_col`.

In [None]:
last_5_years = baby.query('Year >= 2018')

In [None]:
last_5_years

In [None]:
# Look at the similarity to the snippet above
(last_5_years
#  ...
)

### Example:

Find the number of penguins per island and species.

In [None]:
penguins.pivot_table(
    index=..., 
    columns=..., 
    values=...,
    aggfunc=...,
)

Note that there is a `NaN` at the intersection of `'Biscoe'` and `'Chinstrap'`, because there were no Chinstrap penguins on Biscoe Island.

We can either use the `fillna` method afterwards or the `fill_value` argument to fill in `NaN`s.

### Granularity, revisited

Take another look at the pivot table from the previous cell. Each row of the original `penguins` represented a single penguin, and each column represented features of the penguins.

What is the granularity of this table?

In [None]:
penguins.pivot_table(
    index='species', 
    columns='island', 
    values='bill_length_mm', 
    aggfunc='count',
    fill_value=0,
)

### Reshaping

- `pivot_table` reshapes DataFrames from "long" to "wide".
- Other DataFrame reshaping methods:
    - `melt`: Un-pivots a DataFrame. Very useful in data cleaning.
    - `pivot`: Like `pivot_table`, but doesn't do aggregation.
    - `stack`: Pivots multi-level columns to multi-indices.
    - `unstack`: Pivots multi-indices to columns.
    - Google and the documentation are your friends!

## Distributions

### Joint distribution

When using `aggfunc='count'`, a pivot table describes the **joint distribution** of two categorical variables. This is also called a **contingency table**.

In [None]:
counts = penguins.pivot_table(
    index='species', 
    columns='sex', 
    values='body_mass_g', 
    aggfunc='count', 
    fill_value=0
)
counts

We can normalize the DataFrame by dividing by the total number of penguins. The resulting numbers can be interpreted as **probabilities** that a randomly selected penguin from the dataset belongs to a given combination of species and sex.

In [None]:
joint = ...
joint

### Marginal probabilities

If we sum over one of the axes, we can compute **marginal probabilities**, i.e. unconditional probabilities.

In [None]:
joint

For instance, the second Series tells us that a randomly selected penguin has a 0.36 chance of being of species `'Gentoo'`.

### Conditional probabilities

Using `counts`, how might we compute conditional probabilities like $$P(\text{species } = \text{"Adelie"} \mid \text{sex } = \text{"Female"})?$$

In [None]:
counts

$$\begin{align*}
P(\text{species} = c \mid \text{sex} = x) &= \frac{P(\text{species} = c \text{ and } \text{sex} = x)}{P(\text{sex = }x)} \\
&= \frac{\frac{\# \: (\text{species } = \: c \text{ and } \text{sex } = \: x)}{N}}{\frac{\# \: (\text{sex } = \: x)}{N}} \\
&= \frac{\# \: (\text{species} = c \text{ and } \text{sex} = x)}{\# \: (\text{sex} = x)}
\end{align*}$$

**Answer**: To find conditional probabilities of **species given sex**, divide by **column sums**. To find conditional probabilities of **sex given species**, divide by **row sums**.

### Conditional probabilities

To find conditional probabilities of **species given sex**, divide by **column sums**. To find conditional probabilities of **sex given species**, divide by **row sums**.

In [None]:
counts

The conditional distribution of **species given sex** is below. Note that in this new DataFrame, the `'Female'` and `'Male'` columns each sum to 1.

For instance, the above DataFrame tells us that the probability that a randomly selected penguin is of species `'Adelie'` **given** that they are of sex `'Female'` is 0.442424.

**Exercise**: Try and find the conditional distribution of **sex given species**.

## Simpson's paradox

<center><img src="imgs/simpsons.png" width=50%></center>

### Example: Grades

- Two students, Lisa and Bart, just finished freshman year. They both took a different number of classes in Fall, Winter, and Spring.

- Each quarter, Lisa had a higher GPA than Bart.

- But Bart has a higher overall GPA.

- How is this possible? 🤔

Run this cell to create DataFrames that contain each students' grades.

In [None]:
lisa = pd.DataFrame([
        [20, 46],
        [18, 54],
        [5, 20],
    ],
    columns=['Units', 'Grade Points Earned'], 
    index=['Fall', 'Winter', 'Spring'],
)

bart = pd.DataFrame([
        [5, 10],
        [5, 13.5],
        [22, 81.4],
    ],
    columns=['Units', 'Grade Points Earned'], 
    index=['Fall', 'Winter', 'Spring'],
)

### Quarter-specific vs. overall GPAs

**Note:** The number of "grade points" earned for a course is

$$\text{number of units} \cdot \text{grade (out of 4)}$$

For instance, an A- in a 4 unit course earns $3.7 \cdot 4 = 14.8$ grade points.

In [None]:
lisa

In [None]:
bart

Lisa had a higher GPA in all three quarters:

In [None]:
quarterly_gpas = pd.DataFrame({
    "Lisa's Quarter GPA": lisa['Grade Points Earned'] / lisa['Units'],
    "Bart's Quarter GPA": bart['Grade Points Earned'] / bart['Units'],
})

quarterly_gpas

But Lisa's overall GPA was less than Bart's overall GPA:

In [None]:
tot = lisa.sum()
tot['Grade Points Earned'] / tot['Units']

In [None]:
tot = bart.sum()
tot['Grade Points Earned'] / tot['Units']

### What happened?

In [None]:
(quarterly_gpas
 .assign(Lisa_units=lisa['Units'],
         Bart_units=bart['Units']) 
 .iloc[:, [0, 2, 1, 3]]
)

- When Lisa and Bart both performed poorly, Lisa took more units than Bart. **This brought down 📉 Lisa's overall average.**

- When Lisa and Bart both performed well, Bart took more units than Annie. **This brought up 📈 Bart's overall average.**

### Simpson's paradox

- Simpson's paradox occurs when **grouped data and ungrouped data show opposing trends**.
    - It is named after Edward H. Simpson, not Lisa or Bart Simpson.

- It is **purely arithmetic** – it is a consequence of weighted averages.

- It often happens because there is a hidden factor (i.e. a **confounder**) within the data that influences results.

- **Question:** What is the "correct" way to summarize your data? What if you had to act on these results?

### Example: How Berkeley was _almost_ sued for gender discrimination (1973)

What do you notice?

<center><img src='imgs/berkeley.png' width=70%></center>

In [None]:
show_paradox_slides()

### What happened?

- The overall acceptance rate for women (30%) was lower than it was for men (45%).

- However, most departments (A, B, D, F) had a higher acceptance rate for women.


- Department A had a 62% acceptance rate for men and an 82% acceptance rate for women!
    - 31% of men applied to Department A.
    - 6% of women applied to Department A.

- Department F had a 6% acceptance rate for men and a 7% acceptance rate for women!
    - 14% of men applied to Department F.
    - 19% of women applied to Department F.

- **Conclusion:** Women tended to apply to departments with a lower acceptance rate; the data don't support the hypothesis that there was major gender discrimination against women.

### Caution!

This doesn't mean that admissions are free from gender discrimination! 

From [Moss-Racusin et al., 2012, PNAS](https://www.pnas.org/doi/10.1073/pnas.1211286109) (cited 2600+ times):

> In a randomized double-blind study (n = 127), **science faculty** from research-intensive universities **rated the application materials of a student—who was randomly assigned either a male or female** name—for a laboratory manager position. Faculty **participants rated the male applicant as significantly more competent and hireable than the (identical) female applicant**. These participants also selected a higher starting salary and offered more career mentoring to the male applicant. The gender of the faculty participants did not affect responses, such that female and male faculty were equally likely to exhibit bias against the female student.

### But then...

From [Williams and Ceci, 2015, PNAS](https://www.pnas.org/doi/10.1073/pnas.1418878112):

> Here we report five hiring experiments in which faculty evaluated hypothetical female and male applicants, using systematically varied profiles disguising identical scholarship, for assistant professorships in biology, engineering, economics, and psychology. Contrary to prevailing assumptions, **men and women faculty members from all four fields preferred female applicants 2:1 over identically qualified males** with matching lifestyles (single, married, divorced), with the exception of male economists, who showed no gender preference.

### Do these conflict?

Not necessarily. One explanation, from William and Ceci:

> Instead, past studies have used ratings of students’ hirability for a range of posts that do not include tenure-track jobs, such as managing laboratories or performing math assignments for a company. However, hiring tenure-track faculty differs from hiring lower-level staff: it entails selecting among highly accomplished candidates, all of whom have completed Ph.D.s and amassed publications and strong letters of support. **Hiring bias may occur when applicants’ records are ambiguous, as was true in studies of hiring bias for lower-level staff posts, but such bias may not occur when records are clearly strong**, as is the case with tenure-track hiring.

### Do these conflict?

From Witteman, et al, 2019, in *The Lancet*:

> Thus, evidence of scientists favouring women comes exclusively from hypothetical scenarios, whereas evidence of scientists favouring men comes from hypothetical scenarios and real behaviour. This **might reflect academics' growing awareness of the social desirability of achieving gender balance, while real academic behaviour might not yet put such ideals into action**.

### Example: Restaurant reviews and phone types

* You are deciding whether to eat at Dirty Birds or The Loft.

* Suppose Yelp shows ratings aggregated by phone type (Android vs. iPhone).

|Phone Type|Stars for Dirty Birds|Stars for The Loft|
|---|---|---|
|Android|4.24|4.0|
|iPhone|2.99|2.79|
|**All**|**3.32**|**3.37**|


* **Question**: Should you choose Dirty Birds or The Loft? 


* **Answer**: The type of phone you use likely has nothing to do with your taste in food – pick the restaurant that is rated higher overall.

* Remember, Simpson's paradox is merely a property of weighted averages!

### Takeaways

Be skeptical of...

- Aggregate statistics.
- People misusing statistics to "prove" that discrimination doesn't exist.
- Drawing conclusions from individual publications ($p$-hacking, publication bias, narrow focus, etc.).
- Everything!

**We need to apply domain knowledge and human judgement calls to decide what to do when Simpson's paradox is present.**

### Really?

To handle Simpson's paradox with rigor, we need some ideas from causal inference which we don't have time to cover in DSC 80. This video has a good example of how to approach Simpson's paradox, using a minimal amount of causal inference, if you're curious (not required for DSC 80).

In [None]:
IFrame('https://www.youtube-nocookie.com/embed/zeuW1Z2EtLs?si=l2Dl7P-5RCq3ODpo',
       width=560, height=315)

### Further reading

- [Gender Bias in Admission Statistics?](https://www.cantorsparadise.com/gender-bias-in-admission-statistics-eaabca650810)
    - Contains a **great** visualization, but seems to be paywalled now.
- [What is Simpson's Paradox?](https://statisticsbyjim.com/basics/simpsons-paradox/) 

## Summary, next time

- Grouping allows us to change the level of granularity in a DataFrame.
- Grouping involves three steps – split, apply, and combine (or filter, or transform).
- `pivot_table` aggregates data based on two categorical columns, and reshapes the result to be "wide" instead of "long".
- Simpson's paradox occurs when grouped data and ungrouped data show opposing trends.
    - It is a consequence of arithmetic.
- Next time: Data cleaning! 🧼