
# UC Berkeley Admissions (1973) — Simpson's Paradox (Simple)

This notebook uses **basic Pandas + Matplotlib** to show the classic trend **flip**:
- **Pooled** across departments, men appear to have a higher admission rate.
- **Within each department**, women have a higher rate in most departments.


In [29]:

import os
import pandas as pd
import matplotlib.pyplot as plt
plt.rcParams['figure.dpi'] = 110


## 1) Load data (use local `/mnt/data/berkeley.csv` if available, else built-in table)

In [None]:

path = "/content/berkeley.csv"

if os.path.exists(path):
    df = pd.read_csv(path)
    # standardize names if present
    df = df.rename(columns={'Major':'Dept','Sex':'Gender','Admission':'Admit'})
    if 'Year' in df.columns:
        df = df[df['Year']==1973]
    if 'Dept' in df.columns:
        df = df[df['Dept'].isin(list('ABCDEF'))]
    df['Freq'] = 1
    agg = df.groupby(['Admit','Gender','Dept'])['Freq'].sum().reset_index()

    agg = agg[['Admit','Gender','Dept','Freq']]

agg.head()

## 2) Overall admission rate by gender (pooled)

In [31]:

overall = agg.groupby(['Gender','Admit'])['Freq'].sum().unstack(fill_value=0)
overall['Rate'] = overall['Accepted'] / (overall['Accepted'] + overall['Rejected'])
print(overall[['Accepted','Rejected','Rate']])

Admit   Accepted  Rejected      Rate
Gender                              
F            557      1278  0.303542
M           1511      1493  0.502996


In [None]:

# Simple bar chart (overall)
plt.figure(figsize=(5,3.5))
plt.bar(overall.index, overall['Rate'])
for i, v in enumerate(overall['Rate']):
    plt.text(i, v, f"{v*100:.1f}%", ha='center', va='bottom')
plt.ylim(0,1)
plt.ylabel("Admission rate")
plt.title("Overall admission rate by gender (pooled)")
plt.show()

## 3) Within-department admission rates (stratified)

In [33]:

dept = agg.groupby(['Dept','Gender','Admit'])['Freq'].sum().unstack(fill_value=0).reset_index()
dept['Rate'] = dept['Accepted'] / (dept['Accepted'] + dept['Rejected'])
wide = dept.pivot(index='Dept', columns='Gender', values='Rate').sort_index()
print(wide)

Gender         F         M
Dept                      
A       0.824074  0.724956
B       0.680000  0.630357
C       0.338954  0.369231
D       0.349333  0.330935
E       0.239186  0.277487
F       0.073314  0.058981


In [None]:

# Grouped bars per department (Female vs Male)
x = range(len(wide.index))
w = 0.35

plt.figure(figsize=(7,4))
plt.bar([i - w/2 for i in x], wide['F'].values, width=w, label='Female')
plt.bar([i + w/2 for i in x], wide['M'].values,   width=w, label='Male')
plt.xticks(x, wide.index)
plt.ylim(0,1)
plt.ylabel("Admission rate")
plt.title("Within each department")
plt.legend()
plt.tight_layout()
plt.show()

## 4) Show the flip

In [None]:

print("OVERALL:", f"Male={overall.loc['M','Rate']*100:.1f}%",
                 f"Female={overall.loc['F','Rate']*100:.1f}%")
wins = (wide['F'] > wide['M']).sum()
total = len(wide)
print(f"WITHIN DEPARTMENTS: Female higher in {wins}/{total} departments.")


### Conclusion
- Pooled across departments, **men** look better.
- Within departments, **women** have higher rates in most departments.
- The flip happens because the **mix of departments** each gender applied to is different (women applied more to competitive depts).
