# SAT Outlier Detection & Fairness Impact

This notebook simulates a college admissions / donor prediction scenario.
We will:
1. Generate a synthetic dataset with SAT scores, family income, and a donor label.
2. Detect SAT score outliers.
3. Train a simple predictive model using all data.
4. Compare this to a model trained after removing SAT outliers.
5. Evaluate basic **fairness impacts** for an advantaged vs. disadvantaged group.

The goal is not to build a perfect model, but to *illustrate* how handling SAT outliers can
affect both performance and fairness.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.model_selection import train_test_split

np.random.seed(42)
pd.set_option('display.max_columns', None)

## 1. Simulate a Synthetic Dataset

We simulate applicants from two groups:
- `group = 0`: Disadvantaged (on average lower income, lower SAT)
- `group = 1`: Advantaged (on average higher income, higher SAT)

We then define a `future_donor` label that depends on SAT, income, and group.

In [None]:
n = 5000

# Group label: 0 = disadvantaged, 1 = advantaged
group = np.random.binomial(1, 0.5, size=n)

# Income: log-normal-ish, higher for advantaged group
base_income = np.random.lognormal(mean=10, sigma=0.5, size=n)  # around ~22k
income_multiplier = np.where(group == 1, 3.0, 1.0)  # advantaged ~3x richer
family_income = base_income * income_multiplier

# SAT: normal distributions with different means by group
sat_mean_disadv = 1050
sat_mean_adv = 1300

sat_score = np.where(
    group == 1,
    np.random.normal(loc=sat_mean_adv, scale=80, size=n),
    np.random.normal(loc=sat_mean_disadv, scale=90, size=n)
)

# Clip SAT to valid range 400â€“1600
sat_score = np.clip(sat_score, 400, 1600)

# Introduce a few extreme outliers (e.g., data issues or anomalies)
num_outliers = 20
outlier_indices = np.random.choice(n, size=num_outliers, replace=False)
sat_score[outlier_indices[:10]] = 200   # unrealistically low
sat_score[outlier_indices[10:]] = 1800  # unrealistically high

# Donor label: logistic function of SAT, income, and group
sat_scaled = (sat_score - 1000) / 400.0
income_scaled = np.log1p(family_income) / 13.0

logit = -3.0 + 1.2 * sat_scaled + 2.5 * income_scaled + 0.5 * group
prob_donor = 1 / (1 + np.exp(-logit))
future_donor = np.random.binomial(1, prob_donor)

df = pd.DataFrame({
    'sat_score': sat_score,
    'family_income': family_income,
    'group': group,
    'future_donor': future_donor
})

df.head()

## 2. Explore SAT Distribution and Detect Outliers

We treat SAT outliers in two ways here:
1. **Rule-based bounds**: SAT should be between 400 and 1600. Values outside are almost certainly errors.
2. **Statistical bounds**: using the IQR rule to flag extreme values, even if within the valid range.

In [None]:
# Histogram of SAT scores
plt.figure(figsize=(6,4))
plt.hist(df['sat_score'], bins=40)
plt.xlabel('SAT score')
plt.ylabel('Count')
plt.title('Distribution of SAT Scores (with simulated outliers)')
plt.show()

# Simple rule-based outlier flags
df['sat_out_of_range'] = (df['sat_score'] < 400) | (df['sat_score'] > 1600)
print('Number of SAT scores outside [400, 1600]:', df['sat_out_of_range'].sum())

# IQR-based outlier detection (within-range extremes)
q1 = df['sat_score'].quantile(0.25)
q3 = df['sat_score'].quantile(0.75)
iqr = q3 - q1
lower_bound = q1 - 1.5 * iqr
upper_bound = q3 + 1.5 * iqr

df['sat_iqr_outlier'] = (df['sat_score'] < lower_bound) | (df['sat_score'] > upper_bound)
print('Number of IQR-based SAT outliers:', df['sat_iqr_outlier'].sum())
print('Lower bound:', lower_bound, 'Upper bound:', upper_bound)

df[['sat_score', 'sat_out_of_range', 'sat_iqr_outlier']].head(10)

## 3. Baseline Model Using All Data

We now train a simple logistic regression model to predict `future_donor` using:
- SAT score (standardized)
- log-transformed family income
- group (0/1)

We then compute overall performance and simple group fairness metrics:
- Accuracy overall
- Positive prediction rate by group
- True positive rate (TPR) by group

In [None]:
df['log_income'] = np.log1p(df['family_income'])

features = ['sat_score', 'log_income', 'group']
X = df[features]
y = df['future_donor']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y, random_state=42)

model_all = LogisticRegression(max_iter=1000)
model_all.fit(X_train, y_train)

y_pred_all = model_all.predict(X_test)
y_prob_all = model_all.predict_proba(X_test)[:, 1]

acc_all = accuracy_score(y_test, y_pred_all)
cm_all = confusion_matrix(y_test, y_pred_all)
acc_all, cm_all

### 3.1. Basic Fairness Metrics by Group

We define helper functions to compute:
- Positive prediction rate (PPR)
- True positive rate (TPR)
for each group (0 = disadvantaged, 1 = advantaged).

In [None]:
def group_metrics(y_true, y_pred, group_test, group_value):
    mask = group_test == group_value
    y_t = y_true[mask]
    y_p = y_pred[mask]
    
    if len(y_t) == 0:
        return {'ppr': np.nan, 'tpr': np.nan}
    
    # Positive prediction rate
    ppr = (y_p == 1).mean()
    
    # True positive rate
    tp = np.sum((y_t == 1) & (y_p == 1))
    fn = np.sum((y_t == 1) & (y_p == 0))
    tpr = tp / (tp + fn) if (tp + fn) > 0 else np.nan
    
    return {'ppr': ppr, 'tpr': tpr}

group_test = X_test['group'].values

metrics_all_g0 = group_metrics(y_test.values, y_pred_all, group_test, 0)
metrics_all_g1 = group_metrics(y_test.values, y_pred_all, group_test, 1)

print('Overall accuracy (all data):', acc_all)
print('\nGroup 0 (disadvantaged):', metrics_all_g0)
print('Group 1 (advantaged):   ', metrics_all_g1)

## 4. Model After Removing SAT Outliers

Now we **remove SAT outliers** (using IQR-based detection) and retrain the model.
We then compare performance and fairness metrics to the baseline.

In [None]:
# Filter out IQR-based SAT outliers
df_no_outliers = df[~df['sat_iqr_outlier']].copy()
df_no_outliers['log_income'] = np.log1p(df_no_outliers['family_income'])

X2 = df_no_outliers[features]
y2 = df_no_outliers['future_donor']

X2_train, X2_test, y2_train, y2_test = train_test_split(
    X2, y2, test_size=0.3, stratify=y2, random_state=42
)

model_no = LogisticRegression(max_iter=1000)
model_no.fit(X2_train, y2_train)

y2_pred = model_no.predict(X2_test)
y2_prob = model_no.predict_proba(X2_test)[:, 1]

acc_no = accuracy_score(y2_test, y2_pred)
cm_no = confusion_matrix(y2_test, y2_pred)
acc_no, cm_no

In [None]:
# Fairness metrics after removing outliers
group2_test = X2_test['group'].values

metrics_no_g0 = group_metrics(y2_test.values, y2_pred, group2_test, 0)
metrics_no_g1 = group_metrics(y2_test.values, y2_pred, group2_test, 1)

print('Overall accuracy (no SAT outliers):', acc_no)
print('\nGroup 0 (disadvantaged):', metrics_no_g0)
print('Group 1 (advantaged):   ', metrics_no_g1)

## 5. Comparing Performance and Fairness

We now compare the **baseline model** (all data) versus the **no-outlier model**:
- Changes in overall accuracy
- Changes in PPR (Positive Prediction Rate) by group
- Changes in TPR (True Positive Rate) by group

In a real institutional setting, we would discuss whether removing outliers makes the model more or less fair,
and whether it disproportionately helps or harms a particular group.

In [None]:
summary = pd.DataFrame({
    'model': ['all_data', 'no_outliers'],
    'accuracy': [acc_all, acc_no],
    'g0_ppr': [metrics_all_g0['ppr'], metrics_no_g0['ppr']],
    'g0_tpr': [metrics_all_g0['tpr'], metrics_no_g0['tpr']],
    'g1_ppr': [metrics_all_g1['ppr'], metrics_no_g1['ppr']],
    'g1_tpr': [metrics_all_g1['tpr'], metrics_no_g1['tpr']]
})

summary

## 6. Interpretation

Use the table above to discuss:

1. **Performance impact**: Did removing SAT outliers significantly change accuracy?
2. **Fairness impact**: How did PPR and TPR change for group 0 vs. group 1?
   - If the disparity between groups shrinks, removing outliers may have improved fairness.
   - If the disparity grows, removing outliers may have unintentionally harmed fairness.

3. **Ethical implications**:
   - Are SAT outliers mostly coming from advantaged or disadvantaged groups in our simulation?
   - Would removing them erase legitimate high (or low) performance from certain students?
   - In a real dataset, outliers could represent errors, rare but real talent, or structural inequality.

This notebook is a starting point: in practice you would add more robust fairness metrics
(e.g., equalized odds, demographic parity), multiple models, and sensitivity analyses
before making any policy decisions about how to handle SAT outliers.