# Welcome to the Bias-Athon 2025!

This notebook introduces the learning objectives, prepares the datasets, and provides an overview of the workshop.


### Learning Objectives:
1. Explore how **data biases** (e.g., measurement errors or missingness) impact downstream tasks. 
2. Understand and simulate **concept drift**: altering the relationship between features and the target variable.
3. Simulate **prior probability drift**: changing the incidence rate of the target variable.

By the end of this notebook, you will:
- Create and save datasets with introduced biases (e.g., SpO2 and lactate modifications).
- Generate drifted datasets to simulate real-world challenges in data analysis.
- Split the data into train and test sets for further analysis.



# Schedule (2 Hours)

# TBD

 ## Materials

 - **WiDS dataset** - Download the dataset ("training_v2.csv") [here](https://www.kaggle.com/competitions/widsdatathon2020/data).

 - **Data Dictionary** - Refer to the provided documentation for variable definitions.

 - **Bias-Athon GitHub Repository** - Clone the repository for all notebooks and datasets.


## Dataset Preparation

# Step 1: Import Libraries

In [None]:
import pandas as pd
import numpy as np
import os
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings("ignore")

# Step 2: Load Data

In [None]:

data = pd.read_csv("data/training_v2.csv")
with pd.option_context('display.max_rows', 5, 'display.max_columns', None):
    display(data.head())

# Step 3: SpO2 Modifications (Bias 1)

## Baseline data distributions

In [None]:
data['d1_spo2_min'].isna().mean()

In [None]:
data['d1_spo2_min'].describe()

In [None]:
data['d1_spo2_max'].describe()

## Adding Bias to Black Patient's SpO2

In [None]:
# Increase SpO2 of Black patients by 10%
print("Adding bias to SpO2 for Black patients...")
delta_to_add = 10

data['d1_spo2_min_new'] = data.apply(
    lambda row: 
    row.d1_spo2_min + delta_to_add if 
        ((row.d1_spo2_min + delta_to_add) <= 100) & (row.ethnicity == 'African American')
    else (100 if 
        ((row.d1_spo2_min + delta_to_add) > 100) & (row.ethnicity == 'African American')
    else (row.d1_spo2_min)),
    axis=1
)


## Compare the SpO2 Modifications

In [None]:
# Compare distributions before and after bias
print("Before modification:")
print(data.loc[data.ethnicity == 'African American','d1_spo2_min'].describe())
print("After modification:")
print(data.loc[data.ethnicity == 'African American','d1_spo2_min_new'].describe())


# Step 4. # Lactate Modifications (Bias 2)

## Drop all lactate values for Black patients

In [None]:

print("Dropping lactate values for Black patients...")
data['d1_lactate_max_new'] = data.apply(
    lambda row: 
    np.nan if row.ethnicity == 'African American'
    else row.d1_lactate_max,
    axis=1
)

## Check new missingness

In [None]:
print("New missingness for lactate:")
print(data.loc[data.ethnicity == 'African American', 'd1_lactate_max_new'].isna().mean())

# Step 5: Introduce Concept Drift in Target Variable

## Concept Drift
**Goal:** Alter the relationship between SpO2 and hospital_death for all patients.

In [None]:
print("Introducing Concept Drift...")
data['hospital_death_concept_drift'] = data.apply(
    lambda row: 1 if row['d1_spo2_min'] < 92 else row['hospital_death'],
    axis=1
)



## Prior Probability Drift
**Goal:** Change the distribution of hospital_death for African American patients.

In [None]:
print("Introducing Prior Probability Drift...")
data['hospital_death_prior_drift'] = data.apply(
    lambda row: 0 if row['hospital_death'] == 1 and row['ethnicity'] == 'African American' and np.random.rand() < 0.5 else row['hospital_death'],
    axis=1
)

## Display basic statistics for the new target variables

In [None]:

print("Original Hospital Death Rate:", data['hospital_death'].mean())
print("Concept Drift Hospital Death Rate:", data['hospital_death_concept_drift'].mean())
print("Prior Probability Drift Hospital Death Rate:", data['hospital_death_prior_drift'].mean())


# Step 6: Limit Columns for Analysis

**Goal:** Focus on a curated set of features for analysis to reduce redundancy.


In [None]:

print("Limiting Columns...")
data = data[[
    'encounter_id', 'patient_id', 'hospital_id', # IDs
    'age', 'ethnicity', 'gender', 'bmi',        # Patient demographics
    'icu_admit_source', 'icu_type',            # ICU stay info
    'd1_heartrate_max', 'd1_heartrate_min',    # Vital signs
    'd1_mbp_max', 'd1_mbp_min',
    'd1_sysbp_max', 'd1_sysbp_min',
    'd1_diasbp_max', 'd1_diasbp_min',
    'd1_resprate_max', 'd1_resprate_min',
    'd1_temp_max', 'd1_temp_min',
    'd1_albumin_min', 'd1_bilirubin_max',      # Labs
    'd1_bun_max', 'd1_calcium_max', 'd1_calcium_min',
    'd1_creatinine_max', 'd1_glucose_max', 'd1_glucose_min',
    'd1_hco3_min', 'd1_hemaglobin_min', 'd1_hematocrit_min',
    'd1_inr_max', 'd1_platelets_min',
    'd1_potassium_max', 'd1_potassium_min',
    'd1_sodium_max', 'd1_sodium_min',
    'd1_wbc_max',
    # Original and modified target variables
    'hospital_death',
    'hospital_death_concept_drift',
    'hospital_death_prior_drift',
    'd1_spo2_min_new',
    'd1_lactate_max_new',
    'd1_spo2_min',
    'd1_lactate_max'
]]

# Step 4: Train-Test Split
**Goal:** Split the dataset into 80% training and 20% testing subsets.

## Perform the split

In [None]:
data_train, data_test = train_test_split(data, test_size=0.2, random_state=42)
print("Train shape:", data_train.shape)
print("Test shape:", data_test.shape)

## Check balancing of the mortality outcome in each dataset

In [None]:
print("Original Mortality Rate in Train:", data_train['hospital_death'].mean())
print("Concept Drift Mortality Rate in Train:", data_train['hospital_death_concept_drift'].mean())
print("Prior Drift Mortality Rate in Train:", data_train['hospital_death_prior_drift'].mean())

print("Original Mortality Rate in Test:", data_test['hospital_death'].mean())
print("Concept Drift Mortality Rate in Test:", data_test['hospital_death_concept_drift'].mean())
print("Prior Drift Mortality Rate in Test:", data_test['hospital_death_prior_drift'].mean())


# Step 7: Save the DataFrames as CSV Files
**Goal:** Ensure all datasets are saved for subsequent analysis.

## Create a subfolder called 'data_split' if it doesn't exist

In [None]:

if not os.path.exists('data_split'):
    os.makedirs('data_split')

## Save train and test datasets

In [None]:

print("Saving datasets...")
data_train.to_csv('data_split/wids_train.csv', index=False)
data_test.to_csv('data_split/wids_test.csv', index=False)


## Save drifted datasets separately for downstream analysis

In [None]:

data_train[['hospital_death_concept_drift']].to_csv('data_split/wids_train_concept_drift.csv', index=False)
data_train[['hospital_death_prior_drift']].to_csv('data_split/wids_train_prior_drift.csv', index=False)
data_test[['hospital_death_concept_drift']].to_csv('data_split/wids_test_concept_drift.csv', index=False)
data_test[['hospital_death_prior_drift']].to_csv('data_split/wids_test_prior_drift.csv', index=False)

print("Dataset preparation complete!")