# Welcome to the Bias-Athon 2025!

This notebook introduces the learning objectives, prepares the datasets, and provides an overview of the workshop.


### Learning Objectives:
1. Explore how **data biases** (e.g., measurement errors or missingness) impact downstream tasks. 
2. Understand and simulate **concept drift**: altering the relationship between features and the target variable.
3. Simulate **prior probability drift**: changing the incidence rate of the target variable.

By the end of this notebook, you will:
- Create and save datasets with introduced biases (e.g., SpO2 and lactate modifications).
- Generate drifted datasets to simulate real-world challenges in data analysis.
- Split the data into train and test sets for further analysis.



# Schedule (2 Hours)

# TBD

 ## Materials

 - **WiDS dataset** - Download the dataset ("training_v2.csv") [here](https://www.kaggle.com/competitions/widsdatathon2020/data).

 - **Data Dictionary** - Refer to the provided documentation for variable definitions.

 - **Bias-Athon GitHub Repository** - Clone the repository for all notebooks and datasets.


## Dataset Preparation

# Step 1: Import Libraries

In [1]:
import pandas as pd
import numpy as np
import os
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings("ignore")

# Step 2: Load Data

In [None]:
data = pd.read_csv("../training_v2.csv")
with pd.option_context('display.max_rows', 5, 'display.max_columns', None):
    display(data.head())

Unnamed: 0,encounter_id,patient_id,hospital_id,hospital_death,age,bmi,elective_surgery,ethnicity,gender,height,hospital_admit_source,icu_admit_source,icu_id,icu_stay_type,icu_type,pre_icu_los_days,readmission_status,weight,albumin_apache,apache_2_diagnosis,apache_3j_diagnosis,apache_post_operative,arf_apache,bilirubin_apache,bun_apache,creatinine_apache,fio2_apache,gcs_eyes_apache,gcs_motor_apache,gcs_unable_apache,gcs_verbal_apache,glucose_apache,heart_rate_apache,hematocrit_apache,intubated_apache,map_apache,paco2_apache,paco2_for_ph_apache,pao2_apache,ph_apache,resprate_apache,sodium_apache,temp_apache,urineoutput_apache,ventilated_apache,wbc_apache,d1_diasbp_invasive_max,d1_diasbp_invasive_min,d1_diasbp_max,d1_diasbp_min,d1_diasbp_noninvasive_max,d1_diasbp_noninvasive_min,d1_heartrate_max,d1_heartrate_min,d1_mbp_invasive_max,d1_mbp_invasive_min,d1_mbp_max,d1_mbp_min,d1_mbp_noninvasive_max,d1_mbp_noninvasive_min,d1_resprate_max,d1_resprate_min,d1_spo2_max,d1_spo2_min,d1_sysbp_invasive_max,d1_sysbp_invasive_min,d1_sysbp_max,d1_sysbp_min,d1_sysbp_noninvasive_max,d1_sysbp_noninvasive_min,d1_temp_max,d1_temp_min,h1_diasbp_invasive_max,h1_diasbp_invasive_min,h1_diasbp_max,h1_diasbp_min,h1_diasbp_noninvasive_max,h1_diasbp_noninvasive_min,h1_heartrate_max,h1_heartrate_min,h1_mbp_invasive_max,h1_mbp_invasive_min,h1_mbp_max,h1_mbp_min,h1_mbp_noninvasive_max,h1_mbp_noninvasive_min,h1_resprate_max,h1_resprate_min,h1_spo2_max,h1_spo2_min,h1_sysbp_invasive_max,h1_sysbp_invasive_min,h1_sysbp_max,h1_sysbp_min,h1_sysbp_noninvasive_max,h1_sysbp_noninvasive_min,h1_temp_max,h1_temp_min,d1_albumin_max,d1_albumin_min,d1_bilirubin_max,d1_bilirubin_min,d1_bun_max,d1_bun_min,d1_calcium_max,d1_calcium_min,d1_creatinine_max,d1_creatinine_min,d1_glucose_max,d1_glucose_min,d1_hco3_max,d1_hco3_min,d1_hemaglobin_max,d1_hemaglobin_min,d1_hematocrit_max,d1_hematocrit_min,d1_inr_max,d1_inr_min,d1_lactate_max,d1_lactate_min,d1_platelets_max,d1_platelets_min,d1_potassium_max,d1_potassium_min,d1_sodium_max,d1_sodium_min,d1_wbc_max,d1_wbc_min,h1_albumin_max,h1_albumin_min,h1_bilirubin_max,h1_bilirubin_min,h1_bun_max,h1_bun_min,h1_calcium_max,h1_calcium_min,h1_creatinine_max,h1_creatinine_min,h1_glucose_max,h1_glucose_min,h1_hco3_max,h1_hco3_min,h1_hemaglobin_max,h1_hemaglobin_min,h1_hematocrit_max,h1_hematocrit_min,h1_inr_max,h1_inr_min,h1_lactate_max,h1_lactate_min,h1_platelets_max,h1_platelets_min,h1_potassium_max,h1_potassium_min,h1_sodium_max,h1_sodium_min,h1_wbc_max,h1_wbc_min,d1_arterial_pco2_max,d1_arterial_pco2_min,d1_arterial_ph_max,d1_arterial_ph_min,d1_arterial_po2_max,d1_arterial_po2_min,d1_pao2fio2ratio_max,d1_pao2fio2ratio_min,h1_arterial_pco2_max,h1_arterial_pco2_min,h1_arterial_ph_max,h1_arterial_ph_min,h1_arterial_po2_max,h1_arterial_po2_min,h1_pao2fio2ratio_max,h1_pao2fio2ratio_min,apache_4a_hospital_death_prob,apache_4a_icu_death_prob,aids,cirrhosis,diabetes_mellitus,hepatic_failure,immunosuppression,leukemia,lymphoma,solid_tumor_with_metastasis,apache_3j_bodysystem,apache_2_bodysystem
0,66154,25312,118,0,68.0,22.73,0,Caucasian,M,180.3,Floor,Floor,92,admit,CTICU,0.541667,0,73.9,2.3,113.0,502.01,0,0.0,0.4,31.0,2.51,,3.0,6.0,0.0,4.0,168.0,118.0,27.4,0.0,40.0,,,,,36.0,134.0,39.3,,0.0,14.1,46.0,32.0,68.0,37.0,68.0,37.0,119.0,72.0,66.0,40.0,89.0,46.0,89.0,46.0,34.0,10.0,100.0,74.0,122.0,64.0,131.0,73.0,131.0,73.0,39.9,37.2,,,68.0,63.0,68.0,63.0,119.0,108.0,,,86.0,85.0,86.0,85.0,26.0,18.0,100.0,74.0,,,131.0,115.0,131.0,115.0,39.5,37.5,2.3,2.3,0.4,0.4,31.0,30.0,8.5,7.4,2.51,2.23,168.0,109.0,19.0,15.0,8.9,8.9,27.4,27.4,,,1.3,1.0,233.0,233.0,4.0,3.4,136.0,134.0,14.1,14.1,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.1,0.05,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,Sepsis,Cardiovascular
1,114252,59342,81,0,77.0,27.42,0,Caucasian,F,160.0,Floor,Floor,90,admit,Med-Surg ICU,0.927778,0,70.2,,108.0,203.01,0,0.0,,9.0,0.56,1.0,1.0,3.0,0.0,1.0,145.0,120.0,36.9,0.0,46.0,37.0,37.0,51.0,7.45,33.0,145.0,35.1,,1.0,12.7,,,95.0,31.0,95.0,31.0,118.0,72.0,,,120.0,38.0,120.0,38.0,32.0,12.0,100.0,70.0,,,159.0,67.0,159.0,67.0,36.3,35.1,,,61.0,48.0,61.0,48.0,114.0,100.0,,,85.0,57.0,85.0,57.0,31.0,28.0,95.0,70.0,,,95.0,71.0,95.0,71.0,36.3,36.3,1.6,1.6,0.5,0.5,11.0,9.0,8.6,8.0,0.71,0.56,145.0,128.0,27.0,26.0,11.3,11.1,36.9,36.1,1.3,1.3,3.5,3.5,557.0,487.0,4.2,3.8,145.0,145.0,23.3,12.7,,,,,9.0,9.0,8.6,8.6,0.56,0.56,145.0,143.0,27.0,27.0,11.3,11.3,36.9,36.9,1.3,1.3,3.5,3.5,557.0,557.0,4.2,4.2,145.0,145.0,12.7,12.7,37.0,37.0,7.45,7.45,51.0,51.0,54.8,51.0,37.0,37.0,7.45,7.45,51.0,51.0,51.0,51.0,0.47,0.29,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,Respiratory,Respiratory
2,119783,50777,118,0,25.0,31.95,0,Caucasian,F,172.7,Emergency Department,Accident & Emergency,93,admit,Med-Surg ICU,0.000694,0,95.3,,122.0,703.03,0,0.0,,,,,3.0,6.0,0.0,5.0,,102.0,,0.0,68.0,,,,,37.0,,36.7,,0.0,,,,88.0,48.0,88.0,48.0,96.0,68.0,,,102.0,68.0,102.0,68.0,21.0,8.0,98.0,91.0,,,148.0,105.0,148.0,105.0,37.0,36.7,,,88.0,58.0,88.0,58.0,96.0,78.0,,,91.0,83.0,91.0,83.0,20.0,16.0,98.0,91.0,,,148.0,124.0,148.0,124.0,36.7,36.7,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Metabolic,Metabolic
3,79267,46918,118,0,81.0,22.64,1,Caucasian,F,165.1,Operating Room,Operating Room / Recovery,92,admit,CTICU,0.000694,0,61.7,,203.0,1206.03,1,0.0,,,,0.6,4.0,6.0,0.0,5.0,185.0,114.0,25.9,1.0,60.0,30.0,30.0,142.0,7.39,4.0,,34.8,,1.0,8.0,62.0,30.0,48.0,42.0,48.0,42.0,116.0,92.0,92.0,52.0,84.0,84.0,84.0,84.0,23.0,7.0,100.0,95.0,164.0,78.0,158.0,84.0,158.0,84.0,38.0,34.8,62.0,44.0,62.0,44.0,,,100.0,96.0,92.0,71.0,92.0,71.0,,,12.0,11.0,100.0,99.0,136.0,106.0,136.0,106.0,,,35.6,34.8,,,,,,,,,,,185.0,88.0,,,11.6,8.9,34.0,25.9,1.6,1.1,,,198.0,43.0,5.0,3.5,,,9.0,8.0,,,,,,,,,,,,,,,11.6,11.6,34.0,34.0,1.6,1.1,,,43.0,43.0,,,,,8.8,8.8,37.0,27.0,7.44,7.34,337.0,102.0,342.5,236.666667,36.0,33.0,7.37,7.34,337.0,265.0,337.0,337.0,0.04,0.03,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Cardiovascular,Cardiovascular
4,92056,34377,33,0,19.0,,0,Caucasian,M,188.0,,Accident & Emergency,91,admit,Med-Surg ICU,0.073611,0,,,119.0,601.01,0,0.0,,,,,,,,,,60.0,,0.0,103.0,,,,,16.0,,36.7,,0.0,,,,99.0,57.0,99.0,57.0,89.0,60.0,,,104.0,90.0,104.0,90.0,18.0,16.0,100.0,96.0,,,147.0,120.0,147.0,120.0,37.2,36.7,,,99.0,68.0,99.0,68.0,89.0,76.0,,,104.0,92.0,104.0,92.0,,,100.0,100.0,,,130.0,120.0,130.0,120.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Trauma,Trauma


# Step 3: SpO2 Modifications (Bias 1)

## Baseline data distributions

In [4]:
data['d1_spo2_min'].isna().mean()

0.0036308920218507735

In [5]:
data['d1_spo2_min'].describe()

count    91380.000000
mean        90.454826
std         10.030069
min          0.000000
25%         89.000000
50%         92.000000
75%         95.000000
max        100.000000
Name: d1_spo2_min, dtype: float64

In [6]:
data['d1_spo2_max'].describe()

count    91380.000000
mean        99.241836
std          1.794181
min          0.000000
25%         99.000000
50%        100.000000
75%        100.000000
max        100.000000
Name: d1_spo2_max, dtype: float64

## Adding Bias to Black Patient's SpO2

In [7]:
# Increase SpO2 of Black patients by 10%
print("Adding bias to SpO2 for Black patients...")
delta_to_add = 10

data['d1_spo2_min_new'] = data.apply(
    lambda row: 
    row.d1_spo2_min + delta_to_add if 
        ((row.d1_spo2_min + delta_to_add) <= 100) & (row.ethnicity == 'African American')
    else (100 if 
        ((row.d1_spo2_min + delta_to_add) > 100) & (row.ethnicity == 'African American')
    else (row.d1_spo2_min)),
    axis=1
)


Adding bias to SpO2 for Black patients...


## Compare the SpO2 Modifications

In [8]:
# Compare distributions before and after bias
print("Before modification:")
print(data.loc[data.ethnicity == 'African American','d1_spo2_min'].describe())
print("After modification:")
print(data.loc[data.ethnicity == 'African American','d1_spo2_min_new'].describe())


Before modification:
count    9501.000000
mean       91.052837
std        11.702494
min         0.000000
25%        90.000000
50%        94.000000
75%        97.000000
max       100.000000
Name: d1_spo2_min, dtype: float64
After modification:
count    9501.000000
mean       97.048311
std        10.129773
min        10.000000
25%       100.000000
50%       100.000000
75%       100.000000
max       100.000000
Name: d1_spo2_min_new, dtype: float64


# Step 4. # Lactate Modifications (Bias 2)

## Drop all lactate values for Black patients

In [9]:

print("Dropping lactate values for Black patients...")
data['d1_lactate_max_new'] = data.apply(
    lambda row: 
    np.nan if row.ethnicity == 'African American'
    else row.d1_lactate_max,
    axis=1
)

Dropping lactate values for Black patients...


## Check new missingness

In [10]:
print("New missingness for lactate:")
print(data.loc[data.ethnicity == 'African American', 'd1_lactate_max_new'].isna().mean())

New missingness for lactate:
1.0


# Step 5: Introduce Concept Drift in Target Variable

## Concept Drift
**Goal:** Alter the relationship between SpO2 and hospital_death for all patients.

In [11]:
print("Introducing Concept Drift...")
data['hospital_death_concept_drift'] = data.apply(
    lambda row: 1 if row['d1_spo2_min'] < 92 else row['hospital_death'],
    axis=1
)

Introducing Concept Drift...




## Prior Probability Drift
**Goal:** Change the distribution of hospital_death for African American patients.

In [12]:
print("Introducing Prior Probability Drift...")
data['hospital_death_prior_drift'] = data.apply(
    lambda row: 0 if row['hospital_death'] == 1 and row['ethnicity'] == 'African American' and np.random.rand() < 0.5 else row['hospital_death'],
    axis=1
)

Introducing Prior Probability Drift...


## Display basic statistics for the new target variables

In [13]:

print("Original Hospital Death Rate:", data['hospital_death'].mean())
print("Concept Drift Hospital Death Rate:", data['hospital_death_concept_drift'].mean())
print("Prior Probability Drift Hospital Death Rate:", data['hospital_death_prior_drift'].mean())


Original Hospital Death Rate: 0.08630183289173836
Concept Drift Hospital Death Rate: 0.43918528452891087
Prior Probability Drift Hospital Death Rate: 0.08194040103365935


# Step 6: Limit Columns for Analysis

**Goal:** Focus on a curated set of features for analysis to reduce redundancy.


In [14]:

print("Limiting Columns...")
data = data[[
    'encounter_id', 'patient_id', 'hospital_id', # IDs
    'age', 'ethnicity', 'gender', 'bmi',        # Patient demographics
    'icu_admit_source', 'icu_type',            # ICU stay info
    'd1_heartrate_max', 'd1_heartrate_min',    # Vital signs
    'd1_mbp_max', 'd1_mbp_min',
    'd1_sysbp_max', 'd1_sysbp_min',
    'd1_diasbp_max', 'd1_diasbp_min',
    'd1_resprate_max', 'd1_resprate_min',
    'd1_temp_max', 'd1_temp_min',
    'd1_albumin_min', 'd1_bilirubin_max',      # Labs
    'd1_bun_max', 'd1_calcium_max', 'd1_calcium_min',
    'd1_creatinine_max', 'd1_glucose_max', 'd1_glucose_min',
    'd1_hco3_min', 'd1_hemaglobin_min', 'd1_hematocrit_min',
    'd1_inr_max', 'd1_platelets_min',
    'd1_potassium_max', 'd1_potassium_min',
    'd1_sodium_max', 'd1_sodium_min',
    'd1_wbc_max',
    # Original and modified target variables
    'hospital_death',
    'hospital_death_concept_drift',
    'hospital_death_prior_drift',
    'd1_spo2_min_new',
    'd1_lactate_max_new',
    'd1_spo2_min',
    'd1_lactate_max'
]]

Limiting Columns...


# Step 4: Train-Test Split
**Goal:** Split the dataset into 80% training and 20% testing subsets.

## Perform the split

In [15]:
data_train, data_test = train_test_split(data, test_size=0.2, random_state=42)
print("Train shape:", data_train.shape)
print("Test shape:", data_test.shape)

Train shape: (73370, 46)
Test shape: (18343, 46)


## Check balancing of the mortality outcome in each dataset

In [16]:
print("Original Mortality Rate in Train:", data_train['hospital_death'].mean())
print("Concept Drift Mortality Rate in Train:", data_train['hospital_death_concept_drift'].mean())
print("Prior Drift Mortality Rate in Train:", data_train['hospital_death_prior_drift'].mean())

print("Original Mortality Rate in Test:", data_test['hospital_death'].mean())
print("Concept Drift Mortality Rate in Test:", data_test['hospital_death_concept_drift'].mean())
print("Prior Drift Mortality Rate in Test:", data_test['hospital_death_prior_drift'].mean())


Original Mortality Rate in Train: 0.08624778519830993
Concept Drift Mortality Rate in Train: 0.43941665530870927
Prior Drift Mortality Rate in Train: 0.08176366362273409
Original Mortality Rate in Test: 0.08651801777244726
Concept Drift Mortality Rate in Test: 0.4382598266368642
Prior Drift Mortality Rate in Test: 0.08264733140707627


# Step 7: Save the DataFrames as CSV Files
**Goal:** Ensure all datasets are saved for subsequent analysis.

## Create a subfolder called 'data_split' if it doesn't exist

In [None]:

if not os.path.exists('data_split'):
    os.makedirs('data_split')

## Save train and test datasets

In [None]:

print("Saving datasets...")
data_train.to_csv('data_split/wids_train.csv', index=False)
data_test.to_csv('data_split/wids_test.csv', index=False)


## Save drifted datasets separately for downstream analysis

In [None]:

data_train[['hospital_death_concept_drift']].to_csv('data_split/wids_train_concept_drift.csv', index=False)
data_train[['hospital_death_prior_drift']].to_csv('data_split/wids_train_prior_drift.csv', index=False)
data_test[['hospital_death_concept_drift']].to_csv('data_split/wids_test_concept_drift.csv', index=False)
data_test[['hospital_death_prior_drift']].to_csv('data_split/wids_test_prior_drift.csv', index=False)

print("Dataset preparation complete!")