<a href="https://colab.research.google.com/github/MoLue/wft_digital_medicine/blob/main/medical_data_science_fundamentals.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

*Author: Timo Lüders*

*Last Updated: May 2025*

# Data Science Onboarding for Medical Students

## Overview
Welcome to this introductory notebook on data science for medical students! This notebook is designed to serve as an entry point into the world of data science, with a specific focus on applications in medicine and healthcare. Whether you have little to no programming experience or are looking to refresh your knowledge, this notebook will guide you through the fundamentals of data science in a medical context.

## Table of Contents
1. [Introduction]
2. [Python Basics for Medical Data Science]
3. [Data Acquisition and Preparation]
4. [Exploratory Data Analysis]
5. [Introduction to Medical Data Analysis]
6. [Simple Predictive Modeling]
7. [Ethical Considerations]

Let's begin!

## Setup

First, let's install and import the packages we'll need throughout this notebook.

In [None]:
# Install required packages
!pip install numpy pandas matplotlib seaborn plotly scikit-learn

In [None]:
# Import core libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os

# Configure visualizations
import warnings
warnings.filterwarnings('ignore')
plt.style.use('ggplot')
sns.set(style="whitegrid")

# Display settings for pandas
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)

<a id="introduction"></a>
# 1. Introduction

## What is Data Science?

Data science is an interdisciplinary field that uses scientific methods, processes, algorithms, and systems to extract knowledge and insights from structured and unstructured data. In the medical field, data science can help with:

- **Disease diagnosis and prediction**
- **Treatment optimization**
- **Patient monitoring**
- **Drug discovery and development**
- **Healthcare operations and resource allocation**
- **Public health surveillance**

## Why Learn Data Science as a Medical Student?

As healthcare becomes increasingly data-driven, understanding data science principles can help you:

1. **Make evidence-based decisions** by critically evaluating research and clinical data
2. **Contribute to medical research** by applying data analysis techniques
3. **Improve patient care** through personalized medicine approaches
4. **Collaborate effectively** with data scientists and informaticians
5. **Innovate in your field** by identifying patterns and opportunities in healthcare data

## How to Use This Notebook

This notebook is designed to be interactive. You'll learn best by:

- **Reading the explanations** to understand concepts
- **Running the code cells** to see results (click the cell and press Shift+Enter)
- **Modifying the code** to experiment and deepen your understanding
- **Completing the exercises** at the end of each section

Let's start with some basic Python concepts that will form the foundation of your data science journey!

<a id="python-basics"></a>
# 2. Some Python Basics

Python has become one of the most popular languages for data science due to its simplicity, readability, and powerful libraries. In this section, we'll cover the basic Python concepts you'll need for data analysis in a medical context.

## 2.1 Python Variables and Data Types

In Python, you can store information in variables. Let's look at some examples relevant to healthcare:

In [None]:
# Numeric data types
patient_age = 45                  # Integer (whole number)
body_temperature = 37.2           # Float (decimal number)

# String data type
patient_name = "Jane Doe"
diagnosis = "Type 2 Diabetes"

# Boolean data type
is_admitted = True
has_allergies = False

# Print the variables and their types
print(f"Patient: {patient_name}, Type: {type(patient_name)}")
print(f"Age: {patient_age}, Type: {type(patient_age)}")
print(f"Temperature: {body_temperature}°C, Type: {type(body_temperature)}")
print(f"Diagnosis: {diagnosis}, Type: {type(diagnosis)}")
print(f"Admitted: {is_admitted}, Type: {type(is_admitted)}")
print(f"Has Allergies: {has_allergies}, Type: {type(has_allergies)}")

## 2.2 Data Structures in Python

Python has several built-in data structures that are useful for organizing and manipulating data.

### Lists

Lists are ordered, mutable collections that can contain different data types. In a medical context, you might use lists to store a series of measurements or observations.

In [None]:
# Creating a list of blood glucose readings (mg/dL) over a week
glucose_readings = [95, 105, 110, 98, 102, 115, 107]

# Accessing elements (indexing starts at 0)
print(f"First reading: {glucose_readings[0]} mg/dL")
print(f"Last reading: {glucose_readings[-1]} mg/dL")

# Slicing a list (get readings from day 2 to day 5)
print(f"Readings from day 2 to day 5: {glucose_readings[1:5]} mg/dL")

# Adding a new reading
glucose_readings.append(101)
print(f"Updated readings: {glucose_readings}")

# Calculating statistics
average_glucose = sum(glucose_readings) / len(glucose_readings)
print(f"Average glucose level: {average_glucose:.1f} mg/dL")
print(f"Minimum glucose level: {min(glucose_readings)} mg/dL")
print(f"Maximum glucose level: {max(glucose_readings)} mg/dL")

### Dictionaries

Dictionaries store key-value pairs and are excellent for representing structured data, such as patient records.

In [None]:
# Creating a patient record as a dictionary
patient = {
    "id": "P12345",
    "name": "John Smith",
    "age": 58,
    "gender": "Male",
    "diagnosis": "Hypertension",
    "medications": ["Lisinopril", "Hydrochlorothiazide"],
    "vital_signs": {
        "blood_pressure": "140/90",
        "heart_rate": 72,
        "temperature": 36.8
    }
}

# Accessing dictionary values
print(f"Patient: {patient['name']}")
print(f"Diagnosis: {patient['diagnosis']}")
print(f"Medications: {', '.join(patient['medications'])}")
print(f"Blood Pressure: {patient['vital_signs']['blood_pressure']} mmHg")

# Adding new information
patient["allergies"] = ["Penicillin"]
print(f"Allergies: {', '.join(patient['allergies'])}")

# Updating information
patient["vital_signs"]["blood_pressure"] = "135/85"
print(f"Updated Blood Pressure: {patient['vital_signs']['blood_pressure']} mmHg")

## 2.3 Basic Python Functions

Functions allow you to encapsulate code that performs specific tasks. In medical data analysis, you might create functions to calculate health metrics or process patient data.

In [None]:
# Function to calculate Body Mass Index (BMI)
def calculate_bmi(weight_kg, height_m):
    """Calculate BMI given weight in kg and height in meters."""
    if height_m <= 0:
        return "Height must be positive"
    bmi = weight_kg / (height_m ** 2)
    return bmi

# Function to interpret BMI
def interpret_bmi(bmi):
    """Interpret BMI according to WHO classification."""
    if bmi < 18.5:
        return "Underweight"
    elif bmi < 25:
        return "Normal weight"
    elif bmi < 30:
        return "Overweight"
    else:
        return "Obese"

# Using the functions
weight = 70  # kg
height = 1.75  # m

patient_bmi = calculate_bmi(weight, height)
bmi_category = interpret_bmi(patient_bmi)

print(f"Weight: {weight} kg, Height: {height} m")
print(f"BMI: {patient_bmi:.1f}")
print(f"Category: {bmi_category}")

## 2.4 Introduction to NumPy

NumPy (Numerical Python) is a fundamental library for scientific computing in Python. It provides support for arrays, matrices, and many mathematical functions.

In [None]:
# Creating NumPy arrays for medical data
import numpy as np

# Blood pressure readings (systolic) for 10 patients
systolic_bp = np.array([120, 135, 142, 118, 125, 131, 145, 122, 119, 138])
print(f"Systolic BP readings: {systolic_bp}")

# Basic statistics with NumPy
print(f"Mean systolic BP: {np.mean(systolic_bp):.1f} mmHg")
print(f"Median systolic BP: {np.median(systolic_bp):.1f} mmHg")
print(f"Standard deviation: {np.std(systolic_bp):.2f} mmHg")

# Filtering data (patients with high blood pressure > 130 mmHg)
high_bp = systolic_bp[systolic_bp > 130]
print(f"High BP readings: {high_bp}")
print(f"Number of patients with high BP: {len(high_bp)}")
print(f"Percentage with high BP: {(len(high_bp) / len(systolic_bp)) * 100:.1f}%")

# Creating a 2D array for multiple measurements
# Rows: patients, Columns: [systolic BP, diastolic BP, heart rate]
vital_signs = np.array([
    [120, 80, 72],  # Patient 1
    [135, 85, 78],  # Patient 2
    [142, 92, 84],  # Patient 3
    [118, 75, 68],  # Patient 4
    [125, 82, 70]   # Patient 5
])

print("\nVital signs for 5 patients:")
print(vital_signs)

# Accessing data for specific patients
print(f"\nPatient 3's vital signs: {vital_signs[2]}")

# Accessing specific measurements across all patients
print(f"\nAll systolic BP readings: {vital_signs[:, 0]}")
print(f"All diastolic BP readings: {vital_signs[:, 1]}")
print(f"All heart rates: {vital_signs[:, 2]}")

# Calculating mean values for each measurement
print(f"\nMean systolic BP: {np.mean(vital_signs[:, 0]):.1f} mmHg")
print(f"Mean diastolic BP: {np.mean(vital_signs[:, 1]):.1f} mmHg")
print(f"Mean heart rate: {np.mean(vital_signs[:, 2]):.1f} bpm")

## 2.5 Introduction to Pandas

Pandas is a powerful data manipulation library built on top of NumPy. It provides data structures like DataFrames that are ideal for working with tabular data, such as patient records or clinical trial results.

In [None]:
# Creating a DataFrame for patient data
import pandas as pd

# Sample patient data
data = {
    'PatientID': ['P001', 'P002', 'P003', 'P004', 'P005', 'P006', 'P007', 'P008'],
    'Age': [45, 62, 35, 58, 41, 72, 29, 53],
    'Gender': ['M', 'F', 'F', 'M', 'M', 'F', 'M', 'F'],
    'BloodType': ['A+', 'O-', 'B+', 'AB+', 'A-', 'O+', 'B-', 'A+'],
    'Cholesterol': [185, 220, 166, 240, 190, 205, 172, 210],
    'BloodPressure': ['120/80', '135/90', '118/75', '142/92', '125/82', '148/95', '115/78', '130/85'],
    'Smoker': [False, False, False, True, True, False, True, False],
    'Diabetic': [False, True, False, True, False, True, False, False]
}

# Create DataFrame
patients_df = pd.DataFrame(data)

# Display the DataFrame
print("Patient Records:")
patients_df

In [None]:
# Basic DataFrame operations

# View basic information about the DataFrame
print("DataFrame Information:")
patients_df.info()

# Summary statistics
print("\nSummary Statistics:")
patients_df.describe()

# Selecting specific columns
print("\nPatient IDs and Ages:")
patients_df[['PatientID', 'Age']].head()

# Filtering data (patients who are diabetic)
diabetic_patients = patients_df[patients_df['Diabetic'] == True]
print("\nDiabetic Patients:")
diabetic_patients

# Multiple conditions (diabetic patients with high cholesterol > 200)
high_risk_patients = patients_df[(patients_df['Diabetic'] == True) & (patients_df['Cholesterol'] > 200)]
print("\nHigh Risk Patients (Diabetic with Cholesterol > 200):")
high_risk_patients

# Grouping and aggregation
print("\nAverage Cholesterol by Gender:")
patients_df.groupby('Gender')['Cholesterol'].mean()

# More complex grouping
print("\nStatistics by Gender and Diabetic Status:")
patients_df.groupby(['Gender', 'Diabetic'])[['Age', 'Cholesterol']].agg(['mean', 'count'])

## Exercise: Working with Medical Data

Now it's your turn to practice! Try to complete the following tasks using the patients DataFrame we created above:

1. Calculate the average age of smokers vs. non-smokers
2. Find the patient with the highest cholesterol level
3. Create a new column called 'Risk_Category' that categorizes patients as:
   - 'High Risk' if they are both diabetic and smokers
   - 'Medium Risk' if they are either diabetic or smokers (but not both)
   - 'Low Risk' if they are neither diabetic nor smokers
4. Count how many patients fall into each risk category

In [None]:
# 1. Calculate average age of smokers vs. non-smokers
print("\nAverage Age by Smoking Status:")

# 2. Find patient with highest cholesterol
print("\nPatient with Highest Cholesterol:")

# 3. Create Risk_Category column
print("\Risk Category Column:")

# 4. Count patients in each risk category
print("\nPatients in Each Risk Category:")

<a id="data-acquisition"></a>
# 3. Data Acquisition and Preparation

In this section, we'll learn how to acquire data from different sources and prepare it for analysis. These are crucial steps in any data science project, especially in healthcare where data quality directly impacts patient outcomes.

## 3.1 Loading Data from Different Sources

Let's explore how to load data from various sources commonly used in healthcare.

CSV (Comma-Separated Values) files are one of the most common formats for storing tabular data. Let's load a synthetic diabetes dataset.

We use here a synthetic dataset. Synthetic data refers to artificially generated data that mimics the statistical properties and patterns of real-world data without containing any actual patient information. In healthcare, synthetic data is created using various techniques like statistical modeling, machine learning algorithms, or generative AI to produce realistic but entirely fictional medical records, images, or clinical narratives.

Key points about synthetic data in healthcare:

- It preserves privacy by eliminating the risk of exposing protected health information (PHI)
- It can be used to augment limited datasets for training AI models
- It helps overcome data sharing restrictions in medical research
- It can be designed to represent rare conditions or diverse patient populations
- It enables testing of algorithms in scenarios that might be rare in real data
- It addresses regulatory concerns related to HIPAA and other privacy laws

It contains the following columns:
- PatientAge: Age of the patient in years
- BodyMassIndex: BMI value (weight in kg / height in m²)
- BloodGlucose: Blood glucose level in mg/dL
- SystolicBP: Systolic blood pressure in mmHg
- InsulinLevel: Insulin level in μU/mL
- SkinFold: Skin fold thickness in mm
- FamilyHistory: Diabetes pedigree function (a score of diabetes genetic influence)
- ActivityLevel: Physical activity level (0-4, where 0=Sedentary, 4=Very Active)
- DiabetesStatus: Target variable (1=Diabetic, 0=Non-diabetic)

In [None]:
# The dataset path
synthetic_diabetes_url = "https://raw.githubusercontent.com/MoLue/wft_digital_medicine/main/data/synthetic_diabetes.csv"

# Load the dataset
diabetes_df = pd.read_csv(synthetic_diabetes_url)


# Uncomment the following lines if you want to load the dataset from a local file
# Define the path to the local file
# This assumes the file is in the 'data' directory of your project
#data_dir = os.path.join(os.path.dirname('.'), 'data')
#file_path = os.path.join(data_dir, 'synthetic_diabetes.csv')

# Load the dataset
# diabetes_df = pd.read_csv(synthetic_diabetes_url)

# Display the first few rows
print("Synthetic Diabetes Dataset:")
diabetes_df.head()

# Display basic information about the dataset
print("\nDataset Information:")
diabetes_df.info()

# Display statistical summary
print("\nStatistical Summary:")
diabetes_df.describe()

## 3.2 Handling Missing Values in Medical Data

Missing values are common in healthcare datasets and can significantly impact analysis results. Let's explore different approaches to identify and handle missing values in our Synthetic Diabetes Dataset.

In [None]:
# Check for missing values in the dataset
print("\nMissing values in each column:")
print(diabetes_df.isnull().sum())

# In medical datasets, missing values might be represented as zeros or extreme values
# Let's identify potential implicit missing values (zeros in certain columns where zero is not physiologically possible)
print("\nPotential implicit missing values (zeros in columns where zero is physiologically unlikely):")
for column in ['BloodGlucose', 'SystolicBP', 'InsulinLevel', 'BodyMassIndex']:
    zero_count = len(diabetes_df[diabetes_df[column] == 0])
    print(f"{column}: {zero_count} zeros ({zero_count/len(diabetes_df)*100:.1f}% of data)")

# Create a copy of the dataset to work with
diabetes_clean = diabetes_df.copy()

# Method 1: Replace with mean (for numerical data)
print("\nMethod 1: Replace with mean")
for column in ['BloodGlucose', 'SystolicBP', 'InsulinLevel', 'BodyMassIndex']:
    # Replace zeros with NaN
    diabetes_clean[column] = diabetes_clean[column].replace(0, np.nan)
    # Replace NaN with mean
    mean_value = diabetes_clean[column].mean()
    diabetes_clean[column] = diabetes_clean[column].fillna(mean_value)
    print(f"{column} - Mean value used for replacement: {mean_value:.2f}")

# Method 2: Replace with median (more robust to outliers)
print("\nMethod 2: Replace with median (more robust to outliers)")
diabetes_clean2 = diabetes_df.copy()
for column in ['BloodGlucose', 'SystolicBP', 'InsulinLevel', 'BodyMassIndex']:
    # Replace zeros with NaN
    diabetes_clean2[column] = diabetes_clean2[column].replace(0, np.nan)
    # Replace NaN with median
    median_value = diabetes_clean2[column].median()
    diabetes_clean2[column] = diabetes_clean2[column].fillna(median_value)
    print(f"{column} - Median value used for replacement: {median_value:.2f}")

# Method 3: KNN Imputation (more sophisticated approach)
print("\nMethod 3: KNN Imputation")
from sklearn.impute import KNNImputer

# Create a copy of the dataset
diabetes_clean3 = diabetes_df.copy()

# Replace zeros with NaN
for column in ['BloodGlucose', 'SystolicBP', 'InsulinLevel', 'BodyMassIndex']:
    diabetes_clean3[column] = diabetes_clean3[column].replace(0, np.nan)

# Apply KNN imputation
imputer = KNNImputer(n_neighbors=5)
diabetes_imputed = pd.DataFrame(imputer.fit_transform(diabetes_clean3), 
                                columns=diabetes_clean3.columns)

print("Before and after KNN imputation (first 5 rows):")
print("\nBefore imputation:")
print(diabetes_clean3.head())
print("\nAfter imputation:")
print(diabetes_imputed.head())

## 3.3 Data Cleaning Techniques for Medical Data

Clean data is crucial for accurate analysis and modeling, especially in healthcare where decisions can impact patient care. Let's explore techniques to clean our Frankfurt diabetes dataset.
## Data Cleaning and Preprocessing Fundamentals

Data cleaning is a critical step in any data science workflow, especially in healthcare where data quality directly impacts analysis outcomes. The following techniques demonstrate essential preprocessing steps for medical datasets:

### 1. Outlier Detection and Handling

Outliers are extreme values that deviate significantly from other observations. In medical data, outliers might represent measurement errors or genuinely unusual cases.

**The Interquartile Range (IQR) Method:**
- Calculate Q1 (25th percentile) and Q3 (75th percentile)
- Compute IQR = Q3 - Q1
- Define boundaries: Lower bound = Q1 - 1.5×IQR, Upper bound = Q3 + 1.5×IQR
- Values outside these boundaries are considered outliers

This method is particularly useful for clinical variables like blood glucose, BMI, and insulin levels where extreme values can skew analysis.

### 2. Outlier Treatment with Winsorization

Winsorization (capping) replaces extreme values with less extreme ones rather than removing them entirely:
- Values below the lower bound are set to the lower bound
- Values above the upper bound are set to the upper bound

This approach preserves the overall data structure while reducing the impact of extreme values on statistical analyses and machine learning models.

### 3. Standardizing Categorical Data

In healthcare datasets, categorical variables often need standardization to ensure consistency:
- Converting numeric codes to meaningful labels (e.g., activity levels)
- Ensuring consistent terminology across the dataset
- Standardizing units and measurement scales

This improves interpretability and facilitates proper analysis of categorical medical data.

### 4. Duplicate Detection and Removal

Duplicate records can arise from:
- Multiple patient visits
- Data entry errors
- Merging of datasets

Removing duplicates ensures that each observation is counted only once, preventing bias in statistical analyses and machine learning models.

These preprocessing techniques form the foundation of reliable healthcare data analysis, ensuring that subsequent modeling efforts are based on clean, consistent data.

In [None]:
# Create a working copy
diabetes_working = diabetes_imputed.copy()

# 1. Handling Outliers
print("\n1. Detecting and Handling Outliers")

# Function to detect outliers using IQR method
def detect_outliers_iqr(df, column):
    Q1 = df[column].quantile(0.25)
    Q3 = df[column].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    outliers = df[(df[column] < lower_bound) | (df[column] > upper_bound)]
    return outliers, lower_bound, upper_bound

# Check for outliers in key columns
for column in ['BloodGlucose', 'BodyMassIndex', 'InsulinLevel']:
    outliers, lower, upper = detect_outliers_iqr(diabetes_working, column)
    print(f"\nOutliers in {column}: {len(outliers)} values")
    print(f"Lower bound: {lower:.2f}, Upper bound: {upper:.2f}")
    print(f"Min value: {diabetes_working[column].min():.2f}, Max value: {diabetes_working[column].max():.2f}")
    
    # Display a few outliers if they exist
    if len(outliers) > 0:
        print("Sample outliers:")
        print(outliers[[column]].head(3))

# 2. Handling outliers - Capping method
print("\n2. Handling outliers - Capping method (Winsorization)")
diabetes_capped = diabetes_working.copy()

for column in ['BloodGlucose', 'BodyMassIndex', 'InsulinLevel']:
    _, lower, upper = detect_outliers_iqr(diabetes_capped, column)
    # Cap the values
    diabetes_capped[column] = diabetes_capped[column].clip(lower=lower, upper=upper)
    print(f"{column} after capping - Min: {diabetes_capped[column].min():.2f}, Max: {diabetes_capped[column].max():.2f}")

# 3. Standardizing text data (if applicable)
print("\n3. Standardizing Text Data (Example)")
print("In medical datasets, text fields like diagnoses or medications often need standardization.")
print("Example code for standardizing a categorical variable:")
print("diabetes_df['PhysicalActivity'] = diabetes_df['PhysicalActivity'].replace({0: 'Sedentary', 1: 'Light', 2: 'Moderate', 3: 'Active', 4: 'Very Active'})")

# 4. Removing duplicates
print("\n4. Checking and Removing Duplicates")
duplicates = diabetes_working.duplicated().sum()
print(f"Number of duplicate rows: {duplicates}")
if duplicates > 0:
    diabetes_working = diabetes_working.drop_duplicates()
    print(f"After removing duplicates: {len(diabetes_working)} rows")


## 3.4 Data Validation Methods

Data validation ensures the quality and reliability of our dataset. This is particularly important in medical research where data quality directly impacts conclusions that may affect patient care.


In [None]:
# EXERCISE:
# 1. Range validation
print("\n1. Range Validation")
# TODO: Choose appropriate ranges for each variable based on medical knowledge
validation_ranges = {
    'PatientAge': (0, 0),
    'BloodGlucose': (0, 0),
    'SystolicBP': (0, 0),
    'InsulinLevel': (0, 0),
    'BodyMassIndex': (0, 0),
    'FamilyHistory': (0, 0),
    'ActivityLevel': (0, 0)
}

# Check if values are within expected ranges
for column, (min_val, max_val) in validation_ranges.items():
    out_of_range = diabetes_working[(diabetes_working[column] < min_val) | 
                                   (diabetes_working[column] > max_val)]
    if len(out_of_range) > 0:
        print(f"{column}: {len(out_of_range)} values out of expected range ({min_val}-{max_val})")
    else:
        print(f"{column}: All values within expected range")


In [None]:
# More validation approaches

# 2. Consistency checks
print("\n2. Consistency Checks")
# Example: In diabetes data, patients with high glucose should generally have higher insulin levels
# This is a simplified example and not always medically accurate
high_glucose = diabetes_working[diabetes_working['BloodGlucose'] > 180]
low_insulin_count = high_glucose[high_glucose['InsulinLevel'] < 50].shape[0]
print(f"Potential inconsistency: {low_insulin_count} patients with high glucose (>180) but low insulin (<50)")

# 3. Completeness check
print("\n3. Completeness Check")
completeness = (diabetes_working.count() / len(diabetes_working)) * 100
print("Completeness percentage for each column:")
for column, percentage in completeness.items():
    print(f"{column}: {percentage:.2f}%")

# 4. Data type validation
print("\n4. Data Type Validation")
print(diabetes_working.dtypes)

# 5. Creating a validation report function
print("\n5. Creating a Validation Report Function")

def validate_diabetes_data(df):
    """Validate diabetes dataset and return a report of issues found."""
    issues = []
    
    # Check for missing values
    missing = df.isnull().sum()
    if missing.sum() > 0:
        for column, count in missing.items():
            if count > 0:
                issues.append(f"Missing values in {column}: {count}")
    
    # Check for out-of-range values
    for column, (min_val, max_val) in validation_ranges.items():
        out_of_range = df[(df[column] < min_val) | (df[column] > max_val)]
        if len(out_of_range) > 0:
            issues.append(f"Out-of-range values in {column}: {len(out_of_range)}")
    
    # Check for potential inconsistencies
    high_glucose = df[df['BloodGlucose'] > 180]
    low_insulin_count = high_glucose[high_glucose['InsulinLevel'] < 50].shape[0]
    if low_insulin_count > 0:
        issues.append(f"Potential inconsistency: {low_insulin_count} patients with high glucose but low insulin")
    
    # Return report
    if issues:
        return f"Validation issues found ({len(issues)}):\n" + "\n".join(issues)
    else:
        return "Validation passed: No issues found."

# Run validation on our cleaned dataset
validation_report = validate_diabetes_data(diabetes_working)
print(validation_report)

## 3.5 Data Transformation Techniques

Data transformation is often necessary to prepare data for analysis and modeling. Let's explore common transformation techniques used in medical data science.


In [None]:
# Create a working copy
diabetes_transform = diabetes_working.copy()

# 1. Normalization (Min-Max Scaling)
print("\n1. Normalization (Min-Max Scaling)")
from sklearn.preprocessing import MinMaxScaler

# Select numerical columns to normalize
numeric_cols = ['PatientAge', 'BloodGlucose', 'SystolicBP', 'InsulinLevel', 
                'BodyMassIndex', 'FamilyHistory']

# Apply Min-Max scaling
scaler = MinMaxScaler()
diabetes_transform[numeric_cols] = scaler.fit_transform(diabetes_transform[numeric_cols])

print("After normalization (first 5 rows):")
print(diabetes_transform.head())

# 2. Standardization (Z-score)
print("\n2. Standardization (Z-score)")
from sklearn.preprocessing import StandardScaler

# Create a new copy for standardization
diabetes_standardized = diabetes_working.copy()

# Apply standardization
std_scaler = StandardScaler()
diabetes_standardized[numeric_cols] = std_scaler.fit_transform(diabetes_standardized[numeric_cols])

print("After standardization (first 5 rows):")
print(diabetes_standardized.head())

# 3. Log Transformation (useful for skewed data)
print("\n3. Log Transformation")
# Create a copy for log transformation
diabetes_log = diabetes_working.copy()

# Apply log transformation to insulin (often skewed in diabetes data)
# Add a small constant to handle zeros
diabetes_log['Insulin_Log'] = np.log1p(diabetes_log['InsulinLevel'])

# Compare distributions
print("Insulin distribution before and after log transformation:")
print(f"Original - Mean: {diabetes_log['InsulinLevel'].mean():.2f}, Std: {diabetes_log['InsulinLevel'].std():.2f}")
print(f"Log transformed - Mean: {diabetes_log['Insulin_Log'].mean():.2f}, Std: {diabetes_log['Insulin_Log'].std():.2f}")

# 4. Binning/Categorization
print("\n4. Binning/Categorization")
# Create age groups
diabetes_binned = diabetes_working.copy()
diabetes_binned['Age_Group'] = pd.cut(diabetes_binned['PatientAge'], 
                                     bins=[18, 35, 50, 65, 90],
                                     labels=['18-35', '36-50', '51-65', '66+'])

# Create BMI categories according to WHO classification
diabetes_binned['BMI_Category'] = pd.cut(diabetes_binned['BodyMassIndex'],
                                        bins=[0, 18.5, 25, 30, 100],
                                        labels=['Underweight', 'Normal', 'Overweight', 'Obese'])

# Show the distribution of these categories
print("\nAge Group Distribution:")
print(diabetes_binned['Age_Group'].value_counts().sort_index())

print("\nBMI Category Distribution:")
print(diabetes_binned['BMI_Category'].value_counts().sort_index())

# 5. One-Hot Encoding
print("\n5. One-Hot Encoding")
# One-hot encode the categorical variables we created
diabetes_encoded = pd.get_dummies(diabetes_binned, columns=['Age_Group', 'BMI_Category'])

# Display the new columns created
print("New columns after one-hot encoding:")
new_columns = [col for col in diabetes_encoded.columns if 'Age_Group' in col or 'BMI_Category' in col]
print(new_columns)
print(diabetes_encoded[new_columns].head())

# 6. Feature Engineering
print("\n6. Feature Engineering")
# Create new features that might be relevant for diabetes
diabetes_features = diabetes_working.copy()

# BMI * Glucose interaction (higher values might indicate higher risk)
diabetes_features['BMI_Glucose_Interaction'] = diabetes_features['BodyMassIndex'] * diabetes_features['BloodGlucose'] / 100

# Insulin to Glucose ratio (measure of insulin resistance)
diabetes_features['Insulin_Glucose_Ratio'] = diabetes_features['InsulinLevel'] / diabetes_features['BloodGlucose']

# Age-adjusted diabetes pedigree
diabetes_features['Age_Adjusted_Pedigree'] = diabetes_features['FamilyHistory'] * (diabetes_features['PatientAge'] / 50)

print("Newly engineered features (first 5 rows):")
print(diabetes_features[['BMI_Glucose_Interaction', 'Insulin_Glucose_Ratio', 'Age_Adjusted_Pedigree']].head())

# 7. Saving the processed dataset
print("\n7. Saving the Processed Dataset")
print("Code to save the processed dataset:")
print("diabetes_processed = diabetes_features  # or whichever version you prefer")
print("diabetes_processed.to_csv('processed_frankfurt_diabetes.csv', index=False)")


# Section 4: Simple Predictive Modeling
This section introduces basic predictive modeling concepts using the diabetes dataset

## 4.1 Introduction to Predictive Modeling Concepts

Predictive modeling uses statistical techniques to predict outcomes from current data. In healthcare, predictive models can help:

- Identify patients at risk for developing certain conditions
- Predict treatment outcomes
- Forecast disease progression
- Support clinical decision-making

Let's explore the key concepts of predictive modeling:

In [None]:
# Import necessary packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, mean_squared_error, r2_score
import os

# Set some styling
sns.set_style("whitegrid")
sns.set(font_scale=1.2)

## Predictive Modeling Workflow in Healthcare

### The Predictive Modeling Process

1. **Problem Definition**: Determine what we want to predict
   - In healthcare, this might be disease risk, treatment outcomes, or patient readmission probability

2. **Data Collection**: Gather relevant data
   - Medical records, lab results, imaging data, patient surveys, etc.

3. **Data Preprocessing**: Clean and prepare data for modeling
   - Handle missing values, outliers, and inconsistencies
   - Normalize or standardize numerical features
   - Encode categorical variables

4. **Feature Selection**: Identify variables most relevant to our prediction
   - Reduce dimensionality and focus on clinically relevant features
   - Use statistical methods or domain knowledge to select important predictors

5. **Model Selection**: Choose appropriate algorithm(s)
   - Based on the nature of the prediction task and data characteristics
   - Consider interpretability requirements in medical contexts

6. **Training**: Fit the model to training data
   - Optimize model parameters to best capture patterns in the data

7. **Evaluation**: Assess model performance
   - Use appropriate metrics (accuracy, sensitivity, specificity, AUC, etc.)
   - Validate against clinical standards

8. **Deployment**: Apply the model to make predictions
   - Integrate into clinical workflows or decision support systems

9. **Monitoring**: Track model performance over time
   - Ensure continued accuracy as patient populations change
   - Update as medical knowledge evolves

### Types of Predictive Models

1. **Classification Models**: Predict categorical outcomes
   - Example: Predicting whether a patient will develop diabetes (yes/no)
   - Common algorithms: Logistic Regression, Random Forest, Support Vector Machines
   - Evaluation metrics: Accuracy, Precision, Recall, F1-score, AUC-ROC

2. **Regression Models**: Predict continuous values
   - Example: Predicting a patient's future blood glucose level
   - Common algorithms: Linear Regression, Ridge/Lasso Regression, Gradient Boosting
   - Evaluation metrics: RMSE, MAE, R-squared

### Training and Testing Data

- **Training data**: Used to build the model (~70-80% of data)
  - The model learns patterns and relationships from this subset

- **Testing data**: Used to evaluate model performance (~20-30% of data)
  - Simulates how the model will perform on new, unseen patients
  - Helps detect overfitting (when a model performs well on training data but poorly on new data)

- This split is crucial in healthcare applications where model generalizability directly impacts patient care

In [None]:
# Basic cleaning for demonstration
# Replace missing values and zeros with NaN
diabetes_df = diabetes_df.replace('', np.nan)
numeric_cols = ['PatientAge', 'BodyMassIndex', 'BloodGlucose', 'SystolicBP', 
                'InsulinLevel', 'SkinFold', 'FamilyHistory', 'ActivityLevel']

for col in numeric_cols:
    diabetes_df[col] = pd.to_numeric(diabetes_df[col], errors='coerce')
    # Replace zeros with NaN for columns where zero is not physiologically possible
    if col in ['BloodGlucose', 'SystolicBP', 'InsulinLevel', 'BodyMassIndex']:
        diabetes_df[col] = diabetes_df[col].replace(0, np.nan)
    # Fill NaN with mean
    diabetes_df[col] = diabetes_df[col].fillna(diabetes_df[col].mean())

# Ensure target variable is properly formatted
# First fill any NaN values in DiabetesStatus with 0 (or another appropriate value)
diabetes_df['DiabetesStatus'] = diabetes_df['DiabetesStatus'].fillna(0).astype(int)

# Display the first few rows of the cleaned dataset
print("Cleaned Diabetes Dataset:")
display(diabetes_df.head())

# Demonstrate the train-test split
# Select your features and target variable
# Your code here:
# Set the following elements to the features (X) and target variable (y)
# 'PatientAge', 'BodyMassIndex', 'BloodGlucose', 'SystolicBP', 'DiabetesStatus',
# 'InsulinLevel', 'SkinFold', 'FamilyHistory', 'ActivityLevel'
# 
# X = diabetes_df[[]]
# y = diabetes_df[]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

print(f"\nData split for modeling:")
print(f"Total dataset size: {len(diabetes_df)} patients")
print(f"Training set: {len(X_train)} patients ({len(X_train)/len(diabetes_df)*100:.1f}%)")
print(f"Testing set: {len(X_test)} patients ({len(X_test)/len(diabetes_df)*100:.1f}%)")

## 6.2 Simple Classification Example: Predicting Diabetes

Classification models predict categorical outcomes. In this example, we'll build a simple model to predict whether a patient has diabetes based on their health indicators.

We'll use logistic regression, a fundamental classification algorithm that's both interpretable and commonly used in medical research.

In [None]:
# Logistic Regression for Diabetes Classification

from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler

# Prepare the data


# YOUR CODE HERE:
# Select features that are clinically relevant for diabetes prediction
features = ['...']
target = ''
X = diabetes_df[features]
y = diabetes_df[target]

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Standardize features (important for logistic regression)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Create and train the logistic regression model
# We use a higher C value (inverse of regularization strength) for medical data
# to prioritize sensitivity over simplicity

# YOUR CODE HERE:
# Play with the hyperparameters and observe the results. 
# Following values are good values to start with...
C = 10
max_iter = 1000 


log_reg = LogisticRegression(C=C, max_iter=max_iter, random_state=42)
log_reg.fit(X_train_scaled, y_train)

# Make predictions
y_pred = log_reg.predict(X_test_scaled)
y_prob = log_reg.predict_proba(X_test_scaled)[:, 1]  # Probability of diabetes

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
class_report = classification_report(y_test, y_pred)


In [None]:
# Lets see the results
print("Logistic Regression Model for Diabetes Prediction")
print(f"Accuracy: {accuracy:.2f}")

print("\nConfusion Matrix:")
print(conf_matrix)
print("\nTrue Negatives (correctly predicted non-diabetic):", conf_matrix[0, 0])
print("False Positives (incorrectly predicted diabetic):", conf_matrix[0, 1])
print("False Negatives (incorrectly predicted non-diabetic):", conf_matrix[1, 0])
print("True Positives (correctly predicted diabetic):", conf_matrix[1, 1])

print("\nClassification Report:")
print(class_report)

# Visualize the confusion matrix
plt.figure(figsize=(8, 6))
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues', 
            xticklabels=['Non-Diabetic', 'Diabetic'],
            yticklabels=['Non-Diabetic', 'Diabetic'])
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix for Diabetes Prediction')
plt.show()


Think about the results and try to make clear to you what the model is doing and what it is not doing.

Next, let us have a look at the feature importance. This means we want to see which features are most important for the model to make predictions.  

We can do this by looking at the coefficients of the logistic regression model. The coefficients represent the strength of the relationship between each feature and the target variable. A positive coefficient indicates a positive relationship, while a negative coefficient indicates a negative relationship. The larger the absolute value of the coefficient, the stronger the relationship.

In [None]:
# Analyze feature importance
coefficients = log_reg.coef_[0]
feature_importance = pd.DataFrame({'Feature': features, 'Coefficient': coefficients})
feature_importance = feature_importance.sort_values('Coefficient', ascending=False)

print("\nFeature Importance (Coefficients):")
display(feature_importance)

# Visualize feature importance
plt.figure(figsize=(10, 6))
sns.barplot(x='Coefficient', y='Feature', data=feature_importance)
plt.title('Feature Importance for Diabetes Prediction')
plt.axvline(x=0, color='gray', linestyle='--')
plt.show()


In [None]:
# Medical interpretation of the model
print("\nMedical Interpretation:")
print("1. The model achieved an accuracy of {:.1f}%, meaning it correctly classified {:.1f}% of patients.".format(
    accuracy*100, accuracy*100))
print("2. What else? Think about what you just found out so far.")


In [None]:
# Demonstrate prediction for a new patient
print("\nPrediction Example for a New Patient:")
new_patient = pd.DataFrame({
    'PatientAge': [55],
    'BodyMassIndex': [32],  # Obese
    'BloodGlucose': [140],  # Elevated
    'SystolicBP': [150],    # Hypertension
    'InsulinLevel': [150],
    'FamilyHistory': [0.8]  # Strong family history
})

print("New Patient Data:")
display(new_patient)

# Scale the new patient data
new_patient_scaled = scaler.transform(new_patient)

# Make prediction
prediction = log_reg.predict(new_patient_scaled)[0]
probability = log_reg.predict_proba(new_patient_scaled)[0, 1]

print(f"Prediction: {'Diabetic' if prediction == 1 else 'Non-Diabetic'}")
print(f"Probability of Diabetes: {probability:.2f} ({probability*100:.1f}%)")


In [None]:
# Try to do a clinical interpretation and set some meaningful thresholds in this hypothetical scenario

high_risk_probability = 1.0  # Choose a better value based on your analysis so far
moderate_risk_probability = 0.1  #  Choose a better value based on your analysis so far

print("\nClinical Interpretation:")
if probability > high_risk_probability:
    print("High risk of diabetes. Recommend comprehensive screening and lifestyle intervention.")
elif probability > moderate_risk_probability:
    print("Moderate risk of diabetes. Recommend follow-up testing and preventive measures.")
else:
    print("Low risk of diabetes. Recommend routine screening as per age-appropriate guidelines.")

## 6.3 Regression Example: Predicting Blood Glucose Levels

Regression models predict continuous values. In this example, we'll build a model to predict a patient's blood glucose level based on other health indicators.

This type of model could help identify factors that contribute to elevated blood glucose and potentially predict future glucose levels based on current health metrics.

In [None]:
# Linear Regression for Blood Glucose Prediction

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np

# Prepare the data
# Select features that might influence blood glucose levels

# YOUR CODE HERE:
# How do you select the features and target now?

features = []  # Update this list
target = ''  # Update this string
X = diabetes_df[features]
y = diabetes_df[target]

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Create and train the linear regression model
lin_reg = LinearRegression()
lin_reg.fit(X_train, y_train)

# Make predictions
y_pred = lin_reg.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)

print("Linear Regression Model for Blood Glucose Prediction")
print(f"Mean Squared Error: {mse:.2f}")
print(f"Root Mean Squared Error: {rmse:.2f} mg/dL")
print(f"R² Score: {r2:.2f}")

# Visualize actual vs predicted values
plt.figure(figsize=(10, 6))
plt.scatter(y_test, y_pred, alpha=0.5)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--')
plt.xlabel('Actual Blood Glucose (mg/dL)')
plt.ylabel('Predicted Blood Glucose (mg/dL)')
plt.title('Actual vs Predicted Blood Glucose Levels')
plt.show()

# Analyze coefficients
coefficients = pd.DataFrame({'Feature': features, 'Coefficient': lin_reg.coef_})
coefficients = coefficients.sort_values('Coefficient', ascending=False)

print("\nModel Coefficients:")
display(coefficients)

# Visualize coefficients
plt.figure(figsize=(10, 6))
sns.barplot(x='Coefficient', y='Feature', data=coefficients)
plt.title('Feature Importance for Blood Glucose Prediction')
plt.axvline(x=0, color='gray', linestyle='--')
plt.show()

# Medical interpretation
print("\nMedical Interpretation:")
print(f"1. The model explains {r2*100:.1f}% of the variation in blood glucose levels.")
print(f"2. The average prediction error is ±{rmse:.1f} mg/dL.")
print("3. Insulin level likely has a strong positive relationship with blood glucose.")
print("4. BMI appears to be positively associated with blood glucose levels.")
print("5. Physical activity level may have a negative relationship (more activity, lower glucose).")

# Demonstrate prediction for a new patient
print("\nBlood Glucose Prediction Example:")
new_patient = pd.DataFrame({
    'PatientAge': [45],
    'BodyMassIndex': [28],  # Overweight
    'SystolicBP': [130],    # Elevated
    'InsulinLevel': [120],
    'ActivityLevel': [2],   # Moderate activity
    'FamilyHistory': [0.5]  # Moderate family history
})

print("New Patient Data:")
display(new_patient)

# Make prediction
predicted_glucose = lin_reg.predict(new_patient)[0]

print(f"Predicted Blood Glucose: {predicted_glucose:.1f} mg/dL")

## 6.4 Model Evaluation Guide for Beginners

Evaluating predictive models is crucial in healthcare applications where decisions can impact patient care. This section introduces key metrics and approaches for assessing model performance.

In [None]:
# Model Evaluation Metrics and Techniques

from sklearn.model_selection import cross_val_score, learning_curve
from sklearn.metrics import roc_curve, auc, precision_recall_curve

# Let's use our diabetes classification model for demonstration
# Prepare the data (same as before)
features = ['PatientAge', 'BodyMassIndex', 'BloodGlucose', 'SystolicBP', 'InsulinLevel', 'FamilyHistory']
X = diabetes_df[features]
y = diabetes_df['DiabetesStatus']

# Standardize features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Create a logistic regression model
log_reg = LogisticRegression(C=10, max_iter=1000, random_state=42)

# 1. Cross-Validation (more robust than a single train-test split)
print("1. Cross-Validation")
cv_scores = cross_val_score(log_reg, X_scaled, y, cv=5, scoring='accuracy')
print(f"Cross-validation accuracy scores: {cv_scores}")
print(f"Mean CV accuracy: {cv_scores.mean():.2f}")
print(f"Standard deviation: {cv_scores.std():.2f}")

# 2. ROC Curve and AUC
print("\n2. ROC Curve and AUC")
# Train the model on the full dataset for demonstration
log_reg.fit(X_scaled, y)
y_prob = log_reg.predict_proba(X_scaled)[:, 1]

# Calculate ROC curve and AUC
fpr, tpr, thresholds = roc_curve(y, y_prob)
roc_auc = auc(fpr, tpr)

# Plot ROC curve
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, color='darkorange', lw=2, label=f'ROC curve (AUC = {roc_auc:.2f})')
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate (1 - Specificity)')
plt.ylabel('True Positive Rate (Sensitivity)')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc="lower right")
plt.show()

print(f"Area Under the ROC Curve (AUC): {roc_auc:.2f}")
print("- AUC of 0.5 suggests no discrimination (equivalent to random guessing)")
print("- AUC of 1.0 suggests perfect discrimination")
print("- In medical contexts, AUC > 0.8 is often considered good")

# 3. Precision-Recall Curve (especially useful for imbalanced datasets)
print("\n3. Precision-Recall Curve")
precision, recall, _ = precision_recall_curve(y, y_prob)

# Plot Precision-Recall curve
plt.figure(figsize=(8, 6))
plt.plot(recall, precision, color='blue', lw=2)
plt.xlabel('Recall (Sensitivity)')
plt.ylabel('Precision (Positive Predictive Value)')
plt.title('Precision-Recall Curve')
plt.show()

print("Precision: Proportion of positive identifications that were actually correct")
print("Recall: Proportion of actual positives that were identified correctly")
print("- In screening tests, high recall (sensitivity) is often prioritized")
print("- In confirmatory tests, high precision is often prioritized")

# 4. Learning Curves (to diagnose overfitting/underfitting)
print("\n4. Learning Curves")
train_sizes, train_scores, test_scores = learning_curve(
    log_reg, X_scaled, y, cv=5, scoring='accuracy', 
    train_sizes=np.linspace(0.1, 1.0, 10))

# Calculate mean and standard deviation for training and test scores
train_mean = np.mean(train_scores, axis=1)
train_std = np.std(train_scores, axis=1)
test_mean = np.mean(test_scores, axis=1)
test_std = np.std(test_scores, axis=1)

# Plot learning curve
plt.figure(figsize=(10, 6))
plt.grid()
plt.fill_between(train_sizes, train_mean - train_std, train_mean + train_std, alpha=0.1, color="r")
plt.fill_between(train_sizes, test_mean - test_std, test_mean + test_std, alpha=0.1, color="g")
plt.plot(train_sizes, train_mean, 'o-', color="r", label="Training score")
plt.plot(train_sizes, test_mean, 'o-', color="g", label="Cross-validation score")
plt.xlabel("Training examples")
plt.ylabel("Accuracy")
plt.title("Learning Curves for Diabetes Classification Model")
plt.legend(loc="best")
plt.show()

print("Learning curves help diagnose model performance issues:")
print("- If training score is much higher than validation score: Model is overfitting")
print("- If both scores are low: Model is underfitting")
print("- If scores converge at a high value: Model is well-fitted")

### Advanced Model Evaluation in Medical Contexts

#### Choosing the Right Evaluation Metrics for Medical Contexts

Different medical applications require different evaluation priorities:

##### a) Screening Tests (e.g., initial diabetes risk assessment):
- **Prioritize Sensitivity/Recall**: Minimize false negatives
- **Key metrics**: Recall, NPV (Negative Predictive Value)
- **Goal**: Identify as many potential cases as possible
- *Clinical rationale*: Missing a patient with disease (false negative) is often more harmful than incorrectly flagging a healthy patient for follow-up

##### b) Diagnostic Tests (e.g., confirming diabetes diagnosis):
- **Prioritize Specificity**: Minimize false positives
- **Key metrics**: Specificity, PPV (Positive Predictive Value/Precision)
- **Goal**: Ensure those diagnosed truly have the condition
- *Clinical rationale*: Incorrect diagnosis (false positive) can lead to unnecessary treatment, anxiety, and resource utilization

##### c) Prognostic Models (e.g., predicting diabetes complications):
- **Prioritize Calibration**: Accurate probability estimates
- **Key metrics**: Calibration plots, Brier score
- **Goal**: Reliable risk stratification
- *Clinical rationale*: Clinicians need accurate risk estimates to make appropriate treatment decisions and counsel patients

##### d) Resource Allocation Models (e.g., targeting interventions):
- **Prioritize overall accuracy and cost-effectiveness**
- **Key metrics**: Accuracy, AUC, cost-benefit analysis
- **Goal**: Optimize resource utilization
- *Clinical rationale*: Limited healthcare resources must be directed where they will have the greatest impact

### Clinical Validation Considerations

Statistical performance is necessary but not sufficient for clinical utility:

- **External validation**: Test on different patient populations
  - Models should perform consistently across diverse healthcare settings and patient demographics

- **Temporal validation**: Test on newer data
  - Healthcare patterns change over time; models must remain accurate with recent data

- **Impact analysis**: Assess whether the model improves clinical outcomes
  - The ultimate test is whether the model leads to better patient outcomes when implemented

- **Fairness**: Ensure the model performs equitably across different patient groups
  - Models should not perpetuate or amplify existing healthcare disparities

## 6.5 Interpreting Results in Clinical Context

Translating model predictions into clinically meaningful insights is essential for healthcare applications. This section explores how to interpret predictive model results in ways that can inform clinical decision-making.

In [None]:
# Interpreting Predictive Models in Clinical Context

from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.ensemble import RandomForestClassifier

# Let's create a simple decision tree model for interpretability
# Prepare the data (similar to before)
features = ['PatientAge', 'BodyMassIndex', 'BloodGlucose', 'SystolicBP', 'InsulinLevel', 'FamilyHistory']
X = diabetes_df[features]
y = diabetes_df['DiabetesStatus']

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# 1. Decision Trees for Interpretable Predictions
print("1. Decision Trees for Interpretable Predictions")



# YOUR CODE HERE
# Create a simple decision tree (limited depth for interpretability)
# Use the DecisionTreeClassifier from sklearn.tree
# dt_model = ...


# After the model is set, we can predict and evaluate
dt_pred = dt_model.predict(X_test)
dt_accuracy = accuracy_score(y_test, dt_pred)
print(f"Decision Tree Accuracy: {dt_accuracy:.2f}")

# Visualize the decision tree
plt.figure(figsize=(20, 10))
plot_tree(dt_model, feature_names=features, class_names=['Non-Diabetic', 'Diabetic'], 
          filled=True, rounded=True, fontsize=10)
plt.title("Decision Tree for Diabetes Classification")
plt.show()

# What does the decision tree tell us?


In [None]:
# Feature Importance for Clinical Relevance
print("\nFeature Importance for Clinical Relevance")

# YOUR CODE HERE:
# Use Random Forest for more stable feature importance
# rf_model = ...
# 

# Get feature importance
feature_importance = pd.DataFrame({
    'Feature': features,
    'Importance': rf_model.feature_importances_
}).sort_values('Importance', ascending=False)

# Visualize feature importance
plt.figure(figsize=(10, 6))
sns.barplot(x='Importance', y='Feature', data=feature_importance)
plt.title('Feature Importance for Diabetes Prediction')
plt.show()

print("\nClinical Relevance of Features:")
for _, row in feature_importance.iterrows():
    feature = row['Feature']
    importance = row['Importance']
    
    print(f"{feature} (Importance: {importance:.3f}):")


In [None]:
# Risk Stratification and Clinical Thresholds
print("\nRisk Stratification and Clinical Thresholds")

# YOUR CODE HERE: 
# Use our logistic regression model for probability estimates
# log_reg = ...

# Get probabilities for test set
y_prob = log_reg.predict_proba(X_test)[:, 1]

# Create risk categories
risk_df = pd.DataFrame({
    'Actual': y_test.values,
    'Probability': y_prob
})

# YOUR CODE HERE
risk_categories = [0, 0.1, 0.2, 0.3, 1.0]  # try different values...

# Create risk categories based on your input
risk_df['Risk Category'] = pd.cut(
    risk_df['Probability'], 
    bins=risk_categories,
    labels=['Low Risk', 'Moderate Risk', 'High Risk', 'Very High Risk']
)

# Count patients in each risk category
risk_counts = risk_df['Risk Category'].value_counts().sort_index()
print("Patients by Risk Category:")
for category, count in risk_counts.items():
    print(f"{category}: {count} patients ({count/len(risk_df)*100:.1f}%)")

# Calculate actual diabetes rates within each risk category
risk_accuracy = risk_df.groupby('Risk Category')['Actual'].mean()
print("\nActual Diabetes Prevalence by Predicted Risk Category:")
for category, rate in risk_accuracy.items():
    print(f"{category}: {rate*100:.1f}% have diabetes")

# Visualize risk stratification
plt.figure(figsize=(10, 6))
sns.countplot(x='Risk Category', hue='Actual', data=risk_df, palette=['green', 'red'])
plt.title('Predicted Risk Categories vs. Actual Diabetes Status')
plt.xlabel('Predicted Risk Category')
plt.ylabel('Number of Patients')
plt.legend(['Non-Diabetic', 'Diabetic'])
plt.show()

# Section 7: Ethical Considerations
## 7.1 Overview of Privacy Concerns in Medical Data Science

Medical data is among the most sensitive personal information. Privacy concerns in medical data science include:

- Protection of patient identities
- Secure handling of health information
- Appropriate data sharing practices
- Balancing research benefits with privacy risks

Let's explore these concerns and best practices for addressing them:

In [None]:
# Display the first few rows to identify potentially sensitive information
print("Sample of medical dataset:")
display(diabetes_df.head())

# Demonstrate privacy risks with quasi-identifiers
print("\nDemonstration of Re-identification Risk:")

# YOUR CODE HERE:
# Count unique combinations of quasi-identifiers. Select columns where you think they might identify individuals.
# quasi_identifiers = [...] YOUR CODE HERE

# Then count the unique combinations. Hint pandas "groupby" could be a good help here
# unique_combinations = ... YOUR CODE HERE

# Show combinations that might uniquely identify individuals
# rare_combinations = ... YOUR CODE HERE

# Lets see the results:
print(f"Number of rare demographic combinations (≤5 patients): {len(rare_combinations)}")
print(f"These rare combinations could potentially be used for re-identification.")



## Privacy and Ethics in Healthcare Data Science

Working with medical data comes with significant ethical and legal responsibilities. Healthcare data scientists must navigate a complex landscape of regulations designed to protect patient privacy while enabling valuable research and innovation.

### Key Privacy Frameworks and Regulations in Healthcare

The regulatory environment for medical data varies globally, with several major frameworks guiding how we collect, process, and protect sensitive health information:

**HIPAA (Health Insurance Portability and Accountability Act)** serves as the cornerstone of health data protection in the United States. It safeguards individually identifiable health information through a comprehensive set of standards, including the definition of 18 specific protected health identifiers that must be carefully managed. HIPAA establishes clear boundaries for the use and disclosure of patient data, creating a framework that balances privacy protection with legitimate healthcare operations and research needs.

In the European context, the **General Data Protection Regulation (GDPR)** provides even broader protections for health data. Unlike HIPAA's healthcare-specific approach, GDPR treats health information as a special category of personal data deserving heightened protection. The regulation emphasizes individual autonomy through requirements for explicit consent before processing health data and grants individuals powerful rights to access, correct, and even erase their personal information from databases.

### Best Practices for Privacy Protection in Medical Data Science

Beyond regulatory compliance, responsible data scientists should implement robust privacy protection strategies:

**Data Minimization** represents a fundamental principle in privacy-preserving data science. By collecting and retaining only the data elements absolutely necessary for the specific research or clinical question, we reduce privacy risks at their source. This approach includes implementing strict access controls that limit exposure of sensitive information only to those with a legitimate need.

**De-identification Techniques** transform identifiable patient data into more anonymous forms while preserving analytical utility. This process involves removing direct identifiers (like names and medical record numbers), generalizing quasi-identifiers (such as zip codes or birth dates), and applying statistical disclosure controls to minimize re-identification risks. Well-executed de-identification allows researchers to work with realistic data while protecting individual privacy.

**Secure Data Handling** practices form the technical foundation of privacy protection. Implementing strong encryption for data both at rest and in transit ensures that even if unauthorized access occurs, the information remains protected. Conducting analysis within secure computing environments with appropriate access controls and authentication mechanisms adds additional layers of protection against data breaches.

**Ethical Review and Governance** processes provide oversight and accountability. Obtaining IRB approval for research projects, establishing clear data use agreements between institutions, and maintaining transparent data practices all contribute to an ethical framework that respects patient autonomy while enabling scientific progress.

By integrating these regulatory frameworks and best practices into data science workflows, healthcare researchers can advance medical knowledge while honoring their ethical obligation to protect patient privacy.

## 7.2 Data De-identification Techniques

De-identification is the process of removing or modifying personal information to reduce the risk of identifying individuals. In medical data science, effective de-identification is essential for protecting patient privacy while enabling valuable research.

In [None]:
%pip install faker

In [None]:
# Data De-identification Techniques

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from faker import Faker
import hashlib

# Set up a random seed for reproducibility
np.random.seed(42)
fake = Faker()
Faker.seed(42)

# For demonstration, let's add some fictional identifiers to our dataset
# Note: In real medical datasets, these would already exist
patient_ids = [f"P{1000+i}" for i in range(len(diabetes_df))]
names = [fake.name() for _ in range(len(diabetes_df))]
birth_dates = [fake.date_of_birth(minimum_age=18, maximum_age=90) for _ in range(len(diabetes_df))]
zip_codes = [fake.zipcode() for _ in range(len(diabetes_df))]

# Add these to our dataset
diabetes_identified = diabetes_df.copy()
diabetes_identified['PatientID'] = patient_ids
diabetes_identified['Name'] = names
diabetes_identified['BirthDate'] = birth_dates
diabetes_identified['ZipCode'] = zip_codes

# Display the dataset with identifiers
print("Medical Dataset with Identifiers:")
display(diabetes_identified.head())

# 1. Removal of Direct Identifiers
print("\n1. Removal of Direct Identifiers")
print("The simplest de-identification approach is removing direct identifiers.")

# Remove direct identifiers
diabetes_deident1 = diabetes_identified.copy()
diabetes_deident1 = diabetes_deident1.drop(columns=['PatientID', 'Name'])

print("Dataset after removing direct identifiers:")
display(diabetes_deident1.head())
print("Note: This still contains quasi-identifiers that could re-identify individuals.")

# 2. Pseudonymization
print("\n2. Pseudonymization")
print("Pseudonymization replaces identifiers with artificial values that can't be attributed to individuals without additional information.")

# Create pseudonyms using hashing
diabetes_deident2 = diabetes_identified.copy()
# Hash the patient IDs (in practice, you would use a secure key)
diabetes_deident2['PatientID'] = diabetes_deident2['PatientID'].apply(
    lambda x: hashlib.sha256(x.encode()).hexdigest()[:8]
)
# Remove names
diabetes_deident2 = diabetes_deident2.drop(columns=['Name'])

print("Dataset after pseudonymization:")
display(diabetes_deident2.head())
print("The hashed IDs allow linking records without revealing identities.")

# 3. Generalization
print("\n3. Generalization")
print("Generalization reduces precision of data to protect privacy while preserving utility.")

# Generalize quasi-identifiers
diabetes_deident3 = diabetes_identified.copy()
# Age ranges instead of exact ages
diabetes_deident3['AgeGroup'] = pd.cut(
    diabetes_deident3['PatientAge'], 
    bins=[0, 30, 45, 60, 100],
    labels=['<30', '30-45', '46-60', '>60']
)
# First 3 digits of ZIP code
diabetes_deident3['ZipRegion'] = diabetes_deident3['ZipCode'].apply(lambda x: x[:3])
# Year of birth instead of full date
diabetes_deident3['BirthYear'] = pd.to_datetime(diabetes_deident3['BirthDate']).dt.year
# Remove original identifiers
diabetes_deident3 = diabetes_deident3.drop(
    columns=['PatientID', 'Name', 'PatientAge', 'ZipCode', 'BirthDate']
)

print("Dataset after generalization:")
display(diabetes_deident3.head())
print("Note how specific values are replaced with ranges or less precise information.")

# 4. Perturbation (Adding Noise)
print("\n4. Perturbation (Adding Noise)")
print("Perturbation adds random noise to numerical values to protect privacy.")

# Add noise to numerical values
diabetes_deident4 = diabetes_identified.copy()
# Add small random noise to continuous variables
for col in ['BodyMassIndex', 'BloodGlucose', 'SystolicBP', 'InsulinLevel']:
    # Calculate standard deviation
    std = diabetes_deident4[col].std()
    # Add noise (2% of standard deviation)
    diabetes_deident4[col] = diabetes_deident4[col] + np.random.normal(0, 0.02 * std, len(diabetes_deident4))
    # Round to reasonable precision
    diabetes_deident4[col] = diabetes_deident4[col].round(1)
# Remove direct identifiers
diabetes_deident4 = diabetes_deident4.drop(columns=['PatientID', 'Name', 'BirthDate', 'ZipCode'])

print("Dataset after perturbation:")
display(diabetes_deident4.head())
print("Small random noise has been added to numerical values.")

# 5. Synthetic Data Generation
print("\n5. Synthetic Data Generation")
print("Synthetic data preserves statistical properties without containing real individuals.")

# For demonstration, we'll create a very simple synthetic dataset
# In practice, more sophisticated methods would be used
# Calculate statistics from original data
means = diabetes_df[numeric_cols].mean()
stds = diabetes_df[numeric_cols].std()
correlations = diabetes_df[numeric_cols].corr()

# Generate synthetic data with similar properties
n_synthetic = 100
synthetic_data = pd.DataFrame()

for col in numeric_cols:
    synthetic_data[col] = np.random.normal(means[col], stds[col], n_synthetic)

# Add a synthetic diabetes status
synthetic_data['DiabetesStatus'] = np.random.binomial(1, diabetes_df['DiabetesStatus'].mean(), n_synthetic)

print("Synthetic dataset:")
display(synthetic_data.head())
print("This data maintains statistical properties without containing real patient information.")

# 6. K-Anonymity
print("\n6. K-Anonymity")
print("K-anonymity ensures each combination of quasi-identifiers appears at least k times.")

# Demonstrate k-anonymity concept
diabetes_deident6 = diabetes_identified.copy()
# Generalize age into decades
diabetes_deident6['AgeDecade'] = (diabetes_deident6['PatientAge'] // 10) * 10
# Generalize BMI into categories
diabetes_deident6['BMICategory'] = pd.cut(
    diabetes_deident6['BodyMassIndex'],
    bins=[0, 18.5, 25, 30, 100],
    labels=['Underweight', 'Normal', 'Overweight', 'Obese']
)
# Remove direct identifiers and original values
diabetes_deident6 = diabetes_deident6.drop(
    columns=['PatientID', 'Name', 'BirthDate', 'ZipCode', 'PatientAge', 'BodyMassIndex']
)

# Check k-anonymity for quasi-identifiers
quasi_ids = ['AgeDecade', 'BMICategory', 'ActivityLevel']
group_counts = diabetes_deident6.groupby(quasi_ids).size().reset_index(name='count')
k_value = group_counts['count'].min()

print(f"Dataset achieves {k_value}-anonymity for the selected quasi-identifiers.")
print("This means each unique combination of age decade, BMI category, and activity level")
print(f"appears at least {k_value} times in the dataset.")

# Display the k-anonymized dataset
print("\nK-anonymized dataset:")
display(diabetes_deident6.head())

# 7. Comparing De-identification Methods
print("\n7. Comparing De-identification Methods")

print("\nPrivacy-Utility Tradeoff:")
methods = [
    "Removal of Direct Identifiers",
    "Pseudonymization",
    "Generalization",
    "Perturbation",
    "Synthetic Data",
    "K-Anonymity"
]

privacy_levels = [1, 2, 3, 4, 5, 4]  # 1=low, 5=high
utility_levels = [5, 4, 3, 3, 2, 3]  # 1=low, 5=high

comparison_df = pd.DataFrame({
    'Method': methods,
    'Privacy Protection': privacy_levels,
    'Data Utility': utility_levels
})

print("Comparison of de-identification methods:")
display(comparison_df)

# Visualize the privacy-utility tradeoff
plt.figure(figsize=(10, 6))
plt.scatter(comparison_df['Privacy Protection'], comparison_df['Data Utility'], s=100)
for i, method in enumerate(methods):
    plt.annotate(method, 
                 (comparison_df['Privacy Protection'][i], comparison_df['Data Utility'][i]),
                 xytext=(5, 5), textcoords='offset points')
plt.xlabel('Privacy Protection (higher is better)')
plt.ylabel('Data Utility (higher is better)')
plt.title('Privacy-Utility Tradeoff for De-identification Methods')
plt.xlim(0, 6)
plt.ylim(0, 6)
plt.grid(True)
plt.show()

# Best practices for de-identification
print("\nBest Practices for De-identification in Medical Research:")
print("1. Use multiple techniques in combination for stronger protection")
print("2. Consider the specific sensitivity of the data and research context")
print("3. Perform risk assessments to evaluate re-identification risk")
print("4. Document de-identification methods for transparency")
print("5. Implement additional safeguards (e.g., secure access, data use agreements)")
print("6. Regularly review de-identification as new methods and risks emerge")
print("7. Consult with privacy experts and ethics committees")

## 7.4 Bias in Medical Data and Algorithms

Bias in medical data science can lead to unfair treatment, inaccurate predictions, and perpetuation of health disparities. Understanding and mitigating bias is essential for developing equitable healthcare algorithms.

In [None]:
# Bias in Medical Data and Algorithms

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report

# Load our diabetes dataset
import os
data_dir = os.path.join(os.path.dirname(os.path.abspath('')), 'data')
file_path = os.path.join(data_dir, 'synthetic_diabetes.csv')
# diabetes_df = pd.read_csv(file_path)

# For demonstration, let's add synthetic demographic variables
np.random.seed(42)

# Add gender (binary for simplicity, though real applications should be more inclusive)
diabetes_df['Gender'] = np.random.choice(['Male', 'Female'], size=len(diabetes_df))

# Add ethnicity groups
ethnicities = ['Group A', 'Group B', 'Group C', 'Group D']
# Create an imbalanced distribution
ethnicity_probs = [0.6, 0.2, 0.15, 0.05]  # Intentionally imbalanced
diabetes_df['Ethnicity'] = np.random.choice(ethnicities, size=len(diabetes_df), p=ethnicity_probs)

# Add socioeconomic status (SES)
diabetes_df['SES'] = np.random.choice(['Low', 'Medium', 'High'], size=len(diabetes_df), p=[0.3, 0.5, 0.2])

# Introduce synthetic bias: make diabetes more prevalent in certain groups
# This is for educational purposes to demonstrate bias detection and mitigation
for i, row in diabetes_df.iterrows():
    # Higher diabetes rates for Group C and D (for demonstration)
    if row['Ethnicity'] in ['Group C', 'Group D'] and np.random.random() < 0.3:
        diabetes_df.at[i, 'DiabetesStatus'] = 1
    # Higher diabetes rates for low SES (for demonstration)
    if row['SES'] == 'Low' and np.random.random() < 0.2:
        diabetes_df.at[i, 'DiabetesStatus'] = 1

# Display the dataset with demographic variables
print("Medical Dataset with Demographic Variables:")
display(diabetes_df.head())

# 1. Types of Bias in Medical Data Science
print("\n1. Types of Bias in Medical Data Science")

print("\na) Selection Bias")
print("   - Occurs when the data doesn't represent the population it's meant to serve")
print("   - Example: Clinical trials historically underrepresented women and minorities")
print("   - Consequence: Models may not work well for underrepresented groups")

print("\nb) Measurement Bias")
print("   - Occurs when data collection methods vary across groups")
print("   - Example: Different diagnostic criteria or testing rates across populations")
print("   - Consequence: False differences in disease prevalence or severity")

print("\nc) Confounding Bias")
print("   - Occurs when unmeasured variables affect both predictors and outcomes")
print("   - Example: Socioeconomic status affecting both healthcare access and health outcomes")
print("   - Consequence: Spurious associations that don't reflect causal relationships")

print("\nd) Historical/Prejudice Bias")
print("   - Occurs when historical inequities are encoded in the data")
print("   - Example: Historical underdiagnosis of certain conditions in minority populations")
print("   - Consequence: Algorithms may perpetuate or amplify existing disparities")

print("\ne) Algorithmic Bias")
print("   - Occurs when models perform differently across demographic groups")
print("   - Example: Higher error rates for certain populations")
print("   - Consequence: Unfair allocation of resources or clinical recommendations")

# 2. Detecting Bias in Medical Datasets
print("\n2. Detecting Bias in Medical Datasets")

# Analyze diabetes prevalence across demographic groups
print("\nDiabetes Prevalence by Demographic Groups:")

# By gender
gender_diabetes = diabetes_df.groupby('Gender')['DiabetesStatus'].mean()
print("\nDiabetes Prevalence by Gender:")
for gender, rate in gender_diabetes.items():
    print(f"{gender}: {rate*100:.1f}%")

# By ethnicity
ethnicity_diabetes = diabetes_df.groupby('Ethnicity')['DiabetesStatus'].mean()
print("\nDiabetes Prevalence by Ethnicity:")
for ethnicity, rate in ethnicity_diabetes.items():
    print(f"{ethnicity}: {rate*100:.1f}%")
    
# By socioeconomic status
ses_diabetes = diabetes_df.groupby('SES')['DiabetesStatus'].mean()
print("\nDiabetes Prevalence by Socioeconomic Status:")
for ses, rate in ses_diabetes.items():
    print(f"{ses}: {rate*100:.1f}%")

# Visualize the disparities
plt.figure(figsize=(15, 5))

# Gender plot
plt.subplot(1, 3, 1)
sns.barplot(x='Gender', y='DiabetesStatus', data=diabetes_df)
plt.title('Diabetes Prevalence by Gender')
plt.ylabel('Prevalence')

# Ethnicity plot
plt.subplot(1, 3, 2)
sns.barplot(x='Ethnicity', y='DiabetesStatus', data=diabetes_df)
plt.title('Diabetes Prevalence by Ethnicity')
plt.ylabel('')

# SES plot
plt.subplot(1, 3, 3)
sns.barplot(x='SES', y='DiabetesStatus', data=diabetes_df)
plt.title('Diabetes Prevalence by Socioeconomic Status')
plt.ylabel('')

plt.tight_layout()
plt.show()

print("\nObservations:")
print("- There appears to be higher diabetes prevalence in certain ethnic groups")
print("- Socioeconomic status shows a gradient in diabetes prevalence")
print("- These disparities could reflect real-world health inequities or bias in the data")

# 3. Bias in Predictive Models
print("\n3. Bias in Predictive Models")

# Prepare data for modeling
features = ['PatientAge', 'BodyMassIndex', 'BloodGlucose', 'SystolicBP', 'Gender', 'Ethnicity', 'SES']

# Convert categorical variables to dummy variables
diabetes_model_df = pd.get_dummies(diabetes_df, columns=['Gender', 'Ethnicity', 'SES'], drop_first=True)

# Select features for modeling (after dummy encoding)
model_features = [col for col in diabetes_model_df.columns 
                 if col.startswith(('PatientAge', 'BodyMassIndex', 'BloodGlucose', 'SystolicBP', 'Gender', 'Ethnicity', 'SES'))]

X = diabetes_model_df[model_features]
y = diabetes_model_df['DiabetesStatus']

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train a logistic regression model
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Overall model performance
print("\nOverall Model Performance:")
print(classification_report(y_test, y_pred))

# Evaluate model performance across demographic groups
print("\nModel Performance by Demographic Groups:")

# Before evaluating group performance, print the actual column names to help debug
print("\nAvailable columns in X_test:")
for col in X_test.columns:
    print(f"  - {col}")

# Function to evaluate model performance for a specific group
def evaluate_group_performance(group_name, group_value, X, y_true):
    # Create mask for the group - more robust approach
    column_name = f"{group_name}_{group_value}"
    
    # Check if the exact column exists
    if column_name in X.columns:
        mask = X[column_name] == 1
    else:
        # Try to find a matching column (case insensitive)
        matching_cols = [col for col in X.columns if col.lower() == column_name.lower()]
        if matching_cols:
            mask = X[matching_cols[0]] == 1
        else:
            # If no exact match, print available columns and return None
            print(f"Column '{column_name}' not found. Available columns related to {group_name}:")
            related_cols = [col for col in X.columns if group_name.lower() in col.lower()]
            for col in related_cols:
                print(f"  - {col}")
            return None
    
    # Filter data for this group
    X_group = X[mask]
    y_true_group = y_true[mask]
    
    # Skip if no samples
    if len(y_true_group) == 0:
        print(f"No samples found for {group_name}_{group_value}")
        return None
    
    # Make predictions
    y_pred_group = model.predict(X_group)
    
    # Calculate metrics
    try:
        # Try to get a 2x2 confusion matrix
        cm = confusion_matrix(y_true_group, y_pred_group)
        
        # Check if we have a proper 2x2 matrix
        if cm.shape == (2, 2):
            tn, fp, fn, tp = cm.ravel()
            sensitivity = tp / (tp + fn) if (tp + fn) > 0 else 0
            specificity = tn / (tn + fp) if (tn + fp) > 0 else 0
            ppv = tp / (tp + fp) if (tp + fp) > 0 else 0
            npv = tn / (tn + fn) if (tn + fn) > 0 else 0
        else:
            # Handle the case where all predictions are the same class
            print(f"Warning: All predictions for {group_name}_{group_value} are the same class.")
            print(f"Confusion matrix shape: {cm.shape}")
            print(f"Confusion matrix:\n{cm}")
            
            # Set default values
            sensitivity = np.nan
            specificity = np.nan
            ppv = np.nan
            npv = np.nan
    except Exception as e:
        print(f"Error calculating metrics for {group_name}_{group_value}: {e}")
        # Set default values
        sensitivity = np.nan
        specificity = np.nan
        ppv = np.nan
        npv = np.nan
    
    return {
        'Group': f"{group_name}_{group_value}",
        'Samples': len(y_true_group),
        'Sensitivity': sensitivity,
        'Specificity': specificity,
        'PPV': ppv,
        'NPV': npv
    }

# Evaluate performance for each gender - dynamically find gender columns
gender_results = []
# Find what gender values are available as dummy variables
gender_columns = [col for col in X_test.columns if col.startswith('Gender_')]
print(f"\nAvailable gender columns: {gender_columns}")

if gender_columns:
    for gender_col in gender_columns:
        # Extract the gender value from the column name
        gender_value = gender_col.replace('Gender_', '')
        result = evaluate_group_performance('Gender', gender_value, X_test, y_test)
        if result:
            gender_results.append(result)
else:
    print("No gender columns found in the dataset.")

# Evaluate performance for each ethnicity - dynamically find ethnicity columns
ethnicity_results = []
# Find what ethnicity values are available as dummy variables
ethnicity_columns = [col for col in X_test.columns if col.startswith('Ethnicity_')]
print(f"\nAvailable ethnicity columns: {ethnicity_columns}")

if ethnicity_columns:
    for ethnicity_col in ethnicity_columns:
        # Extract the ethnicity value from the column name
        ethnicity_value = ethnicity_col.replace('Ethnicity_', '')
        result = evaluate_group_performance('Ethnicity', ethnicity_value, X_test, y_test)
        if result:
            ethnicity_results.append(result)
else:
    print("No ethnicity columns found in the dataset.")

# Evaluate performance for each SES - dynamically find SES columns
ses_results = []
# Find what SES values are available as dummy variables
ses_columns = [col for col in X_test.columns if col.startswith('SES_')]
print(f"\nAvailable SES columns: {ses_columns}")

if ses_columns:
    for ses_col in ses_columns:
        # Extract the SES value from the column name
        ses_value = ses_col.replace('SES_', '')
        result = evaluate_group_performance('SES', ses_value, X_test, y_test)
        if result:
            ses_results.append(result)
else:
    print("No SES columns found in the dataset.")

# Combine results
all_results = gender_results + ethnicity_results + ses_results
results_df = pd.DataFrame(all_results)

print("\nPerformance Metrics Across Groups:")
display(results_df)

# Visualize performance disparities
plt.figure(figsize=(12, 6))
sns.barplot(x='Group', y='Sensitivity', data=results_df)
plt.title('Model Sensitivity Across Demographic Groups')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()