# Introduction to Clinical Data Analysis

**Stanford Data Science in Precision Medicine - Module 5**

---

## iPOP Study: Longitudinal Clinical Biomarker Analysis

This notebook provides comprehensive analysis of clinical laboratory data using the Stanford iPOP study framework. We focus on three distinct diabetes progression patterns:

- **OGTT-First Progression** (Patient ZNDMXI3): Glucose tolerance impairment precedes fasting abnormalities
- **FPG-First Progression** (Patient ZNED4XZ): Fasting glucose homeostasis disruption occurs first
- **Infection-Triggered Diabetes** (Patient ZOZOWIT): Environmental factors trigger metabolic dysfunction

### Learning Objectives
By completing this analysis, you will:
1. Master clinical data processing and quality control
2. Understand biomarker relationships and progression patterns
3. Apply machine learning to clinical prediction problems
4. Interpret results in precision medicine context

### Clinical Markers
- **FPG**: Fasting Plasma Glucose (glucose homeostasis)
- **OGTT2HR**: Oral Glucose Tolerance Test (glucose processing)
- **SSPG**: Steady-State Plasma Glucose (insulin resistance)
- **HbA1C**: Hemoglobin A1C (long-term glycemic control)


In [None]:
import sys
import os
sys.path.append("./scripts")

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import warnings
warnings.filterwarnings("ignore")

from clinical_utils import (
    ClinicalDataProcessor,
    ClinicalBiomarkerAnalyzer,
    ClinicalDataVisualizer,
    ClinicalPredictiveModeling
)
from config import DEMO_CONFIG, get_diabetes_thresholds

plt.style.use("seaborn-v0_8")
sns.set_palette("husl")

print("✓ Libraries imported successfully")
print("✓ Clinical analysis utilities loaded")
print("✓ iPOP study configuration loaded")

## 1. Clinical Data Loading and Overview

We begin by generating comprehensive clinical data based on real iPOP study patterns.

In [None]:
# Initialize clinical data processor
processor = ClinicalDataProcessor()

# Generate iPOP study demo data
clinical_data = processor.generate_demo_data(DEMO_CONFIG)

print(f"Clinical Data Shape: {clinical_data.shape}")
print(f"Patients: {clinical_data["PatientID"].nunique()}")
print(f"Timepoints: {clinical_data["Days"].nunique()}")
print(f"Total measurements: {len(clinical_data)}")

print("
Patient Information:")
for patient in DEMO_CONFIG["patients"]:
    print(f"• {patient["id"]}: {patient["description"]}")

print("
Clinical Data Preview:")
clinical_data.head(10)

## 2. Data Quality Control and Validation

Quality control is essential in clinical data analysis to ensure reliable results.

In [None]:
# Perform comprehensive quality control
qc_results = processor.quality_control(clinical_data)

print("Quality Control Results:")
print("=" * 40)
print(f"Total samples: {qc_results["total_samples"]}")
print(f"Complete cases: {qc_results["complete_cases"]}")
print(f"Completeness rate: {qc_results["completeness_rate"]:.1f}%")

if qc_results["missing_values"]:
    print("
Missing Values by Marker:")
    for marker, percent in qc_results["missing_values"].items():
        print(f"  {marker}: {percent:.1f}%")

if qc_results["outliers"]:
    print("
Outliers Detected:")
    for marker, count in qc_results["outliers"].items():
        print(f"  {marker}: {count} outliers")

# Display data summary statistics
print("
Summary Statistics:")
clinical_data.describe()

## 3. Clinical Biomarker Analysis

We analyze relationships between clinical biomarkers and their diagnostic significance.

In [None]:
# Initialize biomarker analyzer
analyzer = ClinicalBiomarkerAnalyzer()

# Analyze biomarker correlations
correlation_results = analyzer.biomarker_correlation_analysis(clinical_data)

print("Biomarker Correlation Analysis:")
print("=" * 40)
print("
Correlation Matrix:")
print(correlation_results["correlation_matrix"])

print("
Significant Correlations (|r| > 0.5):")
for corr in correlation_results["significant_correlations"]:
    print(f"• {corr["marker1"]} ↔ {corr["marker2"]}: r={corr["correlation"]:.3f} ({corr["strength"]})")

# Get diabetes thresholds for reference
thresholds = get_diabetes_thresholds()
print("
Diabetes Diagnostic Thresholds:")
for marker, threshold in thresholds.items():
    unit = "mg/dl" if marker != "HbA1C" else "%"
    print(f"• {marker}: ≥{threshold} {unit}")

## 4. Diabetes Progression Pattern Analysis

We analyze how different patients progress to diabetes through distinct pathways.

In [None]:
# Analyze diabetes progression patterns
progression_results = analyzer.diabetes_progression_analysis(clinical_data)

print("Diabetes Progression Analysis:")
print("=" * 50)

for patient, progression in progression_results.items():
    print(f"Patient {patient}:")
    
    if progression["first_diabetic_marker"]:
        print(f"  • First diabetic marker: {progression["first_diabetic_marker"]}")
        print(f"  • First detected on day: {progression["first_diabetic_day"]}")
        print(f"  • Total diabetic markers: {progression["total_diabetic_markers"]}")
        print(f"  • Progression events: {len(progression["progression_pattern"])}")
        
        if progression["progression_pattern"]:
            print("  • Progression timeline:")
            for event in progression["progression_pattern"][:3]:  # Show first 3 events
                markers = ", ".join(event["markers"])
                print(f"    Day {event["day"]}: {markers} in diabetic range")
    else:
        print("  • No diabetic markers detected")
    
    print("-" * 30)

## 5. Clinical Risk Stratification

We calculate composite risk scores and categorize patients by diabetes risk level.

In [None]:
# Perform risk stratification
risk_results = analyzer.risk_stratification(clinical_data)

print("Clinical Risk Stratification:")
print("=" * 40)

risk_summary = {}
for patient, risk_info in risk_results.items():
    category = risk_info["risk_category"]
    risk_summary[category] = risk_summary.get(category, 0) + 1
    
    print(f"Patient {patient}:")
    print(f"  • Risk Category: {risk_info["risk_category"]}")
    print(f"  • Risk Score: {risk_info["risk_score"]} points")
    
    if risk_info["risk_factors"]:
        print(f"  • Risk Factors: {len(risk_info["risk_factors"])}")
        for factor in risk_info["risk_factors"]:
            print(f"    - {factor}")
    else:
        print("  • No significant risk factors")
    print()

print("Risk Distribution Summary:")
for category, count in risk_summary.items():
    print(f"• {category}: {count} patients")

## 6. Comprehensive Clinical Visualizations

We create publication-ready visualizations to explore clinical patterns and progression.

In [None]:
# Initialize visualizer
visualizer = ClinicalDataVisualizer()

# Create longitudinal plots for key patients
key_patients = ["ZNDMXI3", "ZNED4XZ", "ZOZOWIT"]
patient_descriptions = {
    "ZNDMXI3": "OGTT-First Diabetes Progression",
    "ZNED4XZ": "FPG-First Diabetes Progression",
    "ZOZOWIT": "Infection-Triggered Diabetes"
}

print("Creating longitudinal biomarker plots...")

for patient in key_patients:
    print(f"
Patient {patient}: {patient_descriptions[patient]}")
    
    fig, ax = visualizer.plot_longitudinal_biomarkers(
        clinical_data, 
        patient,
        title=f"Patient {patient}: {patient_descriptions[patient]}"
    )
    
    plt.tight_layout()
    plt.show()
    
    # Print key observations
    patient_data = clinical_data[clinical_data["PatientID"] == patient]
    print(f"  • Data points: {len(patient_data)}")
    print(f"  • Study duration: {patient_data["Days"].max()} days")
    
    # Check for diabetic values
    diabetic_markers = []
    if not patient_data["FPG (mg/dl)"].isna().all() and (patient_data["FPG (mg/dl)"] >= 126).any():
        diabetic_markers.append("FPG")
    if not patient_data["OGTT2HR (mg/dl)"].isna().all() and (patient_data["OGTT2HR (mg/dl)"] >= 200).any():
        diabetic_markers.append("OGTT2HR")
    if not patient_data["HbA1C (%)"].isna().all() and (patient_data["HbA1C (%)"] >= 6.5).any():
        diabetic_markers.append("HbA1C")
    
    if diabetic_markers:
        print(f"  • Diabetic markers detected: {.join(diabetic_markers)}")
    else:
        print("  • No diabetic markers detected")

In [None]:
# Create diabetes progression heatmap
print("
Creating diabetes progression heatmap...")
fig, ax = visualizer.plot_diabetes_progression_heatmap(progression_results)
plt.show()

# Create biomarker correlation heatmap
print("
Creating biomarker correlation heatmap...")
fig, ax = visualizer.plot_biomarker_correlations(correlation_results["correlation_matrix"])
plt.show()

# Create risk stratification visualization
print("
Creating risk stratification visualization...")
fig, axes = visualizer.plot_risk_stratification(risk_results)
plt.show()

## 7. Predictive Modeling for Clinical Outcomes

We apply machine learning to predict diabetes risk based on clinical biomarkers.

In [None]:
# Initialize predictive modeling
ml_model = ClinicalPredictiveModeling()

# Prepare features for machine learning
print("Preparing features for machine learning...")
feature_data = ml_model.prepare_features(clinical_data)

print(f"Feature data shape: {feature_data.shape}")
print(f"Diabetes cases: {feature_data["Diabetes"].sum()}")
print(f"Non-diabetes cases: {(feature_data["Diabetes"] == 0).sum()}")

print("
Feature data preview:")
print(feature_data.head())

# Train diabetes prediction model
print("
Training diabetes prediction model...")
ml_results = ml_model.train_diabetes_classifier(feature_data)

print("
Machine Learning Results:")
print("=" * 40)
print(f"Model Accuracy: {ml_results["test_accuracy"]:.3f}")
print(f"ROC AUC Score: {ml_results["roc_auc"]:.3f}")

print("
Feature Importance:")
feature_importance = sorted(ml_results["feature_importance"].items(), key=lambda x: x[1], reverse=True)
for feature, importance in feature_importance:
    print(f"• {feature}: {importance:.3f} ({importance*100:.1f}%)")

print("
Classification Report:")
print(ml_results["classification_report"])

## 8. Clinical Interpretation and Precision Medicine Applications

We interpret our findings in the context of precision medicine and clinical practice.

In [None]:
# Clinical interpretation summary
print("CLINICAL INTERPRETATION AND PRECISION MEDICINE INSIGHTS")
print("=" * 70)

print("1. DIABETES PROGRESSION PATHWAYS:")
print("   Our analysis reveals three distinct pathways to diabetes:")
print()

# Analyze each pathway
pathway_analysis = {
    "ZNDMXI3": "OGTT-First: Glucose tolerance impairment precedes fasting abnormalities",
    "ZNED4XZ": "FPG-First: Fasting glucose homeostasis disruption occurs first",
    "ZOZOWIT": "Infection-Triggered: Environmental factors trigger metabolic dysfunction"
}

for patient, description in pathway_analysis.items():
    patient_progression = progression_results.get(patient, {})
    print(f"   • {description}")
    
    if patient_progression.get("first_diabetic_marker"):
        first_marker = patient_progression["first_diabetic_marker"]
        first_day = patient_progression["first_diabetic_day"]
        print(f"     - First abnormal marker: {first_marker} (Day {first_day})")
        print(f"     - Total affected markers: {patient_progression["total_diabetic_markers"]}")
    print()

print("2. BIOMARKER RELATIONSHIPS:")
print("   Strong correlations reveal underlying metabolic connections:")
for corr in correlation_results["significant_correlations"][:3]:
    print(f"   • {corr["marker1"]} ↔ {corr["marker2"]}: {corr["strength"]} correlation (r={corr["correlation"]:.3f})")
print()

print("3. RISK STRATIFICATION INSIGHTS:")
high_risk_patients = [p for p, r in risk_results.items() if r["risk_category"] == "High Risk"]
moderate_risk_patients = [p for p, r in risk_results.items() if r["risk_category"] == "Moderate Risk"]
low_risk_patients = [p for p, r in risk_results.items() if r["risk_category"] == "Low Risk"]

print(f"   • High Risk: {len(high_risk_patients)} patients - Require immediate intervention")
print(f"   • Moderate Risk: {len(moderate_risk_patients)} patients - Need close monitoring")
print(f"   • Low Risk: {len(low_risk_patients)} patients - Routine follow-up appropriate")
print()

print("4. MACHINE LEARNING INSIGHTS:")
top_feature = max(ml_results["feature_importance"].items(), key=lambda x: x[1])
print(f"   • Model achieves {ml_results["test_accuracy"]:.1%} accuracy in diabetes prediction")
print(f"   • Most predictive biomarker: {top_feature[0]} ({top_feature[1]:.1%} importance)")
print(f"   • ROC AUC of {ml_results["roc_auc"]:.3f} indicates {"excellent" if ml_results["roc_auc"] > 0.9 else "good" if ml_results["roc_auc"] > 0.8 else "moderate"} discriminative ability")
print()

print("5. PRECISION MEDICINE APPLICATIONS:")
print("   This analysis enables several precision medicine approaches:")
print("   • Personalized Risk Assessment: Individual biomarker profiles for tailored risk evaluation")
print("   • Pathway-Specific Interventions: Target specific metabolic pathways based on progression patterns")
print("   • Predictive Monitoring: Use ML models to forecast disease trajectory")
print("   • Treatment Optimization: Monitor intervention effectiveness through biomarker changes")
print("   • Early Detection: Identify pre-diabetic states before clinical symptoms appear")

## 9. Summary and Future Directions

### Key Findings

1. **Multiple Diabetes Pathways**: Our analysis confirms that patients progress to diabetes through distinct metabolic pathways, each requiring different monitoring and intervention strategies.

2. **Biomarker Interconnections**: Strong correlations between clinical markers reveal the interconnected nature of glucose metabolism and provide insights into disease mechanisms.

3. **Predictive Power**: Machine learning models can accurately predict diabetes risk using clinical biomarkers, enabling proactive healthcare interventions.

4. **Risk Stratification**: Composite risk scoring effectively categorizes patients, allowing for personalized monitoring and treatment protocols.

### Clinical Impact

This analysis demonstrates how clinical laboratory data can be transformed into actionable insights for precision medicine:

- **Early Detection**: Identify at-risk patients before clinical symptoms develop
- **Personalized Monitoring**: Tailor follow-up schedules based on individual risk profiles
- **Targeted Interventions**: Apply pathway-specific treatments based on progression patterns
- **Treatment Monitoring**: Track intervention effectiveness through biomarker changes

### Future Directions

1. **Integration with Omics Data**: Combine clinical data with genomics, proteomics, and metabolomics for comprehensive patient profiling

2. **Real-Time Monitoring**: Develop continuous glucose monitoring integration with clinical decision support systems

3. **Population Health**: Scale analysis to larger populations for public health insights

4. **Clinical Validation**: Validate predictive models in prospective clinical trials

5. **AI-Driven Insights**: Apply deep learning techniques for pattern recognition and outcome prediction

### Conclusion

Clinical laboratory data, when properly analyzed, provides the foundation for precision medicine approaches that can transform patient care. The iPOP study framework demonstrates how longitudinal biomarker analysis enables personalized healthcare through data-driven insights, predictive modeling, and targeted interventions.

By understanding individual progression patterns and leveraging machine learning for risk prediction, healthcare providers can move from reactive to proactive care, ultimately improving patient outcomes and reducing healthcare costs.