# Alzheimer's Disease Detection & Analysis
## Hack4Health Hackathon Project

**Goal**: Build machine learning models to support early detection, progression forecasting, and interpretability of Alzheimer's Disease risk using real biomedical data.

### Project Objectives:
1. **Early Detection**: Identify Alzheimer's Disease in its early stages
2. **Progression Forecasting**: Predict disease progression patterns  
3. **Interpretability**: Understand factors contributing to disease risk

### Dataset Information:
- **Source**: Real, de-identified biomedical data
- **Location**: `../data/raw/` directory
- **Privacy**: All data is de-identified and compliant with usage guidelines

---


In [None]:
# Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import warnings
warnings.filterwarnings('ignore')

# Machine Learning Libraries
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, roc_curve
import xgboost as xgb
import lightgbm as lgb

# Model Interpretability
import shap
from sklearn.inspection import permutation_importance

# Set plotting style
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

# Configure pandas display options
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', None)

print("✅ All libraries imported successfully!")
print(f"📊 NumPy version: {np.__version__}")
print(f"🐼 Pandas version: {pd.__version__}")
print(f"📈 Matplotlib version: {plt.matplotlib.__version__}")
print(f"🎨 Seaborn version: {sns.__version__}")


## 1. Data Loading and Initial Exploration

**Instructions**: 
1. Place your downloaded datasets in the `../data/raw/` directory
2. Update the file paths below to match your actual dataset filenames
3. Modify the data loading code based on your dataset format (CSV, Excel, etc.)


In [None]:
# Data Loading - Genomic Alzheimer's Disease Variants
data_path = "../data/raw/"

# Load the genomic datasets
print("🧬 Loading Alzheimer's Disease genomic datasets...")

# Load TSV file (variant data with positions)
df_variants = pd.read_csv(f"{data_path}advp.hg38.tsv", sep='\t', comment='#')
print(f"✅ Loaded TSV dataset: {df_variants.shape}")

# Load BED file (genomic regions)
df_regions = pd.read_csv(f"{data_path}advp.hg38.bed", sep='\t', comment='#')
print(f"✅ Loaded BED dataset: {df_regions.shape}")

# Display basic information about the datasets
print(f"\n📊 TSV Dataset Info:")
print(f"   Shape: {df_variants.shape}")
print(f"   Columns: {list(df_variants.columns)}")
print(f"   Memory usage: {df_variants.memory_usage(deep=True).sum() / 1024**2:.2f} MB")

print(f"\n📊 BED Dataset Info:")
print(f"   Shape: {df_regions.shape}")
print(f"   Columns: {list(df_regions.columns)}")
print(f"   Memory usage: {df_regions.memory_usage(deep=True).sum() / 1024**2:.2f} MB")

# Display first few rows
print(f"\n🔍 First 5 rows of TSV dataset:")
print(df_variants.head())

print(f"\n🔍 First 5 rows of BED dataset:")
print(df_regions.head())


In [None]:
# Genomic Data Analysis Functions

def analyze_genomic_variants(df_variants):
    """
    Analyze the genomic variants dataset for Alzheimer's Disease
    """
    print("🧬 GENOMIC VARIANTS ANALYSIS")
    print("=" * 50)
    
    # Basic statistics
    print(f"📊 Dataset Overview:")
    print(f"   Total variants: {len(df_variants):,}")
    print(f"   Unique chromosomes: {df_variants['#dbSNP_hg38_chr'].nunique()}")
    print(f"   Unique SNPs: {df_variants['Top SNP'].nunique()}")
    print(f"   Unique genes: {df_variants['nearest_gene_symb'].nunique()}")
    
    # Chromosome distribution
    print(f"\n📈 Chromosome Distribution:")
    chr_counts = df_variants['#dbSNP_hg38_chr'].value_counts().sort_index()
    print(chr_counts)
    
    # Study types
    print(f"\n🔬 Study Types:")
    study_types = df_variants['Study type'].value_counts()
    print(study_types)
    
    # Phenotypes
    print(f"\n🎯 Phenotypes:")
    phenotypes = df_variants['Phenotype'].value_counts()
    print(phenotypes.head(10))
    
    # P-value distribution
    print(f"\n📊 P-value Statistics:")
    p_values = pd.to_numeric(df_variants['P-value'], errors='coerce')
    print(f"   Significant variants (p < 0.05): {(p_values < 0.05).sum():,}")
    print(f"   Highly significant (p < 0.001): {(p_values < 0.001).sum():,}")
    print(f"   Genome-wide significant (p < 5e-8): {(p_values < 5e-8).sum():,}")
    
    return {
        'total_variants': len(df_variants),
        'unique_snps': df_variants['Top SNP'].nunique(),
        'unique_genes': df_variants['nearest_gene_symb'].nunique(),
        'significant_variants': (p_values < 0.05).sum(),
        'chr_counts': chr_counts,
        'study_types': study_types,
        'phenotypes': phenotypes
    }

def create_genomic_visualizations(df_variants):
    """
    Create visualizations for genomic data
    """
    print("📊 Creating genomic visualizations...")
    
    # Convert P-values to numeric
    df_variants['P_value_numeric'] = pd.to_numeric(df_variants['P-value'], errors='coerce')
    
    # 1. Chromosome distribution
    plt.figure(figsize=(15, 6))
    chr_counts = df_variants['#dbSNP_hg38_chr'].value_counts().sort_index()
    plt.bar(chr_counts.index, chr_counts.values, color='skyblue')
    plt.title('Distribution of Variants Across Chromosomes')
    plt.xlabel('Chromosome')
    plt.ylabel('Number of Variants')
    plt.xticks(rotation=45)
    plt.tight_layout()
    plt.show()
    
    # 2. P-value distribution (Manhattan plot style)
    plt.figure(figsize=(15, 8))
    
    # Create Manhattan plot
    chr_positions = {}
    cumulative_pos = 0
    
    for chr_name in sorted(df_variants['#dbSNP_hg38_chr'].unique()):
        chr_data = df_variants[df_variants['#dbSNP_hg38_chr'] == chr_name]
        chr_positions[chr_name] = cumulative_pos
        cumulative_pos += chr_data['#dbSNP_hg38_position'].max()
    
    colors = ['red' if i % 2 == 0 else 'blue' for i in range(len(chr_positions))]
    
    for i, (chr_name, start_pos) in enumerate(chr_positions.items()):
        chr_data = df_variants[df_variants['#dbSNP_hg38_chr'] == chr_name]
        x_pos = start_pos + chr_data['#dbSNP_hg38_position']
        y_values = -np.log10(chr_data['P_value_numeric'].fillna(1))
        
        plt.scatter(x_pos, y_values, c=colors[i], alpha=0.6, s=20)
    
    plt.axhline(y=-np.log10(5e-8), color='red', linestyle='--', alpha=0.7, label='Genome-wide significance')
    plt.axhline(y=-np.log10(0.05), color='orange', linestyle='--', alpha=0.7, label='Nominal significance')
    
    plt.title('Manhattan Plot - Alzheimer\'s Disease Variants')
    plt.xlabel('Genomic Position')
    plt.ylabel('-log10(P-value)')
    plt.legend()
    plt.tight_layout()
    plt.show()
    
    # 3. Study type distribution
    plt.figure(figsize=(12, 6))
    study_counts = df_variants['Study type'].value_counts()
    plt.pie(study_counts.values, labels=study_counts.index, autopct='%1.1f%%')
    plt.title('Distribution of Study Types')
    plt.tight_layout()
    plt.show()
    
    # 4. Top genes by variant count
    plt.figure(figsize=(12, 8))
    top_genes = df_variants['nearest_gene_symb'].value_counts().head(15)
    plt.barh(range(len(top_genes)), top_genes.values)
    plt.yticks(range(len(top_genes)), top_genes.index)
    plt.xlabel('Number of Variants')
    plt.title('Top 15 Genes by Number of Associated Variants')
    plt.tight_layout()
    plt.show()

def analyze_phenotype_associations(df_variants):
    """
    Analyze phenotype associations in the genomic data
    """
    print("🎯 PHENOTYPE ASSOCIATION ANALYSIS")
    print("=" * 50)
    
    # Group by phenotype
    phenotype_groups = df_variants.groupby('Phenotype')
    
    print(f"📊 Phenotype Summary:")
    print(f"   Total phenotypes: {len(phenotype_groups)}")
    
    # Top phenotypes by variant count
    phenotype_counts = phenotype_groups.size().sort_values(ascending=False)
    print(f"\n🔝 Top 10 Phenotypes by Variant Count:")
    print(phenotype_counts.head(10))
    
    # AD-specific analysis
    ad_variants = df_variants[df_variants['Phenotype'] == 'AD']
    print(f"\n🧠 AD-Specific Analysis:")
    print(f"   AD variants: {len(ad_variants):,}")
    print(f"   AD-associated genes: {ad_variants['nearest_gene_symb'].nunique()}")
    
    # Significant AD variants
    ad_p_values = pd.to_numeric(ad_variants['P-value'], errors='coerce')
    significant_ad = ad_variants[ad_p_values < 0.05]
    print(f"   Significant AD variants (p < 0.05): {len(significant_ad):,}")
    
    # Top AD-associated genes
    print(f"\n🧬 Top AD-Associated Genes:")
    ad_gene_counts = ad_variants['nearest_gene_symb'].value_counts().head(10)
    print(ad_gene_counts)
    
    return {
        'phenotype_counts': phenotype_counts,
        'ad_variants': len(ad_variants),
        'ad_genes': ad_variants['nearest_gene_symb'].nunique(),
        'significant_ad': len(significant_ad),
        'top_ad_genes': ad_gene_counts
    }

print("🧬 Genomic analysis functions loaded and ready!")


In [None]:
# Run Genomic Analysis
print("🧬 Starting Genomic Analysis...")

# Analyze the variants dataset
variant_stats = analyze_genomic_variants(df_variants)

# Create visualizations
create_genomic_visualizations(df_variants)

# Analyze phenotype associations
phenotype_stats = analyze_phenotype_associations(df_variants)

print("\n✅ Genomic analysis completed!")
print("📊 Key findings:")
print(f"   • Total variants analyzed: {variant_stats['total_variants']:,}")
print(f"   • Significant variants: {variant_stats['significant_variants']:,}")
print(f"   • AD-specific variants: {phenotype_stats['ad_variants']:,}")
print(f"   • AD-associated genes: {phenotype_stats['ad_genes']:,}")


In [None]:
# Genomic Data Preprocessing for Machine Learning

def prepare_genomic_features(df_variants):
    """
    Prepare genomic features for machine learning
    """
    print("🔧 Preparing genomic features for ML...")
    
    # Create a copy for processing
    df_ml = df_variants.copy()
    
    # Convert P-values to numeric and create significance features
    df_ml['P_value_numeric'] = pd.to_numeric(df_ml['P-value'], errors='coerce')
    df_ml['is_significant'] = (df_ml['P_value_numeric'] < 0.05).astype(int)
    df_ml['is_highly_significant'] = (df_ml['P_value_numeric'] < 0.001).astype(int)
    df_ml['is_genome_wide_significant'] = (df_ml['P_value_numeric'] < 5e-8).astype(int)
    
    # Create log-transformed P-values
    df_ml['log_p_value'] = -np.log10(df_ml['P_value_numeric'].fillna(1))
    
    # Extract chromosome number for numerical analysis
    df_ml['chr_numeric'] = df_ml['#dbSNP_hg38_chr'].str.replace('chr', '').astype(str)
    df_ml['chr_numeric'] = pd.to_numeric(df_ml['chr_numeric'], errors='coerce')
    
    # Create study type features
    study_type_dummies = pd.get_dummies(df_ml['Study type'], prefix='study_type')
    df_ml = pd.concat([df_ml, study_type_dummies], axis=1)
    
    # Create phenotype features
    phenotype_dummies = pd.get_dummies(df_ml['Phenotype'], prefix='phenotype')
    df_ml = pd.concat([df_ml, phenotype_dummies], axis=1)
    
    # Create consequence type features
    consequence_dummies = pd.get_dummies(df_ml['most_severe_consequence'], prefix='consequence')
    df_ml = pd.concat([df_ml, consequence_dummies], axis=1)
    
    # Create gene-based features
    df_ml['gene_variant_count'] = df_ml.groupby('nearest_gene_symb')['nearest_gene_symb'].transform('count')
    
    # Create chromosome-based features
    df_ml['chr_variant_count'] = df_ml.groupby('#dbSNP_hg38_chr')['#dbSNP_hg38_chr'].transform('count')
    
    print(f"   ✅ Created {df_ml.shape[1]} features from {df_variants.shape[1]} original columns")
    
    return df_ml

def create_ad_prediction_dataset(df_ml):
    """
    Create dataset for AD prediction based on genomic variants
    """
    print("🎯 Creating AD prediction dataset...")
    
    # Focus on AD-related variants
    ad_related = df_ml[
        (df_ml['Phenotype'] == 'AD') | 
        (df_ml['Phenotype-derived'] == 'AD') |
        (df_ml.get('phenotype_AD', 0) == 1)
    ].copy()
    
    print(f"   📊 AD-related variants: {len(ad_related):,}")
    
    # Create binary target: significant AD variants vs non-significant
    ad_related['is_ad_significant'] = (
        (ad_related['Phenotype'] == 'AD') & 
        (ad_related['is_significant'] == 1)
    ).astype(int)
    
    # Select features for ML
    feature_cols = [
        'P_value_numeric', 'log_p_value', 'chr_numeric', 'chr_variant_count',
        'gene_variant_count', 'is_significant', 'is_highly_significant', 
        'is_genome_wide_significant'
    ]
    
    # Add study type features
    study_cols = [col for col in ad_related.columns if col.startswith('study_type_')]
    feature_cols.extend(study_cols)
    
    # Add consequence features
    consequence_cols = [col for col in ad_related.columns if col.startswith('consequence_')]
    feature_cols.extend(consequence_cols)
    
    # Filter to existing columns
    feature_cols = [col for col in feature_cols if col in ad_related.columns]
    
    X = ad_related[feature_cols].fillna(0)
    y = ad_related['is_ad_significant']
    
    print(f"   📈 Features: {X.shape[1]}")
    print(f"   🎯 Target distribution: {y.value_counts().to_dict()}")
    
    return X, y, ad_related

# Prepare the data
df_ml_ready = prepare_genomic_features(df_variants)
X_genomic, y_genomic, ad_data = create_ad_prediction_dataset(df_ml_ready)

print("✅ Genomic data preprocessing completed!")


In [None]:
# Machine Learning Analysis on Genomic Data

# Check if we have enough data for ML
if len(X_genomic) > 100 and y_genomic.sum() > 10:
    print("🤖 Running Machine Learning Analysis on Genomic Data...")
    
    # Split the data
    X_train_genomic, X_test_genomic, y_train_genomic, y_test_genomic = train_test_split(
        X_genomic, y_genomic, test_size=0.2, random_state=42, stratify=y_genomic
    )
    
    print(f"   📊 Training set: {X_train_genomic.shape}")
    print(f"   📊 Test set: {X_test_genomic.shape}")
    
    # Train and evaluate models
    genomic_results = train_and_evaluate_models(
        X_train_genomic, X_test_genomic, y_train_genomic, y_test_genomic
    )
    
    # Create model comparison plots
    plot_model_comparison(genomic_results)
    
    # Generate detailed reports
    detailed_model_report(genomic_results, y_test_genomic)
    
    # Create final summary
    best_model_name, best_model, summary_df = create_final_summary(genomic_results)
    
    # Feature importance analysis
    feature_importance_df = feature_importance_analysis(X_train_genomic, y_train_genomic)
    
    # Generate insights
    generate_insights_report(summary_df, feature_importance_df)
    
    print("✅ Genomic ML analysis completed!")
    
else:
    print("⚠️  Insufficient data for machine learning analysis")
    print(f"   Total samples: {len(X_genomic)}")
    print(f"   Positive samples: {y_genomic.sum()}")
    print("   Need at least 100 samples and 10 positive cases")


## 8. Load Preprocessed NPZ Dataset

We'll inspect the contents of `preprocessed_alz_data.npz` (keys, shapes, and dtypes) and optionally plug it into the ML pipeline.


In [None]:
# Inspect NPZ contents
import numpy as np
import os

npz_path = "../data/raw/preprocessed_alz_data.npz"

if os.path.exists(npz_path):
    print(f"📦 Loading NPZ: {npz_path}")
    data = np.load(npz_path, allow_pickle=True)
    keys = list(data.keys())
    print(f"🔑 Keys: {keys}")
    
    # Show shapes and dtypes
    for k in keys:
        arr = data[k]
        shape = getattr(arr, 'shape', None)
        dtype = getattr(arr, 'dtype', type(arr))
        print(f" - {k}: shape={shape}, dtype={dtype}")
    
    # Heuristics to find features/labels
    X_key = next((k for k in keys if k.lower() in ["x","features","X","X_train","X_test"]) , None)
    y_key = next((k for k in keys if k.lower() in ["y","labels","target","y_train","y_test"]) , None)
    
    if X_key is None:
        # fallback: pick first array-like
        for k in keys:
            if hasattr(data[k], 'shape'):
                X_key = k
                break
    
    print(f"\n📌 Selected feature key: {X_key}")
    print(f"📌 Selected target key: {y_key}")
else:
    print(f"❌ NPZ not found at {npz_path}")


In [None]:
# Optional: Train ML models on NPZ data if keys found
try:
    if 'data' in locals() and X_key is not None and y_key is not None:
        X_npz = data[X_key]
        y_npz = data[y_key]
        
        print(f"\n✅ Using NPZ dataset: X={X_npz.shape}, y={y_npz.shape}")
        
        # If y is not 1D, flatten
        if len(y_npz.shape) > 1 and y_npz.shape[1] == 1:
            y_npz = y_npz.ravel()
        
        # Train-test split
        X_train_npz, X_test_npz, y_train_npz, y_test_npz = train_test_split(
            X_npz, y_npz, test_size=0.2, random_state=42, stratify=y_npz if len(np.unique(y_npz)) > 1 else None
        )
        
        # Train models
        npz_results = train_and_evaluate_models(X_train_npz, X_test_npz, y_train_npz, y_test_npz)
        plot_model_comparison(npz_results)
        detailed_model_report(npz_results, y_test_npz)
        
        best_model_name, best_model, summary_df = create_final_summary(npz_results)
        generate_insights_report(summary_df)
    else:
        print("⚠️  Could not determine feature/label keys in NPZ. Please set X_key and y_key manually.")
except Exception as e:
    print(f"❌ Error running ML on NPZ data: {e}")


In [None]:
# Save NPZ results and best model
import os, json, joblib
from datetime import datetime

results_dir = "../results/"
os.makedirs(results_dir, exist_ok=True)

timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")

try:
    if 'npz_results' in locals():
        # Save summary CSV and JSON
        _, _, summary_df = create_final_summary(npz_results)
        summary_path = os.path.join(results_dir, f"npz_model_summary_{timestamp}.csv")
        summary_df.to_csv(summary_path, index=False)
        
        simple = {k: {m: float(v[m]) if v[m] is not None else None for m in ['accuracy','auc','cv_mean','cv_std']} for k, v in npz_results.items()}
        json_path = os.path.join(results_dir, f"npz_all_results_{timestamp}.json")
        with open(json_path, 'w') as f:
            json.dump(simple, f, indent=2)
        
        # Save best model
        best_model_name = max(npz_results.items(), key=lambda x: x[1]['accuracy'])[0]
        best_model = npz_results[best_model_name]['model']
        model_path = os.path.join(results_dir, f"npz_best_model_{best_model_name}_{timestamp}.pkl")
        joblib.dump(best_model, model_path)
        
        print(f"💾 Saved: {summary_path}")
        print(f"💾 Saved: {json_path}")
        print(f"💾 Saved: {model_path}")
    else:
        print("⚠️  npz_results not found. Run the NPZ ML cell first.")
except Exception as e:
    print(f"❌ Error saving NPZ results: {e}")


## 9. MRI Scan Data Analysis (Parquet Files)

This section handles the MRI scan data from the parquet files (`train.parquet` and `test.parquet`) which complement the genomic data for comprehensive Alzheimer's analysis.


In [None]:
# Load and analyze MRI scan data from parquet files
import pandas as pd
import numpy as np

# Load parquet files
train_path = "../data/raw/train.parquet"
test_path = "../data/raw/test.parquet"

print("🧠 Loading MRI scan datasets...")

try:
    df_train = pd.read_parquet(train_path)
    df_test = pd.read_parquet(test_path)
    
    print(f"✅ Train dataset: {df_train.shape}")
    print(f"✅ Test dataset: {df_test.shape}")
    
    # Display basic information
    print(f"\n📊 Train Dataset Info:")
    print(f"   Shape: {df_train.shape}")
    print(f"   Columns: {list(df_train.columns)}")
    print(f"   Memory usage: {df_train.memory_usage(deep=True).sum() / 1024**2:.2f} MB")
    
    print(f"\n📊 Test Dataset Info:")
    print(f"   Shape: {df_test.shape}")
    print(f"   Columns: {list(df_test.columns)}")
    print(f"   Memory usage: {df_test.memory_usage(deep=True).sum() / 1024**2:.2f} MB")
    
    # Display first few rows
    print(f"\n🔍 First 5 rows of train dataset:")
    print(df_train.head())
    
    print(f"\n🔍 First 5 rows of test dataset:")
    print(df_test.head())
    
    # Check for target variable
    if 'target' in df_train.columns or 'label' in df_train.columns or 'diagnosis' in df_train.columns:
        target_col = next(col for col in ['target', 'label', 'diagnosis'] if col in df_train.columns)
        print(f"\n🎯 Target variable found: {target_col}")
        print(f"   Distribution: {df_train[target_col].value_counts().to_dict()}")
    
except Exception as e:
    print(f"❌ Error loading parquet files: {e}")
    print("   Make sure train.parquet and test.parquet are in data/raw/ directory")


In [None]:
# MRI Data Analysis and ML Pipeline
if 'df_train' in locals() and 'df_test' in locals():
    print("🧠 Running MRI Data Analysis...")
    
    # Find target column
    target_candidates = ['target', 'label', 'diagnosis', 'class', 'y']
    target_col = None
    for col in target_candidates:
        if col in df_train.columns:
            target_col = col
            break
    
    if target_col:
        print(f"🎯 Using target column: {target_col}")
        
        # Prepare features and target
        feature_cols = [col for col in df_train.columns if col != target_col]
        X_train_mri = df_train[feature_cols]
        y_train_mri = df_train[target_col]
        
        # Handle test data (might not have target)
        if target_col in df_test.columns:
            X_test_mri = df_test[feature_cols]
            y_test_mri = df_test[target_col]
        else:
            X_test_mri = df_test[feature_cols]
            y_test_mri = None
            print("⚠️  Test data has no target column - using for prediction only")
        
        print(f"📊 Training features: {X_train_mri.shape}")
        print(f"📊 Training target: {y_train_mri.shape}")
        print(f"📊 Test features: {X_test_mri.shape}")
        
        # Handle missing values
        X_train_mri = X_train_mri.fillna(X_train_mri.median())
        X_test_mri = X_test_mri.fillna(X_train_mri.median())
        
        # Train models on MRI data
        if y_test_mri is not None:
            print("🤖 Training models on MRI data...")
            mri_results = train_and_evaluate_models(X_train_mri, X_test_mri, y_train_mri, y_test_mri)
            plot_model_comparison(mri_results)
            detailed_model_report(mri_results, y_test_mri)
            
            best_model_name, best_model, summary_df = create_final_summary(mri_results)
            generate_insights_report(summary_df)
            
            print("✅ MRI ML analysis completed!")
        else:
            print("📝 Test data ready for prediction (no ground truth available)")
            
    else:
        print("⚠️  No target column found in MRI data")
        print(f"   Available columns: {list(df_train.columns)}")
else:
    print("⚠️  MRI datasets not loaded. Run the previous cell first.")


## 2. Exploratory Data Analysis (EDA)

**This section will be populated once you load your data. It will include:**
- Dataset shape and basic information
- Missing value analysis
- Target variable distribution
- Feature distributions and correlations
- Demographic analysis
- Cognitive assessment patterns


In [None]:
# EDA Functions - Ready to use once data is loaded

def basic_data_info(df, df_name="Dataset"):
    """Display basic information about the dataset"""
    print(f"📊 {df_name} Information:")
    print(f"   Shape: {df.shape}")
    print(f"   Memory usage: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")
    print(f"   Columns: {list(df.columns)}")
    print(f"   Data types:\n{df.dtypes.value_counts()}")
    
def missing_value_analysis(df):
    """Analyze missing values in the dataset"""
    missing_data = df.isnull().sum()
    missing_percent = (missing_data / len(df)) * 100
    
    missing_df = pd.DataFrame({
        'Missing Count': missing_data,
        'Missing Percentage': missing_percent
    }).sort_values('Missing Percentage', ascending=False)
    
    # Only show columns with missing values
    missing_df = missing_df[missing_df['Missing Count'] > 0]
    
    if len(missing_df) > 0:
        print("🔍 Missing Value Analysis:")
        print(missing_df)
        
        # Visualize missing values
        plt.figure(figsize=(12, 6))
        sns.heatmap(df.isnull(), cbar=True, yticklabels=False)
        plt.title('Missing Values Heatmap')
        plt.tight_layout()
        plt.show()
    else:
        print("✅ No missing values found!")

def plot_target_distribution(df, target_col):
    """Plot distribution of target variable"""
    plt.figure(figsize=(10, 6))
    
    # Count plot
    plt.subplot(1, 2, 1)
    sns.countplot(data=df, x=target_col)
    plt.title(f'Distribution of {target_col}')
    plt.xticks(rotation=45)
    
    # Pie chart
    plt.subplot(1, 2, 2)
    df[target_col].value_counts().plot(kind='pie', autopct='%1.1f%%')
    plt.title(f'Proportion of {target_col}')
    
    plt.tight_layout()
    plt.show()
    
    # Print statistics
    print(f"📈 {target_col} Distribution:")
    print(df[target_col].value_counts())
    print(f"\nProportions:")
    print(df[target_col].value_counts(normalize=True))

print("🔧 EDA functions loaded and ready to use!")
print("💡 Uncomment the function calls below once you have loaded your data")


## 3. Data Preprocessing

**This section will include:**
- Handling missing values
- Feature engineering
- Data normalization/scaling
- Train-test split
- Feature selection


In [None]:
# Data Preprocessing Functions

def preprocess_data(df, target_col, test_size=0.2, random_state=42):
    """
    Comprehensive data preprocessing pipeline
    """
    print("🔧 Starting data preprocessing...")
    
    # Create a copy to avoid modifying original data
    df_processed = df.copy()
    
    # 1. Handle missing values
    print("   📝 Handling missing values...")
    
    # For numerical columns - fill with median
    numerical_cols = df_processed.select_dtypes(include=[np.number]).columns
    df_processed[numerical_cols] = df_processed[numerical_cols].fillna(df_processed[numerical_cols].median())
    
    # For categorical columns - fill with mode
    categorical_cols = df_processed.select_dtypes(include=['object']).columns
    for col in categorical_cols:
        if col != target_col:  # Don't fill target column
            df_processed[col] = df_processed[col].fillna(df_processed[col].mode()[0])
    
    # 2. Encode categorical variables
    print("   🔤 Encoding categorical variables...")
    label_encoders = {}
    for col in categorical_cols:
        if col != target_col:
            le = LabelEncoder()
            df_processed[col] = le.fit_transform(df_processed[col].astype(str))
            label_encoders[col] = le
    
    # 3. Separate features and target
    X = df_processed.drop(columns=[target_col])
    y = df_processed[target_col]
    
    # 4. Train-test split
    print(f"   ✂️  Splitting data (test_size={test_size})...")
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=test_size, random_state=random_state, stratify=y
    )
    
    # 5. Feature scaling
    print("   📏 Scaling features...")
    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled = scaler.transform(X_test)
    
    # Convert back to DataFrames
    X_train_scaled = pd.DataFrame(X_train_scaled, columns=X.columns, index=X_train.index)
    X_test_scaled = pd.DataFrame(X_test_scaled, columns=X.columns, index=X_test.index)
    
    print("✅ Data preprocessing completed!")
    print(f"   Training set shape: {X_train_scaled.shape}")
    print(f"   Test set shape: {X_test_scaled.shape}")
    
    return X_train_scaled, X_test_scaled, y_train, y_test, scaler, label_encoders

def feature_importance_analysis(X_train, y_train, feature_names=None):
    """
    Analyze feature importance using Random Forest
    """
    print("🌲 Analyzing feature importance...")
    
    rf = RandomForestClassifier(n_estimators=100, random_state=42)
    rf.fit(X_train, y_train)
    
    # Get feature importance
    importance = rf.feature_importances_
    
    if feature_names is None:
        feature_names = X_train.columns
    
    # Create importance DataFrame
    importance_df = pd.DataFrame({
        'feature': feature_names,
        'importance': importance
    }).sort_values('importance', ascending=False)
    
    # Plot feature importance
    plt.figure(figsize=(12, 8))
    sns.barplot(data=importance_df.head(20), x='importance', y='feature')
    plt.title('Top 20 Feature Importance (Random Forest)')
    plt.xlabel('Importance')
    plt.tight_layout()
    plt.show()
    
    return importance_df

print("🔧 Preprocessing functions loaded and ready!")


## 4. Machine Learning Models

**This section will implement multiple ML models for comparison:**
- Logistic Regression (baseline)
- Random Forest
- XGBoost
- LightGBM
- Support Vector Machine
- Neural Network (optional)


In [None]:
# Machine Learning Model Pipeline

def train_and_evaluate_models(X_train, X_test, y_train, y_test):
    """
    Train multiple models and compare their performance
    """
    models = {
        'Logistic Regression': LogisticRegression(random_state=42, max_iter=1000),
        'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
        'XGBoost': xgb.XGBClassifier(random_state=42, eval_metric='logloss'),
        'LightGBM': lgb.LGBMClassifier(random_state=42, verbose=-1),
        'SVM': SVC(random_state=42, probability=True)
    }
    
    results = {}
    
    print("🤖 Training and evaluating models...")
    print("=" * 50)
    
    for name, model in models.items():
        print(f"\n🔄 Training {name}...")
        
        # Train model
        model.fit(X_train, y_train)
        
        # Make predictions
        y_pred = model.predict(X_test)
        y_pred_proba = model.predict_proba(X_test)[:, 1] if hasattr(model, 'predict_proba') else None
        
        # Calculate metrics
        accuracy = model.score(X_test, y_test)
        auc_score = roc_auc_score(y_test, y_pred_proba) if y_pred_proba is not None else None
        
        # Cross-validation score
        cv_scores = cross_val_score(model, X_train, y_train, cv=5, scoring='accuracy')
        
        results[name] = {
            'model': model,
            'accuracy': accuracy,
            'auc': auc_score,
            'cv_mean': cv_scores.mean(),
            'cv_std': cv_scores.std(),
            'predictions': y_pred,
            'probabilities': y_pred_proba
        }
        
        print(f"   ✅ Accuracy: {accuracy:.4f}")
        print(f"   📊 AUC: {auc_score:.4f}" if auc_score else "   📊 AUC: N/A")
        print(f"   🔄 CV Score: {cv_scores.mean():.4f} (±{cv_scores.std():.4f})")
    
    return results

def plot_model_comparison(results):
    """
    Create comparison plots for all models
    """
    # Extract metrics
    model_names = list(results.keys())
    accuracies = [results[name]['accuracy'] for name in model_names]
    aucs = [results[name]['auc'] for name in model_names if results[name]['auc'] is not None]
    cv_means = [results[name]['cv_mean'] for name in model_names]
    cv_stds = [results[name]['cv_std'] for name in model_names]
    
    # Create subplots
    fig, axes = plt.subplots(2, 2, figsize=(15, 12))
    
    # Accuracy comparison
    axes[0, 0].bar(model_names, accuracies, color='skyblue')
    axes[0, 0].set_title('Model Accuracy Comparison')
    axes[0, 0].set_ylabel('Accuracy')
    axes[0, 0].tick_params(axis='x', rotation=45)
    
    # AUC comparison (only for models with probabilities)
    auc_names = [name for name in model_names if results[name]['auc'] is not None]
    if auc_names:
        axes[0, 1].bar(auc_names, aucs, color='lightcoral')
        axes[0, 1].set_title('Model AUC Comparison')
        axes[0, 1].set_ylabel('AUC Score')
        axes[0, 1].tick_params(axis='x', rotation=45)
    
    # Cross-validation scores
    axes[1, 0].bar(model_names, cv_means, yerr=cv_stds, capsize=5, color='lightgreen')
    axes[1, 0].set_title('Cross-Validation Scores')
    axes[1, 0].set_ylabel('CV Score')
    axes[1, 0].tick_params(axis='x', rotation=45)
    
    # ROC Curves
    axes[1, 1].plot([0, 1], [0, 1], 'k--', label='Random')
    for name in auc_names:
        fpr, tpr, _ = roc_curve(y_test, results[name]['probabilities'])
        axes[1, 1].plot(fpr, tpr, label=f'{name} (AUC={results[name]["auc"]:.3f})')
    
    axes[1, 1].set_xlabel('False Positive Rate')
    axes[1, 1].set_ylabel('True Positive Rate')
    axes[1, 1].set_title('ROC Curves')
    axes[1, 1].legend()
    
    plt.tight_layout()
    plt.show()

def detailed_model_report(results, y_test):
    """
    Generate detailed classification reports for all models
    """
    print("\n📊 DETAILED MODEL REPORTS")
    print("=" * 60)
    
    for name, result in results.items():
        print(f"\n🔍 {name.upper()}")
        print("-" * 30)
        print(classification_report(y_test, result['predictions']))
        
        # Confusion Matrix
        cm = confusion_matrix(y_test, result['predictions'])
        plt.figure(figsize=(6, 4))
        sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
        plt.title(f'{name} - Confusion Matrix')
        plt.ylabel('True Label')
        plt.xlabel('Predicted Label')
        plt.show()

print("🤖 ML model functions loaded and ready!")


## 5. Model Interpretability & Feature Analysis

**This section focuses on understanding what drives the predictions:**
- SHAP values for model interpretability
- Feature importance analysis
- Partial dependence plots
- Individual prediction explanations


In [None]:
# Model Interpretability Functions

def shap_analysis(model, X_train, X_test, model_name="Model"):
    """
    Perform SHAP analysis for model interpretability
    """
    print(f"🔍 Performing SHAP analysis for {model_name}...")
    
    try:
        # Create SHAP explainer
        explainer = shap.TreeExplainer(model)
        shap_values = explainer.shap_values(X_test)
        
        # Summary plot
        plt.figure(figsize=(12, 8))
        shap.summary_plot(shap_values, X_test, show=False)
        plt.title(f'{model_name} - SHAP Summary Plot')
        plt.tight_layout()
        plt.show()
        
        # Feature importance plot
        plt.figure(figsize=(10, 6))
        shap.summary_plot(shap_values, X_test, plot_type="bar", show=False)
        plt.title(f'{model_name} - SHAP Feature Importance')
        plt.tight_layout()
        plt.show()
        
        # Waterfall plot for first prediction
        plt.figure(figsize=(12, 6))
        shap.waterfall_plot(explainer.expected_value, shap_values[0], X_test.iloc[0], show=False)
        plt.title(f'{model_name} - SHAP Waterfall Plot (First Prediction)')
        plt.tight_layout()
        plt.show()
        
        return explainer, shap_values
        
    except Exception as e:
        print(f"⚠️  SHAP analysis failed for {model_name}: {str(e)}")
        print("   This might be due to model type incompatibility")
        return None, None

def feature_importance_comparison(models_dict, X_train):
    """
    Compare feature importance across different models
    """
    print("📊 Comparing feature importance across models...")
    
    importance_data = []
    
    for name, model in models_dict.items():
        if hasattr(model, 'feature_importances_'):
            importance = model.feature_importances_
            for i, feature in enumerate(X_train.columns):
                importance_data.append({
                    'model': name,
                    'feature': feature,
                    'importance': importance[i]
                })
    
    if importance_data:
        importance_df = pd.DataFrame(importance_data)
        
        # Create pivot table for comparison
        pivot_df = importance_df.pivot(index='feature', columns='model', values='importance')
        
        # Plot comparison
        plt.figure(figsize=(15, 10))
        sns.heatmap(pivot_df, annot=True, fmt='.3f', cmap='YlOrRd')
        plt.title('Feature Importance Comparison Across Models')
        plt.xlabel('Models')
        plt.ylabel('Features')
        plt.tight_layout()
        plt.show()
        
        return importance_df
    else:
        print("⚠️  No models with feature_importances_ found")
        return None

def partial_dependence_analysis(model, X_train, feature_names, top_features=5):
    """
    Create partial dependence plots for top features
    """
    print(f"📈 Creating partial dependence plots for top {top_features} features...")
    
    # Get feature importance
    if hasattr(model, 'feature_importances_'):
        importance = model.feature_importances_
        feature_importance = list(zip(feature_names, importance))
        feature_importance.sort(key=lambda x: x[1], reverse=True)
        top_feature_names = [f[0] for f in feature_importance[:top_features]]
    else:
        # Use first few features if no importance available
        top_feature_names = feature_names[:top_features]
    
    # Create partial dependence plots
    fig, axes = plt.subplots(2, 3, figsize=(18, 12))
    axes = axes.flatten()
    
    for i, feature in enumerate(top_feature_names):
        if i < len(axes):
            # Calculate partial dependence
            from sklearn.inspection import partial_dependence
            
            try:
                pd_result = partial_dependence(model, X_train, [feature])
                
                axes[i].plot(pd_result['values'][0], pd_result['average'][0])
                axes[i].set_title(f'Partial Dependence: {feature}')
                axes[i].set_xlabel(feature)
                axes[i].set_ylabel('Partial Dependence')
                axes[i].grid(True, alpha=0.3)
            except Exception as e:
                axes[i].text(0.5, 0.5, f'Error: {str(e)}', 
                           ha='center', va='center', transform=axes[i].transAxes)
                axes[i].set_title(f'Error: {feature}')
    
    # Hide unused subplots
    for i in range(len(top_feature_names), len(axes)):
        axes[i].set_visible(False)
    
    plt.tight_layout()
    plt.show()

print("🔍 Interpretability functions loaded and ready!")


## 6. Results Summary & Conclusions

**This section will provide:**
- Final model performance comparison
- Key findings and insights
- Clinical relevance of results
- Recommendations for future work
- Model deployment considerations


In [None]:
# Results Summary and Model Selection

def create_final_summary(results):
    """
    Create a comprehensive summary of all model results
    """
    print("📋 FINAL MODEL PERFORMANCE SUMMARY")
    print("=" * 60)
    
    # Create summary DataFrame
    summary_data = []
    for name, result in results.items():
        summary_data.append({
            'Model': name,
            'Accuracy': result['accuracy'],
            'AUC': result['auc'] if result['auc'] is not None else 'N/A',
            'CV_Mean': result['cv_mean'],
            'CV_Std': result['cv_std']
        })
    
    summary_df = pd.DataFrame(summary_data)
    summary_df = summary_df.sort_values('Accuracy', ascending=False)
    
    print(summary_df.to_string(index=False))
    
    # Find best model
    best_model_name = summary_df.iloc[0]['Model']
    best_model = results[best_model_name]['model']
    
    print(f"\n🏆 BEST PERFORMING MODEL: {best_model_name}")
    print(f"   Accuracy: {summary_df.iloc[0]['Accuracy']:.4f}")
    print(f"   AUC: {summary_df.iloc[0]['AUC']}")
    print(f"   CV Score: {summary_df.iloc[0]['CV_Mean']:.4f} (±{summary_df.iloc[0]['CV_Std']:.4f})")
    
    return best_model_name, best_model, summary_df

def save_results(results, summary_df, output_dir="../results/"):
    """
    Save model results and summary to files
    """
    import os
    import joblib
    from datetime import datetime
    
    # Create results directory if it doesn't exist
    os.makedirs(output_dir, exist_ok=True)
    
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    
    # Save summary DataFrame
    summary_df.to_csv(f"{output_dir}model_summary_{timestamp}.csv", index=False)
    
    # Save best model
    best_model_name, best_model, _ = create_final_summary(results)
    joblib.dump(best_model, f"{output_dir}best_model_{timestamp}.pkl")
    
    # Save all results
    results_to_save = {}
    for name, result in results.items():
        results_to_save[name] = {
            'accuracy': result['accuracy'],
            'auc': result['auc'],
            'cv_mean': result['cv_mean'],
            'cv_std': result['cv_std']
        }
    
    import json
    with open(f"{output_dir}all_results_{timestamp}.json", 'w') as f:
        json.dump(results_to_save, f, indent=2)
    
    print(f"💾 Results saved to {output_dir}")
    print(f"   - Model summary: model_summary_{timestamp}.csv")
    print(f"   - Best model: best_model_{timestamp}.pkl")
    print(f"   - All results: all_results_{timestamp}.json")

def generate_insights_report(summary_df, feature_importance_df=None):
    """
    Generate insights and recommendations based on results
    """
    print("\n🔍 KEY INSIGHTS AND RECOMMENDATIONS")
    print("=" * 50)
    
    # Model performance insights
    best_accuracy = summary_df.iloc[0]['Accuracy']
    accuracy_range = summary_df['Accuracy'].max() - summary_df['Accuracy'].min()
    
    print(f"📊 Model Performance:")
    print(f"   • Best accuracy achieved: {best_accuracy:.4f}")
    print(f"   • Performance range: {accuracy_range:.4f}")
    
    if accuracy_range < 0.05:
        print(f"   • Models show consistent performance (low variance)")
    else:
        print(f"   • Significant performance differences between models")
    
    # Feature importance insights
    if feature_importance_df is not None:
        top_features = feature_importance_df.head(10)
        print(f"\n🎯 Top Contributing Features:")
        for idx, row in top_features.iterrows():
            print(f"   • {row['feature']}: {row['importance']:.4f}")
    
    # Clinical recommendations
    print(f"\n🏥 Clinical Relevance:")
    if best_accuracy > 0.85:
        print(f"   • Strong predictive performance suggests clinical utility")
    elif best_accuracy > 0.75:
        print(f"   • Good predictive performance, suitable for screening")
    else:
        print(f"   • Moderate performance, may need additional features")
    
    print(f"\n🔬 Recommendations for Future Work:")
    print(f"   • Collect additional biomarkers and cognitive assessments")
    print(f"   • Implement longitudinal data collection")
    print(f"   • Validate on independent datasets")
    print(f"   • Consider ensemble methods for improved performance")

print("📋 Results summary functions loaded and ready!")


## 7. Next Steps & Usage Instructions

### 🚀 Getting Started

1. **Download your datasets** and place them in the `../data/raw/` directory
2. **Update the data loading section** (Section 1) with your actual file paths
3. **Run the notebook cells sequentially** to perform the analysis
4. **Customize the analysis** based on your specific dataset characteristics

### 📝 Customization Tips

- **Target Variable**: Update the `target_col` parameter in preprocessing functions
- **Feature Engineering**: Add domain-specific features in the preprocessing section
- **Model Selection**: Modify the models dictionary to include/exclude specific algorithms
- **Visualization**: Customize plots based on your data characteristics

### 🔧 Troubleshooting

- **Memory Issues**: Reduce dataset size or use data sampling for initial exploration
- **Missing Libraries**: Install missing packages using `pip install -r requirements.txt`
- **Data Format Issues**: Modify the data loading code based on your file format

### 📊 Expected Outputs

- Comprehensive EDA visualizations
- Model performance comparisons
- Feature importance analysis
- SHAP interpretability plots
- Results saved to `../results/` directory

---

**Ready to analyze your Alzheimer's Disease data! 🧠🔬**
