# 🧠 Alzheimer's Disease Prediction - Colab Training Notebook

This notebook provides a complete training pipeline for Alzheimer's Disease prediction with:
- Data loading with automatic fallback
- Runtime validation (GPU detection, version checks)
- Multiple ML model training
- Bootstrap confidence intervals

## 📋 Colab Setup Instructions

1. **Restart Runtime**: Runtime → Restart runtime (or Ctrl+M → Restart runtime)
2. **Enable GPU**: Runtime → Change runtime type → Hardware accelerator → GPU
3. **Mount Google Drive** (optional): If using external files
   ```python
   from google.colab import drive
   drive.mount('/content/drive')
   ```

4. **Clone Repository**: Run the repository setup cell to get all code from the repo
5. **Install Dependencies**: Run the setup cell below
6. **Upload Data** (optional): Upload your `preprocessed_data.npz` or `fallback_data.csv` file to Colab

## 📥 Repository Setup


In [None]:
# Step 1: Clone repository (Colab only - skip if running locally)
try:
    import google.colab
    import os
    
    # Check if directory already exists
    if os.path.exists('Alzheimer-s'):
        print("📁 Directory already exists, pulling latest changes...")
        %cd Alzheimer-s
        !git pull origin main
        print("✅ Pulled latest changes")
    else:
        !git clone https://github.com/Arnabs-ops/Alzheimer-s.git
        %cd Alzheimer-s
        print("✅ Repository cloned and directory changed")
except ImportError:
    print("ℹ️ Running locally - skipping git clone")
except Exception as e:
    print(f"⚠️ Git operation failed: {e}")
    print("💡 Trying to continue with existing directory...")
    if os.path.exists('Alzheimer-s'):
        %cd Alzheimer-s
        print("✅ Changed to existing directory")


## 🔧 Setup & Installation

In [None]:
# Install dependencies
!pip install -q scikit-learn xgboost lightgbm optuna numpy pandas matplotlib seaborn

## 🔍 Runtime Validation

In [None]:
import sys
import platform
import warnings
warnings.filterwarnings('ignore')

print("🔍 Runtime Validation")
print("=" * 50)

# Python version
print(f"🐍 Python version: {sys.version.split()[0]}")
print(f"💻 Platform: {platform.system()} {platform.release()}")

# GPU detection
try:
    import torch
    if torch.cuda.is_available():
        print(f"⚡ GPU detected: {torch.cuda.get_device_name(0)}")
        print(f"📊 GPU memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")
    else:
        print("⚠️ No GPU detected - using CPU")
except ImportError:
    try:
        import tensorflow as tf
        if tf.config.list_physical_devices('GPU'):
            gpu = tf.config.list_physical_devices('GPU')[0]
            print(f"⚡ GPU detected: {gpu}")
        else:
            print("⚠️ No GPU detected - using CPU")
    except ImportError:
        print("⚠️ PyTorch/TensorFlow not installed - cannot detect GPU")
        print("💡 To enable GPU: Runtime → Change runtime type → GPU")

# Package versions
print("\n📦 Package Versions:")
packages = ['sklearn', 'xgboost', 'lightgbm', 'optuna']
for pkg in packages:
    try:
        mod = __import__(pkg)
        version = getattr(mod, '__version__', 'unknown')
        print(f"  ✅ {pkg}: {version}")
    except ImportError:
        print(f"  ❌ {pkg}: not installed")

print("\n✅ Runtime validation complete")

## 📊 Data Loading with Fallback

In [None]:
import os
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer

print("📦 Loading data...")

# Debug: List available files
print("\n🔍 Checking available files...")
print(f"   Current directory: {os.getcwd()}")

# Check both current directory and /content (where Colab uploads files)
print(f"   Files in current directory:")
try:
    current_files = [f for f in os.listdir('.') if f.endswith(('.npz', '.csv'))]
    for f in current_files:
        print(f"      ✅ {f}")
except:
    print("      (none found)")

# Also check /content directory (where files are uploaded before cd)
print(f"   Files in /content directory:")
try:
    if os.path.exists('/content'):
        content_files = [f for f in os.listdir('/content') if f.endswith(('.npz', '.csv'))]
        for f in content_files:
            print(f"      ✅ /content/{f}")
        if content_files:
            print(f"   💡 Found files in /content - will check there too")
except:
    pass

X = None
y = None
data_source = None

# First, check current directory for ANY .npz files
npz_files_in_dir = []
try:
    npz_files_in_dir = [f for f in os.listdir('.') if f.endswith('.npz')]
except:
    pass

# Also check /content directory (where Colab uploads files)
npz_files_in_content = []
try:
    if os.path.exists('/content'):
        npz_files_in_content = [f'/content/{f}' for f in os.listdir('/content') if f.endswith('.npz')]
except:
    pass

if npz_files_in_dir or npz_files_in_content:
    total = len(npz_files_in_dir) + len(npz_files_in_content)
    print(f"   📁 Found {total} NPZ file(s):")
    for f in npz_files_in_dir:
        print(f"      - {f} (current dir)")
    for f in npz_files_in_content:
        print(f"      - {f} (/content)")

# Build list of paths to check (prioritize files in current directory, then /content)
npz_paths = []
# Add any .npz files found in current directory first
npz_paths.extend(npz_files_in_dir)
# Add files from /content
npz_paths.extend(npz_files_in_content)
# Then add standard paths (both in current dir and /content)
npz_paths.extend([
    'preprocessed_data.npz',  # Standard name (current dir)
    'preprocessed_alz_data.npz',  # Alternative name (current dir)
    '/content/preprocessed_data.npz',  # Standard name (/content)
    '/content/preprocessed_alz_data.npz',  # Alternative name (/content)
    'data/processed/preprocessed_alz_data.npz',  # Repo structure
    'data/processed/preprocessed_data.npz',  # Alternative repo path
])
# Remove duplicates while preserving order
npz_paths = list(dict.fromkeys(npz_paths))

csv_paths = [
    'fallback_data.csv',  # Current directory (uploaded)
    'data/processed/alz_clean.csv',  # Repo structure
    'data/processed/fallback_data.csv',  # Alternative repo path
]

# Try loading NPZ file from multiple locations
data_file = None
for path in npz_paths:
    if os.path.exists(path):
        data_file = path
        break

if data_file:
    try:
        data = np.load(data_file, allow_pickle=True)
        data_keys = list(data.keys())
        
        # Check for X/y format
        if 'X' in data and 'y' in data:
            X = data['X']
            y = data['y']
            data_source = 'NPZ'
            print(f"✅ Data loaded from {data_file} (X/y format)")
            print(f"   Shape: X={X.shape}, y={y.shape}")
        # Check for train/test split format
        elif 'X_train' in data and 'X_test' in data and 'y_train' in data and 'y_test' in data:
            X_train = data['X_train']
            X_test = data['X_test']
            y_train = data['y_train']
            y_test = data['y_test']
            
            # Merge train and test for initial processing
            X = np.vstack([X_train, X_test])
            y = np.concatenate([y_train, y_test])
            
            data_source = 'NPZ_SPLIT'
            print(f"✅ Data loaded from {data_file} (train/test split format)")
            print(f"   Train: X={X_train.shape}, y={y_train.shape}")
            print(f"   Test: X={X_test.shape}, y={y_test.shape}")
            print(f"   Combined: X={X.shape}, y={y.shape}")
            
            # Skip the train_test_split later since we already have splits
            has_existing_split = True
        else:
            print("⚠️ NPZ file found but missing expected keys")
            print(f"   Available keys: {data_keys}")
            print("   Expected: ['X', 'y'] OR ['X_train', 'X_test', 'y_train', 'y_test']")
    except Exception as e:
        print(f"⚠️ Error loading NPZ from {data_file}: {e}")
        print("   Trying CSV fallback...")
        data_file = None
        X = None
        y = None
else:
    print("⚠️ NPZ file not found in any location, trying CSV fallback...")
    
has_existing_split = False if X is None or y is None else has_existing_split

# Initialize split flag if not set
if 'has_existing_split' not in locals():
    has_existing_split = False

# Fallback to CSV
if X is None or y is None:
    csv_file = None
    for path in csv_paths:
        if os.path.exists(path):
            csv_file = path
            break
    
    if csv_file:
        try:
            df = pd.read_csv(csv_file)
            print(f"✅ Data loaded from {csv_file}")
            print(f"   Shape: {df.shape}")
            
            # Assume last column is target
            y = df.iloc[:, -1].values
            X = df.iloc[:, :-1].values
            data_source = 'CSV'
            print(f"   Extracted: X={X.shape}, y={y.shape}")
        except Exception as e:
            print(f"❌ Error loading CSV from {csv_file}: {e}")
            raise
    else:
        print("❌ CSV file not found in any location")
        print("💡 Creating sample data for demonstration...")
        # Create sample data
        np.random.seed(42)
        X = np.random.randn(1000, 50)
        y = np.random.choice([0, 1, 2], 1000)
        data_source = 'SAMPLE'
        print(f"   Generated sample: X={X.shape}, y={y.shape}")

# Handle multi-dimensional y
if len(y.shape) > 1:
    if y.shape[1] == 1:
        y = y.ravel()
    else:
        y = np.argmax(y, axis=1)

# Data preprocessing
print(f"\n🔧 Preprocessing data (source: {data_source})...")

if has_existing_split:
    # Preprocess train and test separately (preserving splits)
    print("   Processing train and test sets separately...")
    
    # Handle NaN and infinity for train
    if np.any(np.isnan(X_train)) or np.any(np.isinf(X_train)):
        print("   Cleaning NaN and infinity values in train set...")
        X_train = np.where(np.isinf(X_train), np.nan, X_train)
        imputer_train = SimpleImputer(strategy='median')
        X_train = imputer_train.fit_transform(X_train)
    
    # Handle NaN and infinity for test
    if np.any(np.isnan(X_test)) or np.any(np.isinf(X_test)):
        print("   Cleaning NaN and infinity values in test set...")
        X_test = np.where(np.isinf(X_test), np.nan, X_test)
        # Use train imputer for test set
        if 'imputer_train' in locals():
            X_test = imputer_train.transform(X_test)
        else:
            imputer_test = SimpleImputer(strategy='median')
            X_test = imputer_test.fit_transform(X_test)
    
    # Scale features (fit on train, transform test)
    print("   Scaling features (fit on train, transform test)...")
    scaler = StandardScaler()
    X_train = scaler.fit_transform(X_train)
    X_test = scaler.transform(X_test)
    print(f"✅ Using existing train/test splits from NPZ file")
else:
    # Handle NaN and infinity for combined data
    if np.any(np.isnan(X)) or np.any(np.isinf(X)):
        print("   Cleaning NaN and infinity values...")
        X = np.where(np.isinf(X), np.nan, X)
        imputer = SimpleImputer(strategy='median')
        X = imputer.fit_transform(X)
    
    # Scale features
    scaler = StandardScaler()
    X = scaler.fit_transform(X)
    
    # Create new train-test split
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42, stratify=y
    )
    print(f"✅ Created new train/test split")

print(f"\n✅ Data preprocessing complete")
print(f"   Train: X={X_train.shape}, y={y_train.shape}")
print(f"   Test: X={X_test.shape}, y={y_test.shape}")
print(f"   Classes: {len(np.unique(y_train))} (distribution: {np.bincount(y_train)})")

## 🤖 Model Training Loop

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
import xgboost as xgb
import lightgbm as lgb

print("🤖 Training Models")
print("=" * 50)

# Define models
models = {
    'Random Forest': RandomForestClassifier(
        n_estimators=100,
        max_depth=10,
        random_state=42,
        n_jobs=-1
    ),
    'XGBoost': xgb.XGBClassifier(
        n_estimators=100,
        max_depth=6,
        learning_rate=0.1,
        random_state=42,
        use_label_encoder=False,
        eval_metric='logloss',
        verbosity=0
    ),
    'LightGBM': lgb.LGBMClassifier(
        n_estimators=100,
        max_depth=6,
        learning_rate=0.1,
        random_state=42,
        verbose=-1
    ),
    'SVM': SVC(
        kernel='rbf',
        probability=True,
        random_state=42
    ),
    'Logistic Regression': LogisticRegression(
        max_iter=1000,
        random_state=42,
        n_jobs=-1
    )
}

# Train and evaluate each model
results = {}

for name, model in models.items():
    print(f"\n🔁 Training {name}...")
    
    try:
        # Train
        model.fit(X_train, y_train)
        
        # Predict
        y_pred = model.predict(X_test)
        
        # Calculate accuracy
        accuracy = accuracy_score(y_test, y_pred)
        
        results[name] = {
            'model': model,
            'accuracy': accuracy,
            'predictions': y_pred
        }
        
        print(f"   ✅ Accuracy: {accuracy:.4f}")
        
    except Exception as e:
        print(f"   ❌ Error: {e}")
        import traceback
        traceback.print_exc()

print("\n" + "=" * 50)
print("📊 Model Performance Summary:")
print("=" * 50)

# Sort by accuracy
sorted_results = sorted(results.items(), key=lambda x: x[1]['accuracy'], reverse=True)

for name, res in sorted_results:
    print(f"{name:20s}: {res['accuracy']:.4f}")

if sorted_results:
    best_name, best_result = sorted_results[0]
    print(f"\n🏆 Best Model: {best_name} (Accuracy: {best_result['accuracy']:.4f})")

## 📈 Bootstrap Confidence Intervals (Optional)

In [None]:
def bootstrap_confidence_interval(model, X_test, y_test, n_bootstrap=300, confidence=0.95):
    """
    Calculate bootstrap confidence intervals for model accuracy.
    """
    n_samples = len(y_test)
    accuracies = []
    
    print(f"🔄 Running {n_bootstrap} bootstrap iterations...")
    
    for i in range(n_bootstrap):
        # Bootstrap sample
        indices = np.random.choice(n_samples, size=n_samples, replace=True)
        X_boot = X_test[indices]
        y_boot = y_test[indices]
        
        # Predict
        y_pred = model.predict(X_boot)
        
        # Calculate accuracy
        acc = accuracy_score(y_boot, y_pred)
        accuracies.append(acc)
        
        # Progress indicator
        if (i + 1) % 50 == 0:
            print(f"   Completed {i + 1}/{n_bootstrap} iterations")
    
    # Calculate statistics
    accuracies = np.array(accuracies)
    mean_acc = np.mean(accuracies)
    std_acc = np.std(accuracies)
    
    # Calculate confidence interval
    alpha = 1 - confidence
    lower = np.percentile(accuracies, 100 * alpha / 2)
    upper = np.percentile(accuracies, 100 * (1 - alpha / 2))
    
    return {
        'mean': mean_acc,
        'std': std_acc,
        'lower': lower,
        'upper': upper,
        'confidence': confidence
    }

# Run bootstrap CI for top models
if results:
    print("\n📊 Bootstrap Confidence Intervals for Top Models")
    print("=" * 50)
    
    # Get top 3 models
    top_models = sorted_results[:3]
    
    bootstrap_results = {}
    
    for name, res in top_models:
        print(f"\n🔍 Analyzing {name}...")
        ci = bootstrap_confidence_interval(res['model'], X_test, y_test, n_bootstrap=300)
        bootstrap_results[name] = ci
        
        print(f"   Mean Accuracy: {ci['mean']:.4f} ± {ci['std']:.4f}")
        print(f"   {int(ci['confidence']*100)}% CI: [{ci['lower']:.4f}, {ci['upper']:.4f}]")
    
    print("\n✅ Bootstrap analysis complete")
else:
    print("⚠️ No trained models available for bootstrap analysis")

## 💾 Save Results (Optional)

In [None]:
import joblib
import json
from datetime import datetime

# Save best model
if results:
    best_name, best_result = sorted_results[0]
    
    # Save model
    model_filename = f'best_model_{best_name.replace(" ", "_")}.pkl'
    joblib.dump(best_result['model'], model_filename)
    print(f"💾 Saved best model: {model_filename}")
    
    # Save results summary
    summary = {
        'timestamp': datetime.now().isoformat(),
        'best_model': best_name,
        'best_accuracy': float(best_result['accuracy']),
        'all_results': {name: float(res['accuracy']) for name, res in results.items()}
    }
    
    if 'bootstrap_results' in locals():
        summary['bootstrap'] = {
            name: {k: float(v) for k, v in ci.items() if k != 'confidence'}
            for name, ci in bootstrap_results.items()
        }
    
    summary_filename = 'training_results.json'
    with open(summary_filename, 'w') as f:
        json.dump(summary, f, indent=2)
    
    print(f"💾 Saved results summary: {summary_filename}")
    print("\n📁 Files saved to current directory")
    print("   💡 To download: Right-click file → Download")
else:
    print("⚠️ No results to save")

## 📝 Notes

- **Runtime Management**: If you encounter memory issues, restart the runtime (Runtime → Restart runtime)
- **GPU Usage**: To enable GPU acceleration, go to Runtime → Change runtime type → Hardware accelerator → GPU
- **Data Upload**: Upload `preprocessed_data.npz` or `fallback_data.csv` to the Colab file system using the file browser
- **Download Results**: Right-click on saved files in the file browser to download them
- **Long Training**: For long training sessions, consider using Colab Pro for longer runtime sessions