# Sleep Disorder Risk Prediction — Data Prep & Baseline Model

This notebook runs the end-to-end data preparation, baseline training, evaluation, and model export steps. **Do not run it without first installing requirements in `requirements.txt`.**

**High-level steps:**
1. Download NHANES cycles (2005–2016) and convert XPT → CSV (uses `src/check_nhanes_downloads.py`).
2. Merge converted CSV files and engineer features (uses `src/merge_and_engineer.py`).
3. Load the merged CSV, preprocess, split, scale, train a baseline XGBoost model, evaluate metrics, and compute SHAP explanations.

Run cells sequentially. If you prefer to run parts separately, run the downloader and merge scripts from the terminal first, then continue with the notebook cells.


## 0. Environment & Requirements

Install dependencies from `requirements.txt` (recommended to use a virtual environment):

```bash
pip install -r requirements.txt
```

This notebook expects the following folders after running the downloader:
- `data/raw/<cycle>/csv/` (converted CSVs)
- `reports/` (per-cycle JSON reports)
- After merging: `data/processed/merged_clean.csv`


In [None]:
# 1) (Optional) Run the downloader from the notebook.
# Running this cell will download ~10+ files per cycle and may take a long time depending on your connection.
# If you prefer, run in a terminal: python src/check_nhanes_downloads.py --out data/raw --report reports --convert

# !python src/check_nhanes_downloads.py --out data/raw --report reports --convert

print('Downloader command (commented out). To run, remove comment and execute this cell or run from terminal.')

## 2) Inspect per-cycle reports
The downloader writes `reports/<cycle>_report.json` describing which files were downloaded and which variables were present/missing. Always inspect these reports before merging.

In [None]:
import json
from pathlib import Path
report_dir = Path('reports')
if report_dir.exists():
    for p in sorted(report_dir.glob('*_report.json')):
        print('\nReport:', p)
        with open(p, 'r', encoding='utf-8') as fh:
            rep = json.load(fh)
        # Print summary of missing variables per file
        for fname, info in rep.get('files', {}).items():
            missing = info.get('missing_variables')
            if missing:
                print(f"  {fname}: missing {len(missing)} key vars -> {missing}")
            else:
                print(f"  {fname}: OK")
else:
    print('No reports found. Run the downloader first (see instructions).')

## 3) Merge & engineer features
You can run the merge script from the notebook or via terminal. This script reads CSVs from `data/raw/<cycle>/csv/`, harmonizes, merges on `SEQN`, engineers features, and writes `data/processed/merged_clean.csv`. If variable names differ across cycles you may need to edit `src/merge_and_engineer.py`.

In [None]:
# Run merge script (uncomment to execute here)
# !python src/merge_and_engineer.py --input data/raw --output data/processed/merged_clean.csv
print('Merge command (commented out). Run from terminal or uncomment to run here.')

## 4) Load merged data and inspect
Load `data/processed/merged_clean.csv` and inspect label balance and missingness.

In [None]:
import pandas as pd
from pathlib import Path
p = Path('data/processed/merged_clean.csv')
if not p.exists():
    print('Merged CSV not found. Run merge_and_engineer.py first.')
else:
    df = pd.read_csv(p)
    print('Merged shape:', df.shape)
    print('\nTarget distribution (sleep_disorder):')
    print(df['sleep_disorder'].value_counts(dropna=False))
    display(df.head())

## 5) Preprocess, cross-validation split, scaling, and baseline training
This cell runs preprocessing: drop NA for core features, split into train/test (stratified), scale numeric features, and train a baseline XGBoost. Adjust hyperparameters as needed.

In [None]:
# Baseline training (example) - DO NOT run until merged CSV is present
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import roc_auc_score, average_precision_score, classification_report, confusion_matrix

if 'df' not in globals():
    print('Load merged CSV first (see previous cell).')
else:
    feature_cols = [
        'age', 'sex', 'BMI', 'exercise_min_week',
        'calories_day', 'fiber_g_day', 'added_sugar_g_day', 'caffeine_mg_day',
        'alcohol_drinks_week', 'current_smoker', 'depression_score',
        'systolic_bp', 'diastolic_bp'
    ]
    # Keep only rows where target is not null
    df2 = df[df['sleep_disorder'].notna()].copy()
    # Drop rows missing core features
    df_model = df2[feature_cols + ['sleep_disorder']].dropna()
    print('After dropping missing:', df_model.shape)
    X = df_model[feature_cols]
    y = df_model['sleep_disorder'].astype(int)
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
    scaler = StandardScaler()
    X_train_s = scaler.fit_transform(X_train)
    X_test_s = scaler.transform(X_test)

    # Train XGBoost baseline
    try:
        from xgboost import XGBClassifier
        model = XGBClassifier(n_estimators=100, max_depth=6, learning_rate=0.1, use_label_encoder=False, eval_metric='logloss')
        model.fit(X_train_s, y_train)
        y_proba = model.predict_proba(X_test_s)[:,1]
        y_pred = model.predict(X_test_s)
        roc = roc_auc_score(y_test, y_proba)
        pr = average_precision_score(y_test, y_proba)
        print(f'ROC-AUC: {roc:.3f} | PR-AUC: {pr:.3f}')
        print('\nClassification report:')
        print(classification_report(y_test, y_pred))
    except Exception as e:
        print('Error training model (is xgboost installed?):', e)

## 6) SHAP explainability (global & local)
Use SHAP to compute feature importances and visualize per-person explanations. This cell shows example code — it may be slow on large datasets.

In [None]:
# SHAP example (requires shap and matplotlib)
if 'model' in globals() and model is not None:
    try:
        import shap
        import matplotlib.pyplot as plt
        explainer = shap.TreeExplainer(model)
        # Use a small sample to compute shap values for speed
        sample = X_test_s[:200]
        shap_values = explainer.shap_values(sample)
        shap.summary_plot(shap_values, sample, feature_names=feature_cols)
    except Exception as e:
        print('Error computing SHAP (install shap and matplotlib):', e)
else:
    print('Train the model first (run the training cell above).')

## 7) Save model & scaler
Save trained artifacts to `models/` for deployment with Streamlit.

In [None]:
import joblib, os
os.makedirs('models', exist_ok=True)
if 'model' in globals() and 'scaler' in globals():
    joblib.dump(model, 'models/sleep_risk_model.pkl')
    joblib.dump(scaler, 'models/feature_scaler.pkl')
    joblib.dump(feature_cols, 'models/feature_names.pkl')
    print('Saved model and scaler to models/')
else:
    print('No model/scaler found in memory. Run training cell first.')

## Next steps & notes
- Inspect reports for variable coverage per cycle and edit `src/merge_and_engineer.py` if variable names differ across cycles.
- Consider multiple imputation (MICE) for missingness instead of complete-case analysis.
- Use temporal external validation: train on early cycles and test on later cycles to assess generalization.
- Tune hyperparameters with `GridSearchCV` or `Optuna` and log experiments with MLflow or Weights & Biases.

**Limitations:** See README for full limitations and ethical notes.
