# ðŸ§  Complete ASD Screening ML Model Training

## Step-by-Step Guide for Jupyter Notebook / Google Colab

This notebook walks you through creating your ASD screening ML model following scientific best practices.

### Key Features:
- âœ… Age-normalized feature engineering
- âœ… Child-level train-test splitting (prevents data leakage)
- âœ… Multiple model comparison (LR, RF, XGBoost)
- âœ… Probability calibration
- âœ… Ablation study (justifies multi-domain approach)
- âœ… Comprehensive evaluation with confidence intervals

---


## Step 1: Setup and Install Libraries


In [None]:
# Install required packages (Google Colab)
# Skip this if using local Jupyter
!pip install pandas numpy scikit-learn xgboost lightgbm matplotlib seaborn scipy joblib -q

print("âœ… All packages installed!")


In [None]:
# Import all libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score, GroupKFold
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    roc_auc_score, roc_curve, confusion_matrix, classification_report
)
from sklearn.calibration import CalibratedClassifierCV
import xgboost as xgb
from scipy import stats
from scipy.stats import mannwhitneyu, pearsonr
import joblib
import pickle
import warnings
warnings.filterwarnings('ignore')

# Set style
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (12, 6)

print("âœ… All libraries imported successfully!")


## Step 2: Load Data

**Option 1**: Upload CSV to Google Colab  
**Option 2**: Load from local file (Jupyter)


In [None]:
# Option 1: Upload CSV to Google Colab
from google.colab import files
uploaded = files.upload()

# Load the CSV (adjust filename)
df = pd.read_csv('training_data.csv')  # Change to your filename

# Option 2: Load from local file (Jupyter)
# df = pd.read_csv('training_data.csv')

print(f"âœ… Data loaded: {df.shape[0]} rows, {df.shape[1]} columns")
print(f"\nFirst few rows:")
df.head()


In [None]:
# Explore data
print("Data Info:")
print(df.info())

print("\n" + "="*50)
print("Missing Values:")
print(df.isnull().sum().sort_values(ascending=False))

print("\n" + "="*50)
print("Target Distribution:")
print(df['group'].value_counts())  # Adjust column name if different

print("\n" + "="*50)
print("Age Distribution:")
print(df['age_months'].describe())


## Step 3: Data Preprocessing


In [None]:
# Handle missing values
# Fill numeric columns with median
numeric_cols = df.select_dtypes(include=[np.number]).columns
for col in numeric_cols:
    if df[col].isnull().sum() > 0:
        median_val = df[col].median()
        df[col].fillna(median_val, inplace=True)
        print(f"Filled {col} with median: {median_val:.2f}")

# Fill categorical with mode
categorical_cols = df.select_dtypes(include=['object']).columns
for col in categorical_cols:
    if df[col].isnull().sum() > 0:
        mode_val = df[col].mode()[0]
        df[col].fillna(mode_val, inplace=True)
        print(f"Filled {col} with mode: {mode_val}")

print("\nâœ… Missing values handled!")


In [None]:
# Encode target variable
label_encoder = LabelEncoder()
df['target'] = label_encoder.fit_transform(df['group'])
# ASD = 1, Control = 0

print("Target encoding:")
print(f"ASD = {label_encoder.transform(['asd'])[0] if 'asd' in df['group'].values else 1}")
print(f"Control = {label_encoder.transform(['typically_developing'])[0] if 'typically_developing' in df['group'].values else 0}")

# Encode other categorical variables
categorical_features = ['gender', 'session_type', 'age_group']
for col in categorical_features:
    if col in df.columns:
        le = LabelEncoder()
        df[f'{col}_encoded'] = le.fit_transform(df[col])
        print(f"Encoded {col}")

print("\nâœ… Categorical variables encoded!")
