# ðŸ§  Brain Connectome Analysis - Local Demo

This notebook demonstrates the Brain Connectome analysis pipeline for **local Jupyter** users.

## Prerequisites
Make sure you have installed the package:
```bash
pip install -e .
```

## What this notebook does:
1. ðŸ“Š Loads HCP connectome data
2. ðŸ”¬ Runs sexual dimorphism analysis
3. ðŸ¤– Trains Random Forest classifier
4. ðŸ§  Trains EBM (Explainable Boosting Machine)
5. ðŸ“ˆ Compares model performance

---


In [None]:
import sys
from pathlib import Path

# Add project root to path if running from notebooks directory
project_root = Path.cwd().parent
if project_root.name == "Brain-Connectome":
    sys.path.insert(0, str(project_root))

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

print("âœ… Imports successful!")


## Step 1: Load Data

Load the processed HCP data. If not available, sample data will be created.


In [None]:
# Try to load real data, fall back to sample data
data_path = project_root / "data" / "processed" / "full_data.csv"

if data_path.exists():
    print(f"Loading data from {data_path}")
    data = pd.read_csv(data_path)
else:
    print("Creating sample data for demonstration...")
    rng = np.random.default_rng(42)
    n_subjects = 200
    n_pcs = 60
    
    data = {"Subject": range(1, n_subjects + 1)}
    data["Gender"] = rng.choice(["M", "F"], n_subjects)
    
    for i in range(1, n_pcs + 1):
        if i in [1, 3, 12, 23, 33]:
            male_mean = 0.5 if i % 2 == 0 else -0.5
            data[f"Struct_PC{i}"] = np.where(
                np.array(data["Gender"]) == "M",
                rng.normal(male_mean, 1, n_subjects),
                rng.normal(-male_mean, 1, n_subjects)
            )
        else:
            data[f"Struct_PC{i}"] = rng.normal(0, 1, n_subjects)
    
    data = pd.DataFrame(data)

print(f"\nðŸ“Š Dataset: {data.shape[0]} subjects, {data.shape[1]} features")
print(f"\nGender distribution:\n{data['Gender'].value_counts()}")
data.head()


## Step 2: Sexual Dimorphism Analysis


In [None]:
from brain_connectome.analysis import DimorphismAnalysis

# Run analysis
analysis = DimorphismAnalysis(data, gender_column="Gender")
struct_pcs = [col for col in data.columns if col.startswith("Struct_PC")]
results = analysis.analyze(feature_columns=struct_pcs)

n_significant = results["Significant"].sum()
print(f"ðŸ”¬ Found {n_significant} significant features (FDR < 0.05)")

# Plot effect sizes
fig, ax = plt.subplots(figsize=(10, 8))
top20 = results.head(20)
colors = ["#1f77b4" if d < 0 else "#d62728" for d in top20["Cohen_D"]]
ax.barh(range(len(top20)), top20["Cohen_D"].values, color=colors)
ax.set_yticks(range(len(top20)))
ax.set_yticklabels(top20["Feature"])
ax.set_xlabel("Cohen's D")
ax.set_title("Sexual Dimorphism: Effect Sizes")
ax.axvline(0, color="black", linewidth=0.5)
ax.invert_yaxis()
plt.tight_layout()
plt.show()


## Step 3: Machine Learning Classification


In [None]:
from brain_connectome.models import ConnectomeRandomForest
from sklearn.model_selection import train_test_split

# Prepare data
X = data[struct_pcs].values
y = (data["Gender"] == "M").astype(int).values

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

# Train and evaluate
clf = ConnectomeRandomForest(n_estimators=200, random_state=42)
clf.fit(X_train, y_train, feature_names=struct_pcs)
metrics = clf.evaluate(X_test, y_test)

print(f"ðŸŽ¯ Test Accuracy: {metrics['accuracy']:.2%}")

# Plot feature importance
importance = clf.get_top_features(n=15)
fig, ax = plt.subplots(figsize=(10, 6))
top15 = importance.iloc[::-1]
ax.barh(top15["Feature"], top15["Importance"], color=plt.colormaps["viridis"](np.linspace(0.3, 0.9, len(top15))))
ax.set_xlabel("Importance")
ax.set_title("Top 15 Features for Classification")
plt.tight_layout()
plt.show()


## Step 4: EBM (Explainable Boosting Machine)


In [None]:
from brain_connectome.models import ConnectomeEBM

# Train EBM (interpretable model)
ebm = ConnectomeEBM(learning_rate=0.01, max_bins=32, max_leaves=3, interactions=0, random_state=42)
ebm.fit(X_train, y_train, feature_names=struct_pcs)
ebm_metrics = ebm.evaluate(X_test, y_test)

print(f"ðŸŽ¯ EBM Accuracy: {ebm_metrics['accuracy']:.2%}")
print(f"ðŸ“Š Random Forest: {metrics['accuracy']:.2%}")

# Compare top features
ebm_importance = ebm.get_top_features(n=10)
print("\nEBM Top Features:")
print(ebm_importance.to_string(index=False))


## Summary

Analysis complete! For the full pipeline with PCA and VAE, run:
```bash
python Runners/run_pipeline.py
```

Or use Docker:
```bash
docker run -v $(pwd)/data:/app/data -v $(pwd)/output:/app/output ghcr.io/sean0418/brain-connectome:latest
```
