# üß† Connectopy Analysis - One-Click Demo

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Sean0418/connectopy/blob/main/notebooks/colab_demo.ipynb)

This notebook demonstrates the Connectopy analysis pipeline using **real HCP data**. Just click **Runtime ‚Üí Run all** to execute the entire analysis!

## What this notebook does:
1. üì¶ Clones the repository and installs the connectopy package
2. üìä Loads HCP connectome data (cognitive + brain features)
3. üî¨ Runs sexual dimorphism analysis
4. üç∑ **Alcohol Classification**: Predicts alcohol use disorder using RF + EBM
   - Sex-stratified models (separate for Males/Females)
   - GridSearchCV for hyperparameter tuning
   - Class imbalance handling via sample weights
   - Comprehensive metrics (AUC, balanced accuracy, etc.)
5. üîó **Mediation Analysis**: Tests brain network mediation of cognitive-alcohol relationships
6. üìà Visualizes the results

---


## Step 1: Setup Environment

First, we'll clone the repository and install dependencies. This takes about 2-3 minutes.


In [None]:
# Install interpret FIRST (required for EBM) - must be before package import
%pip install -q interpret

# Clone or update the repository
import os
import shutil
import sys

# Always start from /content
%cd /content

# Clean up any old directories
for old_dir in ["Brain-Connectome", "connectopy"]:
    if os.path.exists(old_dir):
        print(f"Removing old {old_dir} directory...")
        shutil.rmtree(old_dir)

# Clear any cached imports BEFORE cloning
for mod in list(sys.modules.keys()):
    if "connectopy" in mod:
        del sys.modules[mod]

print("Cloning repository...")
!git clone https://github.com/Sean0418/connectopy.git
%cd /content/connectopy

# Verify structure
print(f"Current directory: {os.getcwd()}")
print(f"Contents: {os.listdir('.')}")

# Install the package
%pip install -q -e .

# Add src to path (needed for editable install with src layout in Colab)
import sys
src_path = "/content/connectopy/src"
if src_path not in sys.path:
    sys.path.insert(0, src_path)

# Verify import works
from connectopy.analysis import DimorphismAnalysis
print(f"‚úÖ Import test passed: {DimorphismAnalysis}")

print("‚úÖ Setup complete!")

## Step 2: Load Data

We'll create sample data for demonstration. For your own analysis, you would mount Google Drive or upload your HCP data.


In [None]:
from pathlib import Path

import numpy as np
import pandas as pd

# Load the HCP data
data_path = Path("data/processed/full_data.csv")

if not data_path.exists():
    raise FileNotFoundError(
        f"Data file not found at {data_path}\n"
        "Please ensure the HCP data is available.\n"
        "Options:\n"
        "  1. Mount Google Drive with your data: from google.colab import drive; drive.mount('/content/drive')\n"
        "  2. Upload full_data.csv to data/processed/\n"
        "  3. Download HCP data from https://db.humanconnectome.org/"
    )

print("Loading HCP data...")
data = pd.read_csv(data_path)

# Create alcohol target from SSAGA_Alc_D4_Ab_Dx if not present
# HCP coding: 1 = No diagnosis, 5 = Yes diagnosis (alcohol abuse/dependence)
if "alc_y" not in data.columns:
    if "SSAGA_Alc_D4_Ab_Dx" in data.columns:
        data["alc_y"] = np.where(data["SSAGA_Alc_D4_Ab_Dx"] == 5, 1, 0).astype(int)
        print("Created alcohol target (alc_y) from SSAGA_Alc_D4_Ab_Dx")
    else:
        raise ValueError("No alcohol target column found. Need 'alc_y' or 'SSAGA_Alc_D4_Ab_Dx'")

print(f"\nüìä Dataset loaded: {data.shape[0]} subjects, {data.shape[1]} features")
print("\nGender distribution:")
print(data["Gender"].value_counts())
print("\nüç∑ Alcohol diagnosis (alc_y) distribution:")
print(data["alc_y"].value_counts())
print(f"   Positive rate: {data['alc_y'].mean():.1%}")
data.head()

## Step 3: Sexual Dimorphism Analysis

We'll analyze which brain connectivity features differ significantly between males and females.


In [None]:
from connectopy.analysis import DimorphismAnalysis

# Run dimorphism analysis
analysis = DimorphismAnalysis(data, gender_column="Gender")

# Analyze ALL connectome features (structural + functional from all variants)
all_conn_features = []
for prefix in ["Struct_PC", "Func_PC", "Raw_Struct_PC", "Raw_Func_PC", "VAE_Struct_LD", "VAE_Func_LD"]:
    all_conn_features.extend([c for c in data.columns if c.startswith(prefix)])

print(f"Analyzing {len(all_conn_features)} connectome features for sexual dimorphism...")
results = analysis.analyze(feature_columns=all_conn_features)

# Show results
n_significant = results["Significant"].sum()
print(f"\nüî¨ Found {n_significant} significant features (FDR < 0.05)")
print("\nüìã Top 10 features by effect size:")
results.head(10)

In [None]:
import matplotlib.pyplot as plt

# Plot effect sizes
fig, ax = plt.subplots(figsize=(10, 8))

top20 = results.head(20)
colors = ["#1f77b4" if d < 0 else "#d62728" for d in top20["Cohen_D"]]

ax.barh(range(len(top20)), top20["Cohen_D"].values, color=colors)
ax.set_yticks(range(len(top20)))
ax.set_yticklabels(top20["Feature"])
ax.set_xlabel("Cohen's D (Effect Size)")
ax.set_title("Sexual Dimorphism: Top 20 Features by Effect Size")
ax.axvline(0, color="black", linestyle="-", linewidth=0.5)
ax.invert_yaxis()

plt.tight_layout()
plt.show()

print("\nüìä Blue bars: Feature is higher in females")
print("üìä Red bars: Feature is higher in males")

## Step 4: Machine Learning Classification

Now we'll train a Random Forest classifier to predict gender from brain connectivity features.


In [None]:
from connectopy.models import (
    ConnectomeRandomForest,
    get_cognitive_features,
    get_connectome_features,
)

# Get cognitive and connectome features for each variant
cog_features = get_cognitive_features(data, include_age=True)
tnpca_features = get_connectome_features(data, "tnpca")  # Struct_PC*, Func_PC*
vae_features = get_connectome_features(data, "vae")      # VAE_Struct_LD*, VAE_Func_LD*
pca_features = get_connectome_features(data, "pca")      # Raw_Struct_PC*, Raw_Func_PC*

# Define feature sets to train on
feature_sets = {
    "TNPCA": cog_features + tnpca_features,
    "PCA": cog_features + pca_features,
    "VAE": cog_features + vae_features,
    "ALL": cog_features + tnpca_features + vae_features + pca_features,
}

# Remove empty feature sets (e.g., if VAE data not available)
feature_sets = {k: v for k, v in feature_sets.items() if len(v) > len(cog_features)}

print("üìä Feature Sets:")
print(f"   Cognitive: {len(cog_features)}")
for name, feats in feature_sets.items():
    conn_count = len(feats) - len(cog_features)
    print(f"   {name}: {len(cog_features)} cog + {conn_count} conn = {len(feats)} total")

# Store results for all models
all_results = []
rf_models = {}

# Train RF for each feature set √ó sex combination
for feat_name, feature_cols in feature_sets.items():
    for sex in ["M", "F"]:
        df_sex = data[data["Gender"] == sex].copy()
        sub = df_sex[feature_cols + ["alc_y"]].dropna()
        
        if len(sub) < 30:
            continue
        
        X = sub[feature_cols].values
        y = sub["alc_y"].astype(int).values
        
        if len(np.unique(y)) < 2:
            continue
        
        print(f"\n{'='*50}")
        print(f"üî¨ RF: {feat_name} features, Sex={sex}")
        print(f"{'='*50}")
        print(f"   Features: {len(feature_cols)}, Samples: {len(y)}, Positive: {y.mean():.1%}")
        
        rf = ConnectomeRandomForest(n_estimators=200, class_weight="balanced", random_state=42, n_jobs=-1)
        metrics = rf.fit_with_cv(
            X, y,
            feature_names=feature_cols,
            handle_imbalance=True,
            param_grid={"rf__n_estimators": [100, 200], "rf__max_depth": [None, 10]},
        )
        
        metrics["sex"] = sex
        metrics["model"] = "RF"
        metrics["features"] = feat_name
        all_results.append(metrics)
        rf_models[(feat_name, sex)] = rf
        
        print(f"   ‚úÖ Test AUC: {metrics['test_auc']:.3f}, Bal Acc: {metrics['test_bal_acc']:.3f}")

In [None]:
# Plot RF feature importance for best model (ALL features) per sex
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

for idx, sex in enumerate(["M", "F"]):
    # Use "ALL" features model if available, else first available
    key = ("ALL", sex) if ("ALL", sex) in rf_models else None
    if key is None:
        for k in rf_models:
            if k[1] == sex:
                key = k
                break
    if key is None:
        continue
    
    rf = rf_models[key]
    importance = rf.get_top_features(n=15)
    top15 = importance.head(15).iloc[::-1]
    
    ax = axes[idx]
    colors = plt.colormaps["viridis"](np.linspace(0.3, 0.9, len(top15)))
    ax.barh(top15["Feature"], top15["Importance"], color=colors)
    ax.set_xlabel("Importance")
    ax.set_title(f"RF Top 15 Features ({key[0]}, {sex}) - Alcohol Classification")

plt.tight_layout()
plt.show()

## Step 5: EBM (Explainable Boosting Machine)

EBM is an interpretable ML model that provides both global and local explanations.


In [None]:
from connectopy.models import ConnectomeEBM

ebm_models = {}

# Train EBM for each feature set √ó sex combination
for feat_name, feature_cols in feature_sets.items():
    for sex in ["M", "F"]:
        df_sex = data[data["Gender"] == sex].copy()
        sub = df_sex[feature_cols + ["alc_y"]].dropna()
        
        if len(sub) < 30:
            continue
        
        X = sub[feature_cols].values
        y = sub["alc_y"].astype(int).values
        
        if len(np.unique(y)) < 2:
            continue
        
        print(f"\n{'='*50}")
        print(f"üî¨ EBM: {feat_name} features, Sex={sex}")
        print(f"{'='*50}")
        print(f"   Features: {len(feature_cols)}, Samples: {len(y)}, Positive: {y.mean():.1%}")
        
        ebm = ConnectomeEBM(max_bins=32, learning_rate=0.01, max_leaves=3, interactions=0, random_state=42)
        ebm_metrics = ebm.fit_with_cv(
            X, y,
            feature_names=feature_cols,
            handle_imbalance=True,
            param_grid={"max_leaves": [2, 3]},  # Simplified for speed
        )
        
        ebm_metrics["sex"] = sex
        ebm_metrics["model"] = "EBM"
        ebm_metrics["features"] = feat_name
        all_results.append(ebm_metrics)
        ebm_models[(feat_name, sex)] = ebm
        
        print(f"   ‚úÖ Test AUC: {ebm_metrics['test_auc']:.3f}, Bal Acc: {ebm_metrics['test_bal_acc']:.3f}")

# Summary comparison table
print("\n" + "="*70)
print("üìä MODEL COMPARISON SUMMARY (by Feature Set)")
print("="*70)
results_df = pd.DataFrame(all_results)
summary_cols = ["model", "features", "sex", "test_auc", "test_bal_acc"]
summary_cols = [c for c in summary_cols if c in results_df.columns]
print(results_df[summary_cols].sort_values(["features", "model", "sex"]).to_string(index=False))

# Best model per feature set
print("\n" + "="*70)
print("üèÜ BEST MODEL PER FEATURE SET")
print("="*70)
for feat in results_df["features"].unique():
    subset = results_df[results_df["features"] == feat]
    best = subset.loc[subset["test_auc"].idxmax()]
    print(f"   {feat}: {best['model']} ({best['sex']}) - AUC={best['test_auc']:.3f}")

## Step 6: Mediation Analysis

Test whether brain networks **mediate** the relationship between cognitive traits and alcohol outcomes, stratified by sex.

**Research Question**: *How do structural brain networks mediate the relationship between cognitive traits and alcohol dependence differently across sexes?*

We'll use the following HCP columns:
- **X (Cognitive)**: Fluid intelligence (PMAT24_A_CR) or similar
- **M (Brain)**: Structural connectome PC (Struct_PC1)
- **Y (Alcohol)**: Alcohol diagnosis (alc_y)

```
Cognitive Traits (X) ‚Üí Brain Networks (M) ‚Üí Alcohol Outcomes (Y)
                              ‚Üë
                         Sex (moderator)
```


In [None]:
from connectopy.analysis import SexStratifiedMediation

# Use actual HCP data columns for mediation analysis
# Mediation model: Cognitive (X) ‚Üí Brain Network (M) ‚Üí Alcohol (Y)

# Select cognitive predictor (use fluid intelligence if available)
cognitive_options = ["PMAT24_A_CR", "ListSort_Unadj", "ReadEng_Unadj", "ProcSpeed_Unadj"]
cognitive_col = None
for col in cognitive_options:
    if col in data.columns:
        cognitive_col = col
        break

if cognitive_col is None:
    raise ValueError(f"No cognitive column found. Need one of: {cognitive_options}")

# Select brain network mediator (try all variants)
brain_options = [
    "Struct_PC1", "Func_PC1",  # TN-PCA
    "Raw_Struct_PC1", "Raw_Func_PC1",  # PCA
    "VAE_Struct_LD1", "VAE_Func_LD1",  # VAE
]
brain_col = None
for col in brain_options:
    if col in data.columns:
        brain_col = col
        break

if brain_col is None:
    raise ValueError(f"No brain network column found. Need one of: {brain_options}")

# Use alcohol target (alc_y is binary, but mediation works with it)
alcohol_col = "alc_y"

print("üìä Mediation Analysis Variables (from actual HCP data):")
print(f"   Cognitive (X): {cognitive_col}")
print(f"      mean={data[cognitive_col].mean():.2f}, std={data[cognitive_col].std():.2f}")
print(f"   Brain Network (M): {brain_col}")
print(f"      mean={data[brain_col].mean():.2f}, std={data[brain_col].std():.2f}")
print(f"   Alcohol Outcome (Y): {alcohol_col}")
print(f"      positive rate={data[alcohol_col].mean():.1%}")

In [None]:
# Run sex-stratified mediation analysis using actual HCP columns
print("üî¨ Running Sex-Stratified Mediation Analysis...")
print(f"   Model: {cognitive_col} ‚Üí {brain_col} ‚Üí {alcohol_col}")
print()

mediation = SexStratifiedMediation(n_bootstrap=1000, random_state=42)
result = mediation.fit(
    data=data,
    cognitive_col=cognitive_col,
    brain_col=brain_col,
    alcohol_col=alcohol_col,
    sex_col="Gender",
)

# Display results
print("=" * 50)
print("MEDIATION RESULTS")
print("=" * 50)

print("\nüë® MALES:")
print(f"   Indirect effect (a√ób): {result.male.indirect_effect:.4f}")
print(f"   95% CI: [{result.male.ci_low:.4f}, {result.male.ci_high:.4f}]")
print(f"   Significant: {'‚úÖ Yes' if result.male.significant else '‚ùå No'}")

print("\nüë© FEMALES:")
print(f"   Indirect effect (a√ób): {result.female.indirect_effect:.4f}")
print(f"   95% CI: [{result.female.ci_low:.4f}, {result.female.ci_high:.4f}]")
print(f"   Significant: {'‚úÖ Yes' if result.female.significant else '‚ùå No'}")

print("\n‚öñÔ∏è SEX DIFFERENCE:")
print(f"   Difference (M - F): {result.difference:.4f}")
print(f"   95% CI: [{result.diff_ci_low:.4f}, {result.diff_ci_high:.4f}]")
print(f"   Significant: {'‚úÖ Yes' if result.diff_significant else '‚ùå No'}")

In [None]:
# Visualize mediation comparison
fig, ax = plt.subplots(figsize=(8, 5))

# Bar positions
x = np.array([0, 1])
width = 0.6

# Data
effects = [result.male.indirect_effect, result.female.indirect_effect]
errors = [
    [
        result.male.indirect_effect - result.male.ci_low,
        result.female.indirect_effect - result.female.ci_low,
    ],
    [
        result.male.ci_high - result.male.indirect_effect,
        result.female.ci_high - result.female.indirect_effect,
    ],
]

colors = ["#3498db", "#e74c3c"]
bars = ax.bar(x, effects, width, color=colors, edgecolor="black", linewidth=1.5)
ax.errorbar(x, effects, yerr=errors, fmt="none", color="black", capsize=5, capthick=2)

ax.set_xticks(x)
ax.set_xticklabels(["Males", "Females"], fontsize=12)
ax.set_ylabel("Indirect Effect (Mediation)", fontsize=12)
ax.set_title("Sex Differences in Brain Network Mediation\nCognitive ‚Üí Brain ‚Üí Alcohol", fontsize=14)
ax.axhline(0, color="black", linestyle="-", linewidth=0.5)

# Add significance stars
sig_list = [result.male.significant, result.female.significant]
for i, (effect, sig) in enumerate(zip(effects, sig_list)):
    if sig:
        ax.annotate("*", (x[i], effect + 0.02), ha="center", fontsize=16, fontweight="bold")

plt.tight_layout()
plt.show()

print("\nüìä Interpretation:")
print("   - The indirect effect represents brain network mediation strength")
print("   - Error bars show 95% bootstrap confidence intervals")
print("   - * indicates significant mediation (CI excludes zero)")

## üìã Summary

This notebook demonstrated the Brain Connectome analysis pipeline:

1. **Data Loading**: Loaded structural connectome PC features
2. **Dimorphism Analysis**: Identified sexually dimorphic brain connectivity patterns
3. **Random Forest**: Trained an ensemble classifier for gender prediction
4. **EBM**: Trained an interpretable boosting model
5. **Mediation Analysis**: Tested sex-stratified mediation (Cognitive ‚Üí Brain ‚Üí Alcohol)
6. **Visualization**: Created publication-ready plots

### Research Question Addressed

> *How do structural brain networks mediate the relationship between cognitive traits and alcohol dependence differently across sexes?*

### Next Steps

- **Use your own data**: Upload HCP data to Google Drive and mount it
- **Run full pipeline**: Use `!python Runners/run_pipeline.py` for complete analysis
- **Use Docker**: `docker pull ghcr.io/sean0418/connectopy:latest`

### Links

- üì¶ [GitHub Repository](https://github.com/Sean0418/connectopy)
- üê≥ [Docker Image](https://ghcr.io/sean0418/connectopy)


In [None]:
print("\n" + "=" * 60)
print("üéâ Analysis Complete!")
print("=" * 60)
print(f"\nüìä Analyzed {data.shape[0]} subjects")
print(f"üî¨ Found {n_significant} significant dimorphic features")

# Show best model results
if all_results:
    best_result = max(all_results, key=lambda x: x.get("test_auc", 0))
    print(f"\nüç∑ Alcohol Classification Results:")
    print(f"   Best model: {best_result['model']} ({best_result['sex']})")
    print(f"   Test AUC: {best_result['test_auc']:.3f}")
    print(f"   Test Balanced Accuracy: {best_result['test_bal_acc']:.3f}")

print(f"\nüîó Mediation sex difference: {result.difference:.4f} (sig: {result.diff_significant})")
print("\n‚≠ê Star us on GitHub: https://github.com/Sean0418/connectopy")