# Quick Start: Complete HIIT Methylation Analysis Pipeline

## Overview

This notebook provides a condensed walkthrough of the complete HIIT methylation analysis pipeline. It demonstrates how to go from raw GEO data to classification results and biological interpretation in a streamlined workflow.

For detailed explanations and methodology discussions, refer to the individual notebooks (01-05).

### Pipeline Steps

1. Data Acquisition
2. Preprocessing
3. Feature Selection (Ten-Level Framework)
4. Classification
5. Enrichment Analysis

## Setup

In [None]:
# Standard imports
import sys
import pickle
from pathlib import Path

import numpy as np
import pandas as pd

# Add project root to path
project_root = Path.cwd().parent.parent
sys.path.insert(0, str(project_root))

# Define paths
DATA_DIR = project_root / 'data' / 'raw'
PROCESSED_DIR = project_root / 'data' / 'processed'
MODELS_DIR = project_root / 'models'
RESULTS_DIR = project_root / 'results'

# Create directories
for dir_path in [DATA_DIR, PROCESSED_DIR, MODELS_DIR, RESULTS_DIR]:
    dir_path.mkdir(parents=True, exist_ok=True)

print(f"Project root: {project_root}")

## Step 1: Data Acquisition

Download and load the GSE171140 dataset.

In [None]:
from src.data.loader import GEODataLoader
from src.data.sample_mapping import SampleMapper

# Initialize loader and download data
loader = GEODataLoader('GSE171140', data_dir=DATA_DIR)

# Download and extract (skip if already exists)
loader.download_series_matrix(force=False)
loader.extract_gz_file(force=False)

# Load methylation data and metadata
methylation_data = loader.load_methylation_matrix()
metadata = loader.get_metadata()

# Create sample mapping
mapper = SampleMapper()
sample_mapping = mapper.create_sample_mapping(
    metadata, 
    output_path=str(DATA_DIR / 'sample_mapping.csv')
)

print(f"\nData loaded: {methylation_data.shape[0]:,} probes x {methylation_data.shape[1]} samples")

## Step 2: Preprocessing

Filter low-variance probes and handle missing values.

In [None]:
from src.data.preprocessing import MethylationPreprocessor, normalize_beta_values

# Normalize beta values
methylation_normalized = normalize_beta_values(methylation_data)

# Initialize preprocessor
preprocessor = MethylationPreprocessor(
    std_threshold=0.02,
    missing_threshold=0.2
)

# Filter and impute
filtered_data = preprocessor.filter_low_variance(methylation_normalized)
imputed_data = preprocessor.handle_missing_values(filtered_data, strategy='median')

# Create data versions
batch_info = sample_mapping.set_index('sample_id').loc[
    imputed_data.columns, 'study_group'
]
data_versions = preprocessor.create_data_versions(imputed_data, batch_info)

# Save preprocessed data
with open(PROCESSED_DIR / 'methyl_data_preprocessed.pkl', 'wb') as f:
    pickle.dump(imputed_data, f)

print(f"Preprocessed: {imputed_data.shape[0]:,} probes")

## Step 3: Feature Selection

Apply the Ten-Level Feature Selection Framework.

In [None]:
from src.features import TenLevelFeatureSelector, FeatureSelectionConfig

# Prepare data for binary classification
sample_ids = imputed_data.columns.tolist()
sample_info = sample_mapping.set_index('sample_id').loc[sample_ids].reset_index()

binary_mask = sample_info['binary_class'].isin(['HIIT', 'Control'])
binary_samples = sample_info[binary_mask]['sample_id'].tolist()
binary_labels = (sample_info[binary_mask]['binary_class'] == 'HIIT').astype(int).values

# Extract feature matrix
X = imputed_data[binary_samples].T.values
y = binary_labels

# Initialize selector
config = FeatureSelectionConfig()
selector = TenLevelFeatureSelector(config)

# Select features at moderate stringency
selected_features = selector.select_binary_features(
    X, y, level='L5_moderate'
)

print(f"Selected {len(selected_features)} features at L5 stringency")

# Save features
features_dir = PROCESSED_DIR / 'features'
features_dir.mkdir(exist_ok=True)
pd.DataFrame({'probe_id': list(selected_features)}).to_csv(
    features_dir / 'selected_features.csv', index=False
)

## Step 4: Classification

Train and evaluate batch-aware classifier.

In [None]:
from src.models import (
    ClassifierConfig,
    BatchAwareClassifier,
    CrossValidationStrategy
)

# Prepare feature matrix with selected features
available_features = [f for f in selected_features if f in imputed_data.index]
X_selected = imputed_data.loc[available_features, binary_samples].T.values
batch = sample_info[binary_mask]['study_group'].values

# Configure and train classifier
clf_config = ClassifierConfig(
    classifier_type='random_forest',
    n_estimators=100,
    random_state=42
)

classifier = BatchAwareClassifier(
    config=clf_config,
    batch_handling='covariate'
)

# Cross-validation
cv_strategy = CrossValidationStrategy(n_splits=5, n_repeats=10, random_state=42)
cv_results = cv_strategy.evaluate(classifier, X_selected, y, batch=batch)

print(f"\nCross-Validation Results:")
print(f"  Accuracy: {cv_results['accuracy_mean']:.3f} +/- {cv_results['accuracy_std']:.3f}")
print(f"  AUC-ROC: {cv_results['auc_mean']:.3f} +/- {cv_results['auc_std']:.3f}")

# Train final model and save
classifier.fit(X_selected, y, batch=batch)
with open(MODELS_DIR / 'classifier.pkl', 'wb') as f:
    pickle.dump(classifier, f)

## Step 5: Enrichment Analysis

Perform functional enrichment on selected features.

In [None]:
from src.enrichment import EnrichmentAnalyzer, EPICAnnotationMapper

# Map CpGs to genes
annotation_mapper = EPICAnnotationMapper(
    annotation_file=project_root / 'data' / 'external' / 'EPIC_manifest.csv'
)
genes = annotation_mapper.map_probes_to_genes(available_features)

print(f"Mapped {len(available_features)} CpGs to {len(genes)} genes")

# Run enrichment analysis
analyzer = EnrichmentAnalyzer()
results = analyzer.run_comprehensive_analysis(cpg_sites=available_features)

# Save results
enrichment_dir = RESULTS_DIR / 'enrichment'
enrichment_dir.mkdir(exist_ok=True)

if 'GO' in results and 'BP' in results['GO']:
    results['GO']['BP'].to_csv(enrichment_dir / 'go_bp_results.csv', index=False)
    print(f"\nTop enriched biological processes saved.")

## Results Summary

In [None]:
print("="*60)
print("PIPELINE COMPLETE")
print("="*60)
print(f"\nData:")
print(f"  Original probes: {methylation_data.shape[0]:,}")
print(f"  After preprocessing: {imputed_data.shape[0]:,}")
print(f"  Samples: {len(binary_samples)}")

print(f"\nFeature Selection:")
print(f"  Selected features: {len(selected_features)}")

print(f"\nClassification:")
print(f"  CV Accuracy: {cv_results['accuracy_mean']:.3f}")
print(f"  CV AUC-ROC: {cv_results['auc_mean']:.3f}")

print(f"\nEnrichment:")
print(f"  Genes analyzed: {len(genes)}")

print(f"\nOutput files:")
print(f"  - {PROCESSED_DIR / 'methyl_data_preprocessed.pkl'}")
print(f"  - {features_dir / 'selected_features.csv'}")
print(f"  - {MODELS_DIR / 'classifier.pkl'}")

## Next Steps

For more detailed analysis:

1. **01_data_acquisition.ipynb**: Understanding the experimental design
2. **02_preprocessing.ipynb**: Quality control and batch effect analysis
3. **03_feature_selection.ipynb**: Exploring different stringency levels
4. **04_classification.ipynb**: Model comparison and feature importance
5. **05_enrichment_analysis.ipynb**: Biological interpretation

For customization:
- See **examples/custom_analysis.ipynb** for advanced configurations