# 02 - Data Preprocessing for ML

## Overview
In this notebook, we'll prepare the TCGA data for machine learning:

1. **Handle high dimensionality** - Select the most informative genes
2. **Normalize the data** - Scale features for ML algorithms
3. **Split data** - Create train/test sets with proper stratification

### Why Preprocessing Matters
- 20,000 features vs 801 samples = risk of overfitting
- Many genes don't vary much (uninformative)
- Different genes have different scales
- Need proper train/test split to evaluate honestly

## 1. Setup

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.feature_selection import VarianceThreshold, SelectKBest, f_classif, mutual_info_classif
from sklearn.decomposition import PCA

import sys
sys.path.append('..')
from src.data_loader import load_tcga_data, CANCER_TYPE_INFO

plt.style.use('seaborn-v0_8-whitegrid')

# For reproducibility
RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)

print("Libraries loaded!")

In [None]:
# Load data
X, y, gene_names, sample_ids = load_tcga_data(verbose=True)
print(f"\nOriginal shape: {X.shape}")

## 2. Train/Test Split

**Important**: Split BEFORE any preprocessing to avoid data leakage!

We use stratified split to maintain class proportions.

In [None]:
# Encode labels to numeric
label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y)

print("Label encoding:")
for i, label in enumerate(label_encoder.classes_):
    print(f"  {label} -> {i}")

In [None]:
# Stratified train/test split (80/20)
X_train, X_test, y_train, y_test = train_test_split(
    X, y_encoded, 
    test_size=0.2, 
    random_state=RANDOM_STATE,
    stratify=y_encoded  # Maintain class proportions
)

print(f"Training set: {X_train.shape[0]} samples")
print(f"Test set: {X_test.shape[0]} samples")

# Verify stratification
print("\nClass distribution:")
for i, label in enumerate(label_encoder.classes_):
    train_pct = (y_train == i).sum() / len(y_train) * 100
    test_pct = (y_test == i).sum() / len(y_test) * 100
    print(f"  {label}: Train {train_pct:.1f}%, Test {test_pct:.1f}%")

## 3. Feature Selection Strategy

We'll use a multi-step approach:
1. **Variance filtering** - Remove near-constant genes
2. **Statistical selection** - Keep genes most associated with cancer type

### Why not use PCA for classification?
PCA is great for visualization, but for classification we want:
- Interpretable features (real genes, not abstract components)
- Supervised selection (genes that predict the target)

In [None]:
# Step 1: Remove low-variance genes
# Only use training data to determine threshold!

gene_variance = np.var(X_train, axis=0)

# Use median variance as threshold (removes ~50% of least variable genes)
variance_threshold = np.percentile(gene_variance, 50)
print(f"Variance threshold (50th percentile): {variance_threshold:.4f}")

# Apply variance filter
var_selector = VarianceThreshold(threshold=variance_threshold)
X_train_var = var_selector.fit_transform(X_train)
X_test_var = var_selector.transform(X_test)

# Track which genes remain
var_mask = var_selector.get_support()
genes_after_var = [g for g, m in zip(gene_names, var_mask) if m]

print(f"\nGenes before: {X_train.shape[1]}")
print(f"Genes after variance filter: {X_train_var.shape[1]}")
print(f"Removed: {X_train.shape[1] - X_train_var.shape[1]} low-variance genes")

In [None]:
# Step 2: Select top K genes using ANOVA F-test
# F-test measures how well each gene separates the cancer types

K_FEATURES = 1000  # Keep top 1000 genes (good balance)

# Fit selector on training data only
kbest_selector = SelectKBest(score_func=f_classif, k=K_FEATURES)
X_train_selected = kbest_selector.fit_transform(X_train_var, y_train)
X_test_selected = kbest_selector.transform(X_test_var)

# Get selected gene names
kbest_mask = kbest_selector.get_support()
selected_genes = [g for g, m in zip(genes_after_var, kbest_mask) if m]

print(f"Final feature count: {X_train_selected.shape[1]}")
print(f"\nTop 20 genes by F-score:")

# Get F-scores for selected genes
f_scores = kbest_selector.scores_[kbest_mask]
gene_scores = list(zip(selected_genes, f_scores))
gene_scores.sort(key=lambda x: x[1], reverse=True)

for gene, score in gene_scores[:20]:
    print(f"  {gene}: {score:.1f}")

In [None]:
# Visualize F-score distribution
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# All F-scores
ax1 = axes[0]
ax1.hist(kbest_selector.scores_, bins=50, alpha=0.7, edgecolor='black')
ax1.axvline(np.sort(kbest_selector.scores_)[-K_FEATURES], color='red', 
            linestyle='--', label=f'Selection threshold')
ax1.set_xlabel('F-score', fontsize=12)
ax1.set_ylabel('Number of Genes', fontsize=12)
ax1.set_title('Distribution of F-scores (ANOVA)', fontsize=14)
ax1.legend()

# Top genes
ax2 = axes[1]
top_20 = gene_scores[:20]
ax2.barh([g[0] for g in top_20][::-1], [g[1] for g in top_20][::-1], color='steelblue')
ax2.set_xlabel('F-score', fontsize=12)
ax2.set_title('Top 20 Most Discriminative Genes', fontsize=14)

plt.tight_layout()
plt.savefig('../results/figures/feature_selection.png', dpi=150, bbox_inches='tight')
plt.show()

## 4. Feature Scaling

Standardization (z-score normalization) centers data to mean=0, std=1.

Important for algorithms like SVM, Logistic Regression, and neural networks.

In [None]:
# Fit scaler on training data only
scaler = StandardScaler()
X_train_final = scaler.fit_transform(X_train_selected)
X_test_final = scaler.transform(X_test_selected)

print("After scaling:")
print(f"  Training mean: {X_train_final.mean():.6f} (should be ~0)")
print(f"  Training std: {X_train_final.std():.6f} (should be ~1)")
print(f"  Test mean: {X_test_final.mean():.6f}")
print(f"  Test std: {X_test_final.std():.6f}")

## 5. Verify Preprocessing with PCA Visualization

In [None]:
# Quick PCA to verify data still has good class separation
pca = PCA(n_components=2)
X_train_pca = pca.fit_transform(X_train_final)
X_test_pca = pca.transform(X_test_final)

fig, axes = plt.subplots(1, 2, figsize=(14, 6))

colors = sns.color_palette('husl', len(label_encoder.classes_))

# Training data
ax1 = axes[0]
for i, label in enumerate(label_encoder.classes_):
    mask = y_train == i
    ax1.scatter(X_train_pca[mask, 0], X_train_pca[mask, 1], 
                c=[colors[i]], label=label, alpha=0.6, s=50)
ax1.set_xlabel(f'PC1 ({pca.explained_variance_ratio_[0]*100:.1f}%)')
ax1.set_ylabel(f'PC2 ({pca.explained_variance_ratio_[1]*100:.1f}%)')
ax1.set_title('Training Data (Preprocessed)', fontsize=14)
ax1.legend()

# Test data
ax2 = axes[1]
for i, label in enumerate(label_encoder.classes_):
    mask = y_test == i
    ax2.scatter(X_test_pca[mask, 0], X_test_pca[mask, 1], 
                c=[colors[i]], label=label, alpha=0.6, s=50)
ax2.set_xlabel(f'PC1 ({pca.explained_variance_ratio_[0]*100:.1f}%)')
ax2.set_ylabel(f'PC2 ({pca.explained_variance_ratio_[1]*100:.1f}%)')
ax2.set_title('Test Data (Preprocessed)', fontsize=14)
ax2.legend()

plt.tight_layout()
plt.savefig('../results/figures/preprocessed_pca.png', dpi=150, bbox_inches='tight')
plt.show()

print("Good class separation preserved after feature selection!")

## 6. Save Preprocessed Data

In [None]:
# Save everything needed for model training
import pickle

# Data
np.save('../data/processed/X_train.npy', X_train_final)
np.save('../data/processed/X_test.npy', X_test_final)
np.save('../data/processed/y_train.npy', y_train)
np.save('../data/processed/y_test.npy', y_test)

# Gene names (for interpretation)
with open('../data/processed/selected_genes.pkl', 'wb') as f:
    pickle.dump(selected_genes, f)

# Label encoder (for converting predictions back to names)
with open('../data/processed/label_encoder.pkl', 'wb') as f:
    pickle.dump(label_encoder, f)

# Preprocessing objects (for new data)
with open('../data/processed/preprocessors.pkl', 'wb') as f:
    pickle.dump({
        'var_selector': var_selector,
        'kbest_selector': kbest_selector,
        'scaler': scaler
    }, f)

print("Saved preprocessed data and pipeline objects!")
print("\nFiles created:")
print("  - X_train.npy, X_test.npy (feature matrices)")
print("  - y_train.npy, y_test.npy (labels)")
print("  - selected_genes.pkl (gene names for interpretation)")
print("  - label_encoder.pkl (convert numeric labels to cancer names)")
print("  - preprocessors.pkl (for preprocessing new samples)")

## 7. Preprocessing Summary

In [None]:
print("="*60)
print("PREPROCESSING PIPELINE SUMMARY")
print("="*60)
print()
print("INPUT:")
print(f"  - {X.shape[0]} samples x {X.shape[1]:,} genes")
print()
print("STEPS:")
print(f"  1. Train/Test Split: 80/20 stratified")
print(f"  2. Variance Filter: Removed {X_train.shape[1] - X_train_var.shape[1]:,} low-variance genes")
print(f"  3. SelectKBest: Kept top {K_FEATURES} genes by F-score")
print(f"  4. StandardScaler: Normalized to mean=0, std=1")
print()
print("OUTPUT:")
print(f"  - Training: {X_train_final.shape[0]} samples x {X_train_final.shape[1]} features")
print(f"  - Test: {X_test_final.shape[0]} samples x {X_test_final.shape[1]} features")
print()
print("READY FOR MODEL TRAINING!")