# Solution: The Unreproducible Model

This notebook shows the correct solution to the reproducibility drill.

---

In [None]:
import numpy as np
import pandas as pd
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score

In [None]:
# Generate sample data
np.random.seed(42)
n_samples = 1000

data = pd.DataFrame({
    'feature_1': np.random.normal(0, 1, n_samples),
    'feature_2': np.random.normal(0, 1, n_samples),
    'feature_3': np.random.normal(0, 1, n_samples),
})
data['target'] = ((data['feature_1'] + data['feature_2'] * 0.5) > 0).astype(int)

print(f"Data shape: {data.shape}")

## The Problem: Missing Random Seeds

The original pipeline doesn't set `random_state` in `train_test_split` or the model.

In [None]:
def broken_train_pipeline(df):
    """Training pipeline with reproducibility bugs."""
    
    features = ['feature_1', 'feature_2', 'feature_3']
    X = df[features]
    y = df['target']
    
    # BUG 1: No random_state in train_test_split!
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2
    )
    
    # BUG 2: No random_state in model!
    model = GradientBoostingClassifier(
        n_estimators=100,
        max_depth=3
    )
    
    model.fit(X_train, y_train)
    y_prob = model.predict_proba(X_test)[:, 1]
    
    return roc_auc_score(y_test, y_prob)

# Test broken pipeline
print("=== Broken Pipeline ===")
results = [broken_train_pipeline(data) for _ in range(5)]
print(f"AUC across 5 runs: {[f'{r:.4f}' for r in results]}")
print(f"Variance: {np.var(results):.6f}")
print("❌ Results vary between runs!")

## The Solution: Add Random State Everywhere

The fix adds `random_state` to both `train_test_split` and the model.

In [None]:
def fixed_train_pipeline(df, random_state=42):
    """Reproducible training pipeline."""
    
    features = ['feature_1', 'feature_2', 'feature_3']
    X = df[features]
    y = df['target']
    
    # FIX 1: Add random_state to train_test_split
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=random_state
    )
    
    # FIX 2: Add random_state to model
    model = GradientBoostingClassifier(
        n_estimators=100,
        max_depth=3,
        random_state=random_state
    )
    
    model.fit(X_train, y_train)
    y_prob = model.predict_proba(X_test)[:, 1]
    
    return roc_auc_score(y_test, y_prob)

print("✓ Fixed pipeline defined")

## Comparing Results

In [None]:
print("=== Fixed Pipeline ===")
fixed_results = [fixed_train_pipeline(data) for _ in range(5)]
print(f"AUC across 5 runs: {[f'{r:.4f}' for r in fixed_results]}")
print(f"Variance: {np.var(fixed_results):.6f}")

is_reproducible = len(set(fixed_results)) == 1
print(f"\n{'✓ Reproducible!' if is_reproducible else '❌ Still not reproducible'}")

In [None]:
# Verification
results = [fixed_train_pipeline(data) for _ in range(3)]
assert len(set(results)) == 1, f"Results should be identical: {results}"

print("\n✓ All checks passed!")
print(f"✓ Consistent AUC: {results[0]:.4f}")

## Key Insight

For reproducible ML experiments:

1. **Set `random_state` in `train_test_split`** - Ensures same train/test split
2. **Set `random_state` in the model** - Ensures same model training
3. **Use config files** - Store all hyperparameters including seeds
4. **Test explicitly** - Run multiple times and verify identical results

Components that need `random_state`:
- `train_test_split`
- `KFold`, `StratifiedKFold`
- `RandomForestClassifier`, `GradientBoostingClassifier`
- `LogisticRegression` (solver dependent)
- Any data shuffling