## Phase 6: Model Training, Evaluation & Selection

### What is Model Training?

Training is the process of feeding labeled data to an algorithm so it learns to make predictions.

### Model Selection Strategy

**1. Start Simple (Baseline Models)**

Always start with simple models as baselines. This prevents over-engineering.


In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
import numpy as np

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

# Baseline: Logistic Regression
baseline_model = LogisticRegression(max_iter=1000, random_state=42)
baseline_model.fit(X_train, y_train)

y_pred = baseline_model.predict(X_test)
y_proba = baseline_model.predict_proba(X_test)[:, 1]

# Evaluate
print("Baseline Model Performance:")
print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print(f"Precision: {precision_score(y_test, y_pred):.4f}")
print(f"Recall: {recall_score(y_test, y_pred):.4f}")
print(f"F1-Score: {f1_score(y_test, y_pred):.4f}")
print(f"ROC-AUC: {roc_auc_score(y_test, y_proba):.4f}")


**2. Try Multiple Algorithms**


In [None]:
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from xgboost import XGBClassifier
import pandas as pd

# Dictionary of models to try
models = {
    'Logistic Regression': LogisticRegression(max_iter=1000),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
    'Gradient Boosting': GradientBoostingClassifier(n_estimators=100, random_state=42),
    'XGBoost': XGBClassifier(n_estimators=100, random_state=42, use_label_encoder=False)
}

# Train and evaluate each model
results = []
for name, model in models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    y_proba = model.predict_proba(X_test)[:, 1]
    
    results.append({
        'Model': name,
        'Accuracy': accuracy_score(y_test, y_pred),
        'Precision': precision_score(y_test, y_pred),
        'Recall': recall_score(y_test, y_pred),
        'F1': f1_score(y_test, y_pred),
        'ROC-AUC': roc_auc_score(y_test, y_proba)
    })

results_df = pd.DataFrame(results)
print(results_df)


### Model Evaluation Metrics

**For Classification Problems:**

```
1. Accuracy: (TP + TN) / Total
   - Overall correctness
   - Misleading with imbalanced data

2. Precision: TP / (TP + FP)
   - Of positive predictions, how many were correct?
   - Important when false positives are costly
   - Churn: Giving incentive to non-churners costs money

3. Recall: TP / (TP + FN)
   - Of actual positives, how many did we catch?
   - Important when false negatives are costly
   - Churn: Missing churners loses customers

4. F1-Score: 2 * (Precision * Recall) / (Precision + Recall)
   - Balanced combination of precision and recall
   - Best when both matter equally

5. ROC-AUC: Area Under Receiver Operating Characteristic Curve
   - Measures model's ability to distinguish classes
   - Ranges 0-1, higher is better
   - Robust to class imbalance

6. PR-AUC: Area Under Precision-Recall Curve
   - Better for imbalanced datasets
   - More informative than ROC-AUC for rare events
```

**Example: Interpreting Metrics for Churn Prediction**


In [None]:
from sklearn.metrics import confusion_matrix, classification_report
import matplotlib.pyplot as plt

# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
print(cm)
# [[TN  FP]
#  [FN  TP]]

print(classification_report(y_test, y_pred, 
                          target_names=['No Churn', 'Churned']))

# Visualization
import seaborn as sns
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.title('Confusion Matrix')
plt.show()

# Business interpretation:
# If Precision=80%: Of 100 customers we predict will churn, 80 actually will
# If Recall=70%: Of all customers who actually churn, we catch 70%
# Cost analysis: 
#   - Cost of retention offer (false positive): $50
#   - Cost of losing customer (false negative): $500
#   - Better to have high recall (catch churners) than high precision


**For Regression Problems:**


In [None]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

# MAE: Mean Absolute Error
mae = mean_absolute_error(y_test, y_pred)
print(f"MAE: {mae}")  # Average absolute error in same units as target

# RMSE: Root Mean Squared Error
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print(f"RMSE: {rmse}")  # Penalizes larger errors more

# MAPE: Mean Absolute Percentage Error
mape = np.mean(np.abs((y_test - y_pred) / y_test)) * 100
print(f"MAPE: {mape}%")  # Percentage error (good for comparison)

# R-squared: Coefficient of Determination
r2 = r2_score(y_test, y_pred)
print(f"R²: {r2}")  # Proportion of variance explained (0-1, higher better)


### Cross-Validation (Robust Evaluation)


In [None]:
from sklearn.model_selection import cross_val_score, StratifiedKFold

# k-Fold Cross-Validation
# Splits data into k folds, trains k times, evaluates on each fold
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

model = RandomForestClassifier(n_estimators=100, random_state=42)

# Get scores for each fold
cv_scores = cross_val_score(model, X, y, cv=cv, scoring='roc_auc')

print(f"CV Scores: {cv_scores}")
print(f"Mean CV Score: {cv_scores.mean():.4f}")
print(f"Std: {cv_scores.std():.4f}")

# If mean=0.82 and std=0.03, model is stable and generalizes well
# If mean=0.85 and std=0.12, high variance suggests overfitting risk


### Handling Class Imbalance


In [None]:
# Problem: Churn dataset has 85% no-churn, 15% churn
# Simple accuracy becomes misleading (predicting all "no-churn" gives 85%)

# Solution 1: Class Weights
model = LogisticRegression(class_weight='balanced', max_iter=1000)
# Automatically gives more importance to minority class

# Solution 2: SMOTE (Synthetic Minority Oversampling)
from imblearn.over_sampling import SMOTE
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)
# Artificially creates synthetic minority examples

# Solution 3: Threshold Adjustment
# Instead of predicting class if prob > 0.5, use custom threshold
y_pred_custom = (y_proba > 0.3).astype(int)  # Lower threshold catches more churners


### Hyperparameter Tuning


In [None]:
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

# Grid Search: Try all combinations
param_grid = {
    'n_estimators': [50, 100, 150],
    'max_depth': [5, 10, 15],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

grid_search = GridSearchCV(
    estimator=RandomForestClassifier(random_state=42),
    param_grid=param_grid,
    cv=5,
    scoring='roc_auc',
    n_jobs=-1,  # Use all CPU cores
    verbose=1
)

grid_search.fit(X_train, y_train)

print(f"Best parameters: {grid_search.best_params_}")
print(f"Best CV score: {grid_search.best_score_:.4f}")

# Random Search: Try random combinations (faster for large spaces)
param_dist = {
    'n_estimators': np.arange(50, 301, 10),
    'max_depth': np.arange(3, 20),
    'min_samples_split': np.arange(2, 11)
}

random_search = RandomizedSearchCV(
    estimator=RandomForestClassifier(random_state=42),
    param_distributions=param_dist,
    n_iter=20,
    cv=5,
    scoring='roc_auc',
    n_jobs=-1,
    random_state=42
)

random_search.fit(X_train, y_train)
best_model = random_search.best_estimator_


### Model Comparison & Selection


In [None]:
# Compare baseline vs tuned model
baseline_auc = roc_auc_score(y_test, baseline_model.predict_proba(X_test)[:, 1])
tuned_auc = roc_auc_score(y_test, best_model.predict_proba(X_test)[:, 1])

print(f"Baseline ROC-AUC: {baseline_auc:.4f}")
print(f"Tuned ROC-AUC: {tuned_auc:.4f}")
print(f"Improvement: {(tuned_auc - baseline_auc):.4f}")

# Select model based on:
# 1. Performance metrics (ROC-AUC, F1, precision/recall trade-off)
# 2. Latency requirements (simpler models are faster)
# 3. Interpretability needs (tree models are more interpretable than neural nets)
# 4. Complexity (don't overfit with overly complex models)


### Tools Used in Model Training

| Tool | Purpose |
|------|---------|
| Scikit-learn | ML algorithms, evaluation |
| XGBoost | Gradient boosting (fast, accurate) |
| LightGBM | Faster gradient boosting |
| CatBoost | Handles categorical features well |
| TensorFlow/PyTorch | Deep learning |
| Hyperopt | Bayesian hyperparameter tuning |

---
