# SMT-WEEX Notebook 2: Model Training (v2)
**Project:** smt-weex-2025
**Author:** Jannet Ekka

**Updates in v2:**
- CatBoost: 80/20 split, NO early stopping (per research)
- Other models: 70/10/20 split with early stopping
- Added 5-fold Stratified CV for all models
- Fixed CatBoost underfitting issue

## 1. Setup

In [None]:
!pip install -q catboost xgboost lightgbm scikit-learn pandas numpy matplotlib seaborn google-cloud-storage

In [None]:
from google.colab import auth
auth.authenticate_user()

PROJECT_ID = 'smt-weex-2025'
BUCKET = 'smt-weex-2025-models'

!gcloud config set project {PROJECT_ID}

In [None]:
import pandas as pd
import numpy as np
import json
import pickle
from datetime import datetime

from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    classification_report, confusion_matrix, make_scorer
)
from sklearn.ensemble import RandomForestClassifier
from sklearn.utils.class_weight import compute_sample_weight

from catboost import CatBoostClassifier
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier

import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

print("Libraries loaded")

## 2. Load Cleaned Data from GCS

In [None]:
!gsutil cp gs://{BUCKET}/data/whale_features_cleaned.csv /content/
!gsutil cp gs://{BUCKET}/data/feature_config.json /content/

df = pd.read_csv('/content/whale_features_cleaned.csv')

with open('/content/feature_config.json', 'r') as f:
    config = json.load(f)

FEATURES = config['features']
TARGET = config['target']

print(f"Loaded {len(df)} samples, {len(FEATURES)} features")
print(f"Categories: {df[TARGET].value_counts().to_dict()}")

In [None]:
# Prepare X and y
X = df[FEATURES].values
y_raw = df[TARGET].values

# Encode labels
le = LabelEncoder()
y = le.fit_transform(y_raw)

# Save label mapping
label_mapping = {i: label for i, label in enumerate(le.classes_)}
n_classes = len(label_mapping)

print("Label mapping:")
for idx, label in label_mapping.items():
    count = (y == idx).sum()
    print(f"  {idx}: {label} ({count} samples)")

## 3. Data Splits

**Strategy:**
- CatBoost: 80/20 train/test (no validation, no early stopping per research)
- Other models: 70/10/20 train/val/test (with early stopping)

In [None]:
# Common test set for all models (20%)
X_trainval, X_test, y_trainval, y_test = train_test_split(
    X, y, 
    test_size=0.20, 
    random_state=42, 
    stratify=y
)

# For CatBoost: use all trainval as training (80/20 split)
X_train_cb = X_trainval
y_train_cb = y_trainval

# For other models: split trainval into train/val (70/10/20 overall)
X_train, X_val, y_train, y_val = train_test_split(
    X_trainval, y_trainval,
    test_size=0.125,  # 0.125 * 0.8 = 0.1 (10% of total)
    random_state=42,
    stratify=y_trainval
)

print("=== Data Splits ===")
print(f"CatBoost Train: {len(X_train_cb)} ({len(X_train_cb)/len(X)*100:.1f}%)")
print(f"Other Train:    {len(X_train)} ({len(X_train)/len(X)*100:.1f}%)")
print(f"Other Val:      {len(X_val)} ({len(X_val)/len(X)*100:.1f}%)")
print(f"Test (all):     {len(X_test)} ({len(X_test)/len(X)*100:.1f}%)")

In [None]:
# Check stratification
print("\n=== Class Distribution in Splits ===")
for name, y_subset in [('CatBoost Train', y_train_cb), ('Other Train', y_train), 
                        ('Val', y_val), ('Test', y_test)]:
    unique, counts = np.unique(y_subset, return_counts=True)
    dist = {label_mapping[u]: int(c) for u, c in zip(unique, counts)}
    print(f"{name:15s}: {dist}")

## 4. Model Definitions

In [None]:
# Store models and results
models = {}
results_holdout = {}
results_cv = {}

# Evaluation function
def evaluate_model(model, X_test, y_test):
    """Evaluate model and return metrics"""
    y_pred = model.predict(X_test)
    
    return {
        'accuracy': accuracy_score(y_test, y_pred),
        'precision_macro': precision_score(y_test, y_pred, average='macro', zero_division=0),
        'recall_macro': recall_score(y_test, y_pred, average='macro', zero_division=0),
        'f1_macro': f1_score(y_test, y_pred, average='macro', zero_division=0),
        'f1_weighted': f1_score(y_test, y_pred, average='weighted', zero_division=0)
    }, y_pred

## 5. Model Training

### 5.1 CatBoost (Primary - Per Research Recommendations)
- 80/20 split, NO early stopping
- depth=5, learning_rate=0.03, l2_leaf_reg=3
- auto_class_weights='Balanced'

In [None]:
print("=" * 60)
print("Training CatBoost (80/20 split, NO early stopping)")
print("=" * 60)

catboost_model = CatBoostClassifier(
    iterations=300,           # Fixed iterations, no early stopping
    learning_rate=0.03,       # Per research
    depth=5,                  # Per research (4-5)
    l2_leaf_reg=3,            # Per research
    loss_function='MultiClass',
    eval_metric='TotalF1:average=Macro',
    random_seed=42,
    verbose=50,
    auto_class_weights='Balanced',  # Handle imbalance
    # NO early_stopping_rounds
)

# Train on 80% data (no validation set)
catboost_model.fit(X_train_cb, y_train_cb)

models['CatBoost'] = catboost_model
print("\nCatBoost training complete!")

### 5.2 XGBoost (70/10/20 with early stopping)

In [None]:
print("=" * 60)
print("Training XGBoost (70/10/20 split, with early stopping)")
print("=" * 60)

# Calculate sample weights
sample_weights = compute_sample_weight('balanced', y_train)

xgb_model = XGBClassifier(
    n_estimators=500,
    learning_rate=0.03,
    max_depth=4,
    reg_alpha=1,              # L1 regularization
    reg_lambda=3,             # L2 regularization (heavy per research)
    objective='multi:softmax',
    num_class=n_classes,
    random_state=42,
    early_stopping_rounds=50,
    eval_metric='mlogloss'
)

xgb_model.fit(
    X_train, y_train,
    sample_weight=sample_weights,
    eval_set=[(X_val, y_val)],
    verbose=100
)

models['XGBoost'] = xgb_model
print("\nXGBoost training complete!")

### 5.3 Random Forest (70/10/20)

In [None]:
print("=" * 60)
print("Training Random Forest")
print("=" * 60)

rf_model = RandomForestClassifier(
    n_estimators=500,
    max_depth=10,
    min_samples_split=5,
    min_samples_leaf=3,
    random_state=42,
    n_jobs=-1,
    class_weight='balanced'
)

rf_model.fit(X_train, y_train)

models['RandomForest'] = rf_model
print("Random Forest training complete!")

### 5.4 LightGBM (70/10/20 - Not Recommended per Research)

In [None]:
print("=" * 60)
print("Training LightGBM (WARNING: Not ideal for <10K samples per research)")
print("=" * 60)

lgbm_model = LGBMClassifier(
    n_estimators=500,
    learning_rate=0.03,
    max_depth=4,
    num_leaves=15,            # Conservative for small data
    reg_alpha=1,
    reg_lambda=3,
    objective='multiclass',
    num_class=n_classes,
    random_state=42,
    n_jobs=-1,
    class_weight='balanced',
    verbosity=-1
)

lgbm_model.fit(
    X_train, y_train,
    eval_set=[(X_val, y_val)]
)

models['LightGBM'] = lgbm_model
print("LightGBM training complete!")

## 6. Holdout Test Set Evaluation

In [None]:
print("=" * 60)
print("HOLDOUT TEST SET EVALUATION")
print("=" * 60)

for name, model in models.items():
    metrics, y_pred = evaluate_model(model, X_test, y_test)
    results_holdout[name] = metrics
    
    print(f"\n{name}:")
    print(f"  Accuracy:          {metrics['accuracy']:.4f}")
    print(f"  Precision (macro): {metrics['precision_macro']:.4f}")
    print(f"  Recall (macro):    {metrics['recall_macro']:.4f}")
    print(f"  F1 (macro):        {metrics['f1_macro']:.4f}")
    print(f"  F1 (weighted):     {metrics['f1_weighted']:.4f}")

In [None]:
# Holdout results table
holdout_df = pd.DataFrame(results_holdout).T.round(4)
print("\n=== Holdout Test Results ===")
print(holdout_df.sort_values('f1_macro', ascending=False))

## 7. 5-Fold Stratified Cross-Validation

Per research: Use 5-fold (not 10-fold) to maintain sufficient samples per class in each fold.

In [None]:
print("=" * 60)
print("5-FOLD STRATIFIED CROSS-VALIDATION")
print("=" * 60)

# Use full trainval data for CV (80% of total)
X_cv = X_trainval
y_cv = y_trainval

skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# Custom scorer for macro F1
f1_macro_scorer = make_scorer(f1_score, average='macro')

In [None]:
# CatBoost 5-fold CV
print("\nCatBoost 5-fold CV...")
cb_cv_scores = []

for fold, (train_idx, val_idx) in enumerate(skf.split(X_cv, y_cv)):
    X_fold_train, X_fold_val = X_cv[train_idx], X_cv[val_idx]
    y_fold_train, y_fold_val = y_cv[train_idx], y_cv[val_idx]
    
    cb_fold = CatBoostClassifier(
        iterations=300,
        learning_rate=0.03,
        depth=5,
        l2_leaf_reg=3,
        loss_function='MultiClass',
        random_seed=42,
        verbose=0,
        auto_class_weights='Balanced'
    )
    cb_fold.fit(X_fold_train, y_fold_train)
    y_pred = cb_fold.predict(X_fold_val)
    score = f1_score(y_fold_val, y_pred, average='macro')
    cb_cv_scores.append(score)
    print(f"  Fold {fold+1}: F1={score:.4f}")

results_cv['CatBoost'] = {
    'mean': np.mean(cb_cv_scores),
    'std': np.std(cb_cv_scores),
    'scores': cb_cv_scores
}
print(f"  Mean: {np.mean(cb_cv_scores):.4f} (+/- {np.std(cb_cv_scores):.4f})")

In [None]:
# XGBoost 5-fold CV
print("\nXGBoost 5-fold CV...")
xgb_cv_scores = []

for fold, (train_idx, val_idx) in enumerate(skf.split(X_cv, y_cv)):
    X_fold_train, X_fold_val = X_cv[train_idx], X_cv[val_idx]
    y_fold_train, y_fold_val = y_cv[train_idx], y_cv[val_idx]
    
    sample_weights_fold = compute_sample_weight('balanced', y_fold_train)
    
    xgb_fold = XGBClassifier(
        n_estimators=300,
        learning_rate=0.03,
        max_depth=4,
        reg_alpha=1,
        reg_lambda=3,
        objective='multi:softmax',
        num_class=n_classes,
        random_state=42,
        verbosity=0
    )
    xgb_fold.fit(X_fold_train, y_fold_train, sample_weight=sample_weights_fold)
    y_pred = xgb_fold.predict(X_fold_val)
    score = f1_score(y_fold_val, y_pred, average='macro')
    xgb_cv_scores.append(score)
    print(f"  Fold {fold+1}: F1={score:.4f}")

results_cv['XGBoost'] = {
    'mean': np.mean(xgb_cv_scores),
    'std': np.std(xgb_cv_scores),
    'scores': xgb_cv_scores
}
print(f"  Mean: {np.mean(xgb_cv_scores):.4f} (+/- {np.std(xgb_cv_scores):.4f})")

In [None]:
# RandomForest 5-fold CV
print("\nRandomForest 5-fold CV...")
rf_cv_scores = []

for fold, (train_idx, val_idx) in enumerate(skf.split(X_cv, y_cv)):
    X_fold_train, X_fold_val = X_cv[train_idx], X_cv[val_idx]
    y_fold_train, y_fold_val = y_cv[train_idx], y_cv[val_idx]
    
    rf_fold = RandomForestClassifier(
        n_estimators=500,
        max_depth=10,
        min_samples_split=5,
        min_samples_leaf=3,
        random_state=42,
        n_jobs=-1,
        class_weight='balanced'
    )
    rf_fold.fit(X_fold_train, y_fold_train)
    y_pred = rf_fold.predict(X_fold_val)
    score = f1_score(y_fold_val, y_pred, average='macro')
    rf_cv_scores.append(score)
    print(f"  Fold {fold+1}: F1={score:.4f}")

results_cv['RandomForest'] = {
    'mean': np.mean(rf_cv_scores),
    'std': np.std(rf_cv_scores),
    'scores': rf_cv_scores
}
print(f"  Mean: {np.mean(rf_cv_scores):.4f} (+/- {np.std(rf_cv_scores):.4f})")

In [None]:
# LightGBM 5-fold CV
print("\nLightGBM 5-fold CV...")
lgbm_cv_scores = []

for fold, (train_idx, val_idx) in enumerate(skf.split(X_cv, y_cv)):
    X_fold_train, X_fold_val = X_cv[train_idx], X_cv[val_idx]
    y_fold_train, y_fold_val = y_cv[train_idx], y_cv[val_idx]
    
    lgbm_fold = LGBMClassifier(
        n_estimators=300,
        learning_rate=0.03,
        max_depth=4,
        num_leaves=15,
        reg_alpha=1,
        reg_lambda=3,
        objective='multiclass',
        num_class=n_classes,
        random_state=42,
        n_jobs=-1,
        class_weight='balanced',
        verbosity=-1
    )
    lgbm_fold.fit(X_fold_train, y_fold_train)
    y_pred = lgbm_fold.predict(X_fold_val)
    score = f1_score(y_fold_val, y_pred, average='macro')
    lgbm_cv_scores.append(score)
    print(f"  Fold {fold+1}: F1={score:.4f}")

results_cv['LightGBM'] = {
    'mean': np.mean(lgbm_cv_scores),
    'std': np.std(lgbm_cv_scores),
    'scores': lgbm_cv_scores
}
print(f"  Mean: {np.mean(lgbm_cv_scores):.4f} (+/- {np.std(lgbm_cv_scores):.4f})")

In [None]:
# CV Results Summary
print("\n" + "=" * 60)
print("5-FOLD CV RESULTS SUMMARY (F1 Macro)")
print("=" * 60)

cv_summary = pd.DataFrame({
    'Model': list(results_cv.keys()),
    'Mean F1': [results_cv[m]['mean'] for m in results_cv],
    'Std': [results_cv[m]['std'] for m in results_cv]
}).sort_values('Mean F1', ascending=False)

print(cv_summary.to_string(index=False))

## 8. Results Comparison

In [None]:
# Compare Holdout vs CV
print("\n" + "=" * 60)
print("HOLDOUT vs 5-FOLD CV COMPARISON")
print("=" * 60)

comparison = []
for model_name in models.keys():
    comparison.append({
        'Model': model_name,
        'Holdout F1': results_holdout[model_name]['f1_macro'],
        'CV Mean F1': results_cv[model_name]['mean'],
        'CV Std': results_cv[model_name]['std'],
        'Holdout Acc': results_holdout[model_name]['accuracy']
    })

comparison_df = pd.DataFrame(comparison).sort_values('CV Mean F1', ascending=False)
print(comparison_df.round(4).to_string(index=False))

In [None]:
# Visualization
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Holdout results
model_names = list(models.keys())
holdout_f1 = [results_holdout[m]['f1_macro'] for m in model_names]
cv_f1 = [results_cv[m]['mean'] for m in model_names]
cv_std = [results_cv[m]['std'] for m in model_names]

x = np.arange(len(model_names))
width = 0.35

axes[0].bar(x - width/2, holdout_f1, width, label='Holdout', color='steelblue')
axes[0].bar(x + width/2, cv_f1, width, label='5-Fold CV', color='coral', yerr=cv_std, capsize=3)
axes[0].set_ylabel('F1 Macro')
axes[0].set_title('Holdout vs 5-Fold CV (F1 Macro)')
axes[0].set_xticks(x)
axes[0].set_xticklabels(model_names, rotation=45)
axes[0].legend()
axes[0].set_ylim(0, 1)

# CV scores distribution
cv_data = [results_cv[m]['scores'] for m in model_names]
axes[1].boxplot(cv_data, labels=model_names)
axes[1].set_ylabel('F1 Macro')
axes[1].set_title('5-Fold CV Score Distribution')
axes[1].tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.show()

In [None]:
# Best model selection
best_holdout = max(results_holdout, key=lambda x: results_holdout[x]['f1_macro'])
best_cv = max(results_cv, key=lambda x: results_cv[x]['mean'])

print("\n" + "=" * 60)
print("BEST MODEL SELECTION")
print("=" * 60)
print(f"Best by Holdout F1:  {best_holdout} ({results_holdout[best_holdout]['f1_macro']:.4f})")
print(f"Best by CV Mean F1:  {best_cv} ({results_cv[best_cv]['mean']:.4f} +/- {results_cv[best_cv]['std']:.4f})")

# Research target check
print("\n--- Research Target Check ---")
print(f"Expected F1 Macro: 65-75% (per research)")
print(f"Best Achieved:     {max(results_cv[best_cv]['mean'], results_holdout[best_holdout]['f1_macro'])*100:.1f}%")

## 9. Save Models and Results

In [None]:
import os
os.makedirs('/content/models', exist_ok=True)

# Save CatBoost
catboost_model.save_model('/content/models/catboost_whale_classifier.cbm')

# Save others as pickle
with open('/content/models/xgboost_whale_classifier.pkl', 'wb') as f:
    pickle.dump(xgb_model, f)

with open('/content/models/randomforest_whale_classifier.pkl', 'wb') as f:
    pickle.dump(rf_model, f)

with open('/content/models/lightgbm_whale_classifier.pkl', 'wb') as f:
    pickle.dump(lgbm_model, f)

# Save label encoder
with open('/content/models/label_encoder.pkl', 'wb') as f:
    pickle.dump(le, f)

# Save all results
all_results = {
    'holdout': results_holdout,
    'cv': {k: {'mean': v['mean'], 'std': v['std'], 'scores': v['scores']} 
           for k, v in results_cv.items()},
    'best_model_holdout': best_holdout,
    'best_model_cv': best_cv,
    'label_mapping': label_mapping
}

with open('/content/models/training_results.json', 'w') as f:
    json.dump(all_results, f, indent=2, default=str)

print("Models and results saved locally")

In [None]:
# Upload to GCS
!gsutil -m cp -r /content/models/* gs://{BUCKET}/models/initial/

# Save data splits
np.savez('/content/data_splits.npz',
         X_train_cb=X_train_cb, y_train_cb=y_train_cb,
         X_train=X_train, y_train=y_train,
         X_val=X_val, y_val=y_val,
         X_test=X_test, y_test=y_test)

!gsutil cp /content/data_splits.npz gs://{BUCKET}/data/data_splits.npz

print(f"\nUploaded to gs://{BUCKET}/models/initial/")

## Summary

**Training Complete:**
- CatBoost: 80/20 split, 300 iterations, NO early stopping
- XGBoost/RF/LightGBM: 70/10/20 split with validation
- 5-Fold Stratified CV for all models

**Key Findings:**
- Best Holdout: [see output]
- Best CV: [see output]
- Research target (65-75% F1): [see output]

**Next:** Run Notebook 03 for detailed evaluation (confusion matrices, feature importance, per-class metrics)