# ML Footprint Prediction - GPU Training

> Multi-output XGBoost training with Colab GPU (VS Code Local Runtime)

This notebook runs locally in VS Code with a connected Colab GPU runtime.
All files are accessed from your local filesystem - no uploads needed.

---

## Prerequisites

1. Connect to Colab GPU: `Ctrl+Shift+P` -> "Notebook: Select Notebook Kernel" -> "Connect to Google Colab"
2. Ensure you have the training data in `data/data_splitter/output/`

## Step 1: Verify GPU and Install Dependencies

In [None]:
import subprocess
import sys

# Check GPU availability
result = subprocess.run(['nvidia-smi'], capture_output=True, text=True)
if result.returncode == 0:
    print("GPU Information:")
    print(result.stdout)
else:
    print("[WARNING] No GPU detected. Training will be slower on CPU.")

# Install XGBoost with GPU support if needed
try:
    import xgboost as xgb
    print(f"\nXGBoost version: {xgb.__version__}")
except ImportError:
    print("Installing XGBoost...")
    subprocess.check_call([sys.executable, "-m", "pip", "install", "-q", "xgboost"])
    import xgboost as xgb
    print(f"XGBoost version: {xgb.__version__}")

## Step 2: Verify Local Files

In [None]:
import os
from pathlib import Path

# Get project root (parent of models/)
PROJECT_ROOT = Path(__file__).parent.parent if '__file__' in dir() else Path.cwd().parent
if PROJECT_ROOT.name == 'models':
    PROJECT_ROOT = PROJECT_ROOT.parent

print(f"Project root: {PROJECT_ROOT}")

# Define paths
PATHS = {
    'train': PROJECT_ROOT / 'data/data_splitter/output/train.csv',
    'validate': PROJECT_ROOT / 'data/data_splitter/output/validate.csv',
    'material_dataset': PROJECT_ROOT / 'data/data_calculations/input/material_dataset_final.csv',
    'src': PROJECT_ROOT / 'models/src',
    'save_dir': PROJECT_ROOT / 'models/saved/gpu_training',
    'logs': PROJECT_ROOT / 'models/logs',
}

# Verify all required files exist
print("\nChecking required files:")
all_ok = True
for name, path in PATHS.items():
    exists = path.exists()
    status = "[OK]" if exists else "[MISSING]"
    if path.is_file() or name in ['train', 'validate', 'material_dataset']:
        size_str = f" ({path.stat().st_size / (1024*1024):.1f} MB)" if exists and path.is_file() else ""
        print(f"  {status} {name}: {path}{size_str}")
    else:
        print(f"  {status} {name}: {path}")
    if not exists and name not in ['save_dir', 'logs']:
        all_ok = False

if all_ok:
    print("\n[SUCCESS] All required files found!")
else:
    print("\n[ERROR] Some files are missing. Check paths above.")

## Step 3: Load Data and Add Source to Path

In [None]:
import sys

# Add src to path for imports
src_path = str(PATHS['src'])
if src_path not in sys.path:
    sys.path.insert(0, str(PATHS['src'].parent))  # Add models/ to path

# Import project modules
from src.data_loader import load_data, FEATURE_COLUMNS, TARGET_COLUMNS, MATERIAL_COLUMNS
from src.formula_features import add_formula_features
from src.preprocessor import FootprintPreprocessor
from src.trainer import FootprintModelTrainer
from src.evaluator import ModelEvaluator
from src.config import BASELINE_CONFIG
from src.utils import set_random_seed

print("[OK] All modules imported successfully")

## Step 4: Load and Prepare Data

In [None]:
# Set random seed for reproducibility
set_random_seed(42)

# Load data (use sample_size for quick testing, None for full dataset)
SAMPLE_SIZE = None  # Set to e.g. 10000 for quick test, None for full ~676K samples

print(f"Loading data{f' (sample: {SAMPLE_SIZE})' if SAMPLE_SIZE else ' (full dataset)'}...")
X_train, y_train, X_val, y_val = load_data(
    train_path=str(PATHS['train']),
    val_path=str(PATHS['validate']),
    sample_size=SAMPLE_SIZE
)

print(f"\nTraining set: {len(X_train):,} samples")
print(f"Validation set: {len(X_val):,} samples")
print(f"Features: {len(FEATURE_COLUMNS)}")
print(f"Targets: {TARGET_COLUMNS}")

## Step 5: Feature Engineering

In [None]:
# Keep raw validation for robustness testing
X_val_raw = X_val.copy()

# Add formula-based features (physics-informed)
print("Adding formula-based features...")
X_train = add_formula_features(X_train, MATERIAL_COLUMNS, str(PATHS['material_dataset']))
X_val = add_formula_features(X_val, MATERIAL_COLUMNS, str(PATHS['material_dataset']))

# Preprocess (encode categoricals, scale numericals)
print("Preprocessing features...")
preprocessor = FootprintPreprocessor()
X_train_processed = preprocessor.fit_transform(X_train)
X_val_processed = preprocessor.transform(X_val)

# Get final feature set
feature_cols = preprocessor.get_feature_names()
X_train_final = X_train_processed[feature_cols]
X_val_final = X_val_processed[feature_cols]

print(f"\n[OK] Feature engineering complete")
print(f"Final feature count: {len(feature_cols)}")
print(f"Training shape: {X_train_final.shape}")
print(f"Validation shape: {X_val_final.shape}")

In [None]:
## Step 6: Train Model (GPU Accelerated)

In [None]:
# Configure for GPU training
config = BASELINE_CONFIG.copy()
config['tree_method'] = 'gpu_hist'  # Use GPU acceleration
config['device'] = 'cuda'

print("Training Configuration:")
for key, value in config.items():
    if key != 'random_state':
        print(f"  {key}: {value}")

# Initialize trainer
trainer = FootprintModelTrainer(**config)

# Train model
print("\nStarting training (this may take 15-30 minutes on GPU)...")
trainer.train(
    X_train_final, y_train,
    X_val_final, y_val,
    verbose=True
)

print("\n[OK] Training complete!")

## Step 7: Evaluate Model

In [None]:
# Create save directory
save_path = PATHS['save_dir'] / 'baseline'
save_path.mkdir(parents=True, exist_ok=True)
eval_dir = save_path / 'evaluation'

# Initialize evaluator
evaluator = ModelEvaluator(save_dir=str(eval_dir))

# Run baseline evaluation
print("Running baseline evaluation...")
baseline_metrics = evaluator.evaluate_baseline(trainer, X_val_final, y_val)

# Display results
print("\n" + "="*60)
print("BASELINE PERFORMANCE")
print("="*60)
for target in TARGET_COLUMNS:
    if target in baseline_metrics:
        m = baseline_metrics[target]
        print(f"\n{target}:")
        print(f"  MAE:  {m['mae']:.4f}")
        print(f"  RMSE: {m['rmse']:.4f}")
        print(f"  R2:   {m['r2']:.4f}")

In [None]:
# Test robustness with missing values
print("Testing robustness with missing values...")
robustness_results = evaluator.test_missing_value_robustness(
    trainer,
    preprocessor,
    X_val_raw,
    y_val,
    missing_levels=[0.0, 0.1, 0.2, 0.3, 0.4, 0.5],
    n_trials=3
)

# Generate full report
evaluator.generate_report()
print("\n[OK] Evaluation complete! Report saved to:", eval_dir)

## Step 8: Visualize Results

In [None]:
from IPython.display import Image, display
import matplotlib.pyplot as plt

# Display robustness curves if generated
plot_path = eval_dir / 'robustness_curves.png'
if plot_path.exists():
    display(Image(filename=str(plot_path)))
else:
    # Plot manually from results
    import pandas as pd
    
    df = pd.DataFrame(robustness_results)
    fig, axes = plt.subplots(1, 2, figsize=(12, 4))
    
    # R2 vs Missing %
    axes[0].plot(df['missing_pct'] * 100, df['carbon_total_r2'], 'b-o', label='Carbon Total')
    axes[0].plot(df['missing_pct'] * 100, df['water_total_r2'], 'g-o', label='Water Total')
    axes[0].set_xlabel('Missing Data (%)')
    axes[0].set_ylabel('R2 Score')
    axes[0].set_title('Model Robustness: R2 vs Missing Data')
    axes[0].legend()
    axes[0].grid(True, alpha=0.3)
    
    # MAE vs Missing %
    axes[1].plot(df['missing_pct'] * 100, df['carbon_total_mae'], 'b-o', label='Carbon Total')
    axes[1].plot(df['missing_pct'] * 100, df['water_total_mae'], 'g-o', label='Water Total')
    axes[1].set_xlabel('Missing Data (%)')
    axes[1].set_ylabel('MAE')
    axes[1].set_title('Model Robustness: MAE vs Missing Data')
    axes[1].legend()
    axes[1].grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()

## Step 9: Save Model

In [None]:
# Save trained model and preprocessor
print(f"Saving model to: {save_path}")

trainer.save(str(save_path))
preprocessor.save(str(save_path / 'preprocessor.pkl'))

print("\nSaved files:")
for f in save_path.iterdir():
    size = f.stat().st_size / 1024
    print(f"  {f.name}: {size:.1f} KB")

print("\n[OK] Model saved successfully!")

## Done!

Training complete. Your model is saved locally at:
- `models/saved/gpu_training/baseline/`

### Model Files:
- `xgb_model.json` - XGBoost model weights
- `trainer_config.pkl` - Training configuration  
- `preprocessor.pkl` - Fitted preprocessor
- `evaluation/` - Performance metrics and plots

### Expected Performance:
- **R2 > 0.90** for all targets
- **MAE < 0.10 kg CO2e** for carbon predictions
- Graceful degradation with up to 30% missing data