# üöÄ ML Footprint Prediction - Google Colab GPU Training

> Multi-output XGBoost training on Google Colab's free GPU

---

## üìã Setup Instructions

### 1. Enable GPU Runtime
- Click **Runtime** ‚Üí **Change runtime type**
- Set **Hardware accelerator** to **GPU** (T4 recommended)
- Click **Save**

### 2. Upload Your Data
You'll need to upload:
- `train.csv` (from `data/data_splitter/output/`)
- `validate.csv` (from `data/data_splitter/output/`)
- `material_dataset_final.csv` (from `data/data_calculations/input/`)

### 3. Run All Cells
- Click **Runtime** ‚Üí **Run all**
- Wait for training to complete (~30-60 min)
- Download trained model at the end

---

## üîß Step 1: Environment Setup

Install required packages and clone your repository.

In [None]:
# Install dependencies
!pip install -q xgboost matplotlib seaborn joblib

# Verify GPU availability
import subprocess
result = subprocess.run(['nvidia-smi'], capture_output=True, text=True)
print("GPU Information:")
print(result.stdout)

# Check XGBoost GPU support
import xgboost as xgb
print(f"\nXGBoost version: {xgb.__version__}")
print(f"GPU available: {xgb.dask != None}")

## üì¶ Step 2: Clone Repository or Upload Code

**Option A**: Clone from GitHub (if you have a repository)

**Option B**: Upload code manually (see below)

In [None]:
# Option A: Clone from GitHub (replace with your repo URL)
# !git clone https://github.com/your-username/bulk_product_generator.git
# %cd bulk_product_generator/models

# Option B: Create directory structure manually
!mkdir -p models/src
!mkdir -p models/data
!mkdir -p models/logs
!mkdir -p models/saved
%cd models

print("‚úì Directory structure created")

### Upload Python Source Files

If you didn't clone from GitHub, upload these files from your local `models/src/` directory:
- `__init__.py`
- `config.py`
- `data_loader.py`
- `formula_features.py`
- `preprocessor.py`
- `trainer.py`
- `evaluator.py`
- `utils.py`

And from `models/`:
- `train_max_accuracy.py`

In [None]:
from google.colab import files
import os

# Upload all Python source files to src/
print("Upload all .py files from your local models/src/ directory:")
uploaded = files.upload()

# Move files to src/ directory
for filename in uploaded.keys():
    if filename.startswith('src_'):  # If you prefix with 'src_'
        target = f"src/{filename[4:]}"
    elif filename == 'train_max_accuracy.py':
        target = filename
    else:
        target = f"src/{filename}"
    
    !mv {filename} {target}
    print(f"‚úì Moved {filename} ‚Üí {target}")

print("\n‚úì All source files uploaded")

## üìä Step 3: Upload Data Files

Upload your training data and material factors.

In [None]:
from google.colab import files
import shutil

print("Upload train.csv, validate.csv, and material_dataset_final.csv")
print("NOTE: These files may be large (train.csv ~500MB). Upload may take a few minutes.")
print("")

uploaded_data = files.upload()

# Move to data directory
for filename in uploaded_data.keys():
    shutil.move(filename, f"data/{filename}")
    print(f"‚úì Moved {filename} ‚Üí data/")

# Verify files
print("\nData files in data/:")
!ls -lh data/

## ‚úÖ Step 4: Verify Setup

Check that everything is ready for training.

In [None]:
import os

# Check source files
required_src = [
    'src/__init__.py',
    'src/config.py',
    'src/data_loader.py',
    'src/formula_features.py',
    'src/preprocessor.py',
    'src/trainer.py',
    'src/evaluator.py',
    'src/utils.py',
    'train_max_accuracy.py'
]

print("Checking source files:")
all_src_ok = True
for file in required_src:
    exists = os.path.exists(file)
    status = "‚úì" if exists else "‚úó"
    print(f"  {status} {file}")
    if not exists:
        all_src_ok = False

# Check data files
required_data = [
    'data/train.csv',
    'data/validate.csv',
    'data/material_dataset_final.csv'
]

print("\nChecking data files:")
all_data_ok = True
for file in required_data:
    exists = os.path.exists(file)
    status = "‚úì" if exists else "‚úó"
    size = os.path.getsize(file) / (1024*1024) if exists else 0
    print(f"  {status} {file} ({size:.1f} MB)" if exists else f"  {status} {file}")
    if not exists:
        all_data_ok = False

if all_src_ok and all_data_ok:
    print("\n‚úÖ All files present! Ready to train.")
else:
    print("\n‚ö†Ô∏è  Some files missing. Please upload them before proceeding.")

## üîß Step 5: Update Data Paths for Colab

Modify the training script to use Colab's file paths.

In [None]:
# Update data_loader.py to use Colab paths
with open('src/data_loader.py', 'r') as f:
    content = f.read()

# Replace default paths with Colab paths
content = content.replace(
    "'/home/tr4moryp/Projects/bulk_product_generator/data/data_splitter/output/train.csv'",
    "'data/train.csv'"
)
content = content.replace(
    "'/home/tr4moryp/Projects/bulk_product_generator/data/data_splitter/output/validate.csv'",
    "'data/validate.csv'"
)
content = content.replace(
    "'/home/tr4moryp/Projects/bulk_product_generator/data/data_calculations/input/material_dataset_final.csv'",
    "'data/material_dataset_final.csv'"
)

with open('src/data_loader.py', 'w') as f:
    f.write(content)

print("‚úì Paths updated for Google Colab")

## üéØ Step 6: Start Training

### Training Configuration

This will run the 3-phase training pipeline:
1. **Phase 1**: Baseline training (2000 rounds, ~15-20 min)
2. **Phase 2**: Evaluation & robustness testing (~10-15 min)
3. **Phase 3**: Augmented retraining if needed (~20-30 min)

**Total time**: 30-60 minutes on GPU

---

In [None]:
# Run full training pipeline
!python train_max_accuracy.py \
  --tree-method gpu_hist \
  --save-dir saved/colab_training

# Note: Remove the line break (\) if running on Windows

### Quick Test (Optional)

If you want to test with a smaller dataset first:

In [None]:
# Quick test with 10K samples (5-10 min)
# !python train_max_accuracy.py --sample-size 10000 --save-dir saved/quick_test

## üìä Step 7: View Results

Check training logs and evaluation metrics.

In [None]:
# View last 100 lines of training log
!tail -100 logs/training_max_accuracy.log

In [None]:
# Display evaluation report
import json

report_path = 'saved/colab_training/baseline/evaluation/evaluation_report.json'
with open(report_path, 'r') as f:
    report = json.load(f)

print("="*60)
print("EVALUATION RESULTS")
print("="*60)

if 'baseline' in report:
    print("\nBaseline Performance (Complete Data):")
    for target in ['carbon_material', 'carbon_transport', 'carbon_total', 'water_total']:
        if target in report['baseline']:
            metrics = report['baseline'][target]
            print(f"\n{target}:")
            print(f"  MAE:  {metrics['mae']:.4f}")
            print(f"  RMSE: {metrics['rmse']:.4f}")
            print(f"  R¬≤:   {metrics['r2']:.4f}")

if 'robustness' in report:
    print("\n" + "="*60)
    print("Robustness (30% Missing Data):")
    for r in report['robustness']:
        if abs(r['missing_pct'] - 0.3) < 0.01:
            print(f"  carbon_total MAE: {r['carbon_total_mae']:.4f}")
            print(f"  carbon_total R¬≤:  {r['carbon_total_r2']:.4f}")
            break

In [None]:
# Display robustness curve
from IPython.display import Image, display
import os

plot_path = 'saved/colab_training/baseline/evaluation/robustness_curves.png'
if os.path.exists(plot_path):
    display(Image(filename=plot_path))
else:
    print("Plot not found. Training may still be in progress.")

## üíæ Step 8: Download Trained Model

Download the trained model to your local machine.

In [None]:
# Create ZIP archive of trained model
!zip -r trained_model.zip saved/colab_training/baseline/ logs/

print("\n‚úì Model files zipped")
!ls -lh trained_model.zip

In [None]:
# Download the ZIP file
from google.colab import files

print("Downloading trained model...")
files.download('trained_model.zip')
print("\n‚úì Download complete!")
print("\nExtract the ZIP file on your local machine:")
print("  - saved/colab_training/baseline/ ‚Üí Contains model files")
print("  - logs/ ‚Üí Contains training logs")

## üîÆ Step 9: Test Predictions (Optional)

Make predictions on sample data.

In [None]:
from src.trainer import FootprintModelTrainer
from src.preprocessor import FootprintPreprocessor
import pandas as pd

# Load trained model
trainer = FootprintModelTrainer.load('saved/colab_training/baseline')
preprocessor = FootprintPreprocessor.load('saved/colab_training/baseline/preprocessor.pkl')

# Load sample from validation set
val_df = pd.read_csv('data/validate.csv')
X_sample = val_df.head(10)

# Prepare features (simplified - normally you'd add formula features too)
from src.data_loader import FEATURE_COLUMNS
X_features = X_sample[FEATURE_COLUMNS]

# Note: In production, you'd need to add formula features and preprocess
# This is just a quick demo

print("Model loaded successfully!")
print(f"Best iteration: {trainer.model.best_iteration}")

## üéâ Training Complete!

### Next Steps:

1. **Review Results**: Check the evaluation report and plots above
2. **Download Model**: The trained model has been zipped and is ready to download
3. **Use Locally**: Extract `trained_model.zip` and use for predictions

### Model Files Included:
- `xgb_model.json` - XGBoost model
- `trainer_config.pkl` - Training configuration
- `preprocessor.pkl` - Fitted preprocessor
- `evaluation_report.json` - Performance metrics
- `robustness_curves.png` - Performance plots
- Training logs

### Expected Performance:
- **R¬≤ > 0.90** for all targets (complete data)
- **MAE < 0.10 kg CO2e** for carbon predictions
- **Physics constraint violation < 0.01**

---

**Questions or issues?** Check the troubleshooting section in the README.
