# üöÄ RoastFormer Transformer Training - Google Colab

**Complete training pipeline for coffee roast profile generation**

Author: Charlee Kraiss  
Project: RoastFormer - Transformer-Based Roast Profile Generation  
Date: November 2024

---

## üìã What This Notebook Does

1. ‚úÖ Sets up GPU environment
2. ‚úÖ Uploads your preprocessed data
3. ‚úÖ Trains the full transformer
4. ‚úÖ Saves results & checkpoints
5. ‚úÖ Generates downloadable results package

**Estimated Runtime:** 30-60 minutes (with free T4 GPU)

---

## üéØ Quick Start

1. **Runtime ‚Üí Change runtime type ‚Üí GPU (T4)**
2. Run cells in order
3. Upload `roastformer_data.zip` when prompted
4. Download results at the end

---

## 1Ô∏è‚É£ Setup Environment

In [1]:
# Check GPU availability
import torch
print("="*80)
print("GPU CHECK")
print("="*80)
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"CUDA version: {torch.version.cuda}")
    print("‚úÖ GPU ready for training!")
else:
    print("‚ö†Ô∏è  No GPU detected. Go to Runtime ‚Üí Change runtime type ‚Üí GPU")
print("="*80)

GPU CHECK
CUDA available: True
GPU: NVIDIA L4
CUDA version: 12.6
‚úÖ GPU ready for training!


In [2]:
# Install required packages (if needed)
!pip install -q pandas scikit-learn matplotlib

print("‚úÖ Dependencies installed")

‚úÖ Dependencies installed


In [3]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [4]:
%cd /content/gdrive/MyDrive/"Colab Notebooks"/"GEN_AI"

/content/gdrive/MyDrive/Colab Notebooks/GEN_AI


In [5]:
pwd

'/content/gdrive/MyDrive/Colab Notebooks/GEN_AI'

In [6]:
ls -a

RoastFormer_Colab_Training.ipynb   roastformer_data_20251111_092727.zip
[0m[01;34mroastformer_data_20251111_092727[0m/


## 2Ô∏è‚É£ Upload Your Data

Upload the `roastformer_data.zip` file created by the packaging script.

In [9]:
import zipfile
import os

print("="*80)
print("EXTRACTING DATA FROM GOOGLE DRIVE")
print("="*80)

zip_path = '/content/gdrive/MyDrive/Colab Notebooks/GEN_AI/roastformer_data_20251111_092727.zip'

if os.path.exists(zip_path):
    print(f"‚úÖ Found zip file")

    # KEY FIX: Change to /content first
    os.chdir('/content')
    print(f"Working directory: {os.getcwd()}")

    print(f"\nüì¶ Extracting...")
    with zipfile.ZipFile(zip_path, 'r') as zip_ref:
        zip_ref.extractall('.')  # Extract to current directory

    print("‚úÖ Extraction complete")

    # Verify
    print("\nüìÅ Verifying:")
    !ls -lh preprocessed_data/

    # Show stats
    import json
    with open('preprocessed_data/dataset_stats.json', 'r') as f:
        stats = json.load(f)
    print(f"\nüìä Dataset: {stats['total_profiles']} profiles")
    print("‚úÖ Ready to train!")
else:
    print(f"‚ùå Zip not found at: {zip_path}")

EXTRACTING DATA FROM GOOGLE DRIVE
‚úÖ Found zip file
Working directory: /content

üì¶ Extracting...
‚úÖ Extraction complete

üìÅ Verifying:
total 13M
-rw-r--r-- 1 root root  267 Nov 13 18:41 dataset_stats.json
-rw-r--r-- 1 root root  19K Nov 13 18:41 train_metadata.csv
-rw-r--r-- 1 root root  11M Nov 13 18:41 train_profiles.json
-rw-r--r-- 1 root root 3.3K Nov 13 18:41 val_metadata.csv
-rw-r--r-- 1 root root 1.8M Nov 13 18:41 val_profiles.json

üìä Dataset: 144 profiles
‚úÖ Ready to train!


## 3Ô∏è‚É£ Verify Data Loaded Correctly

In [10]:
import os
import json

print("="*80)
print("DATA VERIFICATION")
print("="*80)

# Check structure
expected_files = [
    'preprocessed_data/train_profiles.json',
    'preprocessed_data/val_profiles.json',
    'preprocessed_data/train_metadata.csv',
    'preprocessed_data/val_metadata.csv',
    'preprocessed_data/dataset_stats.json',
    'src/dataset/preprocessed_data_loader.py',
    'src/model/transformer_adapter.py',
    'train_transformer.py'
]

all_good = True
for filepath in expected_files:
    exists = os.path.exists(filepath)
    status = "‚úÖ" if exists else "‚ùå"
    print(f"{status} {filepath}")
    if not exists:
        all_good = False

if all_good:
    print("\n‚úÖ All files present!")

    # Load dataset stats
    with open('preprocessed_data/dataset_stats.json', 'r') as f:
        stats = json.load(f)

    print("\nüìä Dataset Statistics:")
    print(f"   Total profiles: {stats['total_profiles']}")
    print(f"   Training: {stats['train_size']}")
    print(f"   Validation: {stats['val_size']}")
    print(f"   Unique origins: {stats['unique_origins']}")
    print(f"   Unique processes: {stats['unique_processes']}")
    print(f"   Unique varieties: {stats['unique_varieties']}")
else:
    print("\n‚ùå Some files missing! Please re-upload the data package.")

print("="*80)

DATA VERIFICATION
‚úÖ preprocessed_data/train_profiles.json
‚úÖ preprocessed_data/val_profiles.json
‚úÖ preprocessed_data/train_metadata.csv
‚úÖ preprocessed_data/val_metadata.csv
‚úÖ preprocessed_data/dataset_stats.json
‚úÖ src/dataset/preprocessed_data_loader.py
‚úÖ src/model/transformer_adapter.py
‚úÖ train_transformer.py

‚úÖ All files present!

üìä Dataset Statistics:
   Total profiles: 144
   Training: 123
   Validation: 21
   Unique origins: 18
   Unique processes: 13
   Unique varieties: 24


## 4Ô∏è‚É£ Configure Training

Choose your model configuration:

In [11]:
# Training Configuration
# Modify these parameters as needed

config = {
    # Model architecture
    'd_model': 256,              # Model dimension (128=small, 256=medium, 512=large)
    'nhead': 8,                  # Attention heads
    'num_layers': 6,             # Transformer layers
    'dim_feedforward': 1024,     # FFN dimension
    'embed_dim': 32,             # Categorical embedding size
    'dropout': 0.1,              # Dropout rate
    'positional_encoding': 'sinusoidal',  # 'sinusoidal' or 'learned'

    # Training hyperparameters
    'batch_size': 8,             # Batch size (4-16 for small dataset)
    'num_epochs': 100,           # Number of epochs
    'learning_rate': 1e-4,       # Learning rate
    'weight_decay': 0.01,        # L2 regularization
    'grad_clip': 1.0,            # Gradient clipping
    'early_stopping_patience': 15,  # Early stopping patience
    'max_sequence_length': 800,  # Max profile length

    # System
    'device': 'cuda' if torch.cuda.is_available() else 'cpu',
    'preprocessed_dir': 'preprocessed_data',
    'checkpoint_dir': 'checkpoints',
    'results_dir': 'results',
    'save_every': 10             # Save checkpoint every N epochs
}

print("="*80)
print("TRAINING CONFIGURATION")
print("="*80)
for key, value in config.items():
    print(f"  {key}: {value}")
print("="*80)

# Estimate parameters
if config['d_model'] == 128:
    params = "~2M"
    time_est = "15-30 min"
elif config['d_model'] == 256:
    params = "~10M"
    time_est = "30-60 min"
else:
    params = "~40M"
    time_est = "1-2 hours"

print(f"\nüìä Estimated model size: {params}")
print(f"‚è±Ô∏è  Estimated training time: {time_est} (on GPU)")

TRAINING CONFIGURATION
  d_model: 256
  nhead: 8
  num_layers: 6
  dim_feedforward: 1024
  embed_dim: 32
  dropout: 0.1
  positional_encoding: sinusoidal
  batch_size: 8
  num_epochs: 100
  learning_rate: 0.0001
  weight_decay: 0.01
  grad_clip: 1.0
  early_stopping_patience: 15
  max_sequence_length: 800
  device: cuda
  preprocessed_dir: preprocessed_data
  checkpoint_dir: checkpoints
  results_dir: results
  save_every: 10

üìä Estimated model size: ~10M
‚è±Ô∏è  Estimated training time: 30-60 min (on GPU)


## 5Ô∏è‚É£ Train the Transformer

**This cell will take 30-60 minutes with GPU.**

You can monitor progress in real-time below.

In [None]:
# Import training script
import sys
sys.path.append('.')

from train_transformer import TransformerTrainer

print("="*80)
print("STARTING TRAINING")
print("="*80)
print(f"Device: {config['device']}")
print(f"Epochs: {config['num_epochs']}")
print("="*80)

# Initialize trainer
trainer = TransformerTrainer(config)

# Load data
trainer.load_data()

# Initialize model
trainer.initialize_model()

# Train!
trainer.train()

print("\n" + "="*80)
print("‚úÖ TRAINING COMPLETE!")
print("="*80)

## 6Ô∏è‚É£ Generate Results Summary

Create a comprehensive summary of training results.

In [None]:
import json
import matplotlib.pyplot as plt
from datetime import datetime

# Load training results
with open('results/transformer_training_results.json', 'r') as f:
    results = json.load(f)

print("="*80)
print("TRAINING RESULTS SUMMARY")
print("="*80)

print(f"\nüìä Model Configuration:")
print(f"   d_model: {results['config']['d_model']}")
print(f"   Layers: {results['config']['num_layers']}")
print(f"   Heads: {results['config']['nhead']}")
print(f"   Parameters: {results['num_parameters']:,}")

print(f"\nüìà Training Progress:")
print(f"   Final epoch: {results['final_epoch']}/{results['config']['num_epochs']}")
print(f"   Best val loss: {results['best_val_loss']:.4f}¬∞F")
print(f"   Final train loss: {results['train_losses'][-1]:.4f}¬∞F")
print(f"   Final val loss: {results['val_losses'][-1]:.4f}¬∞F")

# Plot training curves
plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
plt.plot(results['train_losses'], label='Train Loss', linewidth=2)
plt.plot(results['val_losses'], label='Val Loss', linewidth=2)
plt.xlabel('Epoch')
plt.ylabel('Loss (¬∞F)')
plt.title('Training Progress')
plt.legend()
plt.grid(True, alpha=0.3)

plt.subplot(1, 2, 2)
plt.plot(results['train_losses'], label='Train Loss', linewidth=2)
plt.plot(results['val_losses'], label='Val Loss', linewidth=2)
plt.xlabel('Epoch')
plt.ylabel('Loss (¬∞F)')
plt.title('Training Progress (Log Scale)')
plt.yscale('log')
plt.legend()
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('results/training_curves.png', dpi=150)
plt.show()

print("\n‚úÖ Training curves saved to results/training_curves.png")
print("="*80)

## 7Ô∏è‚É£ Package Results for Download

Create a downloadable package with all results.

In [None]:
import zipfile
import os
from datetime import datetime

# Create results package
timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
package_name = f'roastformer_results_{timestamp}.zip'

print("="*80)
print("PACKAGING RESULTS")
print("="*80)

with zipfile.ZipFile(package_name, 'w', zipfile.ZIP_DEFLATED) as zipf:
    # Add checkpoint
    zipf.write('checkpoints/best_transformer_model.pt',
               'best_transformer_model.pt')
    print("‚úÖ Added: best_transformer_model.pt")

    # Add results
    zipf.write('results/transformer_training_results.json',
               'transformer_training_results.json')
    print("‚úÖ Added: transformer_training_results.json")

    # Add training curves
    zipf.write('results/training_curves.png',
               'training_curves.png')
    print("‚úÖ Added: training_curves.png")

    # Create a summary text file
    summary = f"""RoastFormer Training Results Summary
Generated: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}

MODEL CONFIGURATION
-------------------
d_model: {results['config']['d_model']}
Layers: {results['config']['num_layers']}
Heads: {results['config']['nhead']}
Parameters: {results['num_parameters']:,}
Positional Encoding: {results['config']['positional_encoding']}

TRAINING RESULTS
----------------
Epochs Trained: {results['final_epoch']}
Best Validation Loss: {results['best_val_loss']:.4f}¬∞F
Final Train Loss: {results['train_losses'][-1]:.4f}¬∞F
Final Val Loss: {results['val_losses'][-1]:.4f}¬∞F

DATASET
-------
Total Profiles: {results['feature_dims']['num_origins']} origins
Origins: {results['feature_dims']['num_origins']}
Processes: {results['feature_dims']['num_processes']}
Varieties: {results['feature_dims']['num_varieties']}
Flavors: {results['feature_dims']['num_flavors']}

FILES INCLUDED
--------------
1. best_transformer_model.pt - Best model checkpoint
2. transformer_training_results.json - Complete results
3. training_curves.png - Training visualization
4. training_summary.txt - This file

TO USE THESE RESULTS
--------------------
1. Download this zip file
2. Extract to your RoastFormer project
3. Share training_summary.txt with Claude
4. Use evaluate_transformer.py to analyze the model
5. Use generate_profiles.py to create new profiles
"""

    zipf.writestr('training_summary.txt', summary)
    print("‚úÖ Added: training_summary.txt")

print(f"\nüì¶ Package created: {package_name}")
print(f"   Size: {os.path.getsize(package_name) / 1024 / 1024:.2f} MB")
print("="*80)

## 8Ô∏è‚É£ Download Results

Download the complete results package to your Mac.

In [None]:
from google.colab import files

print("="*80)
print("DOWNLOAD RESULTS")
print("="*80)
print(f"Downloading: {package_name}")
print("\nThis package contains:")
print("  ‚Ä¢ Trained model checkpoint")
print("  ‚Ä¢ Complete training results (JSON)")
print("  ‚Ä¢ Training curves visualization")
print("  ‚Ä¢ Summary text file")
print("\nOnce downloaded:")
print("  1. Extract the zip file")
print("  2. Share 'training_summary.txt' with Claude")
print("  3. Move checkpoint to checkpoints/ folder")
print("  4. Run evaluation and generation scripts")
print("="*80)

files.download(package_name)

print("\n‚úÖ Download complete!")

## üéâ Training Complete!

### What You Have Now:

1. ‚úÖ **Trained transformer model** - Ready for profile generation
2. ‚úÖ **Training results** - Complete metrics and curves
3. ‚úÖ **Downloadable package** - Everything you need

### Next Steps:

**On Your Mac:**

1. **Extract the results:**
   ```bash
   cd ~/VANDY/FALL_2025/GEN_AI_THEORY/ROASTFormer
   unzip roastformer_results_*.zip
   ```

2. **Share results with Claude:**
   - Open `training_summary.txt`
   - Paste contents in chat with Claude
   - Claude will analyze and suggest next steps

3. **Evaluate the model:**
   ```bash
   python evaluate_transformer.py --plot --num_samples 10
   ```

4. **Generate custom profiles:**
   ```bash
   python generate_profiles.py \
     --origin "Ethiopia" \
     --flavors "berries,floral" \
     --plot
   ```

### For Ablation Studies:

**Modify the config in cell 4 and re-run:**

- Try `positional_encoding: 'learned'`
- Try different model sizes (d_model: 128, 256, 512)
- Compare results

---

**Questions?** Share the training summary with Claude for analysis!
