# Knowledge Distillation Training on Google Colab

This notebook runs production-scale KD training on Google Colab with free GPU.

**Setup:**
1. Runtime > Change runtime type > GPU (T4)
2. Run all cells

**Expected time:** ~30-60 minutes on T4 GPU

## 1. Setup Environment

In [None]:
# Clone repository
!git clone https://github.com/YOUR_USERNAME/semantic-kd.git
%cd semantic-kd

In [None]:
# Install dependencies
!pip install -q poetry
!poetry install --no-dev

In [None]:
# Verify GPU
import torch
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")

## 2. Download Data from GCS

In [None]:
# Authenticate with GCS (if needed)
from google.colab import auth
auth.authenticate_user()

# Download data
!gsutil -m cp -r gs://plotpointe-semantic-kd-data/raw/msmarco data/raw/
!gsutil -m cp -r gs://plotpointe-semantic-kd-data/chunks/msmarco data/chunks/
!gsutil -m cp -r gs://plotpointe-semantic-kd-data/indexes/bm25_msmarco artifacts/indexes/

## 3. Run Training

In [None]:
# Run production training
!poetry run python scripts/train_kd_pipeline.py \
    --max-samples 1000 \
    --epochs 3 \
    --batch-size 16 \
    --stage 2 \
    --device cuda \
    --output-dir ./artifacts/models/kd_student_production \
    --log-level INFO

## 4. Evaluate Model

In [None]:
# Evaluate vanilla student
!poetry run python scripts/simple_eval.py \
    --model-path "intfloat/e5-small-v2" \
    --data-path "data/raw/msmarco/train.jsonl" \
    --output-path "artifacts/evaluation_production/vanilla_results.json" \
    --max-samples 1000 \
    --device cuda

In [None]:
# Evaluate KD student
!poetry run python scripts/simple_eval.py \
    --model-path "./artifacts/models/kd_student_production/best_model" \
    --data-path "data/raw/msmarco/train.jsonl" \
    --output-path "artifacts/evaluation_production/kd_results.json" \
    --max-samples 1000 \
    --device cuda

## 5. View Results

In [None]:
import json

# Load results
with open('artifacts/evaluation_production/vanilla_results.json') as f:
    vanilla = json.load(f)

with open('artifacts/evaluation_production/kd_results.json') as f:
    kd = json.load(f)

# Display comparison
print("\n" + "="*60)
print("RESULTS COMPARISON")
print("="*60)
print(f"\nVanilla Student:")
print(f"  nDCG@10: {vanilla.get('ndcg@10', 0):.4f}")
print(f"  MRR@10:  {vanilla.get('mrr@10', 0):.4f}")

print(f"\nKD Student:")
print(f"  nDCG@10: {kd.get('ndcg@10', 0):.4f}")
print(f"  MRR@10:  {kd.get('mrr@10', 0):.4f}")

ndcg_improvement = ((kd.get('ndcg@10', 0) - vanilla.get('ndcg@10', 0)) / vanilla.get('ndcg@10', 1)) * 100
mrr_improvement = ((kd.get('mrr@10', 0) - vanilla.get('mrr@10', 0)) / vanilla.get('mrr@10', 1)) * 100

print(f"\nImprovement:")
print(f"  nDCG@10: {ndcg_improvement:+.2f}%")
print(f"  MRR@10:  {mrr_improvement:+.2f}%")
print("\n" + "="*60)

## 6. Upload Results to GCS

In [None]:
# Upload trained model
!gsutil -m cp -r artifacts/models/kd_student_production gs://plotpointe-semantic-kd-models/

# Upload evaluation results
!gsutil -m cp -r artifacts/evaluation_production gs://plotpointe-semantic-kd-models/

print("\nâœ“ Results uploaded to GCS!")
print("\nModel: gs://plotpointe-semantic-kd-models/kd_student_production")
print("Results: gs://plotpointe-semantic-kd-models/evaluation_production")

## 7. Download Results Locally (Optional)

In [None]:
# Download to your local machine
from google.colab import files

# Zip results
!zip -r kd_training_results.zip artifacts/models/kd_student_production artifacts/evaluation_production

# Download
files.download('kd_training_results.zip')