# ‚ôüÔ∏è Archimedes Chess AI - Google Colab Training

This notebook provides a complete environment for training the Archimedes chess AI on Google Colab with GPU acceleration.

## Features:
- üöÄ Automatic GPU detection and setup
- üíæ Resumable training with checkpoints
- üìä Live metrics dashboard with ngrok
- üéÆ Interactive play vs AI
- üìà Comprehensive performance tracking

## Quick Start:
1. Enable GPU: Runtime ‚Üí Change runtime type ‚Üí GPU
2. Run all cells in order
3. Access dashboard via ngrok URL

## 1. Setup Environment

In [None]:
# Check GPU availability
!nvidia-smi

import torch
print(f"\nPyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"CUDA version: {torch.version.cuda}")

In [None]:
# Install dependencies
!pip install -q torch torchvision torchaudio
!pip install -q torch-geometric torch-scatter torch-sparse -f https://data.pyg.org/whl/torch-2.1.0+cu118.html
!pip install -q python-chess h5py numpy tqdm requests zstandard psutil plotly pandas streamlit pyngrok pynvml

## 2. Clone/Upload Project Files

In [None]:
# Option 1: Clone from GitHub (if you have a repo)
# !git clone https://github.com/yourusername/archimedes-chess-ai.git
# %cd archimedes-chess-ai

# Option 2: Upload files manually
# Use the file browser on the left to upload:
# - model.py
# - mcts.py
# - metrics.py
# - train_end_to_end.py
# - dashboard.py

# Option 3: Download from a URL
# !wget https://your-url.com/archimedes-files.zip
# !unzip archimedes-files.zip

# For this demo, we'll create the files directly
print("Upload project files or run setup script")

## 3. Mount Google Drive (Optional - for checkpoint persistence)

In [None]:
from google.colab import drive
drive.mount('/content/drive')

# Create checkpoint directory in Drive
import os
checkpoint_dir = '/content/drive/MyDrive/archimedes_checkpoints'
os.makedirs(checkpoint_dir, exist_ok=True)
print(f"Checkpoints will be saved to: {checkpoint_dir}")

## 4. Quick Test - Model & MCTS

In [None]:
# Test model creation
from model import ChessResNet, AlphaZeroEncoder
import chess

print("Creating model...")
model = ChessResNet(hidden_dim=256, num_layers=4, num_heads=8)
encoder = AlphaZeroEncoder()

print(f"Model parameters: {sum(p.numel() for p in model.parameters()):,}")

# Test forward pass
board = chess.Board()
data = encoder.board_to_graph(board)

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = model.to(device)
data = data.to(device)

with torch.no_grad():
    policy, value, aux = model(data)

print(f"\nPolicy shape: {policy.shape}")
print(f"Value: {value.item():.3f}")
print("\n‚úÖ Model test passed!")

In [None]:
# Test MCTS
from mcts import MCTS
import time

print("Testing MCTS...")
mcts = MCTS(model, encoder, num_simulations=100)

board = chess.Board()
start = time.time()
best_move, stats = mcts.search(board)
elapsed = time.time() - start

print(f"\nBest move: {best_move}")
print(f"Search time: {elapsed:.2f}s")
print(f"Nodes per second: {stats['nodes_per_second']:.0f}")
print(f"Max depth: {stats['max_search_depth']}")
print("\n‚úÖ MCTS test passed!")

## 5. Start Training

In [None]:
# Training configuration
EPOCHS = 50
GAMES_PER_EPOCH = 20  # Reduced for Colab
BATCH_SIZE = 16
LEARNING_RATE = 0.001

# Use Drive checkpoint directory if mounted, otherwise local
try:
    CHECKPOINT_DIR = checkpoint_dir
except:
    CHECKPOINT_DIR = 'checkpoints'

print(f"Training Configuration:")
print(f"  Epochs: {EPOCHS}")
print(f"  Games per epoch: {GAMES_PER_EPOCH}")
print(f"  Batch size: {BATCH_SIZE}")
print(f"  Learning rate: {LEARNING_RATE}")
print(f"  Checkpoint dir: {CHECKPOINT_DIR}")
print(f"  Device: {device}")

In [None]:
# Start training (run in background)
import subprocess
import threading

def run_training():
    cmd = f"python train_end_to_end.py --epochs {EPOCHS} --games-per-epoch {GAMES_PER_EPOCH} --batch-size {BATCH_SIZE} --lr {LEARNING_RATE} --checkpoint-dir {CHECKPOINT_DIR}"
    subprocess.run(cmd, shell=True)

# Start training in background thread
training_thread = threading.Thread(target=run_training, daemon=True)
training_thread.start()

print("üöÄ Training started in background!")
print("You can now start the dashboard in the next cell.")

## 6. Launch Dashboard with Ngrok

In [None]:
# Setup ngrok authentication
# Get your auth token from: https://dashboard.ngrok.com/get-started/your-authtoken
NGROK_AUTH_TOKEN = "YOUR_NGROK_TOKEN_HERE"  # Replace with your token

from pyngrok import ngrok, conf
import os

# Set auth token
if NGROK_AUTH_TOKEN != "YOUR_NGROK_TOKEN_HERE":
    ngrok.set_auth_token(NGROK_AUTH_TOKEN)
    print("‚úÖ Ngrok authenticated")
else:
    print("‚ö†Ô∏è Please set your ngrok auth token above")
    print("Get it from: https://dashboard.ngrok.com/get-started/your-authtoken")

In [None]:
# Launch Streamlit dashboard
!streamlit run dashboard.py --server.port 8501 &>/dev/null &

# Wait for Streamlit to start
import time
time.sleep(5)

# Create ngrok tunnel
public_url = ngrok.connect(8501)

print("\n" + "="*60)
print("üéâ DASHBOARD IS LIVE!")
print("="*60)
print(f"\nüìä Access your dashboard at:")
print(f"\nüîó {public_url}")
print("\n" + "="*60)
print("\nThe dashboard will show:")
print("  ‚Ä¢ Live training metrics")
print("  ‚Ä¢ MCTS performance")
print("  ‚Ä¢ Chess-specific stats")
print("  ‚Ä¢ Hardware utilization")
print("  ‚Ä¢ Play vs AI interface")
print("  ‚Ä¢ Position analysis")
print("\nKeep this cell running to maintain the connection!")

## 7. Monitor Training Progress

In [None]:
# Quick metrics check
from metrics import MetricsLogger
import pandas as pd

logger = MetricsLogger("training_logs.db")

# Get latest training metrics
metrics = logger.get_latest_metrics('training_metrics', limit=5)

if metrics:
    df = pd.DataFrame(metrics)
    print("\nüìä Latest Training Metrics:")
    print(df[['epoch', 'loss_total', 'loss_policy', 'loss_value', 'accuracy_top1']].to_string(index=False))
else:
    print("No metrics available yet. Training may still be starting...")

logger.close()

## 8. Play Against the AI

In [None]:
# Quick play interface (text-based)
from model import ChessResNet, AlphaZeroEncoder
from mcts import MCTS
import chess
import torch

# Load latest checkpoint
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = ChessResNet().to(device)
encoder = AlphaZeroEncoder()

try:
    checkpoint = torch.load(f"{CHECKPOINT_DIR}/latest_checkpoint.pt", map_location=device)
    model.load_state_dict(checkpoint['model_state_dict'])
    print(f"‚úÖ Loaded checkpoint from epoch {checkpoint['epoch']}")
except:
    print("‚ö†Ô∏è No checkpoint found, using untrained model")

model.eval()
mcts = MCTS(model, encoder, num_simulations=400)

# Play a game
board = chess.Board()
print("\n" + str(board) + "\n")

while not board.is_game_over():
    if board.turn == chess.WHITE:
        # Human move
        move_uci = input("Your move (e.g., e2e4): ")
        try:
            move = chess.Move.from_uci(move_uci)
            if move in board.legal_moves:
                board.push(move)
            else:
                print("Illegal move!")
                continue
        except:
            print("Invalid format!")
            continue
    else:
        # AI move
        print("AI thinking...")
        ai_move, stats = mcts.search(board, add_noise=False)
        board.push(ai_move)
        print(f"AI plays: {ai_move}")
    
    print("\n" + str(board) + "\n")
    
    if board.fullmove_number > 50:
        print("Game too long, stopping...")
        break

print(f"\nGame over! Result: {board.result()}")

## 9. Download Checkpoints

In [None]:
# Download checkpoints to local machine
from google.colab import files
import os

checkpoint_files = [
    f"{CHECKPOINT_DIR}/latest_checkpoint.pt",
    f"{CHECKPOINT_DIR}/best_checkpoint.pt",
    "training_logs.db"
]

for file_path in checkpoint_files:
    if os.path.exists(file_path):
        print(f"Downloading {file_path}...")
        files.download(file_path)
    else:
        print(f"File not found: {file_path}")

print("\n‚úÖ Download complete!")

## 10. Cleanup

In [None]:
# Stop ngrok tunnel
ngrok.kill()
print("‚úÖ Ngrok tunnel closed")

# Note: Training will continue in background until you stop the runtime

---

## üìö Additional Resources

- **GitHub**: [Your Repository URL]
- **Documentation**: See README.md
- **Issues**: Report bugs on GitHub

## üí° Tips

1. **Save checkpoints to Drive** to persist across sessions
2. **Use GPU runtime** for 10-20x faster training
3. **Monitor the dashboard** for real-time metrics
4. **Adjust hyperparameters** in the training configuration cell
5. **Export games** regularly for analysis

## ‚ö†Ô∏è Important Notes

- Colab sessions timeout after 12 hours of inactivity
- GPU usage is limited on free tier
- Save checkpoints frequently to avoid data loss
- The dashboard URL changes each time you restart ngrok

---

**Happy Training! ‚ôüÔ∏èüöÄ**