# Symbol Detection - Resume Training on Google Colab

Clean notebook optimized for **resuming training** from existing checkpoints. Designed to avoid the dependency issues of the original notebook.

## ‚ö° Quick Start:
1. Run cells 1-3 (setup & install)
2. **‚ö†Ô∏è Restart Runtime** (Runtime ‚Üí Restart session) 
3. After restart: Run cells 1-2 again, **skip cell 3**, continue from cell 4
4. Configure resume settings in cell 7
5. Run training (cell 8)

**Note**: The runtime restart after installation is REQUIRED to fix numpy/torchvision compatibility.

## 1. Environment Detection and Setup

In [6]:
import sys
import os
import torch

# Detect environment
IN_COLAB = 'google.colab' in sys.modules

print("=" * 70)
print("ENVIRONMENT DETECTION")
print("=" * 70)
print(f"Running on: {'Google Colab' if IN_COLAB else 'Local Machine'}")
print(f"Python version: {sys.version}")
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")

if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.1f} GB")
else:
    print("GPU: Not available (will use CPU - training will be slow!)")

print("=" * 70)

ENVIRONMENT DETECTION
Running on: Google Colab
Python version: 3.12.12 (main, Oct 10 2025, 08:52:57) [GCC 11.4.0]
PyTorch version: 2.9.0+cu128
CUDA available: True
GPU: NVIDIA A100-SXM4-80GB
GPU Memory: 79.3 GB


In [7]:
# Mount Google Drive if on Colab
if IN_COLAB:
    try:
        from google.colab import drive
        drive.mount('/content/drive', force_remount=False)
        print("‚úì Google Drive mounted at /content/drive")
    except Exception as e:
        print(f"‚ö† Could not mount Google Drive: {e}")
        print("Proceeding without Drive - checkpoints will save locally")
else:
    print("Running locally - not attempting Drive mount")

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
‚úì Google Drive mounted at /content/drive


## 2. Clone/Setup Repository

In [8]:
import subprocess

if IN_COLAB:
    repo_path = '/content/symbol-detection'
    if not os.path.exists(repo_path):
        print("Cloning repository...")
        subprocess.run(['git', 'clone', 'https://github.com/BhanukaDev/symbol-detection.git', repo_path], check=True)
    else:
        print("Repository already exists, pulling latest changes...")
        os.chdir(repo_path)
        subprocess.run(['git', 'pull'], check=True)
else:
    # Local development - repository should already be present
    repo_path = os.path.dirname(os.path.dirname(os.path.abspath('.')))
    
print(f"‚úì Repository path: {repo_path}")
os.chdir(repo_path)

Repository already exists, pulling latest changes...
‚úì Repository path: /content/symbol-detection


## 3. Dependency Installation with Version Lock

**This cell fixes the numpy/PyTorch compatibility issue that broke the previous notebook.**

In [9]:
if IN_COLAB:
    print("Installing dependencies for Google Colab...")
    print("=" * 70)
    
    # Step 1: Install numpy first with exact version
    print("Step 1: Installing numpy==1.26.4 (locked version)...")
    os.system("pip install --no-cache-dir --force-reinstall numpy==1.26.4")
    
    # Step 2: Install PyTorch and dependencies
    print("\nStep 2: Installing PyTorch ecosystem...")
    os.system("pip install --no-cache-dir torch torchvision torchmetrics pycocotools timm")
    
    # Step 3: Install local packages
    print("\nStep 3: Installing local packages...")
    os.chdir(f'{repo_path}/python')
    os.system("pip install --no-cache-dir -e ./floor-grid --no-deps")
    os.system("pip install --no-cache-dir -e ./effects --no-deps")
    os.system("pip install --no-cache-dir -e . --no-deps")
    
    print("\n" + "=" * 70)
    print("‚úì Dependencies installed successfully")
    print("=" * 70)
    print("\n‚ö†Ô∏è  IMPORTANT: You MUST restart the runtime now!")
    print("Go to: Runtime ‚Üí Restart session")
    print("\nThis is CRITICAL - torchvision needs to reload with the correct numpy version.")
    print("After restart, skip this cell and run from the next cell.")
    print("=" * 70)
else:
    print("Running locally - skipping Colab-specific installation")
    print("Make sure you have dependencies installed: pip install -e ./python[dev]")

Installing dependencies for Google Colab...
Step 1: Installing numpy==1.26.4 (locked version)...

Step 2: Installing PyTorch ecosystem...

Step 3: Installing local packages...

‚úì Dependencies installed successfully

‚ö†Ô∏è  IMPORTANT: You MUST restart the runtime now!
Go to: Runtime ‚Üí Restart session

This is CRITICAL - torchvision needs to reload with the correct numpy version.
After restart, skip this cell and run from the next cell.


### ‚ö†Ô∏è STOP HERE - Restart Runtime Required!

After running the cell above, you **MUST** restart the runtime:
1. Go to: **Runtime ‚Üí Restart session**
2. After restart, **skip cells 2-3** and continue from cell 4 (Package Verification) below

This restart is necessary for torchvision to reload with the correct numpy version.

## 4. Package Import Verification

In [10]:
import sys
import numpy

print("=" * 70)
print("PACKAGE VERIFICATION")
print("=" * 70)

# Check numpy version first
print(f"Numpy version: {numpy.__version__}")
if not numpy.__version__.startswith("1.26"):
    print("‚ùå Wrong numpy version detected!")
    print("You MUST restart the runtime (Runtime ‚Üí Restart session) after installing dependencies.")
    print("Then skip the installation cell and run from here.")
    raise RuntimeError(f"Numpy {numpy.__version__} detected, need 1.26.x. Restart runtime!")

# Ensure Python path includes the source directory
python_path = f'{repo_path}/python/src'
if python_path not in sys.path:
    sys.path.insert(0, python_path)

print("Verifying package imports...")
try:
    from symbol_detection.training import Trainer, CIoULoss
    from symbol_detection.training.data import COCODetectionDataset
    print("‚úì symbol_detection.training")
    print("‚úì COCODetectionDataset (training.data)")
except ImportError as e:
    print(f"‚úó Import error: {e}")
    print("\nDid you restart the runtime after installing dependencies?")
    print("Go to: Runtime ‚Üí Restart session, then skip installation and re-run from here")
    raise

print("\n‚úì All required packages imported successfully")
print("=" * 70)

PACKAGE VERIFICATION
Numpy version: 1.26.4
Verifying package imports...
‚úì symbol_detection.training
‚úì COCODetectionDataset (training.data)

‚úì All required packages imported successfully


## 5. Dataset and Checkpoint Path Configuration

In [None]:
from pathlib import Path

# Fast storage toggle: use /content local disk (much faster than Drive)
USE_LOCAL_FAST = True if IN_COLAB else False

DRIVE_MOUNTED = os.path.exists('/content/drive/MyDrive') if IN_COLAB else False

if IN_COLAB and USE_LOCAL_FAST:
    dataset_dir = Path('/content/symbol-detection/dataset')
    checkpoints_dir = Path('/content/symbol-detection/checkpoints')
    print("Using LOCAL /content storage (fast). Copy data/checkpoints here if needed.")
elif IN_COLAB and DRIVE_MOUNTED:
    dataset_dir = Path('/content/drive/MyDrive/symbol-detection/dataset')
    checkpoints_dir = Path('/content/drive/MyDrive/symbol-detection/checkpoints')
    print("Using Google Drive storage (slower I/O)")
elif IN_COLAB:
    dataset_dir = Path('/content/symbol-detection/dataset')
    checkpoints_dir = Path('/content/symbol-detection/checkpoints')
    print("Drive not mounted - using local /content storage")
else:
    dataset_dir = Path(repo_path) / 'python' / 'dataset'
    checkpoints_dir = Path(repo_path) / 'python' / 'checkpoints'
    print("Using local storage")

# Create directories
dataset_dir.mkdir(parents=True, exist_ok=True)
checkpoints_dir.mkdir(parents=True, exist_ok=True)

print(f"\nDataset directory: {dataset_dir}")
print(f"Checkpoints directory: {checkpoints_dir}")

# Check for existing checkpoints
checkpoints = list(checkpoints_dir.glob('*.pth'))
if checkpoints:
    print(f"\n‚úì Found {len(checkpoints)} existing checkpoints:")
    for ckpt in sorted(checkpoints)[-5:]:  # Show last 5
        size_mb = ckpt.stat().st_size / (1024 * 1024)
        print(f"  - {ckpt.name} ({size_mb:.1f} MB)")
else:
    print("\n‚ö† No checkpoints found in this location - copy them here if needed")

Using Google Drive for storage

Dataset directory: /content/drive/MyDrive/symbol-detection/dataset
Checkpoints directory: /content/drive/MyDrive/symbol-detection/checkpoints

‚úì Found 20 existing checkpoints:
  - model_epoch_330.pth (315.0 MB)
  - model_epoch_340.pth (315.0 MB)
  - model_epoch_50.pth (315.0 MB)
  - model_epoch_80.pth (315.0 MB)
  - model_epoch_final.pth (315.0 MB)


## 6. Training Configuration

In [31]:
import gc
import torch

# CUDA memory optimization
if torch.cuda.is_available():
    os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"
    torch.cuda.empty_cache()
    gc.collect()

# Training configuration
TRAINING_CONFIG = {
    'num_epochs': 50,          # Total epochs to train (can resume and extend)
    'batch_size': 8,            # Reduced to prevent OOM errors
    'learning_rate': 0.005,
    'num_classes': 7,           # Electrical symbols
    'use_ciou_loss': True,
    'eval_every_n': 10,
    'enable_ap_eval': True,
}

# Resume configuration (KEY PART)
RESUME_TRAINING = True                              # Set to False to start from scratch
RESUME_FROM_CHECKPOINT = None                       # Auto-detect latest or specify manually
EXTEND_TRAINING_TO = 400                            # Extend training to this epoch number

print("=" * 70)
print("TRAINING CONFIGURATION")
print("=" * 70)
for key, value in TRAINING_CONFIG.items():
    print(f"{key:20s}: {value}")

print(f"\nResume training: {RESUME_TRAINING}")
if RESUME_FROM_CHECKPOINT is None and RESUME_TRAINING:
    # Auto-detect latest checkpoint
    checkpoints = list(checkpoints_dir.glob('*.pth'))
    if checkpoints:
        RESUME_FROM_CHECKPOINT = max(checkpoints, key=lambda x: x.stat().st_mtime)
        print(f"Auto-detected: {RESUME_FROM_CHECKPOINT.name}")
    
print(f"Extend to epoch: {EXTEND_TRAINING_TO}")
print("=" * 70)

TRAINING CONFIGURATION
num_epochs          : 50
batch_size          : 8
learning_rate       : 0.005
num_classes         : 7
use_ciou_loss       : True
eval_every_n        : 10
enable_ap_eval      : True

Resume training: True
Auto-detected: model_epoch_340.pth
Extend to epoch: 400


## 7. Initialize Trainer and Load Checkpoint

In [32]:
from symbol_detection.training import Trainer

# Initialize trainer (with fresh model)
print("Initializing Trainer...")
trainer = Trainer(
    dataset_dir=str(dataset_dir),
    output_dir=str(checkpoints_dir),
    num_classes=TRAINING_CONFIG['num_classes'],
    batch_size=TRAINING_CONFIG['batch_size'],
    learning_rate=TRAINING_CONFIG['learning_rate'],
    num_epochs=TRAINING_CONFIG['num_epochs'],
    device='cuda' if torch.cuda.is_available() else 'cpu',
    use_ciou_loss=TRAINING_CONFIG['use_ciou_loss'],
    eval_every_n=TRAINING_CONFIG['eval_every_n'],
    enable_ap_eval=TRAINING_CONFIG['enable_ap_eval'],
)

print(f"‚úì Trainer initialized on device: {trainer.device}")
print(f"‚úì Model: ResNet50+FPN")
print(f"‚úì CIoU Loss: enabled")

# Load checkpoint if resuming
if RESUME_TRAINING and RESUME_FROM_CHECKPOINT:
    print(f"\nLoading checkpoint: {RESUME_FROM_CHECKPOINT.name}")
    try:
        trainer.load_checkpoint(str(RESUME_FROM_CHECKPOINT), resume_training=True)
        print(f"‚úì Loaded epoch: {trainer.start_epoch}")
        print(f"‚úì Restored optimizer state")
        print(f"‚úì Restored training history")
        
        # Extend training duration if needed
        if EXTEND_TRAINING_TO > trainer.num_epochs:
            print(f"\nExtending training from {trainer.num_epochs} to {EXTEND_TRAINING_TO} epochs")
            trainer.num_epochs = EXTEND_TRAINING_TO
            
    except Exception as e:
        print(f"‚úó Failed to load checkpoint: {e}")
        print("Starting fresh training instead...")
        RESUME_TRAINING = False
elif RESUME_TRAINING:
    print("No checkpoint found - starting fresh training")
else:
    print("Starting fresh training (resume=False)")

Initializing Trainer...
Using device: cuda
‚úì Trainer initialized on device: cuda
‚úì Model: ResNet50+FPN
‚úì CIoU Loss: enabled

Loading checkpoint: model_epoch_340.pth
Loaded checkpoint: /content/drive/MyDrive/symbol-detection/checkpoints/model_epoch_340.pth
‚úì Loaded epoch: 340
‚úì Restored optimizer state
‚úì Restored training history

Extending training from 50 to 400 epochs


## 8. Execute Training

In [None]:
import traceback

print("=" * 70)
print("STARTING TRAINING")
print("=" * 70)

try:
    # Clear any remaining GPU memory
    if torch.cuda.is_available():
        torch.cuda.empty_cache()
        gc.collect()
    
    # Execute training
    trainer.train()
    
    print("\n" + "=" * 70)
    print("‚úì TRAINING COMPLETED SUCCESSFULLY")
    print("=" * 70)
    
except Exception as e:
    print("\n" + "=" * 70)
    print("‚úó TRAINING FAILED")
    print("=" * 70)
    print(f"Error: {e}")
    traceback.print_exc()
except KeyboardInterrupt:
    print("\n\n‚ö† Training interrupted by user")
    print("Checkpoint may have been saved - you can resume from the latest one")

STARTING TRAINING
Training for 400 epochs (starting from epoch 341)...
Training samples: 800, Validation samples: 200
AP evaluation every 10 epochs

Epoch 341/400 [Training...]

## 9. Visualize Training Metrics

In [None]:
import matplotlib.pyplot as plt
import json

metrics_file = checkpoints_dir / 'metrics.json'

if metrics_file.exists():
    with open(metrics_file, 'r') as f:
        metrics = json.load(f)
    
    # Plot losses
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))
    
    # Loss plot
    axes[0].plot(metrics['train_losses'], label='Train Loss', marker='o', markersize=3)
    axes[0].plot(metrics['val_losses'], label='Val Loss', marker='s', markersize=3)
    axes[0].set_xlabel('Epoch')
    axes[0].set_ylabel('Loss')
    axes[0].set_title('Training Progress')
    axes[0].legend()
    axes[0].grid(True, alpha=0.3)
    
    # AP metrics plot (if available)
    if metrics.get('ap_history') and len(metrics['ap_history']) > 0:
        ap_epochs = [x['epoch'] for x in metrics['ap_history']]
        map_values = [x['mAP'] for x in metrics['ap_history']]
        ap50_values = [x['AP50'] for x in metrics['ap_history']]
        ap75_values = [x['AP75'] for x in metrics['ap_history']]
        
        axes[1].plot(ap_epochs, map_values, label='mAP', marker='o')
        axes[1].plot(ap_epochs, ap50_values, label='AP50', marker='s')
        axes[1].plot(ap_epochs, ap75_values, label='AP75', marker='^')
        axes[1].set_xlabel('Epoch')
        axes[1].set_ylabel('AP Score')
        axes[1].set_title('AP Metrics')
        axes[1].legend()
        axes[1].grid(True, alpha=0.3)
    else:
        axes[1].text(0.5, 0.5, 'No AP metrics yet', 
                     ha='center', va='center', fontsize=12)
        axes[1].set_title('AP Metrics')
    
    plt.tight_layout()
    plt.savefig(checkpoints_dir / 'training_curve.png', dpi=150)
    plt.show()
    
    # Print summary
    print("\n" + "=" * 70)
    print("TRAINING SUMMARY")
    print("=" * 70)
    print(f"Final train loss: {metrics['train_losses'][-1]:.4f}")
    print(f"Final val loss: {metrics['val_losses'][-1]:.4f}")
    
    if metrics.get('ap_history') and len(metrics['ap_history']) > 0:
        latest_ap = metrics['ap_history'][-1]
        print(f"\nLatest AP Metrics (Epoch {latest_ap['epoch']}):")
        print(f"  mAP:  {latest_ap['mAP']:.3f}")
        print(f"  AP50: {latest_ap['AP50']:.3f}")
        print(f"  AP75: {latest_ap['AP75']:.3f}")
        print(f"  mAR:  {latest_ap['mAR']:.3f}")
    
    print("=" * 70)
else:
    print("No metrics file found - training may not have started yet")

## üìã Usage Instructions

### First Time Setup (Google Colab):
1. **Run cells 1-3** (Environment, Mount Drive, Clone Repo, Install Dependencies)
2. **‚ö†Ô∏è CRITICAL: Restart Runtime** (Runtime ‚Üí Restart session)
3. **After restart, run cells 1-2 again** (to re-establish environment)
4. **Skip cell 3** (dependencies already installed)
5. **Continue from cell 4** (Package Verification) and run all remaining cells

### To Resume Training:
1. After setup, in cell 7 ensure:
   - `RESUME_TRAINING = True`
   - `RESUME_FROM_CHECKPOINT = None` (auto-detects latest) or specify a specific checkpoint
   - `EXTEND_TRAINING_TO = 500` (or your desired total epochs)
2. Run cells 8-10

### To Start Fresh:
1. In cell 7, set `RESUME_TRAINING = False`
2. Run cells 8-10

### Why Restart Runtime?
The numpy/torchvision compatibility issue you encountered happens because torchvision's C extensions need to be loaded with the correct numpy version. Installing numpy isn't enough - you must restart to reload the Python interpreter.

### Key Features:
- **Auto-resume**: Automatically finds and loads the latest checkpoint
- **Metrics preserved**: Training history is restored from metrics.json
- **Flexible extension**: Easily extend training duration
- **Memory optimized**: Uses expandable CUDA segments and automatic cleanup
- **Dependency fixed**: Locks numpy==1.26.4 to prevent PyTorch conflicts