# ü§ü Sign-to-Text Training Notebook (GPU Version)

This notebook trains the **Hybrid CTC-Attention Model V2** with dual decoders (GRU + Transformer) for sign language to text translation.

## üîß Setup Instructions

1. **Clone the repository** - Uses the `fixed` branch with all NaN-safe fixes
2. **Configure paths** - Set `DATA_DIR` to your extracted landmarks folder
3. **Adjust batch size** - Based on your GPU VRAM (see config cell)
4. **Run training** - Checkpoints are saved automatically

## üìä Model Features

- **Dual Decoder**: GRU (fast) + Transformer (accurate)
- **CTC + Attention Loss**: Hybrid training objective
- **NaN-Safe**: All edge cases handled (short sequences, masked attention)
- **FP16 Mixed Precision**: Faster training, less VRAM
- **Gradient Checkpointing**: ~2x batch size with same memory

## üíæ Required Data

- **extracted_landmarks_v2/**: Preprocessed landmark files (.npy)
  - 204 features per frame: hands (126) + body (12) + mouth (40) + head_pose (12) + eyes (14)
  - With velocity + acceleration: 204 √ó 3 = 612 dims
- **iSign_v1.1.csv**: Label mapping file

## üñ•Ô∏è GPU Memory Guide

| GPU VRAM | Batch Size | Grad Accum | Effective Batch |
|----------|------------|------------|-----------------|
| 8GB      | 12         | 2          | 24              |
| 12GB     | 16         | 2          | 32              |
| 16GB     | 24         | 2          | 48              |
| 24GB     | 32         | 2          | 64              |

In [None]:
# ============================================================
# Sign-to-Text Training Notebook (College GPU Version)
# ============================================================
# This notebook trains the Hybrid CTC-Attention model with dual decoders
# Optimized for: Any CUDA GPU with 8GB+ VRAM
# ============================================================

# Check GPU
import subprocess
result = subprocess.run(['nvidia-smi'], capture_output=True, text=True)
print(result.stdout)

import torch
print(f"PyTorch: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")

In [None]:
# Clone repository from the 'fixed' branch
# This contains all the NaN-safe fixes and v2 model updates
import os

REPO_DIR = "/home/user/kortex_5th_sem"  # Change this to your preferred location

if not os.path.exists(REPO_DIR):
    # Clone from 'fixed' branch specifically
    !git clone --branch fixed https://github.com/Surya-Narayan-M/kortex_5th_sem.git {REPO_DIR}
    print(f"‚úÖ Cloned 'fixed' branch to {REPO_DIR}")
else:
    # Pull latest changes
    os.chdir(REPO_DIR)
    !git checkout fixed
    !git pull origin fixed
    print(f"‚úÖ Updated 'fixed' branch in {REPO_DIR}")

os.chdir(f"{REPO_DIR}/sign_to_text")
print(f"Working directory: {os.getcwd()}")

In [None]:
# Install dependencies (if not already installed)
!pip install tqdm pandas numpy scipy mediapipe --quiet

In [None]:
# ============================================================
# CONFIGURE PATHS - EDIT THESE FOR YOUR ENVIRONMENT
# ============================================================
import os
import sys
from pathlib import Path

# ========== IMPORTANT: SET YOUR PATHS ==========
# Option 1: If you have v2 extracted landmarks already
DATA_DIR = "/path/to/extracted_landmarks_v2"  # <-- CHANGE THIS


CSV_PATH = f"{REPO_DIR}/data/iSign_v1.1.csv"
CHECKPOINT_DIR = "./checkpoints_v2"

# Create checkpoint directory
os.makedirs(CHECKPOINT_DIR, exist_ok=True)

# Add project to path
sys.path.insert(0, f"{REPO_DIR}/sign_to_text")
sys.path.insert(0, f"{REPO_DIR}/data")

# Verify paths
print("=" * 60)
print("PATH CONFIGURATION")
print("=" * 60)
print(f"Repository: {REPO_DIR}")
print(f"Data dir: {DATA_DIR}")
print(f"CSV path: {CSV_PATH}")
print(f"Checkpoint dir: {CHECKPOINT_DIR}")
print()

if os.path.exists(DATA_DIR):
    num_files = len([f for f in os.listdir(DATA_DIR) if f.endswith('.npy')])
    print(f"‚úÖ Found {num_files} landmark files")
else:
    print(f"‚ö†Ô∏è Data directory not found: {DATA_DIR}")
    print("   You need to either:")
    print("   1. Upload extracted_landmarks_v2 folder, OR")
    print("   2. Run the landmark extraction cell below")

In [None]:
# ============================================================
# [OPTIONAL] EXTRACT LANDMARKS FROM VIDEOS
# ============================================================
# Only run this if you don't have extracted_landmarks_v2 already!
# This requires the raw video files

EXTRACT_LANDMARKS = False  # Set to True to run extraction

if EXTRACT_LANDMARKS:
    import subprocess
    
    OUTPUT_DIR = "./extracted_landmarks_v2"
    NUM_WORKERS = 8  # Adjust based on CPU cores
    
    print("Starting landmark extraction...")
    print("This will take several hours depending on video count and GPU speed")
    
    cmd = [
        "python", f"{REPO_DIR}/data/extract_landmarks_v2.py",
        "--input-dir", VIDEOS_DIR,
        "--output-dir", OUTPUT_DIR,
        "--workers", str(NUM_WORKERS)
    ]
    
    result = subprocess.run(cmd, capture_output=False)
    
    if result.returncode == 0:
        print(f"‚úÖ Extraction complete! Files saved to {OUTPUT_DIR}")
        DATA_DIR = OUTPUT_DIR
    else:
        print("‚ùå Extraction failed!")
else:
    print("Skipping landmark extraction (EXTRACT_LANDMARKS=False)")
    print("Using existing landmarks from:", DATA_DIR)

In [None]:
# ============================================================
# TRAINING CONFIGURATION
# ============================================================
# This class mirrors TrainConfig from train_hybrid.py but with
# paths adjusted for your college GPU environment

from pathlib import Path

class GPUConfig:
    """Training configuration for college GPU"""
    
    # Data paths - ADJUST THESE
    data_dir = Path(DATA_DIR)
    csv_path = Path(CSV_PATH)
    vocab_path = Path(f"{REPO_DIR}/sign_to_text/vocabulary.json")
    checkpoint_dir = Path(CHECKPOINT_DIR)
    
    # Feature version: v2 with face landmarks
    feature_version = 'v2'
    
    # Model architecture (must match saved checkpoints)
    input_dim = 612  # 204 * 3 = 612 (hands + body + face + velocity + acceleration)
    hidden_dim = 384
    embedding_dim = 256
    encoder_layers = 3
    decoder_layers = 2
    num_heads = 4
    dropout = 0.4
    
    # Dual decoder settings
    use_dual_decoder = True
    primary_decoder = 'gru'
    
    # Gradient checkpointing (enables larger batches with less VRAM)
    use_gradient_checkpointing = True
    
    # ========== ADJUST BASED ON YOUR GPU VRAM ==========
    # 8GB GPU:  batch_size=12, gradient_accumulation=2 ‚Üí effective=24
    # 12GB GPU: batch_size=16, gradient_accumulation=2 ‚Üí effective=32
    # 16GB GPU: batch_size=24, gradient_accumulation=2 ‚Üí effective=48
    # 24GB GPU: batch_size=32, gradient_accumulation=2 ‚Üí effective=64
    
    batch_size = 16  # Adjust based on GPU memory
    gradient_accumulation = 2
    
    # Training hyperparameters
    epochs = 80
    learning_rate = 5e-4
    min_lr = 1e-6
    weight_decay = 1e-4
    warmup_epochs = 5
    
    # CTC/Attention balance (decays over training)
    ctc_weight_start = 0.3
    ctc_weight_end = 0.1
    ctc_weight_decay_epochs = 30
    
    # Dual decoder loss weights
    gru_loss_weight = 0.6
    transformer_loss_weight = 0.4
    
    # Label smoothing
    label_smoothing = 0.1
    
    # Teacher forcing schedule
    tf_start = 0.9
    tf_end = 0.2
    tf_decay_epochs = 15
    
    # Early stopping
    patience = 15
    min_delta = 0.001
    
    # Device
    device = "cuda" if torch.cuda.is_available() else "cpu"
    use_amp = True  # Mixed precision (FP16)
    
    # Data loading
    num_workers = 4
    pin_memory = True
    prefetch_factor = 2
    persistent_workers = True
    
    # Logging
    log_interval = 50
    val_interval = 1
    save_interval = 1
    log_dir = Path("./logs")
    
    # Sequence limits
    max_src_len = 500
    max_tgt_len = 100
    
    # Data augmentation
    use_augmentation = True
    augmentation_intensity = 'medium'
    
    # Dataset subset (for faster baseline training)
    use_subset = True
    subset_ratio = 0.40  # Use 40% of data
    
    # Validation split
    val_ratio = 0.15
    seed = 42

print("=" * 60)
print("GPU TRAINING CONFIGURATION")
print("=" * 60)
print(f"Batch size: {GPUConfig.batch_size} x {GPUConfig.gradient_accumulation} = {GPUConfig.batch_size * GPUConfig.gradient_accumulation}")
print(f"Model: input={GPUConfig.input_dim}, hidden={GPUConfig.hidden_dim}")
print(f"Dual decoder: {GPUConfig.use_dual_decoder} (primary: {GPUConfig.primary_decoder})")
print(f"Gradient checkpointing: {GPUConfig.use_gradient_checkpointing}")
print(f"Mixed precision (FP16): {GPUConfig.use_amp}")
print(f"Dataset subset: {GPUConfig.subset_ratio*100:.0f}%")
print(f"Augmentation: {GPUConfig.augmentation_intensity}")
print("=" * 60)

In [None]:
# ============================================================
# IMPORT MODEL AND TRAINING COMPONENTS
# ============================================================
# Uses the updated v2 model with NaN-safe fixes

from model_hybrid_v2 import (
    HybridCTCAttentionModelV2,
    create_hybrid_model_v2,
)

# Test model creation
print("Creating model...")
model = HybridCTCAttentionModelV2(
    input_dim=GPUConfig.input_dim,
    hidden_dim=GPUConfig.hidden_dim,
    vocab_size=73,  # Will be updated from vocabulary
    encoder_layers=GPUConfig.encoder_layers,
    decoder_layers=GPUConfig.decoder_layers,
    num_heads=GPUConfig.num_heads,
    dropout=GPUConfig.dropout,
    use_dual_decoder=GPUConfig.use_dual_decoder,
    primary_decoder=GPUConfig.primary_decoder,
    use_gradient_checkpointing=GPUConfig.use_gradient_checkpointing
).cuda()

print()
print("=" * 60)
print("MODEL SUMMARY")
print("=" * 60)
print(f"Total parameters: {model.get_num_params():,}")
print(f"Model size (FP32): {model.get_model_size_mb():.2f} MB")
print(f"Estimated size (INT8): {model.get_model_size_mb()/4:.2f} MB")
print(f"Dual decoder: {GPUConfig.use_dual_decoder}")
print(f"Gradient checkpointing: {GPUConfig.use_gradient_checkpointing}")
print("=" * 60)

# Quick forward pass test
print("\nTesting forward pass...")
with torch.no_grad():
    x = torch.randn(4, 100, GPUConfig.input_dim).cuda()
    lens = torch.tensor([100, 90, 80, 70]).cuda()
    tgt = torch.randint(0, 73, (4, 20)).cuda()
    outputs = model(x, lens, tgt, tf_ratio=0.5)
    print(f"CTC log probs: {outputs['ctc_log_probs'].shape}")
    print(f"GRU outputs: {outputs['gru_outputs'].shape}")
    if 'tf_outputs' in outputs and outputs['tf_outputs'] is not None:
        print(f"Transformer outputs: {outputs['tf_outputs'].shape}")
        
print("\n‚úÖ Model test passed!")

In [None]:
# ============================================================
# IMPORT TRAINING COMPONENTS
# ============================================================
from train_hybrid import (
    SignLanguageDataset,
    collate_fn,
    HybridLoss,
    LabelSmoothingLoss,
    Trainer,
    TrainConfig
)

print("‚úÖ Training components imported successfully!")
print()
print("Components loaded:")
print("  - SignLanguageDataset (with NaN filtering)")
print("  - HybridLoss (CTC + Attention with length validation)")
print("  - LabelSmoothingLoss (with clamping)")
print("  - Trainer (with gradient NaN detection)")

In [None]:
# ============================================================
# CREATE TRAINER WITH CUSTOM CONFIG
# ============================================================
# Override TrainConfig with our GPU-specific paths

# Monkey-patch the paths in TrainConfig
TrainConfig.data_dir = GPUConfig.data_dir
TrainConfig.csv_path = GPUConfig.csv_path
TrainConfig.vocab_path = GPUConfig.vocab_path
TrainConfig.checkpoint_dir = GPUConfig.checkpoint_dir
TrainConfig.log_dir = GPUConfig.log_dir

# Apply GPU-specific settings
TrainConfig.batch_size = GPUConfig.batch_size
TrainConfig.gradient_accumulation = GPUConfig.gradient_accumulation
TrainConfig.num_workers = GPUConfig.num_workers

# Create trainer
print("Initializing trainer...")
trainer = Trainer(TrainConfig())

print()
print("=" * 60)
print("DATASET SUMMARY")
print("=" * 60)
print(f"Train samples: {len(trainer.train_dataset):,}")
print(f"Validation samples: {len(trainer.val_dataset):,}")
print(f"Batches per epoch: {len(trainer.train_loader):,}")
print(f"Vocabulary size: {len(trainer.vocab)}")
print("=" * 60)

In [None]:
# ============================================================
# [OPTIONAL] RESUME FROM CHECKPOINT
# ============================================================
# Set RESUME_PATH to a checkpoint file to continue training

RESUME_PATH = None  # e.g., "./checkpoints_v2/latest.pth"

if RESUME_PATH and os.path.exists(RESUME_PATH):
    print(f"Resuming from checkpoint: {RESUME_PATH}")
    checkpoint = torch.load(RESUME_PATH, map_location=trainer.device)
    trainer.model.load_state_dict(checkpoint['model_state_dict'])
    trainer.optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
    start_epoch = checkpoint.get('epoch', 0) + 1
    print(f"Resuming from epoch {start_epoch}")
else:
    start_epoch = 1
    print("Starting fresh training (no checkpoint loaded)")

In [None]:
# ============================================================
# START TRAINING
# ============================================================
# This will run for the configured number of epochs
# Checkpoints are saved automatically

print("=" * 60)
print("üöÄ STARTING TRAINING")
print("=" * 60)
print(f"Epochs: {TrainConfig.epochs}")
print(f"Batch size: {TrainConfig.batch_size} x {TrainConfig.gradient_accumulation}")
print(f"Learning rate: {TrainConfig.learning_rate}")
print(f"Device: {TrainConfig.device}")
print(f"Mixed precision: {TrainConfig.use_amp}")
print("=" * 60)
print()
print("Training will save checkpoints to:", TrainConfig.checkpoint_dir)
print("Press Ctrl+C to stop training (progress will be saved)")
print()

# Run training
trainer.train()

In [None]:
# ============================================================
# TRAINING COMPLETED - VIEW RESULTS
# ============================================================

print("=" * 60)
print("TRAINING COMPLETE")
print("=" * 60)

if hasattr(trainer, 'history') and trainer.history:
    history = trainer.history
    print(f"Total epochs trained: {len(history.get('train_loss', []))}")
    print(f"Best validation loss: {min(history.get('val_loss', [float('inf')])):.4f}")
    print(f"Best validation accuracy: {max(history.get('val_acc', [0]))*100:.2f}%")
else:
    print("No training history available")

print()
print("Checkpoints saved to:", TrainConfig.checkpoint_dir)
print("  - best.pth (best validation loss)")
print("  - latest.pth (most recent)")
print("  - epoch_*.pth (periodic saves)")

In [None]:
# ============================================================
# VISUALIZE TRAINING CURVES
# ============================================================
import matplotlib.pyplot as plt

if hasattr(trainer, 'history') and trainer.history:
    history = trainer.history
    
    fig, axes = plt.subplots(2, 2, figsize=(14, 10))
    
    # Total Loss
    if 'train_loss' in history:
        axes[0,0].plot(history['train_loss'], label='Train', linewidth=2)
        axes[0,0].plot(history['val_loss'], label='Validation', linewidth=2)
        axes[0,0].set_title('Total Loss', fontsize=12, fontweight='bold')
        axes[0,0].set_xlabel('Epoch')
        axes[0,0].set_ylabel('Loss')
        axes[0,0].legend()
        axes[0,0].grid(True, alpha=0.3)
    
    # Accuracy
    if 'train_acc' in history:
        axes[0,1].plot([a*100 for a in history['train_acc']], label='Train', linewidth=2)
        axes[0,1].plot([a*100 for a in history['val_acc']], label='Validation', linewidth=2)
        axes[0,1].set_title('Token Accuracy (%)', fontsize=12, fontweight='bold')
        axes[0,1].set_xlabel('Epoch')
        axes[0,1].set_ylabel('Accuracy (%)')
        axes[0,1].legend()
        axes[0,1].grid(True, alpha=0.3)
    
    # CTC Loss
    if 'train_ctc_loss' in history:
        axes[1,0].plot(history['train_ctc_loss'], label='Train CTC', linewidth=2)
        axes[1,0].plot(history['val_ctc_loss'], label='Val CTC', linewidth=2)
        axes[1,0].set_title('CTC Loss', fontsize=12, fontweight='bold')
        axes[1,0].set_xlabel('Epoch')
        axes[1,0].set_ylabel('Loss')
        axes[1,0].legend()
        axes[1,0].grid(True, alpha=0.3)
    
    # GRU Attention Loss
    if 'train_gru_loss' in history:
        axes[1,1].plot(history['train_gru_loss'], label='Train GRU', linewidth=2)
        axes[1,1].plot(history['val_gru_loss'], label='Val GRU', linewidth=2)
        axes[1,1].set_title('GRU Decoder Loss', fontsize=12, fontweight='bold')
        axes[1,1].set_xlabel('Epoch')
        axes[1,1].set_ylabel('Loss')
        axes[1,1].legend()
        axes[1,1].grid(True, alpha=0.3)
    
    plt.tight_layout()
    
    # Save figure
    save_path = os.path.join(str(TrainConfig.checkpoint_dir), 'training_curves.png')
    plt.savefig(save_path, dpi=150, bbox_inches='tight')
    print(f"Training curves saved to: {save_path}")
    
    plt.show()
else:
    print("No training history available. Run training first.")

In [None]:
# ============================================================
# QUICK INFERENCE TEST
# ============================================================
# Test the trained model on a sample from validation set

# Load best checkpoint
best_checkpoint = os.path.join(str(TrainConfig.checkpoint_dir), 'best.pth')

if os.path.exists(best_checkpoint):
    print("Loading best model for inference test...")
    
    # Load checkpoint
    checkpoint = torch.load(best_checkpoint, map_location='cuda')
    trainer.model.load_state_dict(checkpoint['model_state_dict'])
    trainer.model.eval()
    
    # Get a random validation sample
    import random
    test_idx = random.randint(0, len(trainer.val_dataset) - 1)
    sample = trainer.val_dataset[test_idx]
    
    # Prepare input
    features = sample['src'].unsqueeze(0).cuda()
    src_lens = torch.tensor([features.shape[1]]).cuda()
    
    # Run inference (greedy decoding)
    with torch.no_grad():
        # Get model outputs without teacher forcing
        outputs = trainer.model(features, src_lens, tgt=None, tf_ratio=0.0)
        
        # Get predictions from primary decoder (GRU)
        gru_logits = outputs['gru_outputs']  # (1, max_len, vocab_size)
        predictions = gru_logits.argmax(dim=-1)  # (1, max_len)
        
        # Convert to text
        pred_tokens = predictions[0].cpu().tolist()
        
        # Decode using vocabulary (reverse mapping)
        idx_to_char = {v: k for k, v in trainer.vocab.items()}
        pred_text = ''.join([idx_to_char.get(t, '?') for t in pred_tokens])
        
        # Remove padding and special tokens
        if '<eos>' in pred_text:
            pred_text = pred_text.split('<eos>')[0]
        pred_text = pred_text.replace('<pad>', '').replace('<sos>', '')
    
    # Get ground truth
    gt_tokens = sample['tgt'].tolist()
    gt_text = ''.join([idx_to_char.get(t, '?') for t in gt_tokens])
    gt_text = gt_text.replace('<pad>', '').replace('<sos>', '').replace('<eos>', '')
    
    print()
    print("=" * 60)
    print("INFERENCE TEST")
    print("=" * 60)
    print(f"Sample index: {test_idx}")
    print(f"Input shape: {features.shape}")
    print(f"Ground truth: '{gt_text}'")
    print(f"Prediction:   '{pred_text}'")
    print("=" * 60)
else:
    print(f"No checkpoint found at: {best_checkpoint}")
    print("Train the model first!")