# KonkaniVani ASR Training - Google Colab
## Resume from Checkpoint 15 with Memory Optimization

**Configuration:**
- Model: d_model=256, 12 encoder layers, 6 decoder layers
- Batch size: 2 (with gradient accumulation 4x)
- Mixed precision: FP16
- GPU: Tesla T4 (14GB)

---

## 1. Setup Environment

In [None]:
# Check GPU
!nvidia-smi

In [None]:
# Install dependencies
!pip install torch torchaudio librosa soundfile tensorboard tqdm pyyaml

## 2. Mount Google Drive (Optional - for backup)

In [None]:
from google.colab import drive
drive.mount('/content/drive')

# Set backup path
DRIVE_BACKUP_PATH = '/content/drive/MyDrive/konkanivani_backup'
!mkdir -p {DRIVE_BACKUP_PATH}

## 3. Copy Project from Google Drive

**Your Drive folder**: https://drive.google.com/drive/folders/1-chxczmcNooqLDtsFgQ8ZT8NvzFuFARr

In [None]:
import os
from pathlib import Path

# Navigate to content directory
%cd /content

# Option A: If you have the project as a zip file in the shared folder
# Find the zip file in your mounted drive and extract it
# !cp "/content/drive/MyDrive/[YourFolder]/konkani_project.zip" .
# !unzip -q konkani_project.zip

# Option B: Copy entire project folder from Drive
# Replace with your actual Drive path
DRIVE_PROJECT_PATH = "/content/drive/MyDrive/konkani"  # Adjust this path

if Path(DRIVE_PROJECT_PATH).exists():
    print(f"‚úÖ Found project at: {DRIVE_PROJECT_PATH}")
    print("üìã Copying to /content/konkani...")
    !cp -r {DRIVE_PROJECT_PATH} /content/konkani
    %cd /content/konkani
    print("‚úÖ Project copied successfully!")
else:
    print(f"‚ùå Project not found at: {DRIVE_PROJECT_PATH}")
    print("\nüìù Please update DRIVE_PROJECT_PATH to match your Drive structure")
    print("   Or manually copy files to /content/konkani")
    
    # Create directory for manual upload
    !mkdir -p /content/konkani
    %cd /content/konkani

## 4. Verify Project Structure

In [None]:
import os

# Check required files
required_files = [
    'training_scripts/train_konkanivani_asr.py',
    'models/konkanivani_asr.py',
    'data/audio_processing/dataset.py',
    'data/audio_processing/text_tokenizer.py',
    'data/vocab.json',
    'data/konkani-asr-v0/splits/manifests/train.json',
    'data/konkani-asr-v0/splits/manifests/val.json',
    'archives/checkpoint_epoch_15.pt'
]

print("Checking project structure...\n")
for file in required_files:
    exists = "‚úÖ" if os.path.exists(file) else "‚ùå"
    print(f"{exists} {file}")

print("\n" + "="*60)

## 5. Copy Checkpoint to Working Directory

In [None]:
!mkdir -p checkpoints
!cp archives/checkpoint_epoch_15.pt checkpoints/
!ls -lh checkpoints/

## 6. Verify Checkpoint Configuration

In [None]:
import torch
import json

# Load checkpoint
checkpoint = torch.load('checkpoints/checkpoint_epoch_15.pt', map_location='cpu')

print("üìã Checkpoint Configuration:")
print("="*60)
print(json.dumps(checkpoint.get('config', {}), indent=2))

print("\nüìä Model Architecture:")
print("="*60)
state = checkpoint['model_state_dict']
encoder_layers = sum(1 for k in state.keys() if 'encoder.layers.' in k and '.ff1.0.weight' in k)
decoder_layers = sum(1 for k in state.keys() if 'decoder.decoder.layers.' in k and '.linear1.weight' in k)
d_model = state['encoder.input_proj.weight'].shape[0]
vocab_size = state['ctc_head.weight'].shape[0]

print(f"Encoder layers: {encoder_layers}")
print(f"Decoder layers: {decoder_layers}")
print(f"d_model: {d_model}")
print(f"vocab_size: {vocab_size}")
print(f"Epoch: {checkpoint['epoch']}")
print(f"Val loss: {checkpoint.get('val_loss', 'N/A')}")

# Clear memory
del checkpoint
torch.cuda.empty_cache()

## 7. Set Memory Optimization Environment Variables

In [None]:
import os

# Set environment variables for better memory management
os.environ['PYTORCH_CUDA_ALLOC_CONF'] = 'expandable_segments:True'
os.environ['CUDA_LAUNCH_BLOCKING'] = '0'

print("‚úÖ Environment variables set:")
print(f"   PYTORCH_CUDA_ALLOC_CONF={os.environ['PYTORCH_CUDA_ALLOC_CONF']}")
print(f"   CUDA_LAUNCH_BLOCKING={os.environ['CUDA_LAUNCH_BLOCKING']}")

## 8. Clear GPU Memory

In [None]:
import torch
import gc

# Clear cache
gc.collect()
torch.cuda.empty_cache()

# Check GPU memory
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"Total memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")
    print(f"Allocated: {torch.cuda.memory_allocated(0) / 1e9:.2f} GB")
    print(f"Cached: {torch.cuda.memory_reserved(0) / 1e9:.2f} GB")
else:
    print("‚ö†Ô∏è CUDA not available!")

## 9. Start Training (Memory Optimized)

### Configuration:
- **Batch size**: 2 (reduced from 8)
- **Gradient accumulation**: 4 steps (effective batch = 8)
- **Mixed precision**: FP16 enabled
- **Model**: d_model=256, 12 encoder, 6 decoder layers
- **Resume from**: checkpoint_epoch_15.pt

In [None]:
# Training command
!python3 training_scripts/train_konkanivani_asr.py \
    --train_manifest data/konkani-asr-v0/splits/manifests/train.json \
    --val_manifest data/konkani-asr-v0/splits/manifests/val.json \
    --vocab_file data/vocab.json \
    --batch_size 2 \
    --gradient_accumulation_steps 4 \
    --num_epochs 50 \
    --learning_rate 0.0005 \
    --device cuda \
    --d_model 256 \
    --encoder_layers 12 \
    --decoder_layers 6 \
    --mixed_precision \
    --checkpoint_dir checkpoints \
    --log_dir logs \
    --resume checkpoints/checkpoint_epoch_15.pt

## 10. Monitor Training (Run in Separate Cell)

In [None]:
# Monitor GPU usage
!nvidia-smi

In [None]:
# View TensorBoard logs
%load_ext tensorboard
%tensorboard --logdir logs

## 11. Backup to Google Drive (Run Periodically)

In [None]:
import shutil
from pathlib import Path

# Backup checkpoints
drive_backup = '/content/drive/MyDrive/konkanivani_backup'

if Path(drive_backup).exists():
    print("üì§ Backing up to Google Drive...")
    
    # Backup checkpoints
    !cp -r checkpoints {drive_backup}/
    
    # Backup logs
    !cp -r logs {drive_backup}/
    
    print("‚úÖ Backup completed!")
    !ls -lh {drive_backup}/checkpoints/
else:
    print("‚ö†Ô∏è Drive not mounted or backup path doesn't exist")

## 12. If Out of Memory - Reduce Batch Size Further

In [None]:
# Clear memory first
import torch
import gc
gc.collect()
torch.cuda.empty_cache()

# Run with batch_size=1
!python3 training_scripts/train_konkanivani_asr.py \
    --train_manifest data/konkani-asr-v0/splits/manifests/train.json \
    --val_manifest data/konkani-asr-v0/splits/manifests/val.json \
    --vocab_file data/vocab.json \
    --batch_size 1 \
    --gradient_accumulation_steps 8 \
    --num_epochs 50 \
    --learning_rate 0.0005 \
    --device cuda \
    --d_model 256 \
    --encoder_layers 12 \
    --decoder_layers 6 \
    --mixed_precision \
    --checkpoint_dir checkpoints \
    --log_dir logs \
    --resume checkpoints/checkpoint_epoch_15.pt

## 13. Download Best Model

In [None]:
from google.colab import files

# Download best model
if Path('checkpoints/best_model.pt').exists():
    files.download('checkpoints/best_model.pt')
    print("‚úÖ Downloaded best_model.pt")
else:
    print("‚ö†Ô∏è best_model.pt not found")

# List all checkpoints
!ls -lh checkpoints/

## 14. Test Inference (After Training)

In [None]:
import torch
import sys
sys.path.append('.')

from models.konkanivani_asr import create_konkanivani_model
from data.audio_processing.text_tokenizer import KonkaniTokenizer

# Load model
tokenizer = KonkaniTokenizer('data/vocab.json')
model = create_konkanivani_model(
    vocab_size=tokenizer.vocab_size,
    config={
        'input_dim': 80,
        'd_model': 256,
        'encoder_layers': 12,
        'decoder_layers': 6,
        'num_heads': 4,
        'conv_kernel_size': 31,
        'dropout': 0.1
    }
)

# Load checkpoint
checkpoint = torch.load('checkpoints/best_model.pt', map_location='cuda')
model.load_state_dict(checkpoint['model_state_dict'])
model = model.cuda()
model.eval()

print("‚úÖ Model loaded successfully!")
print(f"   Trained for {checkpoint['epoch']} epochs")
print(f"   Best val loss: {checkpoint['val_loss']:.4f}")

---

## Troubleshooting

### Out of Memory Error
1. Clear cache: `torch.cuda.empty_cache()`
2. Reduce batch_size to 1
3. Increase gradient_accumulation_steps to 8

### Checkpoint Not Found
1. Check if file exists: `!ls -lh archives/`
2. Copy to checkpoints: `!cp archives/checkpoint_epoch_15.pt checkpoints/`

### Training Slow
1. Check GPU utilization: `!nvidia-smi`
2. Ensure mixed_precision is enabled
3. Verify GPU is being used: Check "Device: cuda" in training output

### Session Timeout
1. Backup to Drive regularly (every 5 epochs)
2. Use Colab Pro for longer sessions
3. Resume from latest checkpoint

---