# KonkaniVani ASR Training - Final Setup
## Resume from Checkpoint 15 - Memory Optimized for Tesla T4

**Drive Folder**: https://drive.google.com/drive/u/5/folders/1KX7k_z2negFKq3qFjHJh-K1U-MEcNp7P

**Configuration:**
- Model: d_model=256, 12 encoder, 6 decoder layers (from checkpoint 15)
- Batch size: 2 + gradient accumulation 4x = effective batch 8
- Mixed precision: FP16 (saves ~50% GPU memory)
- Resume from: Epoch 15 ‚Üí Train to Epoch 50
- Expected time: 8-12 hours on Tesla T4

---

## Step 1: Check GPU

In [None]:
!nvidia-smi
print("\n‚ö†Ô∏è Make sure you see 'Tesla T4' above!")
print("If not: Runtime ‚Üí Change runtime type ‚Üí GPU")

## Step 2: Install Dependencies

In [None]:
print("üì¶ Installing dependencies...\n")
!pip install -q torch torchaudio librosa soundfile tensorboard tqdm pyyaml
print("‚úÖ Dependencies installed!")

## Step 3: Mount Google Drive

In [None]:
from google.colab import drive
drive.mount('/content/drive')
print("\n‚úÖ Drive mounted!")

## Step 4: Locate and Copy Project Files

**Your Drive folder ID**: `1KX7k_z2negFKq3qFjHJh-K1U-MEcNp7P`

First, let's find where your files are:

In [None]:
import os
from pathlib import Path

print("üîç Searching for your project files...\n")

# Common Drive locations
search_paths = [
    "/content/drive/MyDrive",
    "/content/drive/Shareddrives",
]

# Look for the project
for base_path in search_paths:
    if os.path.exists(base_path):
        print(f"üìÇ Checking: {base_path}")
        !ls -la {base_path} | head -20
        print("\n" + "="*60 + "\n")

print("üëÜ Look for your project folder or zip file above")

## Step 5: Extract/Copy Project

**Update the path below** based on what you saw in Step 4:

In [None]:
import os
from pathlib import Path

# ============================================
# UPDATE THESE PATHS BASED ON YOUR DRIVE
# ============================================

# Option A: If you have a ZIP file
USE_ZIP = True
ZIP_PATH = "/content/drive/MyDrive/konkani_project.zip"  # UPDATE THIS

# Option B: If you have a folder
FOLDER_PATH = "/content/drive/MyDrive/konkani"  # UPDATE THIS

# ============================================

%cd /content

if USE_ZIP:
    print(f"üì¶ Extracting from: {ZIP_PATH}\n")
    if Path(ZIP_PATH).exists():
        !unzip -q {ZIP_PATH} -d /content/
        print("‚úÖ Extracted!\n")
        !ls -la /content/
    else:
        print(f"‚ùå ZIP not found at: {ZIP_PATH}")
        print("\nüìù Please update ZIP_PATH above with the correct path from Step 4")
else:
    print(f"üìã Copying from: {FOLDER_PATH}\n")
    if Path(FOLDER_PATH).exists():
        !cp -r {FOLDER_PATH} /content/konkani
        print("‚úÖ Copied!")
    else:
        print(f"‚ùå Folder not found at: {FOLDER_PATH}")
        print("\nüìù Please update FOLDER_PATH above with the correct path from Step 4")

## Step 6: Navigate to Project Directory

In [None]:
import os

# Try to find the project directory automatically
possible_dirs = [
    '/content/konkani',
    '/content/konkani_project',
    '/content',
]

project_dir = None
for dir_path in possible_dirs:
    check_file = f"{dir_path}/training_scripts/train_konkanivani_asr.py"
    if os.path.exists(check_file):
        project_dir = dir_path
        break

if project_dir:
    print(f"‚úÖ Found project at: {project_dir}\n")
    %cd {project_dir}
    !pwd
    print("\nProject contents:")
    !ls -la
else:
    print("‚ùå Could not find project directory automatically.")
    print("\nPlease check the extracted folder name:")
    !ls -la /content/
    print("\nüìù Then manually set the path:")
    print("   %cd /content/[your_folder_name]")

## Step 7: Verify All Required Files

In [None]:
import os

required_files = [
    'training_scripts/train_konkanivani_asr.py',
    'models/konkanivani_asr.py',
    'data/audio_processing/dataset.py',
    'data/audio_processing/text_tokenizer.py',
    'data/vocab.json',
    'data/konkani-asr-v0/splits/manifests/train.json',
    'data/konkani-asr-v0/splits/manifests/val.json',
    'archives/checkpoint_epoch_15.pt'
]

print("Checking required files...\n")
print("="*60)

all_good = True
for file in required_files:
    exists = os.path.exists(file)
    status = "‚úÖ" if exists else "‚ùå"
    print(f"{status} {file}")
    if not exists:
        all_good = False

print("="*60)

if all_good:
    print("\nüéâ All files found! Ready to train!")
else:
    print("\n‚ö†Ô∏è Some files are missing. Please check your Drive folder.")
    print("\nMake sure your Drive folder contains:")
    print("  ‚Ä¢ training_scripts/")
    print("  ‚Ä¢ models/")
    print("  ‚Ä¢ data/ (with audio files and manifests)")
    print("  ‚Ä¢ archives/checkpoint_epoch_15.pt")

## Step 8: Prepare Checkpoint

In [None]:
!mkdir -p checkpoints
!cp archives/checkpoint_epoch_15.pt checkpoints/

print("‚úÖ Checkpoint copied to checkpoints/\n")
!ls -lh checkpoints/

## Step 9: Verify Checkpoint Configuration

In [None]:
import torch
import json

print("üìã Loading checkpoint...\n")
checkpoint = torch.load('checkpoints/checkpoint_epoch_15.pt', map_location='cpu')

print("="*60)
print("CHECKPOINT CONFIGURATION")
print("="*60)
print(json.dumps(checkpoint.get('config', {}), indent=2))

print("\n" + "="*60)
print("MODEL ARCHITECTURE")
print("="*60)

state = checkpoint['model_state_dict']
encoder_layers = sum(1 for k in state.keys() if 'encoder.layers.' in k and '.ff1.0.weight' in k)
decoder_layers = sum(1 for k in state.keys() if 'decoder.decoder.layers.' in k and '.linear1.weight' in k)
d_model = state['encoder.input_proj.weight'].shape[0]
vocab_size = state['ctc_head.weight'].shape[0]

print(f"Encoder layers:  {encoder_layers}")
print(f"Decoder layers:  {decoder_layers}")
print(f"d_model:         {d_model}")
print(f"vocab_size:      {vocab_size}")
print(f"Last epoch:      {checkpoint['epoch']}")
print(f"Val loss:        {checkpoint.get('val_loss', 'N/A')}")

# Calculate model size
num_params = sum(p.numel() for p in [torch.zeros(v.shape) for v in state.values()])
print(f"Parameters:      {num_params:,}")

del checkpoint
torch.cuda.empty_cache()

print("\n‚úÖ Checkpoint verified!")

## Step 10: Setup Environment & Clear GPU Memory

In [None]:
import os
import torch
import gc

# Set environment variables for memory optimization
os.environ['PYTORCH_CUDA_ALLOC_CONF'] = 'expandable_segments:True'
os.environ['CUDA_LAUNCH_BLOCKING'] = '0'

# Clear GPU memory
gc.collect()
torch.cuda.empty_cache()

print("="*60)
print("GPU STATUS")
print("="*60)

if torch.cuda.is_available():
    print(f"GPU:             {torch.cuda.get_device_name(0)}")
    total_mem = torch.cuda.get_device_properties(0).total_memory / 1e9
    allocated = torch.cuda.memory_allocated(0) / 1e9
    cached = torch.cuda.memory_reserved(0) / 1e9
    free = total_mem - allocated - cached
    
    print(f"Total memory:    {total_mem:.2f} GB")
    print(f"Allocated:       {allocated:.2f} GB")
    print(f"Cached:          {cached:.2f} GB")
    print(f"Free:            {free:.2f} GB")
    
    print("\n‚úÖ GPU ready for training!")
    
    if total_mem < 14:
        print("\n‚ö†Ô∏è Warning: GPU has less than 14GB. You may need batch_size=1")
else:
    print("‚ùå CUDA not available!")
    print("\nPlease set GPU: Runtime ‚Üí Change runtime type ‚Üí GPU")

## Step 11: üöÄ START TRAINING

### Memory-Optimized Configuration:

| Setting | Value | Purpose |
|---------|-------|----------|
| Batch size | 2 | Fits in 14GB GPU |
| Gradient accumulation | 4 | Effective batch = 8 |
| Mixed precision | FP16 | Saves ~50% memory |
| d_model | 256 | From checkpoint |
| Encoder layers | 12 | From checkpoint |
| Decoder layers | 6 | From checkpoint |
| Resume from | Epoch 15 | Continue training |
| Target | Epoch 50 | 35 more epochs |

**Expected time**: ~8-12 hours

**Keep this tab open!** Colab disconnects after 90 minutes of inactivity.

In [None]:
print("="*70)
print("üöÄ STARTING KONKANIVANI ASR TRAINING")
print("="*70)
print("Configuration:")
print("  ‚Ä¢ Batch size: 2 (gradient accumulation: 4x)")
print("  ‚Ä¢ Mixed precision: FP16")
print("  ‚Ä¢ Model: d_model=256, 12 encoder, 6 decoder layers")
print("  ‚Ä¢ Resume from: Epoch 15")
print("  ‚Ä¢ Target: Epoch 50")
print("="*70)
print("\n‚è∞ Estimated time: 8-12 hours")
print("‚ö†Ô∏è  Keep this tab open to prevent disconnection!\n")
print("="*70)
print("\n")

!python3 training_scripts/train_konkanivani_asr.py \
    --train_manifest data/konkani-asr-v0/splits/manifests/train.json \
    --val_manifest data/konkani-asr-v0/splits/manifests/val.json \
    --vocab_file data/vocab.json \
    --batch_size 2 \
    --gradient_accumulation_steps 4 \
    --num_epochs 50 \
    --learning_rate 0.0005 \
    --device cuda \
    --d_model 256 \
    --encoder_layers 12 \
    --decoder_layers 6 \
    --mixed_precision \
    --checkpoint_dir checkpoints \
    --log_dir logs \
    --resume checkpoints/checkpoint_epoch_15.pt

## Step 12: Monitor GPU (Run While Training)

In [None]:
!nvidia-smi

print("\nüìä Expected values during training:")
print("  ‚Ä¢ GPU Utilization: 80-95%")
print("  ‚Ä¢ Memory Used: ~7-8 GB / 14 GB")
print("  ‚Ä¢ Temperature: 60-80¬∞C")

## Step 13: View TensorBoard Logs

In [None]:
%load_ext tensorboard
%tensorboard --logdir logs

## Step 14: Backup to Drive (Run Every Few Hours)

In [None]:
import time

BACKUP_PATH = "/content/drive/MyDrive/konkanivani_backup"

print(f"üì§ Backing up to: {BACKUP_PATH}")
print(f"   Time: {time.strftime('%Y-%m-%d %H:%M:%S')}\n")

!mkdir -p {BACKUP_PATH}
!cp -r checkpoints {BACKUP_PATH}/checkpoints_$(date +%Y%m%d_%H%M%S)
!cp -r logs {BACKUP_PATH}/logs_$(date +%Y%m%d_%H%M%S)

print("\n‚úÖ Backup completed!")
!ls -lh {BACKUP_PATH}/

## Step 15: Download Best Model (After Training)

In [None]:
from google.colab import files
from pathlib import Path

print("Available checkpoints:\n")
!ls -lh checkpoints/

if Path('checkpoints/best_model.pt').exists():
    print("\nüì• Downloading best_model.pt...")
    files.download('checkpoints/best_model.pt')
    print("‚úÖ Downloaded!")
else:
    print("\n‚ö†Ô∏è best_model.pt not found yet. Training may still be in progress.")

## üÜò Emergency: Out of Memory

If you get OOM error, run this cell instead of Step 11:

In [None]:
# Clear memory
import torch
import gc
gc.collect()
torch.cuda.empty_cache()

print("üîß Running with MINIMAL memory settings:")
print("  ‚Ä¢ Batch size: 1 (reduced from 2)")
print("  ‚Ä¢ Gradient accumulation: 8 (maintains effective batch = 8)")
print("\n‚è∞ This will be slower but should fit in memory\n")

!python3 training_scripts/train_konkanivani_asr.py \
    --train_manifest data/konkani-asr-v0/splits/manifests/train.json \
    --val_manifest data/konkani-asr-v0/splits/manifests/val.json \
    --vocab_file data/vocab.json \
    --batch_size 1 \
    --gradient_accumulation_steps 8 \
    --num_epochs 50 \
    --learning_rate 0.0005 \
    --device cuda \
    --d_model 256 \
    --encoder_layers 12 \
    --decoder_layers 6 \
    --mixed_precision \
    --checkpoint_dir checkpoints \
    --log_dir logs \
    --resume checkpoints/checkpoint_epoch_15.pt

---

## üìö Quick Reference

### Check GPU
```python
!nvidia-smi
```

### Clear Memory
```python
import torch
torch.cuda.empty_cache()
```

### List Checkpoints
```python
!ls -lh checkpoints/
```

### Resume from Different Checkpoint
```python
# In Step 11, change:
--resume checkpoints/checkpoint_epoch_20.pt
```

### Check Training Progress
```python
!tail -50 logs/events.out.tfevents.*
```

---

## ‚úÖ Success Indicators

- Training starts without OOM error
- GPU memory stable at ~7-8 GB
- Batch processing: 2-3 batches/sec
- Loss decreasing over epochs
- Checkpoints saving every 5 epochs

---

## üéØ After Training

1. Run Step 14 to backup everything
2. Run Step 15 to download best model
3. Evaluate on test set
4. Deploy for inference

Good luck! üöÄ

---