# KonkaniVani ASR Training - Kaggle
## Resume from Checkpoint 15 - Optimized for P100/T4

**Why Kaggle?**
- ‚úÖ 30 hours GPU per week (vs Colab's 12h max)
- ‚úÖ P100 GPU with 16GB memory (more than T4's 15GB)
- ‚úÖ Auto-saves outputs (checkpoints persist)
- ‚úÖ Can restart and continue training
- ‚úÖ Better for long training runs

**Configuration:**
- Model: d_model=256, 12 encoder, 6 decoder layers
- Batch size: 4 on P100 (16GB) or 2 on T4 (15GB)
- Gradient accumulation: 2x or 4x
- Mixed precision: FP16
- Resume from: Epoch 15

---

## Setup Instructions

### Before Running:

1. **Upload Dataset**:
   - Go to Kaggle ‚Üí Datasets ‚Üí New Dataset
   - Upload your `konkani_project.zip`
   - Make it private
   - Note the dataset name (e.g., `yourusername/konkani-asr`)

2. **Create Notebook**:
   - New Notebook
   - Settings ‚Üí Accelerator ‚Üí GPU P100 (or T4)
   - Settings ‚Üí Internet ‚Üí ON (for setup only)
   - Add your dataset to notebook

3. **Update Cell 2**:
   - Change `DATASET_PATH` to your dataset location

---

## Step 1: Check GPU

In [None]:
!nvidia-smi

import torch
print(f"\n{'='*60}")
print(f"GPU: {torch.cuda.get_device_name(0)}")
print(f"Total Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")
print(f"{'='*60}")

# Determine batch size based on GPU
gpu_name = torch.cuda.get_device_name(0)
if 'P100' in gpu_name:
    print("\n‚úÖ P100 detected! Using batch_size=4")
    BATCH_SIZE = 4
    GRAD_ACCUM = 2
elif 'T4' in gpu_name:
    print("\n‚úÖ T4 detected! Using batch_size=2")
    BATCH_SIZE = 2
    GRAD_ACCUM = 4
else:
    print(f"\n‚ö†Ô∏è Unknown GPU: {gpu_name}. Using conservative batch_size=2")
    BATCH_SIZE = 2
    GRAD_ACCUM = 4

print(f"Effective batch size: {BATCH_SIZE * GRAD_ACCUM}")

## Step 2: Extract Dataset

In [None]:
import os
from pathlib import Path

# ============================================
# UPDATE THIS PATH
# ============================================
# After adding your dataset to the notebook, it will be in:
# /kaggle/input/[your-dataset-name]/

DATASET_PATH = "/kaggle/input/konkani-asr-complete-dataset"

# ============================================

print(f"Looking for dataset at: {DATASET_PATH}\n")

if os.path.exists(DATASET_PATH):
    print("‚úÖ Dataset found!\n")
    print("Contents:")
    !ls -la {DATASET_PATH}
    
    # Find zip file
    zip_files = list(Path(DATASET_PATH).glob('*.zip'))
    if zip_files:
        zip_file = zip_files[0]
        print(f"\nüì¶ Extracting: {zip_file}")
        !unzip -q {zip_file} -d /kaggle/working/
        print("‚úÖ Extracted!")
    else:
        print("\nüìã Copying files...")
        !cp -r {DATASET_PATH}/* /kaggle/working/
        print("‚úÖ Copied!")
else:
    print("‚ùå Dataset not found!")
    print("\nAvailable datasets:")
    !ls -la /kaggle/input/
    print("\nüìù Please:")
    print("   1. Add your dataset to this notebook (click Add Data)")
    print("   2. Search for: stavin12/stavinkonkani-asr")
    print("   3. Update DATASET_PATH above if needed")

## Step 3: Navigate to Project

In [None]:
import os

# Find project directory
possible_dirs = [
    '/kaggle/working/kaggle_package',
    '/kaggle/working/kaggle_minimal',
    '/kaggle/working/konkani',
    '/kaggle/working/konkani_project',
    '/kaggle/working',
]

project_dir = None
for dir_path in possible_dirs:
    if os.path.exists(f"{dir_path}/training_scripts/train_konkanivani_asr.py"):
        project_dir = dir_path
        break

if project_dir:
    print(f"‚úÖ Found project at: {project_dir}\n")
    os.chdir(project_dir)
    !pwd
    print("\nContents:")
    !ls -la
else:
    print("‚ùå Project not found. Checking /kaggle/working:")
    !ls -la /kaggle/working/

## Step 4: Install Dependencies

In [None]:
# Kaggle has most packages pre-installed, but let's ensure we have everything
!pip install -q librosa soundfile tensorboard
print("‚úÖ Dependencies ready!")

## Step 5: Verify Files

In [None]:
import os

required_files = [
    'training_scripts/train_konkanivani_asr.py',
    'models/konkanivani_asr.py',
    'data/audio_processing/dataset.py',
    'data/audio_processing/text_tokenizer.py',
    'data/vocab.json',
    'data/konkani-asr-v0/splits/manifests/train.json',
    'data/konkani-asr-v0/splits/manifests/val.json',
    'archives/checkpoint_epoch_15.pt'
]

print("Checking required files...\n")
print("="*60)

all_good = True
for file in required_files:
    exists = os.path.exists(file)
    status = "‚úÖ" if exists else "‚ùå"
    print(f"{status} {file}")
    if not exists:
        all_good = False

print("="*60)

if all_good:
    print("\nüéâ All files found! Ready to train!")
else:
    print("\n‚ö†Ô∏è Some files are missing.")

## Step 6: Prepare Checkpoint

In [None]:
!mkdir -p checkpoints
!cp archives/checkpoint_epoch_15.pt checkpoints/

print("‚úÖ Checkpoint ready\n")
!ls -lh checkpoints/

## Step 7: Verify Checkpoint

In [None]:
import torch
import json

checkpoint = torch.load('checkpoints/checkpoint_epoch_15.pt', map_location='cpu')

print("="*60)
print("CHECKPOINT CONFIGURATION")
print("="*60)
print(json.dumps(checkpoint.get('config', {}), indent=2))

print("\n" + "="*60)
print("MODEL ARCHITECTURE")
print("="*60)

state = checkpoint['model_state_dict']
encoder_layers = sum(1 for k in state.keys() if 'encoder.layers.' in k and '.ff1.0.weight' in k)
decoder_layers = sum(1 for k in state.keys() if 'decoder.decoder.layers.' in k and '.linear1.weight' in k)
d_model = state['encoder.input_proj.weight'].shape[0]
vocab_size = state['ctc_head.weight'].shape[0]

print(f"Encoder layers:  {encoder_layers}")
print(f"Decoder layers:  {decoder_layers}")
print(f"d_model:         {d_model}")
print(f"vocab_size:      {vocab_size}")
print(f"Last epoch:      {checkpoint['epoch']}")
print(f"Val loss:        {checkpoint.get('val_loss', 'N/A')}")

del checkpoint
torch.cuda.empty_cache()

print("\n‚úÖ Checkpoint verified!")

## Step 8: Setup Environment

In [None]:
import os
import torch
import gc

# Environment variables
os.environ['PYTORCH_CUDA_ALLOC_CONF'] = 'expandable_segments:True'
os.environ['CUDA_LAUNCH_BLOCKING'] = '0'

# Clear memory
gc.collect()
torch.cuda.empty_cache()

print("="*60)
print("GPU STATUS")
print("="*60)
print(f"GPU:             {torch.cuda.get_device_name(0)}")
print(f"Total memory:    {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")
print(f"Allocated:       {torch.cuda.memory_allocated(0) / 1e9:.2f} GB")
print(f"Cached:          {torch.cuda.memory_reserved(0) / 1e9:.2f} GB")
print("\n‚úÖ Ready to train!")

## Step 9: Check Audio Files

In [None]:
import os
from pathlib import Path

# Check if audio files exist
audio_dir = Path('data/konkani-asr-v0/data/processed_segments_diarized/audio_segments')

if audio_dir.exists():
    audio_files = list(audio_dir.glob('*.wav'))
    print(f"‚úÖ Found {len(audio_files)} audio files")
    if audio_files:
        print(f"   First file: {audio_files[0].name}")
        print(f"   Last file: {audio_files[-1].name}")
else:
    print(f"‚ùå Audio directory not found: {audio_dir}")
    print("\nLooking for audio files...")
    !find data -name '*.wav' -type f | head -20

## Step 10: Fix Manifest Paths

The manifest files contain absolute paths from your local machine. We need to update them to Kaggle paths.

In [None]:
import json
import os
from pathlib import Path

def fix_manifest_paths(manifest_path):
    """Update absolute paths in manifest to relative paths"""
    print(f"Fixing paths in: {manifest_path}")
    
    # Read JSONL format (one JSON object per line)
    data = []
    with open(manifest_path, 'r') as f:
        for line in f:
            if line.strip():
                data.append(json.loads(line))
    
    fixed_count = 0
    missing_count = 0
    valid_data = []
    
    for item in data:
        old_path = item['audio_filepath']
        
        # Extract just the filename (e.g., segment_000001.wav)
        filename = Path(old_path).name
        
        # New path: data/konkani-asr-v0/audio/segment_XXXXX.wav
        new_path = f"data/konkani-asr-v0/audio/{filename}"
        item['audio_filepath'] = new_path
        fixed_count += 1
        
        # Check if file exists (skip ._ files which are macOS metadata)
        if not filename.startswith('._') and os.path.exists(new_path):
            valid_data.append(item)
        else:
            missing_count += 1
    
    # Save back in JSONL format (only valid files)
    with open(manifest_path, 'w') as f:
        for item in valid_data:
            f.write(json.dumps(item, ensure_ascii=False) + '\n')
    
    print(f"  ‚úÖ Fixed {fixed_count} paths")
    print(f"  ‚úÖ Kept {len(valid_data)} valid files")
    if missing_count > 0:
        print(f"  ‚ö†Ô∏è  Removed {missing_count} missing files")
    
    # Show example
    if valid_data:
        print(f"  Example: {valid_data[0]['audio_filepath']}")
    
    return len(valid_data)

# Fix both manifests
print("="*60)
train_count = fix_manifest_paths('data/konkani-asr-v0/splits/manifests/train.json')
val_count = fix_manifest_paths('data/konkani-asr-v0/splits/manifests/val.json')
print("="*60)
print(f"\n‚úÖ Manifest paths fixed!")
print(f"   Training samples: {train_count}")
print(f"   Validation samples: {val_count}")

## Step 11: Enable Multi-GPU Training (Optional)

If you selected GPU T4 x2, this will enable both GPUs for faster training!

In [None]:
import torch

# Check available GPUs
gpu_count = torch.cuda.device_count()
print(f"Available GPUs: {gpu_count}")

if gpu_count > 1:
    print(f"\n‚úÖ Multi-GPU detected! Will use {gpu_count} GPUs")
    print("   This will speed up training significantly!")
    
    for i in range(gpu_count):
        print(f"   GPU {i}: {torch.cuda.get_device_name(i)}")
    
    # Patch the training script to use DataParallel
    with open('training_scripts/train_konkanivani_asr.py', 'r') as f:
        script = f.read()
    
    # Add DataParallel wrapper after model.to(device)
    if 'torch.nn.DataParallel' not in script:
        script = script.replace(
            'self.model = model.to(device)',
            '''self.model = model.to(device)
        # Enable multi-GPU training
        if torch.cuda.device_count() > 1:
            print(f"Using {torch.cuda.device_count()} GPUs for training!")
            self.model = torch.nn.DataParallel(self.model)'''
        )
        
        with open('training_scripts/train_konkanivani_asr.py', 'w') as f:
            f.write(script)
        
        print("\n‚úÖ Multi-GPU training enabled!")
    else:
        print("\n‚úÖ Multi-GPU already enabled")
    
    # Increase batch size for multi-GPU
    BATCH_SIZE = BATCH_SIZE * gpu_count
    print(f"\nüìà Increased batch size to {BATCH_SIZE} (x{gpu_count})")
    print(f"   Effective batch size: {BATCH_SIZE * GRAD_ACCUM}")
    
elif gpu_count == 1:
    print("\n‚úÖ Single GPU mode")
    print(f"   GPU: {torch.cuda.get_device_name(0)}")
else:
    print("\n‚ö†Ô∏è  No GPU detected!")

## Step 12: Add Project to Python Path

In [None]:
import sys
import os

# Add project directory to Python path
project_dir = os.getcwd()
if project_dir not in sys.path:
    sys.path.insert(0, project_dir)
    print(f"‚úÖ Added to Python path: {project_dir}")

# Verify imports work
try:
    from models.konkanivani_asr import create_konkanivani_model
    from data.audio_processing.dataset import KonkaniASRDataset
    print("‚úÖ All imports working!")
except ImportError as e:
    print(f"‚ùå Import error: {e}")
    print(f"\nCurrent directory: {os.getcwd()}")
    print(f"Python path: {sys.path[:3]}")

## Step 13: üöÄ START TRAINING

### Kaggle Advantages:
- **30 hours GPU/week** (vs Colab's 12h)
- **P100 with 16GB** (can use larger batch size)
- **Auto-saves outputs** (checkpoints persist)
- **Can restart** and continue training

### Configuration:
- Batch size: Auto-detected based on GPU
- Mixed precision: FP16
- Resume from: Epoch 15
- Target: Epoch 50

**Note**: You can turn OFF internet after this cell starts to save your 9h/week quota!

In [None]:
print("="*70)
print("üöÄ STARTING KONKANIVANI ASR TRAINING ON KAGGLE")
print("="*70)
print(f"GPU: {torch.cuda.get_device_name(0)}")
print(f"Batch size: {BATCH_SIZE}")
print(f"Gradient accumulation: {GRAD_ACCUM}x")
print(f"Effective batch size: {BATCH_SIZE * GRAD_ACCUM}")
print(f"Mixed precision: FP16")
print(f"Resume from: Epoch 15")
print(f"Target: Epoch 50")
print("="*70)
print("\nüí° TIP: Turn OFF internet now to save your quota!")
print("   (Settings ‚Üí Internet ‚Üí OFF)\n")
print("="*70)
print("\n")

# Set PYTHONPATH so the training script can find modules
import os
os.environ['PYTHONPATH'] = os.getcwd()

!PYTHONPATH={os.getcwd()} python3 training_scripts/train_konkanivani_asr.py \
    --train_manifest data/konkani-asr-v0/splits/manifests/train.json \
    --val_manifest data/konkani-asr-v0/splits/manifests/val.json \
    --vocab_file data/vocab.json \
    --batch_size {BATCH_SIZE} \
    --gradient_accumulation_steps {GRAD_ACCUM} \
    --num_epochs 50 \
    --learning_rate 0.0005 \
    --device cuda \
    --d_model 256 \
    --encoder_layers 12 \
    --decoder_layers 6 \
    --mixed_precision \
    --checkpoint_dir checkpoints \
    --log_dir logs \
    --resume checkpoints/checkpoint_epoch_15.pt

## Step 14: Monitor GPU

In [None]:
!nvidia-smi

## Step 15: Check Training Progress

In [None]:
# List checkpoints
print("Saved checkpoints:\n")
!ls -lh checkpoints/

# Check latest checkpoint
import torch
from pathlib import Path

checkpoints = sorted(Path('checkpoints').glob('checkpoint_epoch_*.pt'))
if checkpoints:
    latest = checkpoints[-1]
    ckpt = torch.load(latest, map_location='cpu')
    print(f"\nLatest checkpoint: {latest.name}")
    print(f"Epoch: {ckpt['epoch']}")
    print(f"Val loss: {ckpt.get('val_loss', 'N/A')}")
else:
    print("\nNo checkpoints saved yet.")

## Step 16: Download Checkpoints

**Note**: Kaggle auto-saves outputs, but you can also download manually

In [None]:
# Checkpoints are automatically saved to /kaggle/working/checkpoints/
# They will be available in the "Output" tab after the notebook finishes

print("üì¶ Checkpoints will be available in the Output tab")
print("\nTo download:")
print("  1. Wait for training to complete (or stop notebook)")
print("  2. Go to Output tab")
print("  3. Download checkpoints/ folder")
print("\nCurrent checkpoints:")
!ls -lh checkpoints/

---

## üîÑ To Continue Training After Session Ends

Kaggle sessions last 12 hours. To continue:

1. **Download checkpoints** from Output tab
2. **Upload to your dataset** (update it with latest checkpoint)
3. **Start new session** and run this notebook again
4. It will **resume from latest checkpoint** automatically

---

## üìä Kaggle vs Colab Summary

### Kaggle Wins:
- ‚úÖ 30h GPU/week (vs 12h)
- ‚úÖ P100 with 16GB (vs T4 with 15GB)
- ‚úÖ Auto-saves outputs
- ‚úÖ Better for long training

### Colab Wins:
- ‚úÖ 90min idle timeout (vs 20min)
- ‚úÖ Always-on internet
- ‚úÖ More storage (100GB vs 20GB)

### Recommendation:
**Use Kaggle** for this training! The 30h/week and P100 GPU make it ideal for your 8-12 hour training run.

---

## üí° Kaggle Tips

1. **Turn off internet** after setup to save quota (9h/week)
2. **Enable GPU** in notebook settings (P100 preferred)
3. **Save outputs** - Kaggle auto-saves /kaggle/working/
4. **Use datasets** - Upload once, use many times
5. **Monitor quota** - Check GPU hours remaining

---