# KonkaniVani ASR Training - Kaggle (OPTIMIZED)
## Resume from Checkpoint 15 - Fixed Overfitting

**üî• NEW: Optimized to reduce overfitting!**
- ‚úÖ Reduced learning rate (0.0005 ‚Üí 0.0001)
- ‚úÖ Increased dropout (0.1 ‚Üí 0.2)
- ‚úÖ Stronger weight decay (0.000001 ‚Üí 0.0001)
- ‚úÖ Better CTC/Attention balance (0.3 ‚Üí 0.5)

**Why These Changes?**
At Epoch 15:
- Train Loss: 5.32
- Val Loss: 9.59 (almost 2x! = overfitting)

These settings will help the model generalize better.

---

## Step 1: Check GPU

In [None]:
!nvidia-smi

import torch
print(f"\n{'='*60}")
print(f"GPU: {torch.cuda.get_device_name(0)}")
print(f"Total Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")
print(f"{'='*60}")

# Determine batch size based on GPU
gpu_name = torch.cuda.get_device_name(0)
if 'P100' in gpu_name:
    print("\n‚úÖ P100 detected! Using batch_size=4")
    BATCH_SIZE = 4
    GRAD_ACCUM = 2
elif 'T4' in gpu_name:
    print("\n‚úÖ T4 detected! Using batch_size=2")
    BATCH_SIZE = 2
    GRAD_ACCUM = 4
else:
    print(f"\n‚ö†Ô∏è Unknown GPU: {gpu_name}. Using conservative batch_size=2")
    BATCH_SIZE = 2
    GRAD_ACCUM = 4

print(f"Effective batch size: {BATCH_SIZE * GRAD_ACCUM}")

## Step 2: Extract Dataset

In [None]:
import os
from pathlib import Path

# UPDATE THIS PATH to your dataset
DATASET_PATH = "/kaggle/input/konkani-asr-complete-dataset"

print(f"Looking for dataset at: {DATASET_PATH}\n")

if os.path.exists(DATASET_PATH):
    print("‚úÖ Dataset found!\n")
    print("Contents:")
    !ls -la {DATASET_PATH}
    
    zip_files = list(Path(DATASET_PATH).glob('*.zip'))
    if zip_files:
        zip_file = zip_files[0]
        print(f"\nüì¶ Extracting: {zip_file}")
        !unzip -q {zip_file} -d /kaggle/working/
        print("‚úÖ Extracted!")
    else:
        print("\nüìã Copying files...")
        !cp -r {DATASET_PATH}/* /kaggle/working/
        print("‚úÖ Copied!")
else:
    print("‚ùå Dataset not found!")
    print("\nAvailable datasets:")
    !ls -la /kaggle/input/

## Step 3: Navigate to Project

In [None]:
import os

possible_dirs = [
    '/kaggle/working/kaggle_package',
    '/kaggle/working/kaggle_minimal',
    '/kaggle/working/konkani',
    '/kaggle/working',
]

project_dir = None
for dir_path in possible_dirs:
    if os.path.exists(f"{dir_path}/training_scripts/train_konkanivani_asr.py"):
        project_dir = dir_path
        break

if project_dir:
    print(f"‚úÖ Found project at: {project_dir}\n")
    os.chdir(project_dir)
    !pwd
else:
    print("‚ùå Project not found")

## Step 4: Install Dependencies

In [None]:
!pip install -q librosa soundfile tensorboard
print("‚úÖ Dependencies ready!")

## Step 5: Prepare Checkpoint

In [None]:
!mkdir -p checkpoints
!cp archives/checkpoint_epoch_15.pt checkpoints/
print("‚úÖ Checkpoint ready\n")
!ls -lh checkpoints/

## Step 6: Fix Manifest Paths

In [None]:
import json
import os
from pathlib import Path

def fix_manifest_paths(manifest_path):
    print(f"Fixing paths in: {manifest_path}")
    
    data = []
    with open(manifest_path, 'r') as f:
        for line in f:
            if line.strip():
                data.append(json.loads(line))
    
    valid_data = []
    for item in data:
        filename = Path(item['audio_filepath']).name
        new_path = f"data/konkani-asr-v0/audio/{filename}"
        item['audio_filepath'] = new_path
        
        if not filename.startswith('._') and os.path.exists(new_path):
            valid_data.append(item)
    
    with open(manifest_path, 'w') as f:
        for item in valid_data:
            f.write(json.dumps(item, ensure_ascii=False) + '\n')
    
    print(f"  ‚úÖ Kept {len(valid_data)} valid files")
    return len(valid_data)

print("="*60)
train_count = fix_manifest_paths('data/konkani-asr-v0/splits/manifests/train.json')
val_count = fix_manifest_paths('data/konkani-asr-v0/splits/manifests/val.json')
print("="*60)
print(f"\n‚úÖ Training samples: {train_count}")
print(f"‚úÖ Validation samples: {val_count}")

## Step 7: Setup Environment

In [None]:
import os
import torch
import gc
import sys

os.environ['PYTORCH_CUDA_ALLOC_CONF'] = 'expandable_segments:True'
os.environ['PYTHONPATH'] = os.getcwd()

if os.getcwd() not in sys.path:
    sys.path.insert(0, os.getcwd())

gc.collect()
torch.cuda.empty_cache()

print("‚úÖ Environment ready!")

## Step 8: üöÄ START TRAINING (OPTIMIZED)

### Key Changes from Original:
- **Learning Rate**: 0.0005 ‚Üí **0.0001** (5x slower, more stable)
- **Dropout**: 0.1 ‚Üí **0.2** (stronger regularization)
- **Weight Decay**: 0.000001 ‚Üí **0.0001** (100x stronger)
- **CTC Weight**: 0.3 ‚Üí **0.5** (better balance)

### Expected Results:
- Val loss should stop increasing
- Gap between train/val loss should narrow
- Better generalization to new audio

In [None]:
print("="*70)
print("üöÄ STARTING OPTIMIZED KONKANIVANI ASR TRAINING")
print("="*70)
print(f"GPU: {torch.cuda.get_device_name(0)}")
print(f"Batch size: {BATCH_SIZE}")
print(f"Gradient accumulation: {GRAD_ACCUM}x")
print(f"Effective batch size: {BATCH_SIZE * GRAD_ACCUM}")
print("\nüî• OPTIMIZED SETTINGS:")
print(f"  Learning rate: 0.0001 (was 0.0005)")
print(f"  Dropout: 0.2 (was 0.1)")
print(f"  Weight decay: 0.0001 (was 0.000001)")
print(f"  CTC weight: 0.5 (was 0.3)")
print("="*70)
print("\nüí° TIP: Turn OFF internet now to save quota!")
print("="*70)
print("\n")

!python3 training_scripts/train_konkanivani_asr.py \
    --train_manifest data/konkani-asr-v0/splits/manifests/train.json \
    --val_manifest data/konkani-asr-v0/splits/manifests/val.json \
    --vocab_file data/vocab.json \
    --batch_size {BATCH_SIZE} \
    --gradient_accumulation_steps {GRAD_ACCUM} \
    --num_epochs 50 \
    --learning_rate 0.0001 \
    --weight_decay 0.0001 \
    --dropout 0.2 \
    --ctc_weight 0.5 \
    --device cuda \
    --d_model 256 \
    --encoder_layers 12 \
    --decoder_layers 6 \
    --mixed_precision \
    --checkpoint_dir checkpoints \
    --log_dir logs \
    --resume checkpoints/checkpoint_epoch_15.pt

## Step 9: Monitor Progress

In [None]:
!nvidia-smi

print("\nSaved checkpoints:")
!ls -lh checkpoints/

## Step 10: Check Latest Checkpoint

In [None]:
import torch
from pathlib import Path

checkpoints = sorted(Path('checkpoints').glob('checkpoint_epoch_*.pt'))
if checkpoints:
    latest = checkpoints[-1]
    ckpt = torch.load(latest, map_location='cpu')
    print(f"Latest checkpoint: {latest.name}")
    print(f"Epoch: {ckpt['epoch']}")
    print(f"Train loss: {ckpt.get('train_loss', 'N/A')}")
    print(f"Val loss: {ckpt.get('val_loss', 'N/A')}")
    
    # Check if overfitting is improving
    if 'train_loss' in ckpt and 'val_loss' in ckpt:
        ratio = ckpt['val_loss'] / ckpt['train_loss']
        print(f"\nVal/Train ratio: {ratio:.2f}")
        if ratio < 1.5:
            print("‚úÖ Good! Overfitting is under control")
        elif ratio < 1.8:
            print("‚ö†Ô∏è Moderate overfitting")
        else:
            print("‚ùå High overfitting - may need more regularization")
else:
    print("No checkpoints saved yet.")