# Javanese ASR Training on Google Colab

This notebook trains a Listen, Attend and Spell (LAS) style seq2seq ASR model for Javanese on Google Colab GPU.

## Setup Instructions:
1. Upload this notebook to Google Colab
2. Enable GPU: Runtime ‚Üí Change runtime type ‚Üí GPU (T4)
3. Upload your data:
   - `audio_input/` folder (all WAV files)
   - `transcripts.csv` file
4. Run all cells

## Expected Training Time:
- 100 epochs: ~3-5 hours on T4 GPU
- 200 epochs: ~6-10 hours on T4 GPU

## 1. Check GPU Availability

In [7]:
import torch
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")
else:
    print("‚ö†Ô∏è WARNING: No GPU detected! Training will be very slow.")
    print("Please enable GPU: Runtime ‚Üí Change runtime type ‚Üí GPU")

PyTorch version: 2.9.0+cu126
CUDA available: False
Please enable GPU: Runtime ‚Üí Change runtime type ‚Üí GPU


## 2. Install Dependencies

In [2]:
!pip install -q editdistance soundfile tqdm

## 3. Upload Your Data

**Option A: Upload from local computer**
- Click the folder icon on the left sidebar
- Create folder: `audio_input`
- Upload all WAV files to `audio_input/`
- Upload `transcripts.csv` to root

**Option B: Mount Google Drive (if data is in Drive)**

In [3]:
# Option B: Uncomment to mount Google Drive
# from google.colab import drive
# drive.mount('/content/drive')

# Then copy your data from Drive:
# !cp -r /content/drive/MyDrive/javanese_asr/audio_input ./
# !cp /content/drive/MyDrive/javanese_asr/transcripts.csv ./

## 4. Upload Code Files

Upload these Python files to Colab:
- `model.py`
- `features.py`
- `vocab.py`
- `dataset.py`
- `metrics.py`
- `decoder.py`
- `utils.py`
- `config.py`
- `train.py`

In [4]:
# Verify files are uploaded
import os
required_files = ['model.py', 'features.py', 'vocab.py', 'dataset.py', 
                  'metrics.py', 'decoder.py', 'utils.py', 'config.py', 'train.py']

missing_files = [f for f in required_files if not os.path.exists(f)]
if missing_files:
    print("‚ùå Missing files:")
    for f in missing_files:
        print(f"  - {f}")
    print("\nPlease upload these files to continue.")
else:
    print("‚úÖ All code files found!")

# Check data
if os.path.exists('audio_input') and os.path.exists('transcripts.csv'):
    num_audio = len([f for f in os.listdir('audio_input') if f.endswith('.wav')])
    print(f"‚úÖ Data found: {num_audio} audio files")
else:
    print("‚ùå Data not found. Please upload audio_input/ and transcripts.csv")

‚ùå Missing files:
  - model.py
  - features.py
  - vocab.py
  - dataset.py
  - metrics.py
  - decoder.py
  - utils.py
  - config.py
  - train.py

Please upload these files to continue.
‚ùå Data not found. Please upload audio_input/ and transcripts.csv


## 5. Configure Training Settings

In [5]:
# Modify config.py for Colab
config_content = '''
from dataclasses import dataclass

@dataclass
class Config:
    # Data paths
    audio_dir: str = "audio_input"
    transcript_file: str = "transcripts.csv"
    vocab_path: str = "vocab.json"
    
    # Training
    batch_size: int = 16  # Increased for GPU
    num_epochs: int = 100  # Full training
    learning_rate: float = 1e-3
    grad_clip_norm: float = 5.0
    teacher_forcing_ratio: float = 1.0
    
    # Model architecture
    input_dim: int = 80
    encoder_hidden_size: int = 128
    encoder_num_layers: int = 3
    decoder_dim: int = 256
    attention_dim: int = 128
    embedding_dim: int = 64
    dropout: float = 0.3
    
    # CTC settings
    use_ctc: bool = False
    ctc_weight: float = 0.3
    
    # Feature extraction
    sample_rate: int = 16000
    n_mels: int = 80
    win_length_ms: float = 25.0
    hop_length_ms: float = 10.0
    
    # Augmentation
    apply_cmvn: bool = True
    apply_spec_augment: bool = True
    speed_perturb: bool = False
    
    # Validation
    val_split: float = 0.1
    val_every_n_steps: int = 500
    
    # Checkpointing
    checkpoint_dir: str = "checkpoints"
    save_every_n_epochs: int = 10
    
    # Decoding
    max_decode_len: int = 200
    beam_size: int = 5
    
    # Device
    device: str = "cuda"  # Use GPU on Colab
    
    # Random seed
    seed: int = 42
'''

with open('config.py', 'w') as f:
    f.write(config_content)

print("‚úÖ Config updated for Colab (GPU enabled, batch_size=16, epochs=100)")

‚úÖ Config updated for Colab (GPU enabled, batch_size=16, epochs=100)


## 6. Build Vocabulary

In [6]:
from vocab import build_vocab_from_file

print("Building vocabulary from transcripts...")
vocab = build_vocab_from_file("transcripts.csv", save_path="vocab.json")
print(f"\n‚úÖ Vocabulary built with {len(vocab)} tokens")
print(f"Special tokens: {vocab.special_tokens}")

ModuleNotFoundError: No module named 'vocab'

## 7. Start Training

This will take several hours. You can monitor progress in real-time.

In [None]:
# Run training
!python train.py

## 8. Monitor Training Progress

While training is running, you can check:
- Training loss (should decrease from ~4.0 to <1.0)
- Validation CER (should decrease from ~90% to <20%)
- Best model saved at `checkpoints/best_model.pt`

## 9. Download Trained Model

After training completes, download your model:

In [None]:
from google.colab import files
import os

# Download best model
if os.path.exists('checkpoints/best_model.pt'):
    files.download('checkpoints/best_model.pt')
    print("‚úÖ Downloaded best_model.pt")

# Download vocabulary
if os.path.exists('vocab.json'):
    files.download('vocab.json')
    print("‚úÖ Downloaded vocab.json")

# Download all checkpoints (optional)
# !zip -r checkpoints.zip checkpoints/
# files.download('checkpoints.zip')

## 10. Test Inference

Test the trained model on a sample audio file:

In [None]:
import torch
from model import Seq2SeqASR
from vocab import Vocabulary
from features import LogMelFeatureExtractor, load_audio, CMVN
from decoder import GreedyDecoder
from utils import load_checkpoint
from config import Config

# Load config and vocab
cfg = Config()
vocab = Vocabulary.load('vocab.json')
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# Create model
model = Seq2SeqASR(
    vocab_size=len(vocab),
    input_dim=cfg.input_dim,
    encoder_hidden_size=cfg.encoder_hidden_size,
    encoder_num_layers=cfg.encoder_num_layers,
    decoder_dim=cfg.decoder_dim,
    attention_dim=cfg.attention_dim,
    embedding_dim=cfg.embedding_dim,
    dropout=cfg.dropout,
    use_ctc=cfg.use_ctc,
    ctc_weight=cfg.ctc_weight
).to(device)

# Load checkpoint
load_checkpoint('checkpoints/best_model.pt', model, device=device)
model.eval()

# Create decoder
decoder = GreedyDecoder(model, vocab, max_len=200, device=device)

# Test on a sample file
test_file = 'audio_input/speaker03_m_nn_utt01.wav'  # Change to your file
if os.path.exists(test_file):
    # Load and process audio
    feature_extractor = LogMelFeatureExtractor()
    waveform, sr = load_audio(test_file, target_sr=16000)
    features = feature_extractor(waveform)
    cmvn = CMVN()
    features = cmvn(features)
    
    # Add batch dimension
    features = features.unsqueeze(0).to(device)
    feature_lengths = torch.tensor([features.size(1)], dtype=torch.long).to(device)
    
    # Decode
    transcript = decoder.decode(features, feature_lengths)[0]
    
    print(f"\nüé§ Audio: {test_file}")
    print(f"üìù Transcript: {transcript}")
else:
    print(f"File not found: {test_file}")

## 11. Save to Google Drive (Optional)

Save your trained model to Google Drive for permanent storage:

In [None]:
from google.colab import drive
drive.mount('/content/drive')

# Create directory in Drive
!mkdir -p /content/drive/MyDrive/javanese_asr_trained

# Copy checkpoints and vocab
!cp -r checkpoints /content/drive/MyDrive/javanese_asr_trained/
!cp vocab.json /content/drive/MyDrive/javanese_asr_trained/

print("‚úÖ Saved to Google Drive: /MyDrive/javanese_asr_trained/")

## Tips for Better Results:

1. **Training Duration**: Train for at least 100 epochs (3-5 hours on T4 GPU)
2. **Monitor CER**: Good models achieve <20% CER on validation set
3. **Batch Size**: Increase to 32 if you have enough GPU memory
4. **Enable CTC**: Set `use_ctc=True` in config for better alignment
5. **Data Quality**: Remove corrupted audio files before training

## Troubleshooting:

- **Out of Memory**: Reduce `batch_size` to 8 or 4
- **Slow Training**: Make sure GPU is enabled (check cell 1)
- **High CER**: Train longer (200+ epochs) or enable CTC
- **Disconnection**: Colab disconnects after ~12 hours. Save to Drive regularly!