# üéØ NeMo Bangla ASR Fine-tuning Pipeline

**Fine-tuning pretrained Bangla Conformer ASR model on large audio dataset**

- **Model**: `hishab/titu_stt_bn_conformer_large`
- **Framework**: NVIDIA NeMo + PyTorch Lightning
- **Hardware**: Kaggle P100 GPU (16GB VRAM)
- **Audio**: 40+ minute files ‚Üí 30-second chunks (on-the-fly, no file saving)

---

## üì¶ Setup and Installation

Install NeMo toolkit and dependencies.

In [None]:
# Install NeMo and dependencies
!pip install -q nemo_toolkit['all']
!pip install -q soundfile librosa

# Verify installation
import nemo
import nemo.collections.asr as nemo_asr
print(f"NeMo version: {nemo.__version__}")

In [None]:
# Import required libraries
import os
import json
import numpy as np
import soundfile as sf
import librosa
import pandas as pd
import gc
import torch
from pathlib import Path
from tqdm.auto import tqdm
from typing import List, Tuple, Dict
import pytorch_lightning as pl
from omegaconf import OmegaConf, open_dict

print("‚úÖ All imports successful")
print(f"üî• PyTorch version: {torch.__version__}")
print(f"üíæ CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"   GPU: {torch.cuda.get_device_name(0)}")
    print(f"   VRAM: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")

In [None]:
# Configuration
# ‚ö†Ô∏è UPDATE THESE PATHS FOR YOUR KAGGLE ENVIRONMENT

# Input paths (Kaggle dataset)
AUDIO_DIR = "/kaggle/input/your-dataset/audio"  # Directory containing long audio files
TRANSCRIPT_DIR = "/kaggle/input/your-dataset/transcripts"  # Directory containing .txt transcript files
TEST_AUDIO_DIR = "/kaggle/input/your-dataset/test"  # Directory containing test audio files

# Output paths
OUTPUT_DIR = "/kaggle/working/processed_data"
MANIFEST_DIR = os.path.join(OUTPUT_DIR, "manifests")
CHECKPOINT_DIR = "/kaggle/working/checkpoints"
FINAL_MODEL_PATH = "/kaggle/working/nemo_bangla_asr_finetuned.nemo"
SUBMISSION_PATH = "/kaggle/working/"

# Audio parameters
CHUNK_DURATION = 30.0  # seconds
SAMPLE_RATE = 16000  # Hz

# Dataset parameters
USE_FIRST_50_PERCENT = True  # Use only first 50% of training data

# Training parameters (MEMORY OPTIMIZED for P100 16GB)
BATCH_SIZE = 1  # Keep at 1 for memory safety
MAX_EPOCHS = 10
LEARNING_RATE = 2e-5
VAL_SPLIT = 0.1  # 10% validation
GRADIENT_ACCUMULATION = 4  # Simulate larger batch size
INFERENCE_BATCH_SIZE = 4  # For test inference

# Memory optimization settings
USE_GRADIENT_CHECKPOINTING = True
EMPTY_CACHE_EVERY_N_STEPS = 10

# Create directories
os.makedirs(MANIFEST_DIR, exist_ok=True)
os.makedirs(CHECKPOINT_DIR, exist_ok=True)

print("üìÅ Directory structure created")
print(f"  Manifests: {MANIFEST_DIR}")
print(f"  Checkpoints: {CHECKPOINT_DIR}")
print(f"  Submission: {SUBMISSION_PATH}")
print("\n‚ö° Using ON-THE-FLY chunking (no intermediate files saved)")
print(f"üíæ Memory optimizations: Gradient accumulation={GRADIENT_ACCUMULATION}, Checkpointing={USE_GRADIENT_CHECKPOINTING}")
if USE_FIRST_50_PERCENT:
    print(f"üìä Dataset: Using first 50% of training data only")

### üìÅ Expected Dataset Structure

```
/kaggle/input/your-dataset/
‚îú‚îÄ‚îÄ audio/
‚îÇ   ‚îú‚îÄ‚îÄ audio1.wav
‚îÇ   ‚îú‚îÄ‚îÄ audio2.wav
‚îÇ   ‚îî‚îÄ‚îÄ ...
‚îú‚îÄ‚îÄ transcripts/
‚îÇ   ‚îú‚îÄ‚îÄ audio1.txt  ‚Üê Same name as audio file
‚îÇ   ‚îú‚îÄ‚îÄ audio2.txt
‚îÇ   ‚îî‚îÄ‚îÄ ...
‚îî‚îÄ‚îÄ test/
    ‚îú‚îÄ‚îÄ test1.wav
    ‚îú‚îÄ‚îÄ test2.wav
    ‚îî‚îÄ‚îÄ ...
```

**Important**: Each audio file must have a corresponding .txt file with the SAME filename (except extension).

### ‚öôÔ∏è Dataset Size Configuration

**Current setting**: `USE_FIRST_50_PERCENT = True`

- ‚úÖ **True**: Uses only the **first 50%** of audio files for training (faster, less memory)
- ‚ùå **False**: Uses **all 100%** of audio files for training (full dataset, better accuracy)

This is useful for:
- Quick testing and iteration with smaller dataset
- Reducing training time and memory usage
- Prototyping before full training run

In [None]:
# Validate paths (optional but recommended)
print("üîç Validating dataset paths...\n")

paths_to_check = {
    "Audio Directory": AUDIO_DIR,
    "Transcript Directory": TRANSCRIPT_DIR,
    "Test Audio Directory": TEST_AUDIO_DIR
}

all_valid = True
for name, path in paths_to_check.items():
    exists = os.path.exists(path)
    status = "‚úÖ" if exists else "‚ùå"
    print(f"{status} {name}: {path}")
    if not exists:
        all_valid = False
        print(f"   ‚ö†Ô∏è  WARNING: Path does not exist!")

if not all_valid:
    print("\n‚ö†Ô∏è  Some paths are invalid. Please update the configuration above.")
    print("   The notebook will still run, but you may encounter errors later.")
else:
    print("\n‚úÖ All paths validated successfully!")

---

## üéµ Audio Preprocessing (On-the-Fly)

Create chunk metadata WITHOUT saving files. Uses **offset-based** manifest entries.

In [None]:
def create_chunk_metadata(
    audio_path: str,
    chunk_duration: float = 30.0
) -> List[Tuple[str, float, float]]:
    """
    Create metadata for audio chunks WITHOUT saving files.
    Uses offset-based approach for on-the-fly loading.
    
    Args:
        audio_path: Path to input audio file
        chunk_duration: Duration of each chunk in seconds (default: 30.0)
        
    Returns:
        List of (audio_path, offset, duration) tuples
    """
    # Get audio info without loading entire file
    info = sf.info(audio_path)
    total_duration = info.duration
    
    # Calculate number of complete chunks
    num_chunks = int(total_duration // chunk_duration)
    
    chunks_metadata = []
    
    # Create metadata for each chunk (offset-based)
    for i in range(num_chunks):
        offset = i * chunk_duration
        chunks_metadata.append((audio_path, offset, chunk_duration))
    
    # Note: We discard the final incomplete chunk (< 30s)
    
    return chunks_metadata


def get_audio_duration(audio_path: str) -> float:
    """
    Get audio file duration in seconds.
    
    Args:
        audio_path: Path to audio file
        
    Returns:
        Duration in seconds
    """
    info = sf.info(audio_path)
    return info.duration


print("‚úÖ Audio chunking functions defined (offset-based, no file saving)")

In [None]:
def split_transcript_into_chunks(
    transcript: str,
    num_chunks: int
) -> List[str]:
    """
    Split a transcript into roughly equal chunks.
    
    Args:
        transcript: Full transcript text
        num_chunks: Number of chunks to split into
        
    Returns:
        List of transcript chunks
    """
    if num_chunks <= 0:
        return []
    
    if num_chunks == 1:
        return [transcript]
    
    # Split by sentences (simple approach using common punctuation)
    # This works for Bengali and English
    import re
    
    # Split by sentence-ending punctuation
    sentences = re.split(r'[‡•§\.\!\?]+', transcript)
    sentences = [s.strip() for s in sentences if s.strip()]
    
    if len(sentences) == 0:
        # If no sentences found, split by words
        words = transcript.split()
        words_per_chunk = max(1, len(words) // num_chunks)
        
        chunks = []
        for i in range(num_chunks):
            start_idx = i * words_per_chunk
            end_idx = start_idx + words_per_chunk if i < num_chunks - 1 else len(words)
            chunk = ' '.join(words[start_idx:end_idx])
            if chunk:
                chunks.append(chunk)
        return chunks
    
    # Distribute sentences across chunks
    sentences_per_chunk = max(1, len(sentences) // num_chunks)
    
    chunks = []
    for i in range(num_chunks):
        start_idx = i * sentences_per_chunk
        end_idx = start_idx + sentences_per_chunk if i < num_chunks - 1 else len(sentences)
        chunk = ' '.join(sentences[start_idx:end_idx])
        if chunk:
            chunks.append(chunk)
    
    return chunks


def process_dataset(
    audio_dir: str,
    transcript_data: Dict[str, str],
    chunk_duration: float = 30.0,
    use_first_n_percent: float = 1.0
) -> List[Dict]:
    """
    Process all audio files: create chunk metadata and pair with transcripts.
    NO FILES ARE SAVED - uses offset-based approach.
    
    Args:
        audio_dir: Directory containing audio files
        transcript_data: Dictionary mapping audio files to FULL transcripts
                        Format: {"filename.wav": "full transcript text", ...}
        chunk_duration: Duration of each chunk in seconds
        use_first_n_percent: Fraction of data to use (0.5 = first 50%)
        
    Returns:
        List of dictionaries with audio_filepath, offset, duration, text
    """
    all_data = []
    
    # Get all audio files
    audio_files = sorted(Path(audio_dir).glob("*.wav"))
    
    # Take only first N% if specified
    if use_first_n_percent < 1.0:
        num_files_to_use = int(len(audio_files) * use_first_n_percent)
        audio_files = audio_files[:num_files_to_use]
        print(f"üìä Using first {use_first_n_percent*100:.0f}% of data: {num_files_to_use}/{len(sorted(Path(audio_dir).glob('*.wav')))} files")
    
    print(f"Found {len(audio_files)} audio files to process")
    
    for audio_path in tqdm(audio_files, desc="Processing audio files"):
        filename = audio_path.name
        
        # Check if transcript exists
        if filename not in transcript_data:
            print(f"‚ö†Ô∏è  Skipping {filename}: no transcript found")
            continue
        
        # Get full transcript
        full_transcript = transcript_data[filename]
        
        # Create chunk metadata (no file saving!)
        chunks_metadata = create_chunk_metadata(
            str(audio_path),
            chunk_duration
        )
        
        num_chunks = len(chunks_metadata)
        
        # Split transcript into chunks
        transcript_chunks = split_transcript_into_chunks(full_transcript, num_chunks)
        
        # Pair audio chunks with transcript chunks
        for idx, (audio_file, offset, duration) in enumerate(chunks_metadata):
            # Check if transcript exists for this chunk
            if idx >= len(transcript_chunks):
                print(f"‚ö†Ô∏è  No transcript chunk for audio chunk {idx} of {filename}")
                # Use empty string for remaining chunks
                text = ""
            else:
                text = transcript_chunks[idx]
            
            # Validate transcript
            if not text or not text.strip():
                print(f"‚ö†Ô∏è  Empty transcript for chunk {idx} of {filename}")
                continue
            
            # Add to dataset with offset and duration
            all_data.append({
                "audio_filepath": audio_file,
                "offset": offset,
                "duration": duration,
                "text": text.strip()
            })
    
    print(f"\n‚úÖ Processed {len(all_data)} valid audio-transcript pairs")
    print(f"üíæ Space saved: No chunk files created!")
    return all_data


print("‚úÖ Dataset processing function defined")

### Load Transcripts from .txt Files

**Format**: Each audio file has a corresponding .txt file with the same name.
- Audio: `audio1.wav` ‚Üí Transcript: `audio1.txt`
- Each .txt file contains the FULL transcript for that audio file
- Transcripts will be automatically split into chunks matching audio segments

In [None]:
def load_transcript_from_file(transcript_path: str) -> str:
    """
    Load transcript from a .txt file.
    
    Args:
        transcript_path: Path to transcript .txt file
        
    Returns:
        Transcript text as string
    """
    try:
        with open(transcript_path, 'r', encoding='utf-8') as f:
            transcript = f.read().strip()
        return transcript
    except Exception as e:
        print(f"‚ö†Ô∏è  Error reading {transcript_path}: {str(e)}")
        return ""


def load_transcripts_from_directory(
    audio_dir: str,
    transcript_dir: str
) -> Dict[str, str]:
    """
    Load all transcripts from directory.
    Matches audio files (*.wav) with transcript files (*.txt).
    
    Args:
        audio_dir: Directory containing audio files
        transcript_dir: Directory containing transcript .txt files
        
    Returns:
        Dictionary mapping audio filename to full transcript text
        Format: {"audio1.wav": "full transcript text", ...}
    """
    transcript_dict = {}
    
    # Get all audio files
    audio_files = sorted(Path(audio_dir).glob("*.wav"))
    
    print(f"üìÇ Loading transcripts from: {transcript_dir}")
    print(f"   Found {len(audio_files)} audio files in {audio_dir}")
    
    found_count = 0
    missing_count = 0
    
    for audio_path in audio_files:
        audio_filename = audio_path.name
        # Get corresponding transcript file (same name, .txt extension)
        transcript_filename = audio_path.stem + ".txt"
        transcript_path = Path(transcript_dir) / transcript_filename
        
        if transcript_path.exists():
            transcript = load_transcript_from_file(str(transcript_path))
            if transcript:
                transcript_dict[audio_filename] = transcript
                found_count += 1
            else:
                print(f"‚ö†Ô∏è  Empty transcript: {transcript_filename}")
                missing_count += 1
        else:
            print(f"‚ö†Ô∏è  Transcript not found: {transcript_filename}")
            missing_count += 1
    
    print(f"\n‚úÖ Loaded {found_count} transcripts")
    if missing_count > 0:
        print(f"‚ö†Ô∏è  Missing/empty: {missing_count} transcripts")
    
    return transcript_dict


print("‚úÖ Transcript loading functions defined")

In [None]:
# Load transcripts from directory
transcript_data = load_transcripts_from_directory(
    audio_dir=AUDIO_DIR,
    transcript_dir=TRANSCRIPT_DIR
)

print(f"\nüìä Loaded transcripts for {len(transcript_data)} audio files")

In [None]:
# Display sample transcripts
print("\nüìÑ Sample transcripts:")
print("-" * 80)

sample_count = min(3, len(transcript_data))
for i, (filename, transcript) in enumerate(list(transcript_data.items())[:sample_count]):
    print(f"\n{i+1}. File: {filename}")
    print(f"   Transcript length: {len(transcript)} characters")
    print(f"   Preview: {transcript[:150]}...")
    print("-" * 80)

if len(transcript_data) > sample_count:
    print(f"\n... and {len(transcript_data) - sample_count} more files")

In [None]:
# Process the dataset (NO FILES SAVED!)
# This creates offset-based metadata only

all_data = process_dataset(
    audio_dir=AUDIO_DIR,
    transcript_data=transcript_data,
    chunk_duration=CHUNK_DURATION,
    use_first_n_percent=0.5 if USE_FIRST_50_PERCENT else 1.0
)

print(f"\nüìä Dataset Statistics:")
print(f"  Total samples: {len(all_data)}")
print(f"  Total duration: {sum(item['duration'] for item in all_data) / 3600:.2f} hours")
print(f"\nüìù Sample data point (offset-based):")
if all_data:
    print(json.dumps(all_data[0], indent=2, ensure_ascii=False))
    print(f"\n  Sample transcript length: {len(all_data[0]['text'])} characters")
    print(f"  Sample transcript preview: {all_data[0]['text'][:150]}...")

---

## üìÑ Manifest Generation

Create NeMo-compatible JSON manifest files with **offset** and **duration** fields.

In [None]:
def create_manifest(
    data: List[Dict],
    manifest_path: str,
    validate: bool = True
) -> None:
    """
    Create NeMo manifest file with offset support.
    
    Args:
        data: List of dicts with audio_filepath, offset, duration, text
        manifest_path: Output manifest file path
        validate: Whether to validate data before writing
    """
    valid_count = 0
    invalid_count = 0
    
    with open(manifest_path, 'w', encoding='utf-8') as f:
        for item in data:
            # Validation
            if validate:
                # Check audio file exists
                if not os.path.exists(item['audio_filepath']):
                    print(f"‚ö†Ô∏è  Audio file not found: {item['audio_filepath']}")
                    invalid_count += 1
                    continue
                
                # Check text is not empty
                if not item['text'] or not item['text'].strip():
                    print(f"‚ö†Ô∏è  Empty text for: {item['audio_filepath']}")
                    invalid_count += 1
                    continue
                
                # Check duration is positive
                if item['duration'] <= 0:
                    print(f"‚ö†Ô∏è  Invalid duration for: {item['audio_filepath']}")
                    invalid_count += 1
                    continue
            
            # Write to manifest (one JSON per line)
            # Include offset for on-the-fly loading
            json_line = json.dumps(item, ensure_ascii=False)
            f.write(json_line + '\n')
            valid_count += 1
    
    print(f"\n‚úÖ Manifest created: {manifest_path}")
    print(f"  Valid entries: {valid_count}")
    print(f"  Format: Offset-based (no chunk files needed)")
    if invalid_count > 0:
        print(f"  ‚ö†Ô∏è  Skipped invalid entries: {invalid_count}")


def train_val_split(
    data: List[Dict],
    val_ratio: float = 0.1,
    shuffle: bool = True,
    seed: int = 42
) -> Tuple[List[Dict], List[Dict]]:
    """
    Split data into train and validation sets.
    
    Args:
        data: List of data items
        val_ratio: Fraction of data for validation
        shuffle: Whether to shuffle before splitting
        seed: Random seed for reproducibility
        
    Returns:
        (train_data, val_data) tuple
    """
    if shuffle:
        np.random.seed(seed)
        indices = np.random.permutation(len(data))
        data = [data[i] for i in indices]
    
    split_idx = int(len(data) * (1 - val_ratio))
    train_data = data[:split_idx]
    val_data = data[split_idx:]
    
    return train_data, val_data


print("‚úÖ Manifest functions defined")

In [None]:
# Memory management utilities
def clear_memory():
    """Clear GPU and CPU memory cache"""
    gc.collect()
    if torch.cuda.is_available():
        torch.cuda.empty_cache()
        torch.cuda.synchronize()

def print_memory_stats():
    """Print current GPU memory usage"""
    if torch.cuda.is_available():
        allocated = torch.cuda.memory_allocated(0) / 1e9
        reserved = torch.cuda.memory_reserved(0) / 1e9
        print(f"üíæ GPU Memory: {allocated:.2f}GB allocated, {reserved:.2f}GB reserved")

print("‚úÖ Memory management utilities defined")

In [None]:
# Split data into train/val
train_data, val_data = train_val_split(
    all_data,
    val_ratio=VAL_SPLIT,
    shuffle=True,
    seed=42
)

print(f"üìä Data Split:")
print(f"  Training samples: {len(train_data)}")
print(f"  Validation samples: {len(val_data)}")
print(f"  Train duration: {sum(item['duration'] for item in train_data) / 3600:.2f} hours")
print(f"  Val duration: {sum(item['duration'] for item in val_data) / 3600:.2f} hours")

In [None]:
# Create manifest files
train_manifest_path = os.path.join(MANIFEST_DIR, "train_manifest.json")
val_manifest_path = os.path.join(MANIFEST_DIR, "val_manifest.json")

create_manifest(train_data, train_manifest_path, validate=True)
create_manifest(val_data, val_manifest_path, validate=True)

print(f"\n‚úÖ Manifests ready:")
print(f"  Train: {train_manifest_path}")
print(f"  Val: {val_manifest_path}")

In [None]:
# Optional: Verify manifest samples
print("üîç Verifying manifest samples...\n")

def verify_manifest_samples(manifest_path, num_samples=3):
    """Verify that manifest entries can be loaded correctly"""
    print(f"Checking: {os.path.basename(manifest_path)}")
    
    with open(manifest_path, 'r', encoding='utf-8') as f:
        lines = f.readlines()
    
    valid = 0
    invalid = 0
    
    for i, line in enumerate(lines[:num_samples]):
        try:
            data = json.loads(line)
            
            # Check required fields
            required = ['audio_filepath', 'offset', 'duration', 'text']
            missing = [field for field in required if field not in data]
            
            if missing:
                print(f"  ‚ùå Entry {i+1}: Missing fields {missing}")
                invalid += 1
                continue
            
            # Check audio file exists
            if not os.path.exists(data['audio_filepath']):
                print(f"  ‚ùå Entry {i+1}: Audio file not found: {data['audio_filepath']}")
                invalid += 1
                continue
            
            # Check text is not empty
            if not data['text'].strip():
                print(f"  ‚ö†Ô∏è  Entry {i+1}: Empty text")
                invalid += 1
                continue
            
            print(f"  ‚úÖ Entry {i+1}: OK")
            print(f"     Audio: {os.path.basename(data['audio_filepath'])}")
            print(f"     Offset: {data['offset']:.1f}s, Duration: {data['duration']:.1f}s")
            print(f"     Text: {data['text'][:50]}...")
            valid += 1
            
        except Exception as e:
            print(f"  ‚ùå Entry {i+1}: Error - {str(e)}")
            invalid += 1
    
    print(f"\n  Summary: {valid} valid, {invalid} invalid\n")

# Verify both manifests
verify_manifest_samples(train_manifest_path, num_samples=3)
verify_manifest_samples(val_manifest_path, num_samples=2)

print("‚úÖ Manifest verification complete")

---

## üß† Model Fine-tuning

Load pretrained model and configure training for Kaggle P100.

In [None]:
# Load pretrained model
print("üì• Loading pretrained model: hishab/titu_stt_bn_conformer_large")
print("   This may take a few minutes...\n")

asr_model = nemo_asr.models.ASRModel.from_pretrained(
    "hishab/titu_stt_bn_conformer_large"
)

print("‚úÖ Model loaded successfully")
print(f"   Model type: {type(asr_model).__name__}")
print(f"   Sample rate: {asr_model._cfg.sample_rate}")

### üí° Memory Troubleshooting Tips

If you encounter Out-of-Memory (OOM) errors:

**During Training:**
- Increase `GRADIENT_ACCUMULATION` to 8 or 16
- Keep `BATCH_SIZE = 1`
- Set `USE_GRADIENT_CHECKPOINTING = True`
- Reduce `MAX_EPOCHS` for faster testing

**During Inference:**
- Reduce `INFERENCE_BATCH_SIZE` to 2 or 1
- Reduce `CHUNK_DURATION` to 15 or 20 seconds
- Process test files one at a time (set batch_size=1 in transcribe function)

**General:**
- Restart kernel to clear all memory
- Close other notebooks/processes
- Monitor memory with `print_memory_stats()`

In [None]:
# Configure training data
print("‚öôÔ∏è  Configuring training data...")

# Update training data config
with open_dict(asr_model.cfg):
    asr_model.cfg.train_ds.manifest_filepath = train_manifest_path
    asr_model.cfg.train_ds.batch_size = BATCH_SIZE
    asr_model.cfg.train_ds.shuffle = True
    asr_model.cfg.train_ds.num_workers = 2
    asr_model.cfg.train_ds.pin_memory = False  # Disabled for memory efficiency
    asr_model.cfg.train_ds.sample_rate = SAMPLE_RATE
    
    # IMPORTANT: Enable offset-based loading
    # NeMo's AudioToCharDataset supports offset and duration fields
    asr_model.cfg.train_ds.use_start_end_token = False
    
    # Update validation data config
    asr_model.cfg.validation_ds.manifest_filepath = val_manifest_path
    asr_model.cfg.validation_ds.batch_size = BATCH_SIZE
    asr_model.cfg.validation_ds.shuffle = False
    asr_model.cfg.validation_ds.num_workers = 2
    asr_model.cfg.validation_ds.sample_rate = SAMPLE_RATE
    
    # Optimizer config
    asr_model.cfg.optim.name = 'adam'
    asr_model.cfg.optim.lr = LEARNING_RATE
    asr_model.cfg.optim.betas = [0.9, 0.999]
    asr_model.cfg.optim.weight_decay = 1e-6
    
    # Learning rate schedule (warmup + hold + decay)
    asr_model.cfg.optim.sched.name = 'CosineAnnealing'
    asr_model.cfg.optim.sched.warmup_steps = 500
    asr_model.cfg.optim.sched.min_lr = 1e-7
    
    # Memory optimization: Enable gradient checkpointing if available
    if USE_GRADIENT_CHECKPOINTING and hasattr(asr_model.cfg, 'encoder'):
        if hasattr(asr_model.cfg.encoder, 'gradient_checkpointing'):
            asr_model.cfg.encoder.gradient_checkpointing = True
            print("‚úÖ Gradient checkpointing enabled")

# Setup training and validation data loaders
asr_model.setup_training_data(asr_model.cfg.train_ds)
asr_model.setup_validation_data(asr_model.cfg.validation_ds)

# Clear memory before training
clear_memory()
print_memory_stats()

print("‚úÖ Data configuration complete")
print("‚ö° Data loaders will read audio chunks on-the-fly using offsets")

In [None]:
# Configure PyTorch Lightning Trainer
print("‚öôÔ∏è  Configuring trainer...")

from pytorch_lightning.callbacks import ModelCheckpoint, EarlyStopping, Callback
from pytorch_lightning.loggers import TensorBoardLogger

# Custom callback for memory management
class MemoryCleanupCallback(Callback):
    """Clear GPU cache periodically during training"""
    def __init__(self, cleanup_every_n_steps=10):
        self.cleanup_every_n_steps = cleanup_every_n_steps
    
    def on_train_batch_end(self, trainer, pl_module, outputs, batch, batch_idx):
        if batch_idx % self.cleanup_every_n_steps == 0:
            clear_memory()

# Checkpoint callback
checkpoint_callback = ModelCheckpoint(
    dirpath=CHECKPOINT_DIR,
    filename='nemo-asr-{epoch:02d}-{val_wer:.4f}',
    monitor='val_wer',
    mode='min',
    save_top_k=2,  # Reduced to save disk space
    save_last=True,
    verbose=True
)

# Early stopping (optional)
early_stop_callback = EarlyStopping(
    monitor='val_wer',
    patience=3,
    mode='min',
    verbose=True
)

# Memory cleanup callback
memory_callback = MemoryCleanupCallback(cleanup_every_n_steps=EMPTY_CACHE_EVERY_N_STEPS)

# Logger
logger = TensorBoardLogger(
    save_dir='/kaggle/working',
    name='nemo_asr_logs'
)

# Trainer
trainer = pl.Trainer(
    devices=1,
    accelerator='gpu',
    max_epochs=MAX_EPOCHS,
    precision='16-mixed',  # Mixed precision for P100
    callbacks=[checkpoint_callback, early_stop_callback, memory_callback],
    logger=logger,
    gradient_clip_val=1.0,
    accumulate_grad_batches=GRADIENT_ACCUMULATION,  # Memory-efficient training
    log_every_n_steps=10,
    val_check_interval=1.0,  # Validate every epoch
    enable_progress_bar=True,
    enable_model_summary=True,
    # Memory optimizations
    enable_checkpointing=True,
    deterministic=False  # Faster training
)

print("‚úÖ Trainer configured")
print(f"   Devices: {trainer.num_devices} GPU")
print(f"   Precision: {trainer.precision}")
print(f"   Max epochs: {MAX_EPOCHS}")
print(f"   Batch size: {BATCH_SIZE}")
print(f"   Gradient accumulation: {GRADIENT_ACCUMULATION}")
print(f"   Effective batch size: {BATCH_SIZE * GRADIENT_ACCUMULATION}")
print(f"   Learning rate: {LEARNING_RATE}")
print(f"   Memory cleanup: Every {EMPTY_CACHE_EVERY_N_STEPS} steps")

In [None]:
# Start training
print("\nüöÄ Starting fine-tuning...\n")
print("=" * 60)

# Clear memory before training
clear_memory()
print_memory_stats()

trainer.fit(asr_model)

# Clear memory after training
clear_memory()
print_memory_stats()

print("\n" + "=" * 60)
print("‚úÖ Training complete!")

### üîÑ Resume Training (Optional)

If training was interrupted, you can resume from the last checkpoint:

```python
# Uncomment and run this instead of trainer.fit(asr_model)
# checkpoint_path = "/kaggle/working/checkpoints/last.ckpt"
# trainer.fit(asr_model, ckpt_path=checkpoint_path)
```

---

## üíæ Model Export

Save the fine-tuned model in NeMo format.

In [None]:
# Save model to .nemo format
print("üíæ Saving model...")

# Clear memory before saving
clear_memory()

asr_model.save_to(FINAL_MODEL_PATH)

print(f"‚úÖ Model saved to: {FINAL_MODEL_PATH}")
print(f"   File size: {os.path.getsize(FINAL_MODEL_PATH) / (1024**3):.2f} GB")

# Delete trainer to free memory
del trainer
clear_memory()
print_memory_stats()

---

## üé§ Test Data Inference & Submission

Process test audio files and generate submission CSV.

In [None]:
# Load saved model for inference
print("üì• Loading saved model for inference...")

# Clear memory first
clear_memory()
print_memory_stats()

loaded_model = nemo_asr.models.ASRModel.restore_from(FINAL_MODEL_PATH)
loaded_model.eval()
loaded_model = loaded_model.cuda()  # Move to GPU

print("‚úÖ Model loaded successfully")
print_memory_stats()

In [None]:
# Quick validation test (optional)
# Test model on a validation sample before processing entire test set
if val_data and len(val_data) > 0:
    print("üß™ Quick model validation test...\n")
    
    sample = val_data[0]
    
    try:
        # Extract segment for testing
        audio, sr = sf.read(
            sample['audio_filepath'],
            start=int(sample['offset'] * SAMPLE_RATE),
            stop=int((sample['offset'] + sample['duration']) * SAMPLE_RATE)
        )
        
        # Save to temp file
        import tempfile
        with tempfile.NamedTemporaryFile(suffix='.wav', delete=False) as tmp:
            sf.write(tmp.name, audio, SAMPLE_RATE)
            prediction = loaded_model.transcribe([tmp.name])[0]
            os.unlink(tmp.name)
        
        print(f"Ground Truth: {sample['text'][:100]}...")
        print(f"Prediction:   {prediction[:100]}...")
        print("\n‚úÖ Model working correctly!")
        
    except Exception as e:
        print(f"‚ùå Model test failed: {str(e)}")
        print("   Check if model loaded correctly and paths are valid")
    
    clear_memory()
else:
    print("‚ö†Ô∏è  No validation data available for testing")

In [None]:
# Get all test audio files
print(f"üîç Scanning test directory: {TEST_AUDIO_DIR}")

test_audio_files = sorted(Path(TEST_AUDIO_DIR).glob("*.wav"))
print(f"Found {len(test_audio_files)} test audio files")

if len(test_audio_files) == 0:
    print("‚ö†Ô∏è  WARNING: No test audio files found! Check TEST_AUDIO_DIR path.")
else:
    # Check file sizes and durations
    total_duration = 0
    print("\nüìä Test dataset info:")
    sample_files = test_audio_files[:3]  # Show first 3
    for f in sample_files:
        duration = get_audio_duration(str(f))
        size_mb = f.stat().st_size / (1024 ** 2)
        total_duration += duration
        print(f"  {f.name}: {duration/60:.2f} min, {size_mb:.2f} MB")
    
    if len(test_audio_files) > 3:
        print(f"  ... and {len(test_audio_files) - 3} more files")
    
    # Estimate total duration
    for f in test_audio_files[3:]:
        total_duration += get_audio_duration(str(f))
    
    print(f"\n  Total test duration: {total_duration/3600:.2f} hours")
    print(f"  Average file duration: {total_duration/len(test_audio_files)/60:.2f} minutes")

In [None]:
# Estimate inference time
if len(test_audio_files) > 0:
    # Rough estimate: ~1.5x real-time for P100 (30s audio takes ~45s to process)
    avg_duration_per_file = total_duration / len(test_audio_files)
    processing_time_per_file = avg_duration_per_file * 1.5  # Conservative estimate
    total_estimated_time = (processing_time_per_file * len(test_audio_files)) / 60  # in minutes
    
    print(f"\n‚è±Ô∏è  Estimated inference time:")
    print(f"   ~{total_estimated_time:.1f} minutes for {len(test_audio_files)} files")
    print(f"   (~{total_estimated_time/60:.1f} hours)")
    print(f"\nüí° Tip: Adjust INFERENCE_BATCH_SIZE to speed up or reduce memory usage\n")

In [None]:
def transcribe_long_audio_chunked(
    model,
    audio_path: str,
    chunk_duration: float = 30.0,
    sample_rate: int = 16000,
    batch_size: int = 4
) -> str:
    """
    Transcribe long audio file by chunking and merging transcripts.
    Memory-efficient: processes in batches without saving chunks.
    
    Args:
        model: NeMo ASR model
        audio_path: Path to audio file
        chunk_duration: Duration of each chunk in seconds
        sample_rate: Target sample rate
        batch_size: Number of chunks to process at once
        
    Returns:
        Complete transcript (merged from all chunks)
    """
    import tempfile
    
    # Get audio info
    audio_info = sf.info(audio_path)
    total_duration = audio_info.duration
    
    # Calculate chunks
    num_chunks = int(np.ceil(total_duration / chunk_duration))
    
    all_transcripts = []
    
    # Process in batches
    for batch_start in range(0, num_chunks, batch_size):
        batch_end = min(batch_start + batch_size, num_chunks)
        temp_files = []
        
        try:
            # Extract batch of chunks to temp files
            for i in range(batch_start, batch_end):
                offset = i * chunk_duration
                duration = min(chunk_duration, total_duration - offset)
                
                # Read segment
                audio_segment, sr = sf.read(
                    audio_path,
                    start=int(offset * sample_rate),
                    stop=int((offset + duration) * sample_rate)
                )
                
                # Resample if needed
                if sr != sample_rate:
                    audio_segment = librosa.resample(
                        audio_segment,
                        orig_sr=sr,
                        target_sr=sample_rate
                    )
                
                # Save to temp file
                tmp = tempfile.NamedTemporaryFile(suffix='.wav', delete=False)
                sf.write(tmp.name, audio_segment, sample_rate)
                temp_files.append(tmp.name)
                tmp.close()
                
                # Clear memory
                del audio_segment
            
            # Transcribe batch
            batch_transcripts = model.transcribe(temp_files)
            all_transcripts.extend(batch_transcripts)
            
        finally:
            # Clean up temp files
            for tmp_file in temp_files:
                try:
                    os.unlink(tmp_file)
                except:
                    pass
            
            # Clear memory
            clear_memory()
    
    # Merge transcripts with spaces
    full_transcript = " ".join(all_transcripts)
    return full_transcript


print("‚úÖ Chunked transcription function defined")

In [None]:
# Process all test files
print("\nüé§ Starting test inference...\n")

results = []

for audio_path in tqdm(test_audio_files, desc="Transcribing test files"):
    filename = audio_path.name
    
    try:
        # Get audio duration
        duration = get_audio_duration(str(audio_path))
        
        # Choose transcription method based on duration
        if duration > CHUNK_DURATION:
            # Long audio: use chunked transcription
            transcript = transcribe_long_audio_chunked(
                loaded_model,
                str(audio_path),
                chunk_duration=CHUNK_DURATION,
                sample_rate=SAMPLE_RATE,
                batch_size=INFERENCE_BATCH_SIZE
            )
        else:
            # Short audio: transcribe directly
            transcript = loaded_model.transcribe([str(audio_path)])[0]
        
        # Store result
        results.append({
            "filename": filename,
            "transcript": transcript
        })
        
    except Exception as e:
        print(f"\n‚ö†Ô∏è  Error processing {filename}: {str(e)}")
        # Add empty transcript for failed files
        results.append({
            "filename": filename,
            "transcript": ""
        })
    
    # Clear memory periodically
    if len(results) % 5 == 0:
        clear_memory()

print(f"\n‚úÖ Inference complete! Processed {len(results)} files")
print_memory_stats()

### üìä Generate Submission File

In [None]:
# Create submission DataFrame
print("üìù Creating submission file...")

submission_df = pd.DataFrame(results)
submission_df = submission_df[["filename", "transcript"]]

# Remove .wav extension from filenames
submission_df["filename"] = submission_df["filename"].str.replace(r"\.wav$", "", regex=True)

# Fill any empty transcriptions
submission_df["transcript"] = submission_df["transcript"].fillna("")

# Save submission
submission_csv_path = SUBMISSION_PATH + "submission.csv"
submission_df.to_csv(submission_csv_path, index=False, encoding="utf-8")

print(f"‚úÖ Submission saved to: {submission_csv_path}")
print(f"   Total rows: {len(submission_df)}")

# Display preview
print(f"\nüìÑ Submission preview:")
print(submission_df.head(10))

# Statistics
print(f"\nüìä Submission statistics:")
print(f"   Files with transcripts: {(submission_df['transcript'] != '').sum()}")
print(f"   Empty transcripts: {(submission_df['transcript'] == '').sum()}")
print(f"   Average transcript length: {submission_df['transcript'].str.len().mean():.1f} characters")

In [None]:
# Optional: Display some example transcriptions
print("\nüéØ Sample transcriptions:\n")

for i in range(min(5, len(submission_df))):
    row = submission_df.iloc[i]
    print(f"File: {row['filename']}")
    print(f"Transcript: {row['transcript'][:200]}{'...' if len(row['transcript']) > 200 else ''}")
    print("-" * 80)

# Clear memory at end
clear_memory()
print("\n‚úÖ All done!")
print_memory_stats()

---

## üìä Summary

### Completed Steps:

1. ‚úÖ **Setup & Installation**: Installed NeMo toolkit and dependencies
2. ‚úÖ **Transcript Loading**: Loaded transcripts from individual .txt files (one per audio file)
3. ‚úÖ **Audio Preprocessing**: Created offset-based chunk metadata (NO files saved)
4. ‚úÖ **Transcript Chunking**: Automatically split full transcripts into segments matching audio chunks
5. ‚úÖ **Manifest Generation**: Created NeMo JSON manifests with offset/duration fields
6. ‚úÖ **Model Fine-tuning**: Fine-tuned pretrained Bangla Conformer on P100 with memory optimizations
7. ‚úÖ **Model Export**: Saved model to `.nemo` format
8. ‚úÖ **Test Inference**: Processed all test audio files with chunked transcription
9. ‚úÖ **Submission Generation**: Created CSV file ready for submission

### üéØ Data Processing Features:

- üìÇ **Individual Transcript Files**: Each audio file (e.g., `audio1.wav`) has a corresponding transcript file (`audio1.txt`)
- ‚úÇÔ∏è **Smart Transcript Splitting**: Full transcripts automatically split into chunks based on sentence boundaries
- üìä **Configurable Dataset Size**: Use `USE_FIRST_50_PERCENT` to train on subset of data (memory/time optimization)
- ‚öñÔ∏è **Intelligent Chunking**: Distributes sentences evenly across audio segments

### üíæ Memory Optimization Features:

- ‚ö° **Gradient Accumulation**: Simulates larger batch sizes without memory overhead
- ‚ö° **Mixed Precision (FP16)**: Reduces memory usage by ~50%
- ‚ö° **Gradient Checkpointing**: Trades computation for memory
- ‚ö° **Periodic Cache Clearing**: Prevents memory fragmentation
- ‚ö° **On-the-fly Audio Loading**: No intermediate chunk files saved
- ‚ö° **Batch Inference**: Processes multiple test chunks efficiently
- ‚ö° **Pin Memory Disabled**: Reduces CPU memory overhead
- üìâ **50% Data Option**: Train on first half of dataset for faster iterations

### üéØ Key Benefits:

- ‚ö° **No intermediate files**: Saves disk space and processing time
- ‚ö° **On-the-fly loading**: NeMo reads segments directly from original files
- ‚ö° **Memory efficient**: Optimized for P100 16GB VRAM
- ‚ö° **Handles long audio**: Automatic chunking for files > 30 seconds
- ‚ö° **Flexible transcripts**: Works with full transcripts per audio file
- ‚ö° **Production ready**: Complete inference pipeline with error handling

### üìÅ Output Files:

- **Train Manifest**: `/kaggle/working/processed_data/manifests/train_manifest.json`
- **Val Manifest**: `/kaggle/working/processed_data/manifests/val_manifest.json`
- **Checkpoints**: `/kaggle/working/checkpoints/`
- **Final Model**: `/kaggle/working/nemo_bangla_asr_finetuned.nemo`
- **Submission**: `/kaggle/working/submission.csv`

---

### üöÄ Next Steps:

1. ‚úÖ Download `submission.csv` for competition submission
2. üí° Monitor TensorBoard logs for training insights
3. üîß Fine-tune hyperparameters if needed (learning rate, batch size, etc.)
4. üìä Evaluate WER on validation set
5. üéØ Further optimize for better accuracy
6. üîÑ If needed, train on full dataset by setting `USE_FIRST_50_PERCENT = False`

---

### üìù Notes:

- **Transcript Format**: Each audio file needs a corresponding .txt file with the same base name
- **Dataset Size**: Currently using first 50% of training data (controlled by `USE_FIRST_50_PERCENT`)
- **Transcript Splitting**: Transcripts are split by sentence boundaries for better alignment
- **Inference**: Adjust `INFERENCE_BATCH_SIZE` if you encounter OOM during test inference
- **Long Audio**: Files are automatically chunked into 30-second segments
- **Error Handling**: Empty transcripts are handled gracefully (filled with empty strings)
- **Memory**: All audio chunks are processed in-memory without saving files

---