# Phase II Evaluation Notebook

This notebook documents the current status of the Marathi ASR + ST pipeline, demonstrates the preprocessing outputs, and captures the latest training/evaluation artefacts you can present during the Phase-II review.

## 1. Environment quick check
Ensure you activate the project virtual environment (`source ml/bin/activate`) before running this notebook.

In [None]:
import os
import sys
from pathlib import Path

print(f"Python executable: {sys.executable}")
print(f"Current working directory: {Path.cwd()}")
print(f"CUDA visible devices: {os.environ.get('CUDA_VISIBLE_DEVICES', 'not set')}")

## 2. Datasets
The pipeline currently relies on two corpora:
- **Common Voice 22.0 (Marathi)** for ASR pre-training.
- **IWSLT 2023 (Marathi→Hindi)** for speech translation fine-tuning.

Both releases have already been preprocessed into 16kHz WAV clips with aligned manifests under `processed_data/`.
If you need to regenerate the manifests, invoke the helper scripts in `scripts/`.

In [None]:
COMMON_VOICE_ROOT = Path('/home/anamika-rajesh/Desktop/ml/processed_data/common_voice_16khz')
IWSLT_ROOT = Path('/home/anamika-rajesh/Desktop/ml/processed_data/iwslt_16khz')

for name, root in [('Common Voice', COMMON_VOICE_ROOT), ('IWSLT', IWSLT_ROOT)]:
    print(f'\n{name} assets:')
    print('  manifests:', sorted((root / 'manifests').glob('*.tsv')))
    print('  sentencepiece model:', sorted((root / 'spm').glob('*.model')))
    print('  dictionary:', (root / 'dict.txt').exists())

In [None]:
def preview_manifest(manifest_path: Path, num_lines: int = 5):
    if not manifest_path.exists():
        raise FileNotFoundError(f'Manifest missing: {manifest_path}')
    print(f'Previewing {manifest_path}:')
    with manifest_path.open('r', encoding='utf-8') as handle:
        for idx, line in enumerate(handle):
            print(line.rstrip())
            if idx + 1 >= num_lines:
                break

preview_manifest(COMMON_VOICE_ROOT / 'manifests' / 'train.tsv')

## 3. Preprocessing summary
Scripts (`scripts/prepare_common_voice_dataset.py` and `scripts/prepare_iwslt_dataset.py`) cover resampling, manifest generation, and text normalization.

Re-run them if the raw corpora are updated or if you need to regenerate artefacts with different parameters.

In [None]:
# Example: regenerate Common Voice manifests (commented out to avoid accidental reprocessing)
# !python scripts/prepare_common_voice_dataset.py \ 
#     --input-root cv-corpus-22.0-2025-06-20-mr/cv-corpus-22.0-2025-06-20/mr \ 
#     --output-root processed_data/common_voice_16khz \ 
#     --num-workers 8

# Example: regenerate IWSLT manifests
# !python scripts/prepare_iwslt_dataset.py \ 
#     --input-root datasets/iwslt2023_mr-hi \ 
#     --output-root processed_data/iwslt_16khz \ 
#     --num-workers 8

## 4. Training configuration
Fairseq Hydra configs live in `configs/fairseq/`.
The current ASR pretraining run relies on `asr_pretrain.yaml`, while `st_finetune.yaml` consumes the ASR checkpoint for translation fine-tuning.

In [None]:
ASR_CONFIG = Path('/home/anamika-rajesh/Desktop/ml/configs/fairseq/asr_pretrain.yaml')
ST_CONFIG = Path('/home/anamika-rajesh/Desktop/ml/configs/fairseq/st_finetune.yaml')

print(ASR_CONFIG.read_text()[:800])

## 5. Launch commands
Training scripts thinly wrap `fairseq-hydra-train`.
Execute them from the repository root with the `ml` virtual environment activated.

In [None]:
print('ASR pretrain command:')
print('  source ml/bin/activate && ./scripts/run_asr_pretrain.sh')
print('ST fine-tune command:')
print('  source ml/bin/activate && ./scripts/run_st_finetune.sh \\')
print('        --encoder-checkpoint checkpoints/asr_pretrain/checkpoint_best.pt')

## 6. Tracking metrics
Once a training job completes, Fairseq logs appear in `checkpoints/**/train.log`. The snippet below extracts the last few updates for quick inspection.

In [None]:
import json
LOG_PATH = Path('/home/anamika-rajesh/Desktop/ml/checkpoints/asr_pretrain/train.log')
if LOG_PATH.exists():
    print('Latest log entries:')
    with LOG_PATH.open('r', encoding='utf-8') as handle:
        for line in handle.readlines()[-10:]:
            print(line.rstrip())
else:
    print('Train log not found yet. Run a training job to generate it.')

## 7. Current findings
Summarise early WER/BLEU numbers, qualitative observations, and open questions here before the evaluation meeting.

## 8. Next actions
- [ ] Finish the ongoing ASR pre-training run and archive the best checkpoint.
- [ ] Kick off ST fine-tuning with the updated encoder.
- [ ] Run inference on the dev/test sets and record WER/BLEU.
- [ ] Update this notebook with final metrics, plots, and analysis.