Model: facebook/nllb-200-distilled-600M (Meta AI)
Dataset: LinCE — Linguistic Code-switching Evaluation Benchmark
Task: Translate Hinglish (Hindi-English) and Spanglish (Spanish-English) → English
codemixed_mt/
├── config.py # All hyperparameters and language code mappings
├── data_pipeline.py # LinCE loader, text cleaning, NLLB tokenization
├── trainer.py # Model loading, metrics (BLEU/ChrF/COMET), Seq2SeqTrainer
├── inference.py # Translation pipeline, ONNX export, interactive demo
├── main.py # CLI entry point (train / evaluate / translate / demo)
├── notebook.ipynb # End-to-end Jupyter walkthrough
└── requirements.txt # Python dependencies
Code-Mixed Input (Hinglish/Spanglish)
│
▼
┌─────────────────────┐
│ Text Cleaner │ Unicode normalization, URL/emoji removal,
│ │ repeated-char normalization, Romanized Hindi norms
└────────┬────────────┘
│
▼
┌─────────────────────┐
│ NLLB Tokenizer │ src_lang=hin_Deva / spa_Latn
│ AutoTokenizer │ Handles multilingual subword tokenization
└────────┬────────────┘
│
▼
┌─────────────────────┐
│ NLLB-200 (600M) │ Encoder-Decoder Transformer
│ Distilled Model │ Fine-tuned on code-mixed parallel data
└────────┬────────────┘
│
▼
┌─────────────────────┐
│ Beam Search │ forced_bos_token_id=eng_Latn
│ (num_beams=4) │ Guides output to target language
└────────┬────────────┘
│
▼
English Translation
| Language | Script | NLLB Code |
|---|---|---|
| English | Latin | eng_Latn |
| Hindi | Devanagari | hin_Deva |
| Spanish | Latin | spa_Latn |
| French | Latin | fra_Latn |
| Arabic | Arabic | arb_Arab |
Important: NLLB uses BCP-47 style codes with script suffix. Always set
forced_bos_token_idto the target language token to steer generation.
pip install -r requirements.txtpython main.py --mode train --lang-pair hi-en --epochs 5 --batch-size 16python main.py \
--mode train \
--lang-pair hi-en \
--train-csv data/train.csv \
--val-csv data/val.csv \
--test-csv data/test.csvCSV format:
source,target
"Mujhe bahut hunger lag raha hai","I am feeling very hungry"
python main.py \
--mode translate \
--model-path ./nllb_codemixed_output \
--lang-pair hi-en \
--text "Mujhe bahut zyada hunger lag raha hai"python main.py \
--mode translate \
--model-path ./nllb_codemixed_output \
--lang-pair hi-en \
--input-file inputs.txt \
--output-file translations.txtpython main.py \
--mode evaluate \
--model-path ./nllb_codemixed_output \
--lang-pair hi-enpython main.py --mode demo --model-path ./nllb_codemixed_outputOption 1 — HuggingFace Hub:
from datasets import load_dataset
dataset = load_dataset("lince", "mt_hineng")Option 2 — Kaggle:
pip install kaggle
kaggle datasets download -d <lince-dataset-slug>Option 3 — Official Website:
Download from https://ritual.uh.edu/lince/ and convert to CSV format with source,target columns.
| Parameter | Default | Notes |
|---|---|---|
| Model | nllb-200-distilled-600M | ~600M params |
| Max Input Length | 128 | Tokens |
| Max Target Length | 128 | Tokens |
| Batch Size | 16 | Per GPU |
| Gradient Accumulation | 2 | Effective batch = 32 |
| Learning Rate | 3e-5 | AdamW |
| Warmup Ratio | 5% | Of total steps |
| Epochs | 5 | With early stopping |
| FP16 | True | GPU only |
| Scheduler | Linear decay | With warmup |
GPU Memory Requirements:
- FP16 training: ~12 GB VRAM (batch=16)
- FP16 training: ~8 GB VRAM (batch=8, grad_accum=4)
- CPU training: Supported but very slow
| Metric | Description | Range |
|---|---|---|
| SacreBLEU | Tokenization-independent BLEU | 0–100 (higher=better) |
| ChrF | Character n-gram F-score | 0–100 (higher=better) |
| COMET | Neural metric (requires GPU) | 0–1 (higher=better) |
Typical scores on code-mixed MT:
- SacreBLEU: 20–40 (task is harder than clean MT)
- ChrF: 40–60
- COMET: 0.5–0.8
python main.py --mode train --gradient-checkpointingtranslator = CodeMixedTranslator(
model_path="./nllb_codemixed_output",
use_quantization=True, # Requires bitsandbytes
)from inference import export_to_onnx
export_to_onnx(
model_path="./nllb_codemixed_output",
output_dir="./onnx_model",
)
# Requires: pip install optimum[onnxruntime]-
forced_bos_token_id: Critical for NLLB — forces the decoder's first token to be the target language ID, steering generation to the correct language. -
Label padding with -100: PyTorch cross-entropy ignores
-100tokens, preventing padding positions from contributing to loss. -
group_by_length=True: Groups similar-length sequences in batches, reducing padding and improving efficiency. -
EarlyStoppingCallback: Stops training if BLEU doesn't improve for 3 evaluations, preventing overfitting. -
Code-mixed src_lang: We use the dominant language's code (e.g.,
hin_Devafor Hinglish). This is a simplification — true code-mixed language detection would require LID at token level.
@inproceedings{aguilar2020lince,
title={LinCE: A Centralized Benchmark for Linguistic Code-switching Evaluation},
author={Aguilar, Gustavo and others},
booktitle={LREC},
year={2020}
}
@article{nllb2022,
title={No Language Left Behind: Scaling Human-Centered Machine Translation},
author={NLLB Team, Meta AI},
year={2022}
}