# Fine-tune NLLB-200-distilled-600M on Google Colab

This notebook fine-tunes NLLB-200 on the opus-100 dataset with:
- ✅ **Automatic checkpoint saving to HuggingFace Hub** (no data loss)
- ✅ **Resume from checkpoint** if interrupted
- ✅ **Optimized for T4 GPU** (free on Colab)

**Hardware**: Runtime → Change runtime type → T4 GPU

**Time**: ~2-3 hours for 3 language pairs

## 1. Setup and Installation

In [None]:
# Install dependencies
!pip install -q transformers>=4.35.0 datasets>=2.14.0 accelerate>=0.24.0 sentencepiece protobuf tensorboard

In [None]:
# Check GPU
!nvidia-smi

## 2. HuggingFace Login

Get your token from: https://huggingface.co/settings/tokens

In [None]:
from huggingface_hub import login

# Enter your HuggingFace token
HF_TOKEN = ""  # Paste your token here

login(token=HF_TOKEN)

## 3. Configuration

Customize your training parameters here:

In [None]:
# Training configuration
CONFIG = {
    "model_name": "facebook/nllb-200-distilled-600M",
    "language_pairs": ["en-fr", "en-de", "en-es"],  # Customize!
    "max_samples_per_pair": 50000,
    "epochs": 3,
    "batch_size": 4,
    "learning_rate": 5e-5,
    
    # HuggingFace Hub settings
    "hf_username": "YOUR_USERNAME",  # CHANGE THIS!
    "hub_repo_name": "fine-tuned-nllb-600M",
    
    # Resume from checkpoint (leave None to start fresh)
    "resume_from_checkpoint": None,  # Or "YOUR_USERNAME/fine-tuned-nllb-600M"
}

# Validate configuration
assert CONFIG["hf_username"] != "YOUR_USERNAME", "Please set your HuggingFace username!"

print("✅ Configuration loaded")
print(f"Training pairs: {CONFIG['language_pairs']}")
print(f"Hub repo: {CONFIG['hf_username']}/{CONFIG['hub_repo_name']}")

## 4. Download Training Script

In [None]:
# Option A: Upload train_nllb_hf_spaces.py manually via Files panel
# Option B: Download from your repository

# For now, we'll create it inline
!wget https://raw.githubusercontent.com/YOUR_REPO/train_nllb_hf_spaces.py -O train_nllb_hf_spaces.py

# Or upload manually: Click folder icon → Upload → Select train_nllb_hf_spaces.py

## 5. Start Training

**Important**: Checkpoints are automatically saved to HuggingFace Hub every 500 steps!

Monitor your Hub repo: https://huggingface.co/YOUR_USERNAME/fine-tuned-nllb-600M

In [None]:
# Build command
hub_model_id = f"{CONFIG['hf_username']}/{CONFIG['hub_repo_name']}"

cmd = [
    "python", "train_nllb_hf_spaces.py",
    "--language_pairs", *CONFIG["language_pairs"],
    "--max_samples", str(CONFIG["max_samples_per_pair"]),
    "--epochs", str(CONFIG["epochs"]),
    "--batch_size", str(CONFIG["batch_size"]),
    "--learning_rate", str(CONFIG["learning_rate"]),
    "--push_to_hub",
    "--hub_repo_name", CONFIG["hub_repo_name"],
    "--hf_username", CONFIG["hf_username"],
]

# Add resume checkpoint if specified
if CONFIG["resume_from_checkpoint"]:
    cmd.extend(["--resume_from_checkpoint", CONFIG["resume_from_checkpoint"]])

print("Starting training with command:")
print(" ".join(cmd))
print("\n" + "="*60)

# Run training
import subprocess
subprocess.run(cmd)

## 6. Monitor Training

Check your HuggingFace Hub repository to see checkpoints:

https://huggingface.co/YOUR_USERNAME/fine-tuned-nllb-600M

You should see:
- `checkpoint-500/`
- `checkpoint-1000/`
- `checkpoint-1500/`
- etc.

**If Colab disconnects**: Simply restart this notebook and set:
```python
CONFIG["resume_from_checkpoint"] = "YOUR_USERNAME/fine-tuned-nllb-600M"
```

Then re-run the training cell!

## 7. Test Fine-tuned Model

In [None]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import torch

# Load your fine-tuned model from Hub
model_name = f"{CONFIG['hf_username']}/{CONFIG['hub_repo_name']}"

print(f"Loading model: {model_name}")
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)

# Test translation (English to French)
test_text = "Hello, how are you today?"
print(f"\nTest input: '{test_text}'")

tokenizer.src_lang = "eng_Latn"
inputs = tokenizer(test_text, return_tensors="pt").to(device)

translated_tokens = model.generate(
    **inputs,
    forced_bos_token_id=tokenizer.lang_code_to_id["fra_Latn"],
    max_length=512,
    num_beams=5
)

translation = tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)[0]
print(f"Test output (French): '{translation}'")
print("\n✅ Model test successful!")

## 8. Download to Local Machine

On your local machine, run:

```bash
python download_finetuned_model.py \
    --model_name YOUR_USERNAME/fine-tuned-nllb-600M \
    --test
```

## Tips for Success

### Preventing Colab Disconnects

1. **Keep browser tab active** - Colab may disconnect if tab is inactive
2. **Use Colab Pro** - Longer timeout periods ($10/month)
3. **Enable background execution** - Colab Pro feature
4. **Don't worry about disconnects** - Checkpoints are on Hub!

### If Training Stops

1. Check your Hub repo for last checkpoint
2. Set `CONFIG["resume_from_checkpoint"]` to your Hub repo
3. Re-run training cell
4. Training continues from last checkpoint!

### Monitoring Progress

- **Hub repo**: https://huggingface.co/YOUR_USERNAME/fine-tuned-nllb-600M
- **TensorBoard**: Run in another cell: `%load_ext tensorboard; %tensorboard --logdir ./fine-tuned-nllb/logs`

### Estimated Timeline

- First checkpoint (~500 steps): 10-15 minutes
- Total training (3 language pairs, 50k each): 2-3 hours
- Checkpoints every 500 steps: ~10-15 checkpoints total