Multilingual LLM training pipeline for 40+ Ugandan languages. Adapts open-weight models (Gemma, Qwen) via continued pretraining, translation-focused SFT, and GRPO post-training.
uv sync # installs everything pinned in pyproject.toml
apt-get install -y ffmpeg libnpp-12-8 # needed for torchcodec (speech data only)
hf auth login # Hugging Face token for gated models
export MLFLOW_TRACKING_USERNAME=... MLFLOW_TRACKING_PASSWORD=...uv sync pulls torch + CUDA wheels from the PyTorch cu128 index (see
[tool.uv.sources]), so you don't pip install unsloth separately.
Optional -- Flash Attention 2, significantly faster than the xformers fallback (used by TranslateGemma 12B and Qwen3 14B). Match CUDA + torch + Python from mjun0812/flash-attention-prebuild-wheels:
uv pip install https://github.com/mjun0812/flash-attention-prebuild-wheels/releases/download/v0.7.16/flash_attn-2.8.3+cu128torch2.10-cp311-cp311-linux_x86_64.whlGemma 4 has global_head_dim=512, which FA2/FA3 don't support. The
Gemma 4 config pins attn_implementation: sdpa to route through cuDNN
SDPA -- FA2 stays installed but is bypassed for that model only.
Every command is <script> --config <yaml>. Add --dry-run to validate
config + load model without training.
# Continued pretraining
uv run python scripts/pretrain.py --config configs/pretrain/translategemma-12b.yml
# SFT -- text-only (translation)
uv run python scripts/sft.py --config configs/sft/translategemma-12b.yml
# SFT -- multimodal (translation + speech, Gemma 4)
uv run python scripts/sft.py --config configs/sft/gemma4-e4b-speech.yml
# GRPO
uv run python scripts/rl.py --config configs/rl/translategemma-grpo.yml
# Multi-GPU (DDP)
torchrun --nproc_per_node=2 scripts/sft.py --config configs/sft/translategemma-12b.ymlThe SFT script auto-detects modality: if any entry under sft_datasets
in the YAML has modality: audio, it switches to the Unsloth Gemma-4
vision pipeline (multimodal processor, UnslothVisionDataCollator,
per-modality eval losses). Otherwise it runs the legacy text pipeline
with response-only masking. Existing TranslateGemma and Qwen3 configs
are unchanged.
uv run python scripts/eval.py --config configs/eval/translation.yml
uv run python scripts/eval.py --config configs/eval/translation.yml --model-id Sunbird/other-model
uv run python scripts/analysis/analyze_evals.py outputs/eval/Copy the closest existing config in configs/<stage>/, change
base_model and base_model_original, adjust LoRA rank if needed. If
the arch is new to Unsloth (like Gemma 4), add its audio/vision
lora_target_modules from the relevant Unsloth notebook -- see
configs/sft/gemma4-e4b-speech.yml for the Gemma 4 template.
Speech data lives in sft_datasets alongside text. The model does not
auto-detect language -- you identify it in the user_message prompt
per dataset entry. Train-time prompts become the inference-time prompt,
so pick phrasing you'll actually use at deploy time.
Example multi-language SFT (append to sft_datasets in your config):
sft_datasets:
# Text translation -- no modality flag needed
- repo: Sunbird/ug40-instructions
config: translations-fine-tuning
split: train
# Luganda speech
- repo: Sunbird/salt
config: multispeaker-lug
split: train
eval_split: "dev[:200]"
modality: audio
user_message: "Please transcribe this Luganda audio."
# Acholi speech -- same shape, different subset + prompt
- repo: Sunbird/salt
config: multispeaker-ach
split: train
eval_split: "dev[:200]"
modality: audio
user_message: "Please transcribe this Acholi audio."
# Swahili speech from a different repo
- repo: Sunbird/external-speech-data
config: common-voice-sample-packed-swa
split: train
modality: audio
user_message: "Please transcribe this Swahili audio."
sample: 5000 # optional: count (int) or fraction (float < 1)Required fields on each audio entry: repo, config, split,
modality: audio, user_message. Optional: eval_split, sample,
audio_column (default audio), text_column (default text).
SALT and most HF audio datasets use audio + text column names. If a
dataset uses something else (e.g. transcript, sentence), set
text_column explicitly on that entry.
Every script supports --dry-run — loads model, builds dataset, stops
before the optimizer step. Use this when adding a new entry to catch
column name / schema / chat-template issues without burning GPU time.
configs/ YAML configs per stage (pretrain, sft, rl, eval)
scripts/ Training and eval runners
analysis/ Post-hoc analysis and plotting
sunflower/ Reusable library code
data/ Data loading (pretraining, finetuning)
eval_translation.py Translation metrics
rl_rewards.py GRPO reward functions
utils.py Shared model loading, MLflow, merge utilities
Configs define what to train, scripts define how, sunflower/ holds shared logic.
- Pretraining: Plain text from
Sunbird/sunflower-pretrain-dataplus legacy sources. Assembled bysunflower/data/pretraining.py. - SFT (text): Translation pairs from
Sunbird/ug40-instructions. Combine multiple subsets in one YAML; seesunflower/data/finetuning.py. - SFT (speech): Audio-transcript pairs from
Sunbird/salt,Sunbird/external-speech-data, etc. Loaded bysunflower/data/speech.pywhenmodality: audiois set on ansft_datasetsentry. - GRPO: Translation RL data from
Sunbird/sunflower-posttrain-data.
| Model | Stage | BLEU | chrF | Config |
|---|---|---|---|---|
| Sunflower 14B + GRPO | CPT+SFT+GRPO | 14.77 | 41.4 | sunflower-14b-grpo.yml |
| Sunflower 32B | baseline | 14.29 | 39.6 | -- |
| TranslateGemma 12B + GRPO | CPT+SFT+GRPO | 12.48 | 38.5 | translategemma-grpo.yml |
| Sunflower 14B (Qwen) | baseline | 12.30 | 37.1 | -- |
| TranslateGemma 12B SFT | CPT+SFT | 12.71 | 36.3 | translategemma-12b.yml |