Skip to content

SunbirdAI/sunbird-tutor-modelling

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

209 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Sunflower

Multilingual LLM training pipeline for 40+ Ugandan languages. Adapts open-weight models (Gemma, Qwen) via continued pretraining, translation-focused SFT, and GRPO post-training.

Quick start

Setup (once per machine)

uv sync                          # installs everything pinned in pyproject.toml
apt-get install -y ffmpeg libnpp-12-8     # needed for torchcodec (speech data only)
hf auth login                    # Hugging Face token for gated models
export MLFLOW_TRACKING_USERNAME=... MLFLOW_TRACKING_PASSWORD=...

uv sync pulls torch + CUDA wheels from the PyTorch cu128 index (see [tool.uv.sources]), so you don't pip install unsloth separately.

Optional -- Flash Attention 2, significantly faster than the xformers fallback (used by TranslateGemma 12B and Qwen3 14B). Match CUDA + torch + Python from mjun0812/flash-attention-prebuild-wheels:

uv pip install https://github.com/mjun0812/flash-attention-prebuild-wheels/releases/download/v0.7.16/flash_attn-2.8.3+cu128torch2.10-cp311-cp311-linux_x86_64.whl

Gemma 4 has global_head_dim=512, which FA2/FA3 don't support. The Gemma 4 config pins attn_implementation: sdpa to route through cuDNN SDPA -- FA2 stays installed but is bypassed for that model only.

Train

Every command is <script> --config <yaml>. Add --dry-run to validate config + load model without training.

# Continued pretraining
uv run python scripts/pretrain.py --config configs/pretrain/translategemma-12b.yml

# SFT -- text-only (translation)
uv run python scripts/sft.py --config configs/sft/translategemma-12b.yml

# SFT -- multimodal (translation + speech, Gemma 4)
uv run python scripts/sft.py --config configs/sft/gemma4-e4b-speech.yml

# GRPO
uv run python scripts/rl.py --config configs/rl/translategemma-grpo.yml

# Multi-GPU (DDP)
torchrun --nproc_per_node=2 scripts/sft.py --config configs/sft/translategemma-12b.yml

The SFT script auto-detects modality: if any entry under sft_datasets in the YAML has modality: audio, it switches to the Unsloth Gemma-4 vision pipeline (multimodal processor, UnslothVisionDataCollator, per-modality eval losses). Otherwise it runs the legacy text pipeline with response-only masking. Existing TranslateGemma and Qwen3 configs are unchanged.

Eval

uv run python scripts/eval.py --config configs/eval/translation.yml
uv run python scripts/eval.py --config configs/eval/translation.yml --model-id Sunbird/other-model
uv run python scripts/analysis/analyze_evals.py outputs/eval/

Adding a new model

Copy the closest existing config in configs/<stage>/, change base_model and base_model_original, adjust LoRA rank if needed. If the arch is new to Unsloth (like Gemma 4), add its audio/vision lora_target_modules from the relevant Unsloth notebook -- see configs/sft/gemma4-e4b-speech.yml for the Gemma 4 template.

Adding speech datasets / new languages

Speech data lives in sft_datasets alongside text. The model does not auto-detect language -- you identify it in the user_message prompt per dataset entry. Train-time prompts become the inference-time prompt, so pick phrasing you'll actually use at deploy time.

Example multi-language SFT (append to sft_datasets in your config):

sft_datasets:
  # Text translation -- no modality flag needed
  - repo: Sunbird/ug40-instructions
    config: translations-fine-tuning
    split: train

  # Luganda speech
  - repo: Sunbird/salt
    config: multispeaker-lug
    split: train
    eval_split: "dev[:200]"
    modality: audio
    user_message: "Please transcribe this Luganda audio."

  # Acholi speech -- same shape, different subset + prompt
  - repo: Sunbird/salt
    config: multispeaker-ach
    split: train
    eval_split: "dev[:200]"
    modality: audio
    user_message: "Please transcribe this Acholi audio."

  # Swahili speech from a different repo
  - repo: Sunbird/external-speech-data
    config: common-voice-sample-packed-swa
    split: train
    modality: audio
    user_message: "Please transcribe this Swahili audio."
    sample: 5000   # optional: count (int) or fraction (float < 1)

Required fields on each audio entry: repo, config, split, modality: audio, user_message. Optional: eval_split, sample, audio_column (default audio), text_column (default text).

SALT and most HF audio datasets use audio + text column names. If a dataset uses something else (e.g. transcript, sentence), set text_column explicitly on that entry.

Validate before training

Every script supports --dry-run — loads model, builds dataset, stops before the optimizer step. Use this when adding a new entry to catch column name / schema / chat-template issues without burning GPU time.

Project structure

configs/            YAML configs per stage (pretrain, sft, rl, eval)
scripts/            Training and eval runners
  analysis/         Post-hoc analysis and plotting
sunflower/          Reusable library code
  data/             Data loading (pretraining, finetuning)
  eval_translation.py  Translation metrics
  rl_rewards.py     GRPO reward functions
  utils.py          Shared model loading, MLflow, merge utilities

Configs define what to train, scripts define how, sunflower/ holds shared logic.

Data

  • Pretraining: Plain text from Sunbird/sunflower-pretrain-data plus legacy sources. Assembled by sunflower/data/pretraining.py.
  • SFT (text): Translation pairs from Sunbird/ug40-instructions. Combine multiple subsets in one YAML; see sunflower/data/finetuning.py.
  • SFT (speech): Audio-transcript pairs from Sunbird/salt, Sunbird/external-speech-data, etc. Loaded by sunflower/data/speech.py when modality: audio is set on an sft_datasets entry.
  • GRPO: Translation RL data from Sunbird/sunflower-posttrain-data.

Results

Model Stage BLEU chrF Config
Sunflower 14B + GRPO CPT+SFT+GRPO 14.77 41.4 sunflower-14b-grpo.yml
Sunflower 32B baseline 14.29 39.6 --
TranslateGemma 12B + GRPO CPT+SFT+GRPO 12.48 38.5 translategemma-grpo.yml
Sunflower 14B (Qwen) baseline 12.30 37.1 --
TranslateGemma 12B SFT CPT+SFT 12.71 36.3 translategemma-12b.yml

Documentation

About

Gemma4 - Sunbird Tutor: a translanguage education assistant

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors