Sunflower

Multilingual LLM training pipeline for 40+ Ugandan languages. Adapts open-weight models (Gemma, Qwen) via continued pretraining, translation-focused SFT, and GRPO post-training.

Quick start

Setup (once per machine)

uv sync                          # installs everything pinned in pyproject.toml
apt-get install -y ffmpeg libnpp-12-8     # needed for torchcodec (speech data only)
hf auth login                    # Hugging Face token for gated models
export MLFLOW_TRACKING_USERNAME=... MLFLOW_TRACKING_PASSWORD=...

uv sync pulls torch + CUDA wheels from the PyTorch cu128 index (see [tool.uv.sources]), so you don't pip install unsloth separately.

Optional -- Flash Attention 2, significantly faster than the xformers fallback (used by TranslateGemma 12B and Qwen3 14B). Match CUDA + torch + Python from mjun0812/flash-attention-prebuild-wheels:

uv pip install https://github.com/mjun0812/flash-attention-prebuild-wheels/releases/download/v0.7.16/flash_attn-2.8.3+cu128torch2.10-cp311-cp311-linux_x86_64.whl

Gemma 4 has global_head_dim=512, which FA2/FA3 don't support. The Gemma 4 config pins attn_implementation: sdpa to route through cuDNN SDPA -- FA2 stays installed but is bypassed for that model only.

Train

Every command is <script> --config <yaml>. Add --dry-run to validate config + load model without training.

# Continued pretraining
uv run python scripts/pretrain.py --config configs/pretrain/translategemma-12b.yml

# SFT -- text-only (translation)
uv run python scripts/sft.py --config configs/sft/translategemma-12b.yml

# SFT -- multimodal (translation + speech, Gemma 4)
uv run python scripts/sft.py --config configs/sft/gemma4-e4b-speech.yml

# GRPO
uv run python scripts/rl.py --config configs/rl/translategemma-grpo.yml

# Multi-GPU (DDP)
torchrun --nproc_per_node=2 scripts/sft.py --config configs/sft/translategemma-12b.yml

The SFT script auto-detects modality: if any entry under sft_datasets in the YAML has modality: audio, it switches to the Unsloth Gemma-4 vision pipeline (multimodal processor, UnslothVisionDataCollator, per-modality eval losses). Otherwise it runs the legacy text pipeline with response-only masking. Existing TranslateGemma and Qwen3 configs are unchanged.

Eval

uv run python scripts/eval.py --config configs/eval/translation.yml
uv run python scripts/eval.py --config configs/eval/translation.yml --model-id Sunbird/other-model
uv run python scripts/analysis/analyze_evals.py outputs/eval/

Adding a new model

Copy the closest existing config in configs/<stage>/, change base_model and base_model_original, adjust LoRA rank if needed. If the arch is new to Unsloth (like Gemma 4), add its audio/vision lora_target_modules from the relevant Unsloth notebook -- see configs/sft/gemma4-e4b-speech.yml for the Gemma 4 template.

Adding speech datasets / new languages

Speech data lives in sft_datasets alongside text. The model does not auto-detect language -- you identify it in the user_message prompt per dataset entry. Train-time prompts become the inference-time prompt, so pick phrasing you'll actually use at deploy time.

Example multi-language SFT (append to sft_datasets in your config):

sft_datasets:
  # Text translation -- no modality flag needed
  - repo: Sunbird/ug40-instructions
    config: translations-fine-tuning
    split: train

  # Luganda speech
  - repo: Sunbird/salt
    config: multispeaker-lug
    split: train
    eval_split: "dev[:200]"
    modality: audio
    user_message: "Please transcribe this Luganda audio."

  # Acholi speech -- same shape, different subset + prompt
  - repo: Sunbird/salt
    config: multispeaker-ach
    split: train
    eval_split: "dev[:200]"
    modality: audio
    user_message: "Please transcribe this Acholi audio."

  # Swahili speech from a different repo
  - repo: Sunbird/external-speech-data
    config: common-voice-sample-packed-swa
    split: train
    modality: audio
    user_message: "Please transcribe this Swahili audio."
    sample: 5000   # optional: count (int) or fraction (float < 1)

Required fields on each audio entry: repo, config, split, modality: audio, user_message. Optional: eval_split, sample, audio_column (default audio), text_column (default text).

SALT and most HF audio datasets use audio + text column names. If a dataset uses something else (e.g. transcript, sentence), set text_column explicitly on that entry.

Validate before training

Every script supports --dry-run — loads model, builds dataset, stops before the optimizer step. Use this when adding a new entry to catch column name / schema / chat-template issues without burning GPU time.

Project structure

configs/            YAML configs per stage (pretrain, sft, rl, eval)
scripts/            Training and eval runners
  analysis/         Post-hoc analysis and plotting
sunflower/          Reusable library code
  data/             Data loading (pretraining, finetuning)
  eval_translation.py  Translation metrics
  rl_rewards.py     GRPO reward functions
  utils.py          Shared model loading, MLflow, merge utilities

Configs define what to train, scripts define how, sunflower/ holds shared logic.

Data

Pretraining: Plain text from Sunbird/sunflower-pretrain-data plus legacy sources. Assembled by sunflower/data/pretraining.py.
SFT (text): Translation pairs from Sunbird/ug40-instructions. Combine multiple subsets in one YAML; see sunflower/data/finetuning.py.
SFT (speech): Audio-transcript pairs from Sunbird/salt, Sunbird/external-speech-data, etc. Loaded by sunflower/data/speech.py when modality: audio is set on an sft_datasets entry.
GRPO: Translation RL data from Sunbird/sunflower-posttrain-data.

Results

Model	Stage	BLEU	chrF	Config
Sunflower 14B + GRPO	CPT+SFT+GRPO	14.77	41.4	`sunflower-14b-grpo.yml`
Sunflower 32B	baseline	14.29	39.6	--
TranslateGemma 12B + GRPO	CPT+SFT+GRPO	12.48	38.5	`translategemma-grpo.yml`
Sunflower 14B (Qwen)	baseline	12.30	37.1	--
TranslateGemma 12B SFT	CPT+SFT	12.71	36.3	`translategemma-12b.yml`

Name		Name	Last commit message	Last commit date
Latest commit History 209 Commits
configs		configs
scripts		scripts
sunflower		sunflower
tests		tests
tools		tools
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Sunflower

Quick start

Setup (once per machine)

Train

Eval

Adding a new model

Adding speech datasets / new languages

Validate before training

Project structure

Data

Results

Documentation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Sunflower

Quick start

Setup (once per machine)

Train

Eval

Adding a new model

Adding speech datasets / new languages

Validate before training

Project structure

Data

Results

Documentation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages