Local-first post-training toolkit for Qwen2.5 7B on Apple Silicon.
This project is building a private workflow to fine-tune
Qwen2.5-7B-Instruct on a Mac mini M4 with 16 GB unified memory using
MLX-friendly adapter training. The target path is:
- Interview the user to define a custom SFT dataset.
- Generate or collect about
1000high-quality training examples. - Validate and normalize data into canonical SFT/DPO JSONL.
- Run SFT locally with LoRA/QLoRA adapters.
- Run DPO locally when memory allows.
- Run the fine-tuned adapter from CLI.
- Serve the same local model over HTTP later.
The default design keeps private data, training, and inference on the local machine. Cloud LLMs or hosted training can be added later only as explicit opt-in paths.
This repository is in planning/bootstrap stage.
Implemented documentation artifacts:
- ROADMAP.md: full local training roadmap, 16 GB constraints, synthetic data pipeline, SFT/DPO formats, CLI/HTTP goals, and 64 GB scale plan.
- skills/sft-dataset-interviewer/SKILL.md: repo-local skill for interviewing the user and producing a concrete SFT dataset brief before data generation.
- docs/references/sglang-post-training.md: review of SGLang's post-training infrastructure ideas and which ones fit this local-first project.
Implementation status:
- Beta local SFT loop is complete.
- DPO has an explicit readiness check, but local DPO training is gated because the current MLX-LM install does not expose a DPO training command.
Implemented code paths:
- Real local Qwen CLI inference through
qwenpt chat. - SFT dataset brief generation through
qwenpt data brief. - Deterministic mock SFT data generation and split validation through
qwenpt data generateandqwenpt data validate-splits. - Conservative MLX-LM LoRA SFT command construction, split preparation, and
per-run metadata through
qwenpt train sft.
Use a repo-local virtual environment for real model inference. The project expects Python 3.11+; Python 3.12 is the tested local setup.
cd /Users/ricktu/qwen-post-training
curl -LsSf https://astral.sh/uv/install.sh | sh
~/.local/bin/uv python install 3.12
~/.local/bin/uv venv --python 3.12 .venv
~/.local/bin/uv pip install -r requirements.txtCheck the environment and run a real local prompt:
.venv/bin/qwenpt doctor
.venv/bin/qwenpt chat --max-tokens 64 "Say hi in five words."The first real run downloads mlx-community/Qwen2.5-7B-Instruct-4bit into the
local Hugging Face cache. On the Mac mini M4 16 GB smoke test, a short prompt
completed successfully with about 4.4 GB peak memory reported by MLX-LM.
Useful test commands:
make doctor
make chat-dry-run PROMPT="hello"
.venv/bin/qwenpt chat --dry-run "hello"
.venv/bin/qwenpt chat --backend mock "hello"If Python 3.12 is already installed, standard venv also works:
python3.12 -m venv .venv
.venv/bin/python -m pip install -r requirements.txtCustom data is first-class. The project should support both user-provided data and LLM-generated synthetic data.
The SFT data workflow starts with the dataset interviewer skill:
- Talk with the user for a few focused rounds.
- Capture the desired model behavior, audience, task mix, style, boundaries, and evaluation prompts.
- Produce a dataset brief.
- Generate a first smoke batch of about
100examples. - Review, score, deduplicate, and validate.
- Scale toward about
1000examples. - Export accepted records to canonical SFT chat JSONL.
Create a dataset brief and generate deterministic mock SFT data:
.venv/bin/qwenpt data brief \
--slug local-qwen-helper \
--domain "local Qwen post-training" \
--task "answer project questions" \
--task "draft dataset examples" \
--eval-prompt "Help me create an SFT dataset."
.venv/bin/qwenpt data generate \
--brief dataset_briefs/local-qwen-helper.md \
--provider mock \
--count 1000
.venv/bin/qwenpt data validate-splits data/processed/sft/local-qwen-helperUse --provider mlx to generate through the local Qwen teacher model instead
of the deterministic mock provider. Start with --smoke or a small --count
before attempting 1000 examples.
Canonical SFT record:
{"messages":[{"role":"system","content":"..."},{"role":"user","content":"..."},{"role":"assistant","content":"..."}]}Canonical DPO record:
{"prompt":"...","chosen":"...","rejected":"..."}Check local DPO readiness:
.venv/bin/qwenpt train dpoThis writes metadata under runs/dpo/<run_id> and reports the current DPO
backend status. Today it is expected to report unavailable for MLX-LM DPO
training on this setup.
After generating and validating an SFT dataset, run a conservative MLX-LM LoRA smoke training job:
.venv/bin/qwenpt train sft \
--dataset data/processed/sft/local-qwen-helper \
--smokeTo inspect the exact mlx_lm.lora invocation and generated run metadata
without starting training:
.venv/bin/qwenpt train sft \
--dataset data/processed/sft/local-qwen-helper \
--smoke \
--dry-runEach run writes MLX-ready split files, metadata, training.log, and
metrics.json under runs/sft/<run_id>, and saves adapters under
adapters/sft/<run_id>.
The default non-smoke profile is local_16gb, which keeps batch size, context,
rank, and layer count small but trains longer than the smoke path:
make sft-local PYTHON=.venv/bin/python DATASET_SLUG=local-qwen-helperUse --profile smoke or --profile local_16gb to choose a profile explicitly.
Use --iters for one-off iteration overrides without editing
configs/sft.yaml.
Run recorded before/after evals against fixed prompt files:
.venv/bin/qwenpt eval \
--prompts data/eval/local-qwen-helper.jsonl \
--run-id baseline-local-qwen-helper
.venv/bin/qwenpt eval \
--prompts data/eval/local-qwen-helper.jsonl \
--adapter adapters/sft/sft-local16-20260511T105810Z \
--run-id sft-local16-local-qwen-helperEval runs write metadata.json and results.jsonl under runs/eval/<run_id>.
Compare two recorded eval runs:
.venv/bin/qwenpt eval compare \
--baseline runs/eval/eval-baseline-local-qwen-helper-20260511T105810Z/results.jsonl \
--candidate runs/eval/eval-sft-local16-local-qwen-helper-20260511T105810Z/results.jsonl \
--output runs/eval/compare-local-qwen-helper.jsonServe the same local generation path behind an OpenAI-compatible chat endpoint:
.venv/bin/qwenpt serve \
--host 127.0.0.1 \
--port 8080 \
--adapter adapters/sft/sft-local16-20260511T105810ZThen call it from another terminal:
curl http://127.0.0.1:8080/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{"model":"mlx-community/Qwen2.5-7B-Instruct-4bit","messages":[{"role":"user","content":"Help me create an SFT dataset."}],"max_tokens":128}'The server also exposes GET /health and GET /v1/models. Use
--backend mock for endpoint smoke tests without loading MLX.
On the current 16 GB Mac mini M4, the project should use:
- 4-bit MLX Qwen2.5 7B instruct base model.
- LoRA/QLoRA adapter training only.
- Small batch size, short context, low LoRA rank, and conservative DPO tests.
On a future 64 GB M4/M5-class machine, the project should scale by increasing:
- Context length.
- LoRA rank.
- Trainable layers.
- Validation and preference datasets.
- DPO experiment depth.
- Optional less-aggressive quantization.
Full-weight 7B fine-tuning is not the default local target. The near-term goal is reliable local adapter training that can grow with larger Apple Silicon machines.
- Base model:
mlx-community/Qwen2.5-7B-Instruct-4bit. - Upstream reference:
Qwen/Qwen2.5-7B-Instruct. - Training approach: MLX LoRA/QLoRA SFT, then DPO if stable.
- Inference: local CLI first.
- Serving: localhost HTTP API after CLI inference works.
See ROADMAP.md for the detailed build plan.