[0.5.0] - 2026-05-07
Added
Distributed inference & training
pmetal-distributedcrate (Phases 1-4 + 7): Thunderbolt-fabric-aware multi-Mac cluster runtime with feature-gated tensor, expert, context, ZeRO, and pipeline parallelism modulespmetal clusterCLI: per-node launch, ring/mesh topology discovery, fabric handshake- Pipeline harness with overlap of computation and Thunderbolt transfers
- Canonical expert-rank mapping + per-architecture MoE/MLA tensor-parallel plans
- Ring all-reduce / all-gather with corrected chunk indexing
TurboQuant KV cache (production-ready)
-
TurboQuant KV cache quantization: Provably near-optimal KV cache compression based on random rotation + Lloyd-Max scalar quantization + QJL residual for unbiased inner products (arXiv:2504.19874). Achieves 4-6x KV cache compression with near-zero quality loss. Available via
--kv-turboquantor presets--kv-turboquant-preset q3_5(near-lossless) /q2_5(6.4x compression)- Separate key/value runtimes with independent bit widths and outlier-aware mixed-precision
- Direct attention path for single-token decode avoids full cache dequantization
- Data-oblivious (no calibration data required) — quantizes KV entries online as generated
- Precomputed codebooks via Lloyd-Max algorithm for Beta distribution (deterministic from seed)
- Metal kernel backend with CPU fallback
- Phase 0: split monolithic
mod.rs(6101 → 222 LOC) into config/core/state/bits/math submodules - Phase 3: GPU-resident hot/cold pipeline + Mixed K/V storage;
mixed_scoreas layout oracle - Phase A/B: QJL ablation harness (feature-gated) + per-row
key_slot_scalecodebook adaptation - Phase C/C′: Variant F drop-QJL opt-in path; d128/d256
no_qjl_2passfast paths (4..=8 bits) - Phase D:
TurboQuantPackModeconfig + Fullbyte dense-values kernel - Phase E:
TurboQuantOutlierMode— encode-side top-K outlier storage, zero pre-quant + decode override, outlier-bias on d128/d256 fullbyte score kernel; CPU mirror in scalar encode/decode - Phase F: Hamming skip-list dispatch —
skiplist_thresholdconfig, GPUsign_hashbuffer, Metal Hamming-distances kernel + FFI, GQA support - Mixed-precision attention parity baseline; defensive residual-norm clamp + NaN-safe encode
-
Asymmetric K/V head dimensions: KV cache, TurboQuant, and fused attention now support models where key and value projections have different widths (e.g. DeepSeek MLA with
qk_head_dim != v_head_dim) -
pmetal serve --kv-turboquant: TurboQuant KV cache in the serving engine with--kv-turboquant-preset q3_5for near-lossless 4.6x KV compression in production
Quantization & model formats
- Optimized FP8 checkpoint loading: Hugging Face FP8
weight_scale_invsidecars are dequantized or repacked into MLXmxfp8weights for Qwen3-family native paths; mode-aware quantized matmul plumbing handles floating-point quantized weights without dense fallback - Expanded GGUF quantization/export:
pmetal quantizenow writes standard GGUF metadata from Hugging Face configs, tokenizer/pre-tokenizer metadata, HF-to-GGUF tensor names, stacked MoE expert tensors, and method-specific file types - Broader GGUF format coverage: quantization/dequantization support now includes K-quants, legacy Q4/Q5/Q8 variants, Q1_0, TQ1_0/TQ2_0, MXFP4, NVFP4, BF16, F16, and F32 round trips
- MLX safetensors quantization path: quality-based bit allocation with
--target-bpw, GPU-resident weight loading, and tokenizer/config sidecar copy for MLX-format quantized exports
Inference server (OpenAI- + Anthropic-compatible)
- Continuous batching with paged-KV-style admission + shared prefix cache in
pmetal-serve: per-request slot scheduling, KV-cache prefix sharing, concurrent decode for many simultaneous chats- Token-block admission budget (
--cb-block-size,--cb-max-blocks) prevents over-admitting active contexts and skips head-of-line requests when a smaller queued request fits the remaining block budget - Continuous batching now reuses the shared prompt prefix cache, prefills only uncached suffix tokens, and saves extended prefixes after final prefill
- Continuous batching derives the same cache mode as the single-request serving path, honoring
--kv-quantand--kv-turboquant - Hybrid/recurrent models are rejected from continuous batching instead of silently running without recurrent state
- Token-block admission budget (
- Anthropic-compatible
/v1/messagesendpoint: streamingmessage_start→content_block_start→content_block_delta*→content_block_stop→message_delta→message_stopevents; non-streaming JSON path /v1/embeddingsendpoint: 17 architectures supported viaforward_hidden(Llama/Llama4/Qwen2/Qwen3/Qwen3MoE/Qwen3Next/Mistral/Gemma/Gemma4/Phi/Phi4/DeepSeek/Cohere/Granite/GptOss/NemotronH/BERT) — pooling viapmetal_models::pooling- Token logprobs:
SamplingParams.logprobs_top_nplumbed end-to-end through non-streaming and SSE streaming on both/v1/chat/completionsand/v1/completions. Newpmetal_models::generation::token_logprobsprimitive; ANE/CPU paths emitlogprob: None - Best-effort tool calling on
/v1/chat/completions:try_parse_tool_callsaccepts{name, arguments}or{tool_calls: [...]}.ChatCompletionRequest.toolsgates the attempt; chat templating threads tool defs into the rendered prompt IncrementalDecoder<Aux>SSE buffer: shared UTF-8 boundary buffer + per-token aux pipelining (used for logprobs alignment) across chat/completions/anthropic streams
Job orchestration substrate (TUI / GUI / MCP / CLI parity)
JobSpecsubstrate: 16 canonical spec types inpmetal-core(Train, Distill, GRPO, Bench, Eval, Pretrain, Tokenize, Serve, Generate, RLKD, EmbedTrain, DFlash, Memory, Ollama, …) with#[derive(JobSpec)]proc-macroJobEventcanonical streaming protocol: progress / metric / log / artefact / complete / failed events emitted by all 4 surfaces (CLI, TUI, GUI, MCP)- CLI: 8 specced
Commandsvariants flattened — 613 LOC removed frommain.rs;cli/<sub>.rsArgs structs and JobSpec argv round-trip tests;--log-eventsflag stub - TUI: 14 tabs with full CLI parity,
?-key help overlay,Ctrl+1..9tab jump, active-job footer badge, descriptor-driven forms with sharedFormTabStateprimitive; channel-based metrics streaming (ChannelMetricsCallback) for direct-path train/distill/grpo/bench/eval/pretrain - GUI (Tauri): complete 9-DTO frontend-lockstep migration to
*Spectypes; Serve, Bench, Eval, Jobs, Pretrain pages; embed-train + rlkd + ollama routes; channel-based metrics streaming - MCP: 51-tool server with migrated train/pretrain/tokenize/memory/dflash/generate coverage, allowlisted CLI passthrough tools for newly added CLI flags, and a JobEvent JSONL consumer for managed background jobs
SOTA distillation (pmetal-distill)
- Universal Logit Distillation (ULD) — Wasserstein-1 over sorted logit distributions for cross-tokenizer KD (Boizard et al. 2024); optional
top_ktruncation; permutation-invariant by design - Generalized Knowledge Distillation (GKD) — λ-weighted off-policy + on-policy KL blend (Agarwal et al. 2024);
OnPolicySamplertrait withGreedySamplerreference impl;compute_full(t_off, s_off, t_on, s_on, T) - MiniLLM — reverse-KL with optional teacher-mix
target = mix·T + (1-mix)·S(Gu et al. 2024) - Skewed JSD (DistiLLM-2) —
α·KL(T||M_α) + (1-α)·KL(S||M_α)withM_α = α·T + (1-α)·S, log-sum-exp computation; α=0.5 reduces to standard symmetric JSD (Ko et al. 2024) - Attention-transfer loss + weighted Metal path for hidden-state distillation
- Offline teacher-logit caching:
pmetal distill --offline-cache <path>precomputes teacher logits to disk; newInt8PerTokencompressed-block variant replaces NaN-sentinel scheme with explicitper_token_metafield (legacyInt8variant retained for read-back) DistillLossOutput.metrics: HashMap<&'static str, f32>: lazily-evaluatedteacher_entropy,student_entropy,kl_per_token,top1_agreementexposed to trainer JSONL/TUI streaming- TAID difficulty-aware observability:
alpha_varsurfaced for per-step monitoring - Configurable
ignore_index: PyTorch-standard-100default onTrainingConfig; safe label clamping before gather - Hidden-state shape assertions before matmul (clear error vs. silent broadcast bug)
SOTA model merging (pmetal-merge)
- Fisher merging (Matena & Raffel 2022): diagonal-Fisher-weighted average
θ = Σ F_i⊙θ_i / (Σ F_i + ε); lazy-loaded Fisher safetensors;fallback_to_meanfor tensors without Fisher entries - RegMean (Jin et al. 2023): closed-form linear-layer merge
W = (Σ G_i)⁻¹ · (Σ G_i W_i)via hand-rolled Gauss-Jordanpseudo_inverse_2dwith Tikhonov ridge; falls back to mean for non-2D weights - MoE expert permutation alignment: per-(model, layer) Hungarian solver (Jonker-Volgenant style, O(N³)) over L2-normalized cosine similarity of expert fingerprints; tensor-name remapping
experts.{i}.→experts.{π(i)}.before merge; gated byalign_moe_experts - Honor
config.dtypein save path:MergeBuilder.dtypebuilder,TensorWriter::with_dtypeplumbing, per-dtype byte packing for F16/BF16/F32; previously hardcoded to F16 - Cross-model dtype consistency check:
verify_source_dtypeserrors on mismatch unlessallow_mixed_dtypeis set - Tied-embedding detection:
lm_head.weightandembed_tokens.weightaliasing detected and merged once under canonical name - Tokenizer + config sidecar copy:
tokenizer.json,tokenizer_config.json,special_tokens_map.json,config.json,generation_config.jsoncopied on full-model merge;config.json.torch_dtypepatched to match output dtype - Post-merge sanity sweep (
SanityLevel::{Off,Quick,Full}, defaultQuick): NaN/inf detection aborts save; full mode reports per-tensormean/std/abs_max/sparsity MergeConfig.dry_run: short-circuits write phase, logs would-write summary
LoRA / QLoRA — full text-architecture coverage
- New LoRA adapters: Granite, Llama4, DeepSeek, NemotronH, MLlama, Cohere, Phi, Gemma4, GPT-OSS, Qwen3-MoE, Qwen3-Next
- New QLoRA adapters: Granite, Llama4, DeepSeek, NemotronH, Cohere, Phi, Gemma4, GPT-OSS, Qwen3-MoE, Qwen3-Next
- LoRA+ wired into
run_compiledtraining path; Gemma4 QLoRA KV-cache path - DeepSeek
merge_loraproperly implemented; Phi4 dispatched to existingPhiLoraForCausalLM - Interface-parity gradient-checkpointing hooks across 7 adapters
Bridge & native paths
pmetal-bridgecrate: Zero-allocation MLX C++ bridge replacing mlx-rs as the core runtime. Native inference at 201 tok/s (Qwen3.5 0.8B), 4-bit quantized inference (28 tok/s on 27B), compiled attention, KV cache trimming, and full training ops (autograd, optimizer, random, math, reduction, comparison) — all without mlx-rs overhead- Fused [T=1] decode kernels for
gpt_ossandllama4(Bridge Phase 4) - Fused [N,1] batched-decode path for Tier-1/2 architectures
- Cheap native KV cache fork support for Qwen3, GPT-OSS, Llama4, DeepSeek, and generic
KVCache, preserving dense, quantized, and TurboQuant cache state for serving prefix reuse BRIDGE_TRY_{DST,VOID}error coverage: thread-local exception slot replaces process-abort across most ops;pmetal_bridge::check_last_error()?surfacesBridgeError::CxxExceptionafter any op;InlineArray::try_*variants for matmul/softmax/reshape/sdpa/gather_mm/dequantize/etc.- Scalar dtype footgun fix:
InlineArray::scalar_like(value, peer)+mul_scalar/add_scalar/sub_scalar/div_scalareliminate manual.as_dtype(model_dtype)calls async_evalactually async: prior implementation blocked the calling thread- Bridge file splits:
bridge.handbridge.cppcarved intocpp/bridge/sub-headers + 6 source files;bridge_turboquant.cppsplit by kernel family;inline_array.rs,qwen3_native.rs,deepseek_native.rs,llama4_native.rs,gpt_oss_native.rssplit into submodule directories forward_hiddenfor 17 architectures (embeddings + retrieval support)
Preference & RL trainers (pmetal-trainer)
PairedPreferenceTrainer<L>trait +DpoLosskernel:DpoTrainer::trainandOnlineDpoTrainer::train_stepnow delegate to the shared trainer;ReferenceStrategy::{StopGradient, Zero, Precomputed}covers the three reference-logp sources- Shared log-prob helpers fanned out to KTO/ORPO/GRPO/OnlineDPO via
logprob_utils::{compute_log_probs, compute_log_probs_with_avg, shifted_selective_log_softmax}
Model, training & data
-
Full-parameter pretraining: End-to-end
pmetal pretrainpipeline for training models from scratch- Model factory supporting llama, qwen, gemma, mistral, phi, and gpt-oss architectures
- Gradient accumulation, cosine/linear/constant LR scheduling, gradient clipping
- Full model + optimizer checkpoint save/restore for resumable runs
- Memory-mapped streaming shard reader (
StreamingShardReader) with zero-copy I/O via memmap2 pmetal tokenizecommand for converting JSONL corpora to binary shards- Pretrain tab in TUI and GUI with real-time loss/throughput/ETA monitoring
-
Gradient checkpointing:
checkpoint_apply()wraps forward functions viamlx::core::checkpoint()to recompute activations during backward, reducing peak training memory from O(layers) to O(1) -
Optimizer checkpoint/resume:
AdamWgainsstep_count(),set_lr(), andrestore_state()for saving and restoring optimizer state across training runs -
Gemma 4 architecture: Full Gemma 4 model support with sliding-window attention and per-layer KV head configuration
-
DFlash speculative decoding: Native Rust port of the dflash-mlx speculative decoding pipeline for accelerated generation
-
Jinja chat templates: Real upstream jinja rendering via minijinja with 16 parity-audited template types (ChatML, Llama2/3/4, Mistral, Gemma/Gemma4, Phi3/4, Qwen, DeepSeek, Cohere, Alpaca, Vicuna, Zephyr, GptOss)
-
Qwen3.5 MoE dispatch improvements: Expert prefetch reset per generation, configurable GDN chunk size, chunked prefill, and generation helpers
-
Adaptive sequence packing:
compute_pack_seq_len()uses p99 of actual dataset sequence lengths instead ofmax_position_embeddings— up to 256x reduction in wasted compute for short-sequence datasets.--pack-max-seq-lenfor explicit override
Metal 4 / MPP backend
-
Metal 4 / MPP kernel backend (#14): Trait-based kernel dispatch with
Metal3BackendandMetal4Backendfor M5+ (Apple10/NAX) GPUsKernelBackendtrait with 16 methods covering GEMM, attention, fused linear, training, MoE, distillationKernelDispatchrouter onMetalContext— selects Metal 4 for large GEMMs (M>1, K%32==0) on M5, Metal 3 for everything elseMetal4CommandBufferwith correct begin/end lifecycle,CommandAllocatorPool,ResidencyManager- Compile-time
#[cfg(has_metal4)]gating + runtimehas_naxcheck — zero overhead on M1-M4
-
15 MPP-optimized Metal 4 shaders: All following Apple MPP best practices (single simdgroup execution, Morton-order threadgroup walk, K-dimension alignment to 32, accumulation-loop barriers at BK=128)
- 8 existing shaders optimized:
mpp_gemm,mpp_flash_attention,mpp_quantized,mpp_fused_swiglu,mpp_fused_norm_lora,mpp_dw_gemm,mpp_grouped_gemm,mpp_fused_lora - 5 new shaders:
mpp_fused_training(AdamW),mpp_fused_cross_entropy,mpp_fused_rope,mpp_fused_moe,mpp_fused_distill - 2 additional:
mpp_fused_mlp(gate+up+down combined), quantized MoE expert variants
- 8 existing shaders optimized:
Benchmarking & inference UX
-
Benchmark enhancements: workload presets, custom dataset/expert-dir controls, inference session repeats, train sample/step/batch/sequence controls, warmup passes, GDN prefill stage profiling, TurboQuant flag for bench commands, and fused gate/up expert packing with auto-detected tensor layout
-
--modesampling presets: Per-model-family recommended sampling parameters (Qwen3/3.5 thinking/instruct modes).--mode autoselects based on--no-thinkingflag -
--detect-repetition: Opt-in n-gram repetition loop detection (8-token pattern x 4 repeats), force-stops infinite loops -
Chip name in decode stats: Inference output now shows the Apple Silicon chip (e.g.,
[M4 Max])
Changed
- mlx-rs removed: All crates migrated from mlx-rs to the
pmetal-bridgecompat API. Entire model, training, serving, and GUI stacks now use the zero-allocation C++ bridge - Thinking trace shown by default for thinking models; use
--hide-thinkingto suppress - Migrated scattered
has_nax()checks toMetalContext::dispatch()for centralized backend routing - Split
compat.rs(3620 lines) into 7-filecompat/module directory - Split
bridge.cpp(6749 lines) into 6 C++ source files with sharedbridge_internal.h Array::id()replacesdata_ptr()for weight change detection — safe with lazy evaluation- Persistent MLX build cache in CI and
build.rs: skips redundant cmake compilation across CI runs pmetal_hub::resolve_model_pathadopted acrosscore,cli, andguifor consistent local-cache → Hub-ID resolution- Distillation orchestration stubs removed:
Distiller::run_online/run_offline/run_progressivedeleted — orchestration now lives entirely inpmetal-trainer - MLX MoE routing audit: confirmed no
argpartition(-scores, -k)anti-top-k regressions remain; documented as a permanent footgun MergeMethodtrait extended withmerge_named(name, …)(default forwards tomerge); Fisher and RegMean dispatch through name-aware path- MSRV: Raised workspace
rust-versionto 1.89 to match the updatedturbomcpdependency
Fixed
- Release GUI workflow: Installs the Tauri frontend with pnpm before
tauri-action, matching the package manager that the action detects frompnpm-lock.yaml - MCP adaptive training controls: LR/checkpoint/stop commands now handle
--output=...jobs and create the control directory before writing.lr_control.json - AdamW bias correction: step counter was advancing per-parameter instead of per-step
- Cross-entropy loss masking: ignored labels are masked before gather, using a selective
logsumexp - target_logitpath that avoids materializing fulllog_softmax - Gradient clipping in compiled training path now uses
_clippedstep variants - FFI exception safety: ~33 C++ bridge functions wrapped in try/catch
- LoRA inference segfault: put_along_axis crash during generation
- UTF-8 char boundary panics in inference/GUI output stream handling
- Distributed ring reduce: all-gather chunk indexing used wrong offset, corrupting gradient aggregation
- Distributed transport/compression audit: namespace PSK handshakes, TCP fallback hardening, bounded compressed-gradient deserialization, and out-of-range sparse-index guards
- Serving parameter validation: OpenAI-compatible routes validate sampling parameters before streaming and non-streaming generation dispatch
- select_axis parameter order: standardized (data, index, axis) across all call sites
- Lazy array segfaults: diffusion sampler now evals sigmas/timesteps before slice access
- Sampling penalties: correctly wired through native bridge decode path
- Qwen3-Next MoE routing: corrected anti-top-k expert selection bug (sign/slice pair)
- Qwen3-Next hybrid cache flag in LoRA + distillation paths
- TurboQuant d128 pass-2 cross-simdgroup reduction; lazy-transpose footgun in mixed-precision attention
- TurboQuant serving prefix-cache compatibility: prefix cache now stores forked KV caches instead of dense snapshots, preserving compressed TurboQuant history without fp16 re-inflation
- Native attention correctness/perf audit: fail-fast TurboQuant dispatch, checked SDPA wrappers, true hot-ring cache behavior, centralized quantized tuple growth, and unsupported-cache rejection in tree verification
- GKD
compute_weightedno longer scales by(1-λ)(silently zeroed training at λ=1.0) pmetal_serve::sse::IncrementalDecoderprevents UTF-8 boundary panics on partial codepoint emission- Zero clippy warnings across entire workspace
- Stale
pmetal-mlx-sysmetallib path in release workflow (renamed topmetal-bridge)
Removed
- mlx-rs dependency: Fully replaced by
pmetal-bridge— removes ~15K lines of Rust FFI bindings - 1065 lines of dead code:
qwen3_train.rs, unused LoRA functions inqwen3_native.rs - 5 superseded LoRA training modules
- Dropped model support for StarCoder2, FalconH1, RecurrentGemma, and Jamba
Distiller::run_*orchestration stubs (now lives inpmetal-trainer)
Downloads
| Asset | Description |
|---|---|
pmetal-*-aarch64-apple-darwin.tar.gz |
CLI binary + mlx.metallib (Apple Silicon) |
PMetal-*-aarch64-apple-darwin-*.dmg |
Desktop GUI app (Apple Silicon) |
mlx.metallib |
MLX Metal shader library (standalone) |
CLI Quick Start
tar xzf pmetal-*-aarch64-apple-darwin.tar.gz
./pmetal train --model Qwen/Qwen3-0.6B --dataset train.jsonl --output ./outputGUI
Mount the DMG and drag PMetal to Applications.
Full Changelog: v0.4.0...v0.5.0