Skip to content

PMetal v0.5.0

Latest

Choose a tag to compare

@github-actions github-actions released this 08 May 11:38
· 1 commit to main since this release

[0.5.0] - 2026-05-07

Added

Distributed inference & training

  • pmetal-distributed crate (Phases 1-4 + 7): Thunderbolt-fabric-aware multi-Mac cluster runtime with feature-gated tensor, expert, context, ZeRO, and pipeline parallelism modules
    • pmetal cluster CLI: per-node launch, ring/mesh topology discovery, fabric handshake
    • Pipeline harness with overlap of computation and Thunderbolt transfers
    • Canonical expert-rank mapping + per-architecture MoE/MLA tensor-parallel plans
    • Ring all-reduce / all-gather with corrected chunk indexing

TurboQuant KV cache (production-ready)

  • TurboQuant KV cache quantization: Provably near-optimal KV cache compression based on random rotation + Lloyd-Max scalar quantization + QJL residual for unbiased inner products (arXiv:2504.19874). Achieves 4-6x KV cache compression with near-zero quality loss. Available via --kv-turboquant or presets --kv-turboquant-preset q3_5 (near-lossless) / q2_5 (6.4x compression)

    • Separate key/value runtimes with independent bit widths and outlier-aware mixed-precision
    • Direct attention path for single-token decode avoids full cache dequantization
    • Data-oblivious (no calibration data required) — quantizes KV entries online as generated
    • Precomputed codebooks via Lloyd-Max algorithm for Beta distribution (deterministic from seed)
    • Metal kernel backend with CPU fallback
    • Phase 0: split monolithic mod.rs (6101 → 222 LOC) into config/core/state/bits/math submodules
    • Phase 3: GPU-resident hot/cold pipeline + Mixed K/V storage; mixed_score as layout oracle
    • Phase A/B: QJL ablation harness (feature-gated) + per-row key_slot_scale codebook adaptation
    • Phase C/C′: Variant F drop-QJL opt-in path; d128/d256 no_qjl_2pass fast paths (4..=8 bits)
    • Phase D: TurboQuantPackMode config + Fullbyte dense-values kernel
    • Phase E: TurboQuantOutlierMode — encode-side top-K outlier storage, zero pre-quant + decode override, outlier-bias on d128/d256 fullbyte score kernel; CPU mirror in scalar encode/decode
    • Phase F: Hamming skip-list dispatch — skiplist_threshold config, GPU sign_hash buffer, Metal Hamming-distances kernel + FFI, GQA support
    • Mixed-precision attention parity baseline; defensive residual-norm clamp + NaN-safe encode
  • Asymmetric K/V head dimensions: KV cache, TurboQuant, and fused attention now support models where key and value projections have different widths (e.g. DeepSeek MLA with qk_head_dim != v_head_dim)

  • pmetal serve --kv-turboquant: TurboQuant KV cache in the serving engine with --kv-turboquant-preset q3_5 for near-lossless 4.6x KV compression in production

Quantization & model formats

  • Optimized FP8 checkpoint loading: Hugging Face FP8 weight_scale_inv sidecars are dequantized or repacked into MLX mxfp8 weights for Qwen3-family native paths; mode-aware quantized matmul plumbing handles floating-point quantized weights without dense fallback
  • Expanded GGUF quantization/export: pmetal quantize now writes standard GGUF metadata from Hugging Face configs, tokenizer/pre-tokenizer metadata, HF-to-GGUF tensor names, stacked MoE expert tensors, and method-specific file types
  • Broader GGUF format coverage: quantization/dequantization support now includes K-quants, legacy Q4/Q5/Q8 variants, Q1_0, TQ1_0/TQ2_0, MXFP4, NVFP4, BF16, F16, and F32 round trips
  • MLX safetensors quantization path: quality-based bit allocation with --target-bpw, GPU-resident weight loading, and tokenizer/config sidecar copy for MLX-format quantized exports

Inference server (OpenAI- + Anthropic-compatible)

  • Continuous batching with paged-KV-style admission + shared prefix cache in pmetal-serve: per-request slot scheduling, KV-cache prefix sharing, concurrent decode for many simultaneous chats
    • Token-block admission budget (--cb-block-size, --cb-max-blocks) prevents over-admitting active contexts and skips head-of-line requests when a smaller queued request fits the remaining block budget
    • Continuous batching now reuses the shared prompt prefix cache, prefills only uncached suffix tokens, and saves extended prefixes after final prefill
    • Continuous batching derives the same cache mode as the single-request serving path, honoring --kv-quant and --kv-turboquant
    • Hybrid/recurrent models are rejected from continuous batching instead of silently running without recurrent state
  • Anthropic-compatible /v1/messages endpoint: streaming message_startcontent_block_startcontent_block_delta*content_block_stopmessage_deltamessage_stop events; non-streaming JSON path
  • /v1/embeddings endpoint: 17 architectures supported via forward_hidden (Llama/Llama4/Qwen2/Qwen3/Qwen3MoE/Qwen3Next/Mistral/Gemma/Gemma4/Phi/Phi4/DeepSeek/Cohere/Granite/GptOss/NemotronH/BERT) — pooling via pmetal_models::pooling
  • Token logprobs: SamplingParams.logprobs_top_n plumbed end-to-end through non-streaming and SSE streaming on both /v1/chat/completions and /v1/completions. New pmetal_models::generation::token_logprobs primitive; ANE/CPU paths emit logprob: None
  • Best-effort tool calling on /v1/chat/completions: try_parse_tool_calls accepts {name, arguments} or {tool_calls: [...]}. ChatCompletionRequest.tools gates the attempt; chat templating threads tool defs into the rendered prompt
  • IncrementalDecoder<Aux> SSE buffer: shared UTF-8 boundary buffer + per-token aux pipelining (used for logprobs alignment) across chat/completions/anthropic streams

Job orchestration substrate (TUI / GUI / MCP / CLI parity)

  • JobSpec substrate: 16 canonical spec types in pmetal-core (Train, Distill, GRPO, Bench, Eval, Pretrain, Tokenize, Serve, Generate, RLKD, EmbedTrain, DFlash, Memory, Ollama, …) with #[derive(JobSpec)] proc-macro
  • JobEvent canonical streaming protocol: progress / metric / log / artefact / complete / failed events emitted by all 4 surfaces (CLI, TUI, GUI, MCP)
  • CLI: 8 specced Commands variants flattened — 613 LOC removed from main.rs; cli/<sub>.rs Args structs and JobSpec argv round-trip tests; --log-events flag stub
  • TUI: 14 tabs with full CLI parity, ?-key help overlay, Ctrl+1..9 tab jump, active-job footer badge, descriptor-driven forms with shared FormTabState primitive; channel-based metrics streaming (ChannelMetricsCallback) for direct-path train/distill/grpo/bench/eval/pretrain
  • GUI (Tauri): complete 9-DTO frontend-lockstep migration to *Spec types; Serve, Bench, Eval, Jobs, Pretrain pages; embed-train + rlkd + ollama routes; channel-based metrics streaming
  • MCP: 51-tool server with migrated train/pretrain/tokenize/memory/dflash/generate coverage, allowlisted CLI passthrough tools for newly added CLI flags, and a JobEvent JSONL consumer for managed background jobs

SOTA distillation (pmetal-distill)

  • Universal Logit Distillation (ULD) — Wasserstein-1 over sorted logit distributions for cross-tokenizer KD (Boizard et al. 2024); optional top_k truncation; permutation-invariant by design
  • Generalized Knowledge Distillation (GKD) — λ-weighted off-policy + on-policy KL blend (Agarwal et al. 2024); OnPolicySampler trait with GreedySampler reference impl; compute_full(t_off, s_off, t_on, s_on, T)
  • MiniLLM — reverse-KL with optional teacher-mix target = mix·T + (1-mix)·S (Gu et al. 2024)
  • Skewed JSD (DistiLLM-2)α·KL(T||M_α) + (1-α)·KL(S||M_α) with M_α = α·T + (1-α)·S, log-sum-exp computation; α=0.5 reduces to standard symmetric JSD (Ko et al. 2024)
  • Attention-transfer loss + weighted Metal path for hidden-state distillation
  • Offline teacher-logit caching: pmetal distill --offline-cache <path> precomputes teacher logits to disk; new Int8PerToken compressed-block variant replaces NaN-sentinel scheme with explicit per_token_meta field (legacy Int8 variant retained for read-back)
  • DistillLossOutput.metrics: HashMap<&'static str, f32>: lazily-evaluated teacher_entropy, student_entropy, kl_per_token, top1_agreement exposed to trainer JSONL/TUI streaming
  • TAID difficulty-aware observability: alpha_var surfaced for per-step monitoring
  • Configurable ignore_index: PyTorch-standard -100 default on TrainingConfig; safe label clamping before gather
  • Hidden-state shape assertions before matmul (clear error vs. silent broadcast bug)

SOTA model merging (pmetal-merge)

  • Fisher merging (Matena & Raffel 2022): diagonal-Fisher-weighted average θ = Σ F_i⊙θ_i / (Σ F_i + ε); lazy-loaded Fisher safetensors; fallback_to_mean for tensors without Fisher entries
  • RegMean (Jin et al. 2023): closed-form linear-layer merge W = (Σ G_i)⁻¹ · (Σ G_i W_i) via hand-rolled Gauss-Jordan pseudo_inverse_2d with Tikhonov ridge; falls back to mean for non-2D weights
  • MoE expert permutation alignment: per-(model, layer) Hungarian solver (Jonker-Volgenant style, O(N³)) over L2-normalized cosine similarity of expert fingerprints; tensor-name remapping experts.{i}.experts.{π(i)}. before merge; gated by align_moe_experts
  • Honor config.dtype in save path: MergeBuilder.dtype builder, TensorWriter::with_dtype plumbing, per-dtype byte packing for F16/BF16/F32; previously hardcoded to F16
  • Cross-model dtype consistency check: verify_source_dtypes errors on mismatch unless allow_mixed_dtype is set
  • Tied-embedding detection: lm_head.weight and embed_tokens.weight aliasing detected and merged once under canonical name
  • Tokenizer + config sidecar copy: tokenizer.json, tokenizer_config.json, special_tokens_map.json, config.json, generation_config.json copied on full-model merge; config.json.torch_dtype patched to match output dtype
  • Post-merge sanity sweep (SanityLevel::{Off,Quick,Full}, default Quick): NaN/inf detection aborts save; full mode reports per-tensor mean/std/abs_max/sparsity
  • MergeConfig.dry_run: short-circuits write phase, logs would-write summary

LoRA / QLoRA — full text-architecture coverage

  • New LoRA adapters: Granite, Llama4, DeepSeek, NemotronH, MLlama, Cohere, Phi, Gemma4, GPT-OSS, Qwen3-MoE, Qwen3-Next
  • New QLoRA adapters: Granite, Llama4, DeepSeek, NemotronH, Cohere, Phi, Gemma4, GPT-OSS, Qwen3-MoE, Qwen3-Next
  • LoRA+ wired into run_compiled training path; Gemma4 QLoRA KV-cache path
  • DeepSeek merge_lora properly implemented; Phi4 dispatched to existing PhiLoraForCausalLM
  • Interface-parity gradient-checkpointing hooks across 7 adapters

Bridge & native paths

  • pmetal-bridge crate: Zero-allocation MLX C++ bridge replacing mlx-rs as the core runtime. Native inference at 201 tok/s (Qwen3.5 0.8B), 4-bit quantized inference (28 tok/s on 27B), compiled attention, KV cache trimming, and full training ops (autograd, optimizer, random, math, reduction, comparison) — all without mlx-rs overhead
  • Fused [T=1] decode kernels for gpt_oss and llama4 (Bridge Phase 4)
  • Fused [N,1] batched-decode path for Tier-1/2 architectures
  • Cheap native KV cache fork support for Qwen3, GPT-OSS, Llama4, DeepSeek, and generic KVCache, preserving dense, quantized, and TurboQuant cache state for serving prefix reuse
  • BRIDGE_TRY_{DST,VOID} error coverage: thread-local exception slot replaces process-abort across most ops; pmetal_bridge::check_last_error()? surfaces BridgeError::CxxException after any op; InlineArray::try_* variants for matmul/softmax/reshape/sdpa/gather_mm/dequantize/etc.
  • Scalar dtype footgun fix: InlineArray::scalar_like(value, peer) + mul_scalar/add_scalar/sub_scalar/div_scalar eliminate manual .as_dtype(model_dtype) calls
  • async_eval actually async: prior implementation blocked the calling thread
  • Bridge file splits: bridge.h and bridge.cpp carved into cpp/bridge/ sub-headers + 6 source files; bridge_turboquant.cpp split by kernel family; inline_array.rs, qwen3_native.rs, deepseek_native.rs, llama4_native.rs, gpt_oss_native.rs split into submodule directories
  • forward_hidden for 17 architectures (embeddings + retrieval support)

Preference & RL trainers (pmetal-trainer)

  • PairedPreferenceTrainer<L> trait + DpoLoss kernel: DpoTrainer::train and OnlineDpoTrainer::train_step now delegate to the shared trainer; ReferenceStrategy::{StopGradient, Zero, Precomputed} covers the three reference-logp sources
  • Shared log-prob helpers fanned out to KTO/ORPO/GRPO/OnlineDPO via logprob_utils::{compute_log_probs, compute_log_probs_with_avg, shifted_selective_log_softmax}

Model, training & data

  • Full-parameter pretraining: End-to-end pmetal pretrain pipeline for training models from scratch

    • Model factory supporting llama, qwen, gemma, mistral, phi, and gpt-oss architectures
    • Gradient accumulation, cosine/linear/constant LR scheduling, gradient clipping
    • Full model + optimizer checkpoint save/restore for resumable runs
    • Memory-mapped streaming shard reader (StreamingShardReader) with zero-copy I/O via memmap2
    • pmetal tokenize command for converting JSONL corpora to binary shards
    • Pretrain tab in TUI and GUI with real-time loss/throughput/ETA monitoring
  • Gradient checkpointing: checkpoint_apply() wraps forward functions via mlx::core::checkpoint() to recompute activations during backward, reducing peak training memory from O(layers) to O(1)

  • Optimizer checkpoint/resume: AdamW gains step_count(), set_lr(), and restore_state() for saving and restoring optimizer state across training runs

  • Gemma 4 architecture: Full Gemma 4 model support with sliding-window attention and per-layer KV head configuration

  • DFlash speculative decoding: Native Rust port of the dflash-mlx speculative decoding pipeline for accelerated generation

  • Jinja chat templates: Real upstream jinja rendering via minijinja with 16 parity-audited template types (ChatML, Llama2/3/4, Mistral, Gemma/Gemma4, Phi3/4, Qwen, DeepSeek, Cohere, Alpaca, Vicuna, Zephyr, GptOss)

  • Qwen3.5 MoE dispatch improvements: Expert prefetch reset per generation, configurable GDN chunk size, chunked prefill, and generation helpers

  • Adaptive sequence packing: compute_pack_seq_len() uses p99 of actual dataset sequence lengths instead of max_position_embeddings — up to 256x reduction in wasted compute for short-sequence datasets. --pack-max-seq-len for explicit override

Metal 4 / MPP backend

  • Metal 4 / MPP kernel backend (#14): Trait-based kernel dispatch with Metal3Backend and Metal4Backend for M5+ (Apple10/NAX) GPUs

    • KernelBackend trait with 16 methods covering GEMM, attention, fused linear, training, MoE, distillation
    • KernelDispatch router on MetalContext — selects Metal 4 for large GEMMs (M>1, K%32==0) on M5, Metal 3 for everything else
    • Metal4CommandBuffer with correct begin/end lifecycle, CommandAllocatorPool, ResidencyManager
    • Compile-time #[cfg(has_metal4)] gating + runtime has_nax check — zero overhead on M1-M4
  • 15 MPP-optimized Metal 4 shaders: All following Apple MPP best practices (single simdgroup execution, Morton-order threadgroup walk, K-dimension alignment to 32, accumulation-loop barriers at BK=128)

    • 8 existing shaders optimized: mpp_gemm, mpp_flash_attention, mpp_quantized, mpp_fused_swiglu, mpp_fused_norm_lora, mpp_dw_gemm, mpp_grouped_gemm, mpp_fused_lora
    • 5 new shaders: mpp_fused_training (AdamW), mpp_fused_cross_entropy, mpp_fused_rope, mpp_fused_moe, mpp_fused_distill
    • 2 additional: mpp_fused_mlp (gate+up+down combined), quantized MoE expert variants

Benchmarking & inference UX

  • Benchmark enhancements: workload presets, custom dataset/expert-dir controls, inference session repeats, train sample/step/batch/sequence controls, warmup passes, GDN prefill stage profiling, TurboQuant flag for bench commands, and fused gate/up expert packing with auto-detected tensor layout

  • --mode sampling presets: Per-model-family recommended sampling parameters (Qwen3/3.5 thinking/instruct modes). --mode auto selects based on --no-thinking flag

  • --detect-repetition: Opt-in n-gram repetition loop detection (8-token pattern x 4 repeats), force-stops infinite loops

  • Chip name in decode stats: Inference output now shows the Apple Silicon chip (e.g., [M4 Max])

Changed

  • mlx-rs removed: All crates migrated from mlx-rs to the pmetal-bridge compat API. Entire model, training, serving, and GUI stacks now use the zero-allocation C++ bridge
  • Thinking trace shown by default for thinking models; use --hide-thinking to suppress
  • Migrated scattered has_nax() checks to MetalContext::dispatch() for centralized backend routing
  • Split compat.rs (3620 lines) into 7-file compat/ module directory
  • Split bridge.cpp (6749 lines) into 6 C++ source files with shared bridge_internal.h
  • Array::id() replaces data_ptr() for weight change detection — safe with lazy evaluation
  • Persistent MLX build cache in CI and build.rs: skips redundant cmake compilation across CI runs
  • pmetal_hub::resolve_model_path adopted across core, cli, and gui for consistent local-cache → Hub-ID resolution
  • Distillation orchestration stubs removed: Distiller::run_online/run_offline/run_progressive deleted — orchestration now lives entirely in pmetal-trainer
  • MLX MoE routing audit: confirmed no argpartition(-scores, -k) anti-top-k regressions remain; documented as a permanent footgun
  • MergeMethod trait extended with merge_named(name, …) (default forwards to merge); Fisher and RegMean dispatch through name-aware path
  • MSRV: Raised workspace rust-version to 1.89 to match the updated turbomcp dependency

Fixed

  • Release GUI workflow: Installs the Tauri frontend with pnpm before tauri-action, matching the package manager that the action detects from pnpm-lock.yaml
  • MCP adaptive training controls: LR/checkpoint/stop commands now handle --output=... jobs and create the control directory before writing .lr_control.json
  • AdamW bias correction: step counter was advancing per-parameter instead of per-step
  • Cross-entropy loss masking: ignored labels are masked before gather, using a selective logsumexp - target_logit path that avoids materializing full log_softmax
  • Gradient clipping in compiled training path now uses _clipped step variants
  • FFI exception safety: ~33 C++ bridge functions wrapped in try/catch
  • LoRA inference segfault: put_along_axis crash during generation
  • UTF-8 char boundary panics in inference/GUI output stream handling
  • Distributed ring reduce: all-gather chunk indexing used wrong offset, corrupting gradient aggregation
  • Distributed transport/compression audit: namespace PSK handshakes, TCP fallback hardening, bounded compressed-gradient deserialization, and out-of-range sparse-index guards
  • Serving parameter validation: OpenAI-compatible routes validate sampling parameters before streaming and non-streaming generation dispatch
  • select_axis parameter order: standardized (data, index, axis) across all call sites
  • Lazy array segfaults: diffusion sampler now evals sigmas/timesteps before slice access
  • Sampling penalties: correctly wired through native bridge decode path
  • Qwen3-Next MoE routing: corrected anti-top-k expert selection bug (sign/slice pair)
  • Qwen3-Next hybrid cache flag in LoRA + distillation paths
  • TurboQuant d128 pass-2 cross-simdgroup reduction; lazy-transpose footgun in mixed-precision attention
  • TurboQuant serving prefix-cache compatibility: prefix cache now stores forked KV caches instead of dense snapshots, preserving compressed TurboQuant history without fp16 re-inflation
  • Native attention correctness/perf audit: fail-fast TurboQuant dispatch, checked SDPA wrappers, true hot-ring cache behavior, centralized quantized tuple growth, and unsupported-cache rejection in tree verification
  • GKD compute_weighted no longer scales by (1-λ) (silently zeroed training at λ=1.0)
  • pmetal_serve::sse::IncrementalDecoder prevents UTF-8 boundary panics on partial codepoint emission
  • Zero clippy warnings across entire workspace
  • Stale pmetal-mlx-sys metallib path in release workflow (renamed to pmetal-bridge)

Removed

  • mlx-rs dependency: Fully replaced by pmetal-bridge — removes ~15K lines of Rust FFI bindings
  • 1065 lines of dead code: qwen3_train.rs, unused LoRA functions in qwen3_native.rs
  • 5 superseded LoRA training modules
  • Dropped model support for StarCoder2, FalconH1, RecurrentGemma, and Jamba
  • Distiller::run_* orchestration stubs (now lives in pmetal-trainer)

Downloads

Asset Description
pmetal-*-aarch64-apple-darwin.tar.gz CLI binary + mlx.metallib (Apple Silicon)
PMetal-*-aarch64-apple-darwin-*.dmg Desktop GUI app (Apple Silicon)
mlx.metallib MLX Metal shader library (standalone)

CLI Quick Start

tar xzf pmetal-*-aarch64-apple-darwin.tar.gz
./pmetal train --model Qwen/Qwen3-0.6B --dataset train.jsonl --output ./output

GUI

Mount the DMG and drag PMetal to Applications.

Full Changelog: v0.4.0...v0.5.0