Release PMetal v0.4.0 · Epistates/pmetal

[0.4.0] - 2026-03-23

Added

pmetal-mcp crate: Full MCP (Model Context Protocol) server exposing 45 tools for Claude Desktop and other MCP clients. Covers all pmetal functionality — training, inference, distillation, GRPO, RLKD, quantization, model merging, dataset operations, evaluation, benchmarking, model search, and Ollama export
- Device & models: device_info, search_models, download_model, list_local_models, model_fit, model_info
- Inference: generate (blocking), chat (via running serve instance), start_serve, benchmark, bench_train, bench_gen, bench_corpus
- Training: train, distill, grpo, rlkd, embed_train — all as background jobs with full parameter coverage matching the CLI
- Runtime training control: job_set_lr, job_reduce_lr, job_reset_lr, job_save_checkpoint, job_graceful_stop — LLM-driven adaptive training via the control file protocol
- Job management: list_jobs, job_status, job_logs, stop_job
- Dataset ops: dataset_analyze, dataset_preview, dataset_validate, dataset_download, dataset_convert, dataset_filter, dataset_split, dataset_merge, dataset_sample, dataset_template, dataset_prepare
- Quantization & conversion: quantize, fuse_lora, merge_models, pack_experts, ollama_create, ollama_modelfile
- Evaluation: eval_perplexity
- All tools include rich #[description] annotations for parameter documentation in the MCP schema
- Standalone binary (pmetal-mcp) for Claude Desktop + pmetal mcp subcommand (behind mcp feature flag)
- Uses turbomcp v3.0.7 from crates.io
Runtime training control protocol: Extended the control file protocol (.lr_control.json) with SaveCheckpoint and GracefulStop commands. The adaptive LR controller now polls the control file before checking its enabled flag, so external agents (MCP, TUI) can always send commands regardless of whether automatic detection is active
--no-adaptive-lr flag: Disables automatic spike/plateau/divergence detection while keeping the control file protocol active. Enables fully LLM-driven learning rate control — the agent observes loss via job_status and manually adjusts LR via job_set_lr/job_reduce_lr
UltraFusion execution planner (pmetal-distributed): Per-die stage planner for M-series Ultra Macs with in-memory channel transport backend for same-process links, avoiding TCP overhead on UltraFusion interconnect
MPP FlashAttention for head_dim 64/96: Metal 4 MPP flash attention kernel now supports head_dim 64, 96, and 128 with stride-2/stride-3 SIMD lane packing and causal/non-causal variants
Tuna persistent disk cache: The auto-tuner now persists benchmark results to disk, avoiding re-tuning on restart. Expanded search covers FlashAttention, FusedCrossEntropy, FusedNormLora, and FusedSwiGLU via function constants
MoE GPU top-k selection: Expert top-k selection moved from CPU sort to GPU argpartition_axis, eliminating a sync point in the MoE forward path
bench-workload CLI command: Benchmark a real cached workload for inference and short LoRA training with named presets (--preset dense-qwen3, --preset hybrid-qwen3next)
KV cache quantization auto-select: --kv-quant is now optional — omitting it auto-selects the fastest quantization mode that fits the device memory budget
UltraFusion info display: pmetal info shows UltraFusion topology, die count, and local executor plan on Ultra Macs
Qwen3 LoRA RoPE reset: Qwen3 LoRA and QLoRA gain dense attention and RoPE reset support
ANE real-time evaluation: Experimental _ANEClient real-time dispatch with automatic fallback to standard evaluation on failure. Propagated via --ane-real-time CLI flag
bench-corpus CLI command: Structured kernel benchmarking with device-tier-aware test cases, JSON reporting, and --quick/--output flags
GPU memory bandwidth probing: Real GPU copy benchmark replaces static spec-table lookup, with disk-cached results and spec-table fallback
Persistent runtime kernel backend selection: Benchmark-and-persist infrastructure races MLX vs MPP backends on Apple10/M5, validates numerical agreement, and caches the winner to disk for 4-bit quantized linear, fused attention, and LoRA matmul
MPP kernel tile variants: Metal 4 GEMM supports parameterized tile variants (32x32, 64x32, 32x64, 64x64) with Tuna auto-tuner selection per device and problem shape
Serve ANE/CPU-hybrid engine caching: Serve engine auto-selects optimal backend (ANE, CPU-hybrid, GPU) at startup with permanent downgrade on failure. Compiled engines cached across requests
Rollback enabled by default for LoRA: Best-loss checkpoint rollback now defaults to on with extended warmup grace period. Persistent snapshot to disk via atomic write. for_lora() factory for recommended defaults
Extended StepMetrics: gpu_fwd_bwd_ms, optimizer_ms, io_staging_ms, overhead_ms fields for fine-grained training profiling
Zero-copy MoE expert dispatch: ExpertBufferPool with read_experts_aligned + encode_expert_aligned for pread-to-Metal expert weight dispatch. Auto-enable KV-Q8 when memory-constrained
ANE dual-die support: On UltraFusion chips, compile variant-B kernel set with distinct MIL hashes and alternate per step for dual-die thermal distribution. Auto-recompile on throughput degradation (>15% or >25K dispatches)
Batched parameter eval: Model dispatcher evaluates parameters in batches of 128 tensors per sync instead of all-at-once, reducing peak memory during model loading
Architecture enhancements: DeepSeek V3/V3.2, GPT-OSS, Jamba, Llama 4, Qwen3, and Qwen3-MoE model improvements and weight sanitization refinements
Third-party attribution: Complete THIRD_PARTY_NOTICES with entries for mlx-lm, llama.cpp/GGML, Candle, and Burn

Changed

ANE is now opt-in: The --no-ane flag has been replaced with --ane across CLI, TUI, orchestrator, and MCP. ANE training is experimental and limited to small models, so it defaults to off. The orchestrator's DispatchConfig now sets ane: false by default
Gradient checkpointing support corrected: Qwen3 and Qwen3Next no longer claim gradient checkpointing support (was incorrectly advertised)
Training loop refactored: Gradient checkpointing helper extracted, step logging tracks step numbers correctly, training loop tests expanded

Removed

Merge methods: Removed merge methods with incompatible licenses. Cleaned up related references across documentation and configuration

Fixed

MetalSampler use-after-free: Retained source logits array until GPU completion in serve engine
Fused merge Tuna cache: Now uses persistent disk cache instead of ephemeral per-session tuning

Downloads

Asset	Description
`pmetal-*-aarch64-apple-darwin.tar.gz`	CLI binary + mlx.metallib (Apple Silicon)
`PMetal--aarch64-apple-darwin-.dmg`	Desktop GUI app (Apple Silicon)
`mlx.metallib`	MLX Metal shader library (standalone)

CLI Quick Start

tar xzf pmetal-*-aarch64-apple-darwin.tar.gz
./pmetal train --model Qwen/Qwen3-0.6B --dataset train.jsonl --output ./output

GUI

Mount the DMG and drag PMetal to Applications.

Full Changelog: v0.3.13...v0.4.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

PMetal v0.4.0

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

[0.4.0] - 2026-03-23

Added

Changed

Removed

Fixed

Downloads

CLI Quick Start

GUI

Uh oh!