PMetal v0.4.0
[0.4.0] - 2026-03-23
Added
-
pmetal-mcpcrate: Full MCP (Model Context Protocol) server exposing 45 tools for Claude Desktop and other MCP clients. Covers all pmetal functionality — training, inference, distillation, GRPO, RLKD, quantization, model merging, dataset operations, evaluation, benchmarking, model search, and Ollama export- Device & models:
device_info,search_models,download_model,list_local_models,model_fit,model_info - Inference:
generate(blocking),chat(via running serve instance),start_serve,benchmark,bench_train,bench_gen,bench_corpus - Training:
train,distill,grpo,rlkd,embed_train— all as background jobs with full parameter coverage matching the CLI - Runtime training control:
job_set_lr,job_reduce_lr,job_reset_lr,job_save_checkpoint,job_graceful_stop— LLM-driven adaptive training via the control file protocol - Job management:
list_jobs,job_status,job_logs,stop_job - Dataset ops:
dataset_analyze,dataset_preview,dataset_validate,dataset_download,dataset_convert,dataset_filter,dataset_split,dataset_merge,dataset_sample,dataset_template,dataset_prepare - Quantization & conversion:
quantize,fuse_lora,merge_models,pack_experts,ollama_create,ollama_modelfile - Evaluation:
eval_perplexity - All tools include rich
#[description]annotations for parameter documentation in the MCP schema - Standalone binary (
pmetal-mcp) for Claude Desktop +pmetal mcpsubcommand (behindmcpfeature flag) - Uses
turbomcpv3.0.7 from crates.io
- Device & models:
-
Runtime training control protocol: Extended the control file protocol (
.lr_control.json) withSaveCheckpointandGracefulStopcommands. The adaptive LR controller now polls the control file before checking itsenabledflag, so external agents (MCP, TUI) can always send commands regardless of whether automatic detection is active -
--no-adaptive-lrflag: Disables automatic spike/plateau/divergence detection while keeping the control file protocol active. Enables fully LLM-driven learning rate control — the agent observes loss viajob_statusand manually adjusts LR viajob_set_lr/job_reduce_lr -
UltraFusion execution planner (
pmetal-distributed): Per-die stage planner for M-series Ultra Macs with in-memory channel transport backend for same-process links, avoiding TCP overhead on UltraFusion interconnect -
MPP FlashAttention for head_dim 64/96: Metal 4 MPP flash attention kernel now supports head_dim 64, 96, and 128 with stride-2/stride-3 SIMD lane packing and causal/non-causal variants
-
Tuna persistent disk cache: The auto-tuner now persists benchmark results to disk, avoiding re-tuning on restart. Expanded search covers FlashAttention, FusedCrossEntropy, FusedNormLora, and FusedSwiGLU via function constants
-
MoE GPU top-k selection: Expert top-k selection moved from CPU sort to GPU
argpartition_axis, eliminating a sync point in the MoE forward path -
bench-workloadCLI command: Benchmark a real cached workload for inference and short LoRA training with named presets (--preset dense-qwen3,--preset hybrid-qwen3next) -
KV cache quantization auto-select:
--kv-quantis now optional — omitting it auto-selects the fastest quantization mode that fits the device memory budget -
UltraFusion info display:
pmetal infoshows UltraFusion topology, die count, and local executor plan on Ultra Macs -
Qwen3 LoRA RoPE reset: Qwen3 LoRA and QLoRA gain dense attention and RoPE reset support
-
ANE real-time evaluation: Experimental
_ANEClientreal-time dispatch with automatic fallback to standard evaluation on failure. Propagated via--ane-real-timeCLI flag -
bench-corpusCLI command: Structured kernel benchmarking with device-tier-aware test cases, JSON reporting, and--quick/--outputflags -
GPU memory bandwidth probing: Real GPU copy benchmark replaces static spec-table lookup, with disk-cached results and spec-table fallback
-
Persistent runtime kernel backend selection: Benchmark-and-persist infrastructure races MLX vs MPP backends on Apple10/M5, validates numerical agreement, and caches the winner to disk for 4-bit quantized linear, fused attention, and LoRA matmul
-
MPP kernel tile variants: Metal 4 GEMM supports parameterized tile variants (32x32, 64x32, 32x64, 64x64) with Tuna auto-tuner selection per device and problem shape
-
Serve ANE/CPU-hybrid engine caching: Serve engine auto-selects optimal backend (ANE, CPU-hybrid, GPU) at startup with permanent downgrade on failure. Compiled engines cached across requests
-
Rollback enabled by default for LoRA: Best-loss checkpoint rollback now defaults to on with extended warmup grace period. Persistent snapshot to disk via atomic write.
for_lora()factory for recommended defaults -
Extended StepMetrics:
gpu_fwd_bwd_ms,optimizer_ms,io_staging_ms,overhead_msfields for fine-grained training profiling -
Zero-copy MoE expert dispatch:
ExpertBufferPoolwithread_experts_aligned+encode_expert_alignedfor pread-to-Metal expert weight dispatch. Auto-enable KV-Q8 when memory-constrained -
ANE dual-die support: On UltraFusion chips, compile variant-B kernel set with distinct MIL hashes and alternate per step for dual-die thermal distribution. Auto-recompile on throughput degradation (>15% or >25K dispatches)
-
Batched parameter eval: Model dispatcher evaluates parameters in batches of 128 tensors per sync instead of all-at-once, reducing peak memory during model loading
-
Architecture enhancements: DeepSeek V3/V3.2, GPT-OSS, Jamba, Llama 4, Qwen3, and Qwen3-MoE model improvements and weight sanitization refinements
-
Third-party attribution: Complete THIRD_PARTY_NOTICES with entries for mlx-lm, llama.cpp/GGML, Candle, and Burn
Changed
- ANE is now opt-in: The
--no-aneflag has been replaced with--aneacross CLI, TUI, orchestrator, and MCP. ANE training is experimental and limited to small models, so it defaults to off. The orchestrator'sDispatchConfignow setsane: falseby default - Gradient checkpointing support corrected: Qwen3 and Qwen3Next no longer claim gradient checkpointing support (was incorrectly advertised)
- Training loop refactored: Gradient checkpointing helper extracted, step logging tracks step numbers correctly, training loop tests expanded
Removed
- Merge methods: Removed merge methods with incompatible licenses. Cleaned up related references across documentation and configuration
Fixed
- MetalSampler use-after-free: Retained source logits array until GPU completion in serve engine
- Fused merge Tuna cache: Now uses persistent disk cache instead of ephemeral per-session tuning
Downloads
| Asset | Description |
|---|---|
pmetal-*-aarch64-apple-darwin.tar.gz |
CLI binary + mlx.metallib (Apple Silicon) |
PMetal-*-aarch64-apple-darwin-*.dmg |
Desktop GUI app (Apple Silicon) |
mlx.metallib |
MLX Metal shader library (standalone) |
CLI Quick Start
tar xzf pmetal-*-aarch64-apple-darwin.tar.gz
./pmetal train --model Qwen/Qwen3-0.6B --dataset train.jsonl --output ./outputGUI
Mount the DMG and drag PMetal to Applications.
Full Changelog: v0.3.13...v0.4.0