PMetal v0.3.7
[0.3.7] - 2026-03-16
Added
pmetal mergeCLI command: Model merging exposed as a first-class CLI command supporting all merge methods (Linear, SLERP, TIES, DARE, DELLA, NearSwap, Model Stock) with--method,--base,--t,--weight-a,--weight-b,--density, and--dtypeflagspmetal evalCLI command: Dataset evaluation command — measures loss/perplexity over a validation set with optional LoRA adapter,--num-samplescap, and--jsonoutputpmetal infoCLI command: Prints device and runtime information;--jsonflag emits structured JSON for scriptingpmetal search --jsonoutput: Structured JSON output mode for search results including fit estimates, download counts, parameter estimates, and tags — enables scripting and GUI integrationQuantizeMethodenum: Replaces the string--methodargument forpmetal quantizewith a typed enum (dynamic,q8_0,q4_k_m, etc.) — invalid methods now fail at argument parsing rather than deep inside the quantizer- GRPO CLI arguments:
--epochs,--lora-r,--lora-alpha,--max-completion-length, and--seedexposed as CLI arguments, replacing previous hardcoded defaults loraplus_lr_ratioandneftune_noise_alpha: New fields on training loop configurations — enables LoRA+ differential learning rates and NEFTune noise injection directly from configtrainable_params()helper: New utility inpmetal-lorafor counting total vs. trainable parameter counts, useful for logging and memory estimationlora_alpha: f32: Distillation CLI andrun_distillation_clinow acceptlora_alphaasf32instead ofusizefor finer-grained scaling controlseedparameter in distillation and GRPO CLI: Reproducible runs via explicit--seedflag in all training entry points- Gemma3 sliding window auto-detection:
DynamicModelloader now readsmodel_type == "gemma3"and setsis_gemma3 = trueon the config, enabling the correct every-6th-layer global attention pattern without manual config overrides - KV cache support for more architectures:
DynamicModel::forward_with_cachenow routes DeepSeek, Cohere, StarCoder2, and Llama4 to their native caching paths; RecurrentGemma and Jamba now get clear error messages that they requireforward()directly; hybrid models (NemotronH, Qwen3Next) get a descriptive error directing toforward_with_hybrid_cache - Speculative decoding greedy path:
SpeculativeDecoder::verify_greedy()— exact-correct verification for temperature=0 decoding using argmax equality; avoids the numerically unstable rejection-sampling limit as temperature→0 - Hub cache management (
pmetal-hub): Newcache.rsmodule with cache inspection, eviction, and size-reporting helpers - Shared model utilities (
pmetal-models/utils.rs): Common helpers extracted from per-architecture modules to reduce duplication
Fixed
- Scale factor broadcasting in distillation:
squeezeapplied to the scale factor dimension so it broadcasts correctly across batch and sequence axes — previously caused shape mismatches on non-unit batch sizes - TAID
mean_alphaforcing GPU sync:TaidLossOutput::mean_alphachanged fromf32to a lazyArray— the.eval()call is deferred until callers explicitly call.item::<f32>(), removing a forced GPU-CPU sync before the backward pass - SLERP numerical stability: Added epsilon clamping in the SLERP merge path to prevent NaN when interpolation parameter is at the boundary values (0.0 or 1.0)
- Llama LoRA
trainable_params/ gradient application: Replaced 100+ lines of repeated field accesses with aninsert_adapter!macro and loop over projection names, fixing DoRAmagnitudeparameter that was silently dropped from gradient maps - GaLore improvements: Corrected projection matrix update schedule and subspace dimensionality handling
- Distillation hidden-state loss: Refactored alignment computation to correctly handle variable-rank teacher/student hidden state tensors
- Jensen-Shannon / KL divergence loss: Numerical stability improvements — log-sum-exp stabilization applied consistently across all reduction paths
- Offline distillation: Fixed logit cache loading to handle both single-file and sharded cache layouts
Changed
lm_groups.rs/ LoRA+ optimizer groups:build_lora_param_groupssignificantly reworked — LoRA+ differential LR ratio (loraplus_lr_ratio) applied tolora_bparameters, NEFTune noise injection integrated into group construction- GRPO trainer:
epochs,lora_r,lora_alpha,max_completion_length, andseedplumbed through from CLI args; previously these were hardcoded to1,16,32,512, and a fixed seed - Training loop:
loraplus_lr_ratioandneftune_noise_alpharead from config and forwarded to optimizer group construction pmetal-coreconfig / scheduler / traits: Config structs gainedloraplus_lr_ratioandneftune_noise_alphafields; scheduler types and learning rate trait bounds refined;TrainingCallbacktrait extended with blanket impls for boxed callbacks- Data pipeline: Tokenizer, packing,
vocab_compact, dataset, and chat template modules updated — minor correctness and efficiency fixes accumulated across the release cycle - GGUF reader / writer / quantize: Reader handles additional tensor metadata fields; writer improves alignment padding; quantize module uses
QuantizeMethodenum instead of string matching - Hub search:
search_modelsreturns richer result structs used by both the human-readable table and the new--jsonoutput path; upload path fixes for large model shards - Metal kernels: GDN, LoRA, grouped GEMM, and fused SwiGLU Metal shaders updated — improved numerical correctness and register pressure
- GUI app icons and Tauri config: Updated icons (32×32, 128×128, 128×128@2x, icns, ico) and
tauri.conf.jsonfor the 0.3.7 release build; Python vocodereasyAPI additions and mel spectrogram fix
Downloads
| Asset | Description |
|---|---|
pmetal-*-aarch64-apple-darwin.tar.gz |
CLI binary + mlx.metallib (Apple Silicon) |
PMetal-*-aarch64-apple-darwin-*.dmg |
Desktop GUI app (Apple Silicon) |
mlx.metallib |
MLX Metal shader library (standalone) |
CLI Quick Start
tar xzf pmetal-*-aarch64-apple-darwin.tar.gz
./pmetal train --model Qwen/Qwen3-0.6B --dataset train.jsonl --output ./outputGUI
Mount the DMG and drag PMetal to Applications.
Full Changelog: v0.3.6...v0.3.7