PMetal v0.3.6
[0.3.6] - 2026-03-15
Added
- Desktop GUI (Tauri + Svelte): Full desktop application for model management, training, distillation, GRPO, inference, merging, and quantization. 10 pages: Dashboard, Models, Datasets, Training, Distillation, GRPO, Inference, Merging, Quantize, Settings. Real-time training metrics with live loss charts via broadcast events. Model download with HuggingFace Hub integration, dataset browser, and inference chat interface with streaming token display
- GUI in-process execution: Training, distillation, GRPO, inference, model merging, LoRA fuse, and quantization run as direct library calls instead of shelling out to the
pmetalbinary. Eliminates binary discovery issues, reduces process overhead, and enables richer progress reporting. Device info and model metadata also read from library APIs easy::dpo()/easy::simpo()/easy::orpo()/easy::kto()builders:PreferenceTuneBuilderineasy.rsfor preference optimization methods. Full pipeline: model download → tokenizer → dataset loading → LoRA setup → training loop → weight saving. Supports method-specific config (DPO beta/loss type, SimPO gamma/CPO, ORPO beta, KTO desirable/undesirable weights)easy::infer().generate_streaming(): Streaming inference API with per-delta callback. Supports both base models and LoRA adapters. Returnsfalsefrom callback to cancel early. ANE fallback emits full result as single delta- Preference trainer
train()methods: DPO, KTO, ORPO, and SimPO trainers now have self-containedtrain()methods with optimizer integration, batching, epoch loops, callback lifecycle, and metrics collection. Previously only exposed per-step primitives TrainingCallback::should_stop(): Clean cancellation mechanism — callbacks returntrueto request training loop to finish the current step and exit withCancellederror. Checked after every step in all 5TrainingLoop::run*methods, all 4 preference trainertrain()loops, andGrpoTrainer::run()PMetalError::Cancelled: New error variant for clean training cancellation. CorrespondingCancelledvariants added toSftError,DpoError,KtoError,OrpoError,SimpoError, andGrpoError- Preference batch padding utilities:
pad_u32_sequences,pad_i64_sequences,pad_f32_sequencesinpreference_batch.rsfor batching variable-length preference pairs - NemotronH runtime FP8 quantization:
quantize_fp8()converts float weights to FP8 (E4M3) at runtime for all four block types (Mamba, attention, MLP, MoE). Shared helpersmaterialize_linear_weightandlinear_forward_with_optional_fp8consolidate FP8 dequantization across the model. MoE weights are restacked after quantization for batched dispatch - FluxPipeline::from_pretrained: Load Flux diffusion pipelines from HuggingFace-style model directories. Discovers components via
model_index.json, parses both native and diffusers-style config keys for CLIP, T5, FluxDiT, and VAE - Python training callbacks:
Trainer.add_callback()now wires callbacks into the training loop. Built-inProgressCallback,LoggingCallback, andMetricsJsonCallbackmap to native Rust implementations; arbitrary Python objects bridge throughPythonCallbackBridge
Fixed
- Training cancellation via
panic_anyreplaced: GUI and TUI previously usedstd::panic::panic_any(CancelledRun)+catch_unwindto abort training — fragile, UB-prone through FFI, and could be swallowed by intermediate catch_unwind. Replaced withTrainingCallback::should_stop()returning a cleanErr(Cancelled)from the training loop - GUI QLoRA silently failed on non-Llama models:
run_qlora_training_in_processhardcodedLlamaConfigdeserialization, causing confusing errors or silent misconfiguration for Gemma/Qwen/Phi models. Now detectsmodel_typefrom config.json and returns a clear error for unsupported architectures - GUI
resume_fromsilently ignored: Training config acceptedresume_frombut discarded it (let _ = eval). Now returns an error directing users to the CLI - GUI GRPO with no reward function produced noise:
DummyRewardreturning constant 0.1 for all completions made GRPO training meaningless when reasoning rewards were disabled. Now requires explicit reward configuration - Preference trainers doubled compute per step: DPO, KTO, ORPO, and SimPO
train()methods ran a second full forward pass after the gradient step solely for logging metrics. Replaced withRefCellside-channels that capture metric arrays from within the autograd closure — same metrics, zero extra compute - Base model thinking mode: Auto-detect base vs instruct models and disable
<think>tag prefill for base models. Base models don't understand thinking tags, causing infinite generation without a closing tag - Fused model 5x slower than LoRA: Skip ANE-hybrid path for models under 2B parameters where GPU KV-cache decode is significantly faster (115 vs 20 tok/s). ANE-hybrid benefits larger models where prefill dominates
- DataLoader panics on bad images: Replace
panic!()in VLM batch construction with properDataLoaderErrorenum andtry_next_batch()method. Image preprocessing failures and missing-image errors now propagate asResultinstead of crashing - Division by zero with log_every=0: Clamp
log_everyandsave_everyto minimum 1 acrossTrainingLoop,LoggingCallback,CheckpointCallback, and CLI - LoRA scaling with rank 0:
LoraConfig::scaling()returns 0.0 when rank is 0 instead of dividing by zero - BF16 LoRA weights:
sanitize_loaded_weights()converts BF16 tensors to FP16 since MLX doesn't natively support BF16 on Apple Silicon - Qwen3Next silent weight mismatch: Weight loading now returns errors for unmatched or missing parameters instead of logging a warning and continuing with a partially loaded model
- Dataset download only fetched README:
download_dataset()now enumerates repo files and downloads actual data files (parquet, json, jsonl, csv, arrow, etc.) with split-aware filtering - Model download silent failures:
download_model()tracks per-file failures and reports them instead of silently skipping failed downloads - Flux loading via DynamicModel:
DynamicModel::load()for Flux now returns an error directing toFluxPipelineinstead of incorrectly loading a diffusion model as a causal LM
Changed
- GUI architecture: library calls replace subprocess spawning: Training, distillation, GRPO, inference, merge, fuse, and quantize commands now call
pmetallibrary functions directly instead of spawningpmetalCLI as a child process. System info reads fromMetalContext::global()instead of parsingpmetal memorystdout. Removeswhichandfutures-utildependencies - TUI direct training execution:
command_runner.rsdispatchestrain,distill, andgrpocommands as in-process library calls viarun_direct_command(), falling back to subprocess for other commands. Training parameters parsed fromCommandSpecargs withparse_arg/required_arg/optional_arghelpers - ORPO loss computation refactored:
compute_orpo_loss_staticnow contains the full computation directly instead of creating a throwawayOrpoTrainerinstance. The instance methodcompute_orpo_lossdelegates to it - SimPO gradient-safe loss path: New
compute_loss_with_cpo_for_gradstatic method keeps the computation graph lazy (no.eval()/.item()calls) for correct autograd. The existingcompute_loss_with_cporemains for non-grad contexts FinetuneBuilderexpanded: New builder methods —lora_dropout(),use_rslora(),use_dora(),gradient_checkpointing_layers(),callback(),metrics_path(). LoRA config now forwards dropout, RSLoRA, and DoRA settings- GRPO CLI gains new parameters:
epochs,lora_r,lora_alpha,max_completion_lengthexposed as CLI arguments and TUI form fields. GRPO now savesadapter_config.jsonalongside LoRA weights - CLI
emit_console_outputflag: Training, distillation, and GRPO CLI functions acceptemit_console_output: boolandextra_callbacks: Vec<Box<dyn TrainingCallback>>to suppress terminal output when called from GUI/TUI - DataLoader error handling: New
DataLoaderErrorenum withMlx,ImagePreprocess, andMissingImagesvariants. All 7 training loop entry points migrated fromnext_batch()totry_next_batch() - AdapterManager validation:
load()now validates path existence, checks for adapter artifacts in directories, and rejects unsupported file types - Metal shader build isolation: Shader compiler cache redirected to build output directory, preventing pollution of user's home directory
- unsafe_code lint scoping: Moved blanket
#![allow(unsafe_code)]from crate-levellib.rsinto individual modules that contain unsafe blocks across pmetal-metal, pmetal-mlx, pmetal-models, pmetal-trainer, pmetal-distill, and pmetal-distributed
Downloads
| Asset | Description |
|---|---|
pmetal-*-aarch64-apple-darwin.tar.gz |
CLI binary + mlx.metallib (Apple Silicon) |
PMetal-*-aarch64-apple-darwin-*.dmg |
Desktop GUI app (Apple Silicon) |
mlx.metallib |
MLX Metal shader library (standalone) |
CLI Quick Start
tar xzf pmetal-*-aarch64-apple-darwin.tar.gz
./pmetal train --model Qwen/Qwen3-0.6B --dataset train.jsonl --output ./outputGUI
Mount the DMG and drag PMetal to Applications.
Full Changelog: v0.3.5...v0.3.6