Release PMetal v0.3.6 · Epistates/pmetal

[0.3.6] - 2026-03-15

Added

Desktop GUI (Tauri + Svelte): Full desktop application for model management, training, distillation, GRPO, inference, merging, and quantization. 10 pages: Dashboard, Models, Datasets, Training, Distillation, GRPO, Inference, Merging, Quantize, Settings. Real-time training metrics with live loss charts via broadcast events. Model download with HuggingFace Hub integration, dataset browser, and inference chat interface with streaming token display
GUI in-process execution: Training, distillation, GRPO, inference, model merging, LoRA fuse, and quantization run as direct library calls instead of shelling out to the pmetal binary. Eliminates binary discovery issues, reduces process overhead, and enables richer progress reporting. Device info and model metadata also read from library APIs
easy::dpo() / easy::simpo() / easy::orpo() / easy::kto() builders: PreferenceTuneBuilder in easy.rs for preference optimization methods. Full pipeline: model download → tokenizer → dataset loading → LoRA setup → training loop → weight saving. Supports method-specific config (DPO beta/loss type, SimPO gamma/CPO, ORPO beta, KTO desirable/undesirable weights)
easy::infer().generate_streaming(): Streaming inference API with per-delta callback. Supports both base models and LoRA adapters. Returns false from callback to cancel early. ANE fallback emits full result as single delta
Preference trainer train() methods: DPO, KTO, ORPO, and SimPO trainers now have self-contained train() methods with optimizer integration, batching, epoch loops, callback lifecycle, and metrics collection. Previously only exposed per-step primitives
TrainingCallback::should_stop(): Clean cancellation mechanism — callbacks return true to request training loop to finish the current step and exit with Cancelled error. Checked after every step in all 5 TrainingLoop::run* methods, all 4 preference trainer train() loops, and GrpoTrainer::run()
PMetalError::Cancelled: New error variant for clean training cancellation. Corresponding Cancelled variants added to SftError, DpoError, KtoError, OrpoError, SimpoError, and GrpoError
Preference batch padding utilities: pad_u32_sequences, pad_i64_sequences, pad_f32_sequences in preference_batch.rs for batching variable-length preference pairs
NemotronH runtime FP8 quantization: quantize_fp8() converts float weights to FP8 (E4M3) at runtime for all four block types (Mamba, attention, MLP, MoE). Shared helpers materialize_linear_weight and linear_forward_with_optional_fp8 consolidate FP8 dequantization across the model. MoE weights are restacked after quantization for batched dispatch
FluxPipeline::from_pretrained: Load Flux diffusion pipelines from HuggingFace-style model directories. Discovers components via model_index.json, parses both native and diffusers-style config keys for CLIP, T5, FluxDiT, and VAE
Python training callbacks: Trainer.add_callback() now wires callbacks into the training loop. Built-in ProgressCallback, LoggingCallback, and MetricsJsonCallback map to native Rust implementations; arbitrary Python objects bridge through PythonCallbackBridge

Fixed

Training cancellation via panic_any replaced: GUI and TUI previously used std::panic::panic_any(CancelledRun) + catch_unwind to abort training — fragile, UB-prone through FFI, and could be swallowed by intermediate catch_unwind. Replaced with TrainingCallback::should_stop() returning a clean Err(Cancelled) from the training loop
GUI QLoRA silently failed on non-Llama models: run_qlora_training_in_process hardcoded LlamaConfig deserialization, causing confusing errors or silent misconfiguration for Gemma/Qwen/Phi models. Now detects model_type from config.json and returns a clear error for unsupported architectures
GUI resume_from silently ignored: Training config accepted resume_from but discarded it (let _ = eval). Now returns an error directing users to the CLI
GUI GRPO with no reward function produced noise: DummyReward returning constant 0.1 for all completions made GRPO training meaningless when reasoning rewards were disabled. Now requires explicit reward configuration
Preference trainers doubled compute per step: DPO, KTO, ORPO, and SimPO train() methods ran a second full forward pass after the gradient step solely for logging metrics. Replaced with RefCell side-channels that capture metric arrays from within the autograd closure — same metrics, zero extra compute
Base model thinking mode: Auto-detect base vs instruct models and disable <think> tag prefill for base models. Base models don't understand thinking tags, causing infinite generation without a closing tag
Fused model 5x slower than LoRA: Skip ANE-hybrid path for models under 2B parameters where GPU KV-cache decode is significantly faster (115 vs 20 tok/s). ANE-hybrid benefits larger models where prefill dominates
DataLoader panics on bad images: Replace panic!() in VLM batch construction with proper DataLoaderError enum and try_next_batch() method. Image preprocessing failures and missing-image errors now propagate as Result instead of crashing
Division by zero with log_every=0: Clamp log_every and save_every to minimum 1 across TrainingLoop, LoggingCallback, CheckpointCallback, and CLI
LoRA scaling with rank 0: LoraConfig::scaling() returns 0.0 when rank is 0 instead of dividing by zero
BF16 LoRA weights: sanitize_loaded_weights() converts BF16 tensors to FP16 since MLX doesn't natively support BF16 on Apple Silicon
Qwen3Next silent weight mismatch: Weight loading now returns errors for unmatched or missing parameters instead of logging a warning and continuing with a partially loaded model
Dataset download only fetched README: download_dataset() now enumerates repo files and downloads actual data files (parquet, json, jsonl, csv, arrow, etc.) with split-aware filtering
Model download silent failures: download_model() tracks per-file failures and reports them instead of silently skipping failed downloads
Flux loading via DynamicModel: DynamicModel::load() for Flux now returns an error directing to FluxPipeline instead of incorrectly loading a diffusion model as a causal LM

Changed

GUI architecture: library calls replace subprocess spawning: Training, distillation, GRPO, inference, merge, fuse, and quantize commands now call pmetal library functions directly instead of spawning pmetal CLI as a child process. System info reads from MetalContext::global() instead of parsing pmetal memory stdout. Removes which and futures-util dependencies
TUI direct training execution: command_runner.rs dispatches train, distill, and grpo commands as in-process library calls via run_direct_command(), falling back to subprocess for other commands. Training parameters parsed from CommandSpec args with parse_arg/required_arg/optional_arg helpers
ORPO loss computation refactored: compute_orpo_loss_static now contains the full computation directly instead of creating a throwaway OrpoTrainer instance. The instance method compute_orpo_loss delegates to it
SimPO gradient-safe loss path: New compute_loss_with_cpo_for_grad static method keeps the computation graph lazy (no .eval()/.item() calls) for correct autograd. The existing compute_loss_with_cpo remains for non-grad contexts
FinetuneBuilder expanded: New builder methods — lora_dropout(), use_rslora(), use_dora(), gradient_checkpointing_layers(), callback(), metrics_path(). LoRA config now forwards dropout, RSLoRA, and DoRA settings
GRPO CLI gains new parameters: epochs, lora_r, lora_alpha, max_completion_length exposed as CLI arguments and TUI form fields. GRPO now saves adapter_config.json alongside LoRA weights
CLI emit_console_output flag: Training, distillation, and GRPO CLI functions accept emit_console_output: bool and extra_callbacks: Vec<Box<dyn TrainingCallback>> to suppress terminal output when called from GUI/TUI
DataLoader error handling: New DataLoaderError enum with Mlx, ImagePreprocess, and MissingImages variants. All 7 training loop entry points migrated from next_batch() to try_next_batch()
AdapterManager validation: load() now validates path existence, checks for adapter artifacts in directories, and rejects unsupported file types
Metal shader build isolation: Shader compiler cache redirected to build output directory, preventing pollution of user's home directory
unsafe_code lint scoping: Moved blanket #![allow(unsafe_code)] from crate-level lib.rs into individual modules that contain unsafe blocks across pmetal-metal, pmetal-mlx, pmetal-models, pmetal-trainer, pmetal-distill, and pmetal-distributed

Downloads

Asset	Description
`pmetal-*-aarch64-apple-darwin.tar.gz`	CLI binary + mlx.metallib (Apple Silicon)
`PMetal--aarch64-apple-darwin-.dmg`	Desktop GUI app (Apple Silicon)
`mlx.metallib`	MLX Metal shader library (standalone)

CLI Quick Start

tar xzf pmetal-*-aarch64-apple-darwin.tar.gz
./pmetal train --model Qwen/Qwen3-0.6B --dataset train.jsonl --output ./output

GUI

Mount the DMG and drag PMetal to Applications.

Full Changelog: v0.3.5...v0.3.6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

PMetal v0.3.6

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

[0.3.6] - 2026-03-15

Added

Fixed

Changed

Downloads

CLI Quick Start

GUI

Uh oh!