Add MPS support for Apple Silicon devices#55
Conversation
- Added device field to TrainingConfig schema (auto/cuda/mps/cpu) - Created device_utils.py with device resolution functions - Updated hardware detector to detect both CUDA and MPS - Added MPS support to system info detection Co-authored-by: RETR0-OS <74290459+RETR0-OS@users.noreply.github.com>
- Added device resolution in training_service - Updated HuggingFace provider to support MPS device - Added MPS validation in Unsloth provider - Disabled bitsandbytes on MPS automatically - Pass device_type to model loading Co-authored-by: RETR0-OS <74290459+RETR0-OS@users.noreply.github.com>
- Modified chat_playground to support CUDA/MPS/CPU - Updated hardware service to expose device_type - Replaced hardcoded CUDA references with device checks Co-authored-by: RETR0-OS <74290459+RETR0-OS@users.noreply.github.com>
- Created macOS MPS installation guide - Updated README with MPS support and platform table - Updated HuggingFace provider docs for MPS - Updated Unsloth provider docs with MPS limitations - Added MPS troubleshooting section to common issues Co-authored-by: RETR0-OS <74290459+RETR0-OS@users.noreply.github.com>
- Fixed type annotation to use List[str] for Python 3.8 compatibility - Clarified comment about MPS device_map handling - Updated docs README to include macOS MPS installation link Co-authored-by: RETR0-OS <74290459+RETR0-OS@users.noreply.github.com>
- Created tests directory - Added comprehensive unit tests for device resolution - Tests cover auto-detection, availability checks, and error cases - All tests use mocks to avoid torch dependency Co-authored-by: RETR0-OS <74290459+RETR0-OS@users.noreply.github.com>
- Make bitsandbytes an optional dependency (moved to [cuda] extras) so MPS installations don't fail on import - Add MPS memory management: torch.mps.empty_cache(), gc.collect(), PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0 to prevent OOM kills - Load models with torch_dtype=float16 and low_cpu_mem_usage=True on MPS to halve peak memory during model loading - Switch default optimizer from paged_adamw_32bit (requires bnb) to adamw_torch which works on all platforms - Force fp16 precision on MPS (bf16 unreliable), auto-default max_seq_length=512 for memory efficiency - Add conditional BitsAndBytesConfig imports in quantization.py and legacy tuners to prevent import crashes on MPS - Add get_mps_memory_info() for memory monitoring during training - Clear device cache before and after training on MPS - Update MPS documentation with new optimizer and memory settings https://claude.ai/code/session_01RaPUzQs3pZjM2i9JoZ66Lz
Set dataloader_num_workers=0 and dataloader_pin_memory=False on MPS to avoid macOS multiprocessing fork issues and unnecessary pin_memory on unified memory systems. https://claude.ai/code/session_01RaPUzQs3pZjM2i9JoZ66Lz
- Restore QLoRA bf16 default to True for CUDA (MPS override in training_service already forces False) - Add psutil to pyproject.toml dependencies (used by device_utils) - Remove bitsandbytes from requirements.txt for consistency - Add _BNB_AVAILABLE guards in legacy tuners to prevent NameError when bitsandbytes is not installed https://claude.ai/code/session_01RaPUzQs3pZjM2i9JoZ66Lz
BitsAndBytesConfig imports from transformers even without bitsandbytes installed. Check for the actual bitsandbytes package to ensure the guard correctly detects when quantization is unavailable. https://claude.ai/code/session_01RaPUzQs3pZjM2i9JoZ66Lz
This env var must be set before `import torch` to take effect. Move it to app.py and cli.py entry points which run before any transitive torch imports from routers/strategies/providers. https://claude.ai/code/session_01RaPUzQs3pZjM2i9JoZ66Lz
Previously only disabled use_4bit/use_8bit when they were already True. Now always sets them to False on MPS regardless, preventing any path where quantization could be attempted without bitsandbytes. Also changed QuantizationFactory.create_config default to False since the schema already provides True for CUDA users. https://claude.ai/code/session_01RaPUzQs3pZjM2i9JoZ66Lz
Resolve conflicts keeping MPS support while incorporating main's changes: - Accept deletion of legacy tuners (CausalLLMTuner, QATuner, Seq2SeqLMTuner) - Keep graceful bitsandbytes degradation (return None vs crash) for MPS - Merge TensorBoard launch with pre-training cache clear - Add model_validator import alongside Literal for device enum - Keep MPS device detection in chat_playground - Use main's [quantization] extra name (renamed from [cuda]) - Keep MPS documentation in README and troubleshooting docs https://claude.ai/code/session_01RaPUzQs3pZjM2i9JoZ66Lz
fp16/bf16 TrainingArguments flags trigger CUDA-only Accelerate paths. On MPS the model dtype is already set natively at load time via torch_dtype=float16, so no precision flag is needed. Also added a second guard in _adjust_config_for_dtype to catch any remaining paths. Reduced SQLite pool_size to 1 to avoid multiprocessing conflicts that surface during MPS dataloader teardown.
Paths pasted with a trailing newline or space would fail the file existence check. Now strips before validating and before returning so downstream code receives a clean path.
Validation endpoint returns fields at the top level of the response object, not nested under result.validation. Updates field access to match the actual API shape.
Pins react-scripts to 5.0.1 (removing semver range) to prevent unexpected upgrades. Includes updated package-lock and rebuilt static assets.
Adds .venv to .gitignore so virtual environments aren't accidentally staged. Changes uvicorn host from 0.0.0.0 to 127.0.0.1 to restrict the dev server to local access only.
Add comprehensive Apple MPS support with memory management
There was a problem hiding this comment.
Pull request overview
This PR adds Metal Performance Shaders (MPS) backend support for Apple Silicon Macs, enabling LLM fine-tuning on M1/M2/M3/M4 hardware. It introduces a device abstraction layer (device_utils.py), updates hardware detection to recognize MPS, and wires MPS-specific configuration overrides through the training pipeline. The HuggingFace provider is updated for MPS-compatible model loading, while Unsloth is blocked on MPS. Comprehensive documentation for macOS MPS setup is included.
Changes:
- New device resolution module (
device_utils.py) with auto-detection hierarchy (CUDA → MPS → CPU), MPS memory management, and device-agnostic cache clearing - Training service and provider updates to handle MPS-specific constraints: disabled quantization, optimizer fallback, native float16 loading, and conditional bitsandbytes imports
- Documentation additions including macOS installation guide, troubleshooting, and updated platform support tables across README and provider docs
Reviewed changes
Copilot reviewed 32 out of 42 changed files in this pull request and generated 11 comments.
Show a summary per file
| File | Description |
|---|---|
ModelForge/utilities/device_utils.py |
New module: device resolution, MPS memory config, cache management |
ModelForge/services/training_service.py |
MPS config overrides, device resolution wiring, Unsloth+MPS validation |
ModelForge/providers/huggingface_provider.py |
MPS-optimized model loading (float16, CPU→MPS transfer) |
ModelForge/providers/unsloth_provider.py |
MPS guard (blocks Unsloth on MPS hardware) |
ModelForge/utilities/hardware_detection/hardware_detector.py |
MPS detection as CUDA fallback in hardware profiling |
ModelForge/schemas/training_schemas.py |
Added device field with Literal type |
ModelForge/utilities/finetuning/quantization.py |
Conditional bitsandbytes import, graceful degradation |
ModelForge/strategies/sft_strategy.py |
Default optimizer change, dataloader settings |
ModelForge/strategies/qlora_strategy.py |
Default optimizer change, dataloader settings |
ModelForge/utilities/finetuning/settings_builder.py |
Default optimizer change |
ModelForge/utilities/chat_playground.py |
MPS device detection, guarded CUDA cache clearing |
ModelForge/cli.py / ModelForge/app.py |
Early MPS fallback env var, host change |
ModelForge/database/database_manager.py |
SQLite pool_size reduction |
ModelForge/notebook_cli/prompts.py |
Path stripping fix |
ModelForge/services/hardware_service.py |
Expose device_type in hardware specs |
tests/test_device_utils.py |
Unit tests for device resolution |
requirements.txt |
New requirements file (encoding issue) |
pyproject.toml |
Added psutil dependency |
docs/installation/macos-mps.md |
New macOS MPS installation guide |
docs/providers/huggingface.md |
MPS platform support, MPS config examples |
docs/providers/unsloth.md |
MPS unsupported notes |
docs/troubleshooting/common-issues.md |
MPS troubleshooting section |
README.md / docs/README.md |
Updated platform support tables |
| Frontend files | Build artifacts update, validation response fix |
.gitignore |
Added .venv |
| try: | ||
| import torch | ||
| if hasattr(torch.backends, 'mps') and torch.backends.mps.is_available(): | ||
| # Check if we're actually trying to use MPS | ||
| # (this is a secondary check, primary check is in training_service) | ||
| raise ProviderError( | ||
| "Unsloth provider is not supported on Apple MPS devices. " | ||
| "Unsloth requires NVIDIA CUDA GPUs for its optimized kernels. " | ||
| "Please switch to 'huggingface' provider or use 'device: cuda' with an NVIDIA GPU." | ||
| ) | ||
| except ImportError: | ||
| pass # torch not available, will fail below anyway |
There was a problem hiding this comment.
Bug: This check unconditionally blocks Unsloth on any system where MPS is available, even when the user explicitly set device: cuda (e.g., a Mac with an external eGPU, or a theoretical dual-GPU setup). The comment says "this is a secondary check" but unlike the primary check in training_service.py (which correctly checks device_type == "mps"), this check looks at hardware availability rather than the user's device selection.
This should accept a device_type parameter (or check the config/kwargs) so that it only raises when the user is actually targeting MPS. As it stands, it will always block Unsloth on Apple Silicon, regardless of the chosen device.
| # If no CUDA GPU, try to detect Apple MPS | ||
| if not cuda_available: | ||
| try: | ||
| import torch |
There was a problem hiding this comment.
torch is imported at module level on line 3, but then re-imported locally on line 117 (import torch) and line 298. The local imports are redundant and should be removed since the module-level import already makes torch available throughout the file.
There was a problem hiding this comment.
All review feedback has been addressed in commit 69ee0d5:
Bugs Fixed:
- Unsloth now checks
device_typeparameter instead of MPS hardware availability - Removed redundant torch imports in hardware_detector.py
- Fixed MPS availability check to use local variable
Breaking Changes Reverted:
- Optimizer defaults restored to
paged_adamw_32bitfor CUDA users - Settings builder, SFT, and QLoRA strategies all use original defaults
- MPS override in training_service handles the switch for MPS devices
Documentation Corrected:
- Removed misleading
fp16: truerecommendations (automatically disabled on MPS) - Added M4 to README for consistency
- Removed redundant device validator
Enhancements:
- Added MPS garbage collection in chat_playground cleanup
| if not cuda_available and not (hasattr(torch, 'backends') and hasattr(torch.backends, 'mps') and torch.backends.mps.is_available()): | ||
| raise RuntimeError( | ||
| "No GPU detected. ModelForge requires either an NVIDIA GPU with CUDA " | ||
| "or Apple Silicon with MPS support. If you have a GPU, please ensure " | ||
| "you have the correct drivers installed." | ||
| ) |
There was a problem hiding this comment.
Bug: This fallback check re-evaluates torch.backends.mps.is_available(), but if MPS detection succeeded earlier (lines 118-136), the function already returned. This line is only reached when MPS was either unavailable or when MPS detection threw an exception (caught on line 137). In the exception case, calling torch.backends.mps.is_available() again here could throw the same exception, resulting in an unhandled error instead of the intended RuntimeError.
A simpler and more robust approach: track MPS availability in a local variable (like cuda_available) and check that variable here instead of re-calling torch.backends.mps.is_available().
| gradient_accumulation_steps=config.get("gradient_accumulation_steps", 4), | ||
| # QLoRA uses paged optimizers for memory efficiency | ||
| optim=config.get("optim", "paged_adamw_32bit"), | ||
| optim=config.get("optim", "adamw_torch"), |
There was a problem hiding this comment.
Breaking change for existing CUDA users: Changing the QLoRA default optimizer from paged_adamw_32bit to adamw_torch increases memory usage for CUDA QLoRA workflows. The paged optimizer from bitsandbytes is specifically designed for QLoRA's memory-efficient training and is the recommended optimizer in the HuggingFace QLoRA documentation.
A better approach would be to keep paged_adamw_32bit as the default for CUDA and only override to adamw_torch for MPS in the training_service.py MPS adjustment block (which already does this on line 198-205). This way existing CUDA users get optimal performance while MPS users get compatibility.
| per_device_eval_batch_size=config.get("per_device_eval_batch_size", 1), | ||
| gradient_accumulation_steps=config.get("gradient_accumulation_steps", 4), | ||
| optim=config.get("optim", "paged_adamw_32bit"), | ||
| optim=config.get("optim", "adamw_torch"), |
There was a problem hiding this comment.
Breaking change for existing CUDA users: Same issue as in qlora_strategy.py — changing the SFT strategy default optimizer from paged_adamw_32bit to adamw_torch increases memory usage for CUDA workflows. The paged optimizer fallback to adamw_torch is already handled for MPS in training_service.py (lines 198-205), so the default here should remain paged_adamw_32bit to preserve existing CUDA behavior.
README.md
Outdated
| ### Installation | ||
| - **[Windows Installation](https://github.com/forgeopus/modelforge/tree/main/docs/installation/windows.md)** - Complete Windows setup (including WSL and Docker) | ||
| - **[Linux Installation](https://github.com/forgeopus/modelforge/tree/main/docs/installation/linux.md)** - Linux setup guide | ||
| - **[macOS (Apple Silicon) Installation](https://github.com/forgeopus/modelforge/tree/main/docs/installation/macos-mps.md)** - Setup for M1/M2/M3 Macs with MPS support |
There was a problem hiding this comment.
Inconsistency: This line says "M1/M2/M3 Macs" but the rest of the documentation (including the PR description, docs/installation/macos-mps.md line 3, and README.md line 50) consistently references "M1/M2/M3/M4". M4 should be included here for consistency.
There was a problem hiding this comment.
This feedback was already addressed in commit 69ee0d5. The README.md line 111 now correctly references "M1/M2/M3/M4 Macs" for consistency with the rest of the documentation.
docs/installation/macos-mps.md
Outdated
| - **HuggingFace provider only** - Unsloth provider is not supported on MPS | ||
| - **No 4-bit/8-bit quantization** - bitsandbytes library doesn't support MPS | ||
| - **Smaller models recommended** - Unified memory architecture limits model sizes | ||
| - **FP16 precision** - Use `fp16: true` for best performance |
There was a problem hiding this comment.
Documentation is inconsistent with the code. Line 13 says "Use fp16: true for best performance," but training_service.py (line 195) explicitly sets config["fp16"] = False on MPS, because fp16 triggers Accelerate's mixed-precision mode which is CUDA-only. The model is loaded with torch_dtype=float16 natively, but the TrainingArguments fp16 flag is always forced to False.
This recommendation (and similar ones on lines 89 and 119) will confuse users who set fp16: true in their config only to have it silently overridden to False. The documentation should explain that the model runs in float16 natively on MPS, and that the fp16 training argument is not needed (and will be ignored).
| @field_validator("device") | ||
| @classmethod | ||
| def validate_device(cls, v): | ||
| if v not in VALID_DEVICES: | ||
| raise ValueError( | ||
| f"Invalid device: {v}. Must be one of {VALID_DEVICES}" | ||
| ) | ||
| return v |
There was a problem hiding this comment.
The device field is already typed as Literal["auto", "cuda", "mps", "cpu"] on line 25, which means Pydantic automatically validates that the value is one of those four strings. This @field_validator for device is redundant — Pydantic's Literal type already raises a ValidationError for invalid values. This doesn't cause a bug, but it's unnecessary code duplication.
| # Clear device cache if CUDA | ||
| if self.device == "cuda" and torch.cuda.is_available(): | ||
| torch.cuda.empty_cache() |
There was a problem hiding this comment.
Missing MPS cache cleanup: The clean_up method now correctly guards torch.cuda.empty_cache() behind a CUDA check, but doesn't add equivalent cleanup for MPS. Since MPS devices are now supported in the constructor, MPS cache should also be cleared here. Consider using the new clear_device_cache utility from device_utils.py, or add an elif self.device == "mps" branch that calls gc.collect() and torch.mps.empty_cache().
docs/providers/huggingface.md
Outdated
| "device": "mps", | ||
| "use_4bit": false, | ||
| "use_8bit": false, | ||
| "fp16": true, |
There was a problem hiding this comment.
Same documentation inconsistency as in macos-mps.md: "fp16": true is recommended here, but training_service.py forces fp16=False and bf16=False on MPS. The model dtype is set natively at load time (torch_dtype=float16), so this config setting is silently overridden. The documentation should clarify that the fp16 flag is not needed on MPS and will be ignored.
…cumentation - Fix Unsloth provider to check device_type param instead of hardware availability - Remove redundant torch import in hardware_detector.py - Fix MPS availability check to use local variable instead of re-calling - Revert optimizer defaults to paged_adamw_32bit for CUDA compatibility - Add M4 to README platform support - Remove redundant device validator (Literal already validates) - Add MPS cache cleanup to chat_playground - Fix documentation: Remove fp16 recommendations (automatically disabled on MPS) Co-authored-by: devgunnu <139300251+devgunnu@users.noreply.github.com>
| # Store device info in config for downstream use | ||
| config["device_type"] = device_type | ||
| config["device_obj"] = device_obj |
There was a problem hiding this comment.
Storing device_obj (a torch.device object) in the config dict is unnecessary — it's never read back from config["device_obj"] anywhere in the codebase. It also requires the workaround on line 461 (json.dumps({k: str(v) if not isinstance(v, (str, int, float, bool, type(None))) else v ...})) to avoid a TypeError: Object of type device is not JSON serializable. Removing this line (and just keeping device_type) would simplify the code and avoid the need for the complex serialization logic downstream.
| # Create quantization config (default False — schema provides True for CUDA; | ||
| # MPS override above already forces False) | ||
| quant_config = QuantizationFactory.create_config( | ||
| use_4bit=config.get("use_4bit", True), | ||
| use_4bit=config.get("use_4bit", False), |
There was a problem hiding this comment.
The fallback default for use_4bit was changed from True to False. This is inconsistent with the schema default (TrainingConfig.use_4bit = True at line 33 of training_schemas.py) and with the CLI wizard default ("use_4bit": hyperparams.get("use_4bit", True) in wizard.py:226). While the comment explains the rationale ("schema provides True for CUDA"), having the fallback disagree with the schema creates a subtle behavioral difference for any caller that passes a config dict without use_4bit. The fallback should match the schema default of True to maintain consistent behavior, since the MPS override block above already forces it to False when device_type == "mps".
| { | ||
| "use_4bit": false, | ||
| "use_8bit": false, | ||
| "fp16": true | ||
| } |
There was a problem hiding this comment.
The troubleshooting section recommends "fp16": true for MPS when disabling quantization, but the code in training_service.py (lines 195-196) explicitly forces both fp16 and bf16 to False on MPS because Accelerate's mixed-precision mode is CUDA-only. The macos-mps.md installation guide (line 119) also states "Do not set fp16: true or bf16: true." This documentation here is contradictory and will confuse MPS users. The "fp16": true should be removed from this JSON example.
| ```json | ||
| { | ||
| "use_4bit": false, | ||
| "use_8bit": false, | ||
| "fp16": true | ||
| } |
There was a problem hiding this comment.
Same issue as in common-issues.md: this troubleshooting example recommends "fp16": true but the code forces fp16 to False on MPS (see training_service.py lines 195-196 and _auto_detect_precision_settings). This JSON example contradicts the statement on line 119 of this same file which says "Do not set fp16: true or bf16: true - they will be ignored." The "fp16": true line should be removed from this example.
| elif self.device == "mps": | ||
| # MPS doesn't have an empty_cache() method yet, but we can help with GC | ||
| import gc | ||
| gc.collect() |
There was a problem hiding this comment.
The comment "MPS doesn't have an empty_cache() method yet" is incorrect — torch.mps.empty_cache() has been available since PyTorch 2.0. This PR itself uses torch.mps.empty_cache() in device_utils.py (line 243) and huggingface_provider.py (line 105). For consistency, this cleanup method should also call torch.mps.empty_cache() after gc.collect(), and the misleading comment should be corrected.
Changing the default to False caused CUDA users to silently train in full precision when use_4bit was not explicitly set, leading to higher VRAM usage and potential OOM on consumer GPUs. MPS path already forces use_4bit=False via the device override block above.
Changes Made
Bug Fixes
device_typeparameter instead of hardware availability, allowing Unsloth on systems with MPS when user explicitly selects CUDABreaking Change Reverted
paged_adamw_32bitfor CUDA users (MPS override in training_service handles the switch toadamw_torchfor MPS)Documentation Fixes
fp16: truerecommendations; clarified that training precision flags are automatically disabled on MPS (model loads natively in float16)All changes maintain backward compatibility with existing CUDA workflows while properly supporting MPS devices.
Original prompt
Add MPS support for Apple Silicon devices
1. Define a device abstraction in config
Goal: introduce
"device"with"auto" | "cuda" | "mps" | "cpu"and make it flow through the training stack. githubIn the backend config schema (look for the Pydantic model used for training config, referenced in
docs/api-reference/training-config.md): githubdevice: Literal["auto", "cuda", "mps", "cpu"] = "auto".In the REST API layer (file referenced by
docs/api-reference/rest-api.md– the endpoint that starts a training job): githubdevicein the request body and pass it into the internal training config object.deviceis missing, default to"auto".Implement device resolution helper (e.g., in a
utils/devices.pyor whatever shared utility module exists):device == "cuda": asserttorch.cuda.is_available().device == "mps": asserttorch.backends.mps.is_available().device == "cpu": returntorch.device("cpu").device == "auto":cudaavailable →cuda.mpsavailable →mps.cpu.torch.deviceand a string name so you can pass it to HF Trainer.2. Wire device into HuggingFace provider
Goal: make the HuggingFace provider fully device‑agnostic instead of CUDA‑specific. github
Locate the HuggingFace provider implementation (see
docs/providers/huggingface.md→ class/file names are referenced there). githubIn the provider’s
train/prepare_modelpath:device="cuda"ortorch.device("cuda")with the resolved device from the config helper above.Trainer/SFTTrainer/TRLtrainers:training_args.deviceortraining_args.device_mapusing the resolved device string:"cuda","mps", or"cpu".Remove/guard direct
torch.cuda.*calls inside the HuggingFace provider:torch.cuda.is_available()checks with calls into your new device helper.torch.cuda.empty_cache()with a conditional that only calls it when the resolved device type is"cuda".torch.cuda.amp.autocastusage should be replaced withtorch.autocast(device_type=device_type, dtype=...)so it works on"cuda"and"cpu"; ifmpsautocast is flaky, you can disable autocast whendevice_type == "mps".Bitsandbytes / quantization in HF provider:
load_in_4bit=True,bnb_4bit_compute_dtype, etc.).device_type == "mps", either:load_in_4bit=Falseand use fp16/bf16 (if supported), or"4-bit quantization via bitsandbytes is not supported on MPS; use 'precision: fp16' and 'device: mps'".3. Disable Unsloth on MPS
Goal: prevent users from selecting Unsloth with MPS, since its kernels assume CUDA. github
In the Unsloth provider implementation (see
docs/providers/unsloth.mdfor location): githubtrain(or constructor), insert a check:"mps": raise a fast, human‑readable error:"Unsloth provider is not supported on Apple MPS. Please switch provider to 'huggingface' or device to 'cuda'."In any provider selection logic (factory/registry):
provider == "unsloth"anddevice == "mps"in the config, reject the config before jobs start, with HTTP 400 and the same message.Update
docs/providers/unsloth.mdwith a short “Unsupported on macOS/MPS” note in the constraints section. github4. Extend hardware detection / profiles
Goal: make “hardware‑aware” logic MPS‑aware, not CUDA‑only. github
Locate the hardware detection logic referenced under
docs/configuration/hardware-profiles.md(“Auto-detects GPU and recommends optimal models”). githubIn that module:
torch.cuda.is_available()and queries CUDA properties.torch.backends.mps.is_available():device_type = "mps".psutil.virtual_memory().totalif there’s no direct MPS memory query).In any “recommended models” logic:
💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.