- [03/23/2026] π MagiCompiler is officially open-sourced! Delivering whole-graph compilation for multi-modality inference and FSDP-aware whole-layer compilation for large model training.
MagiCompiler is an advanced compiler and runtime augmentation framework built on top of torch.compile. Designed specifically for large-scale Transformer-like architectures, it addresses the critical bottlenecks of memory walls and operator overheads.
By stepping beyond traditional local operator optimization, MagiCompiler introduces system-level optimizations, seamlessly accelerating both training and multi-modality inference workloads with minimal code intrusion.
"Reimagining the compiler: from generating kernels to orchestrating the entire dataflow."
MagiCompiler's core philosophy is Compiler as Manager. We believe a modern deep learning compiler should not be restricted to mere kernel fusion. Instead, it acts as a global manager that owns the full lifecycle of execution. MagiCompiler actively manages subgraph dispatching, dynamically orchestrates dataflow (like offloading and prefetching), and controls memory allocation, ensuring optimal balance between compute efficiency and memory footprint.
Tailored for Transformer-like architectures with scenario-specific strategies:
- Inference: Achieves full-graph capture across Transformer boundaries, maximizing kernel fusion scope.
- Training: Introduces FSDP-aware layer-wise compilation. Unlocks aggressive cross-op fusion while keeping distributed parameter sharding entirely transparent.
No complex model refactoring needed. Just two decorators deliver up to 20%+ extra speedups out-of-the-box, seamlessly integrating into SOTA multi-modality frameworks.
For memory-constrained setups, our built-in selective offloading policy perfectly overlaps H2D transfers with computation, eliminating pipeline bubbles.
Say goodbye to manual torch.utils.checkpoint. MagiCompiler automatically saves compute-bound ops (e.g., MatMul, Attention) and recomputes memory-bound ones, slashing peak memory without sacrificing throughput.
Toggle MAGI_ENABLE_FX_GRAPH_VIZ=1 and let our powerful introspection toolchain do the rest. All implicit artifacts from graphs to kernels are automatically dumped as human-readable files, making compiler debugging highly accessible.
Requirements:
- Python >= 3.12
- PyTorch >= 2.9
- CUDA Toolkit
# Step 1 β Clone the repo
git clone https://github.com/SandAI-org/MagiCompiler.git
cd MagiCompiler
# Step 2 β System dependencies (optional, for FX graph visualization; Debian/Ubuntu)
sudo apt update && sudo apt install -y graphviz
# Step 3 β Python dependencies
pip install -r requirements.txt
# Step 4 β Install MagiCompiler (pick one)
pip install . # End users (recommended)
# pip install -e . --no-build-isolation --config-settings editable_mode=compat # Developer / editableRemove scattered torch.compile or torch.compiler.disable calls. Decorate your core Transformer block once for automatic full-graph capture and dynamic shape support (defaulting to dim 0).
import torch
from torch import nn
from magi_compiler import magi_compile
# Decorate your core module once. No more scattered compile tweaks!
@magi_compile
class TransformerBlock(nn.Module):
def __init__(self, hidden_dim):
super().__init__()
self.attn = Attention(hidden_dim)
self.mlp = MLP(hidden_dim)
def forward(self, x: torch.Tensor, mask: torch.Tensor | None) -> torch.Tensor:
x = x + self.attn(x, mask)
x = x + self.mlp(x)
return x
model = TransformerBlock(hidden_dim=1024).cuda()
# Execute normally - whole-graph compilation handles dynamic batches automatically!
out = model(torch.randn(4, 128, 1024, device="cuda"), None)
out = model(torch.randn(8, 128, 1024, device="cuda"), None)Using custom kernels (FlashAttention, MoE routers) that break FX tracing? Don't disable compilation. Wrap them to teach the compiler how to handle them during graph partitioning and recomputation.
from magi_compiler import magi_register_custom_op
@magi_register_custom_op(
name="athena::flash_attn",
infer_output_meta_fn=["q"], # Output shape matches parameter 'q'
is_subgraph_boundary=True, # Split graph here for subgraph compilation
is_compute_sensitive=True, # Retain this output during recomputation
)
def flash_attn(q: torch.Tensor, k: torch.Tensor, v: torch.Tensor) -> torch.Tensor:
... # Your custom kernel or C++ extensionExplore magi_compiler/config.py for power-user features like custom backend toggles and fine-grained memory management. (Comprehensive guides for popular training/inference frameworks are coming soon!)
On a single NVIDIA H100, MagiCompiler outperforms current SOTA solutions (like LightX2V) by 9% to 26% across mainstream open-source video generation models.
Thanks to our underlying JIT offloading engine, daVinci-MagiHuman achieves near real-time speeds, even on heavily VRAM-constrained consumer GPUs.
We are actively developing MagiCompiler. Here is a glimpse into our upcoming milestones:
- Ecosystem Integration: Benchmarks and out-of-the-box integration guides for popular frameworks (e.g.,
sglang-diffusion,vllm-omni, andLLaMAtraining). - Official Hub & Tech Blog: A dedicated website for advanced tutorials, documentation, and frontier engineering insights.
- Hardware-Aware Auto-Scheduler: An adaptive engine that dynamically orchestrates optimal strategies (auto-recomputation boundaries, offloading) based on your hardware constraints.
- Next-Gen Custom Backend (v2.0): Pushing hardware limits with extreme kernel-level efficiency, native distributed communication and MegaKernels.
If you find MagiCompiler useful in your research or production, please consider citing us:
@software{magi_compiler_2026,
author = {Hongyu Jia and Zhiyao Cen and Taoran Wang and Yunbo Zhang},
title = {MagiCompiler: Break the Boundaries of Local Compilation for Large Models},
year = {2026},
url = {https://github.com/SandAI-org/MagiCompiler}
}MagiCompiler is deeply inspired by and builds upon the shoulders of giants. We extend our heartfelt gratitude to the PyTorch team for their foundational work on torch.compile and torch.fx, and to the vLLM community for their pioneering contributions to large model inference.
We are moving fast, and we want you on board! MagiCompiler is under rapid development. If you are passionate about pushing the limits of large model compilation, we'd love to have you with us. From opening issues and discussing architectures to submitting core PRs, every contribution matters. Let's engineer the future of AI infrastructure together!
This project is licensed under the Apache License 2.0.


