A lightweight vLLM-Omni-style diffusion implementation built around Wan2.2-TI2V-5B-Diffusers.
- 🚀 Real engine boundaries - explicit
request -> scheduler -> runner -> pipeline - 📖 Readable codebase - core implementation in ~
1,079lines of Python for studying diffusion serving - ⚡ Step execution - preserves the
prepare_encode -> denoise_step -> step_scheduler -> post_decodecontract fromvllm-omni - 🧠 Minimal reuse path - CPU prompt-embedding cache as the diffusion analogue of prefix/KV reuse
- 💾 Practical memory path - explicit module-level CPU offload for a 24GB 3090
This project keeps the core vllm-omni diffusion shape:
- explicit scheduler-owned request lifecycle
- per-request mutable runner state
- step-wise denoising instead of one giant
pipe(...)call - a dedicated pipeline adapter instead of hiding all logic inside Diffusers
It does not try to reimplement distributed executors, cache backends, tensor parallel diffusion, or multi-model orchestration.
This project was validated with Python 3.10, a CUDA-capable NVIDIA GPU, and ffmpeg.
- Create an environment.
conda create -n nano-vllm-omni python=3.10 -y
conda activate nano-vllm-omni- Install a CUDA-enabled PyTorch build that matches your system.
Example for CUDA 12.1:
python -m pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121If your machine uses a different CUDA version, use the selector on the official PyTorch site instead of this example.
- Install system
ffmpeg.
Ubuntu:
sudo apt-get update
sudo apt-get install -y ffmpeg- Install this project and the Hugging Face CLI.
This project currently depends on the Wan pipeline from the diffusers main branch, so pip install -e . will fetch diffusers from GitHub automatically.
python -m pip install -e .
python -m pip install huggingface_hubThis repo expects the Wan model under ./models/Wan2.2-TI2V-5B-Diffusers.
Create the directory and download the official Diffusers weights:
mkdir -p models
huggingface-cli download --resume-download Wan-AI/Wan2.2-TI2V-5B-Diffusers \
--local-dir ./models/Wan2.2-TI2V-5B-Diffusers \
--local-dir-use-symlinks FalseThe repository already includes a demo image at ./assets/i2v_input.JPG, so no extra asset download is required for the default example.
The repo includes the official Wan TI2V sample input image under ./assets:
i2v_input.JPG: the official Wan cat-on-surfboard example
Source and license details are listed in assets/README.md.
After the model is downloaded, this command should run end-to-end:
CUDA_VISIBLE_DEVICES=0 python example_wan22_i2v.py \
--model ./models/Wan2.2-TI2V-5B-Diffusers \
--image ./assets/i2v_input.JPG \
--preset quality \
--output ./output/example_wan22_i2v_quality.mp4Or via the CLI entrypoint:
CUDA_VISIBLE_DEVICES=0 nano-vllm-omni \
--model ./models/Wan2.2-TI2V-5B-Diffusers \
--image ./assets/i2v_input.JPG \
--preset quality \
--output ./output/example_wan22_i2v_quality.mp4The current defaults in nanovllm_omni/config.py point to these same repo-relative paths, so after you place the model under ./models/Wan2.2-TI2V-5B-Diffusers, the explicit --model and --image flags become optional.
The CLI also accepts --negative-prompt. The default negative prompt already suppresses ghosting, double image, duplicate subject, and motion trails.
quality: target 480P-class area,17frames,12steps,flow_shift=3.0
See bench.py for the benchmark used below.
Test Configuration:
- Hardware: RTX 3090 24GB
- Model:
Wan2.2-TI2V-5B-Diffusers - Input:
./assets/i2v_input.JPG - Resolution:
576x768 - Frames:
17 - Sampler:
Euler - Inference Steps:
12 - Metric: post-load
generatetime only, including text embedding and video generation - Warmup:
1run, then5timed runs
Performance Results:
| Inference Engine | Mean Generate Time (s) | Min (s) | Max (s) | Notes |
|---|---|---|---|---|
vllm-omni |
28.2445 |
28.1874 |
28.3386 |
official 0.18.0 |
nano-vllm-omni |
25.8918 |
25.0985 |
26.7400 |
current implementation |
On this single-GPU Wan2.2 I2V benchmark, nano-vllm-omni is about 9.1% faster than the official vllm-omni path while keeping the codebase small and readable.
ffmpegis required to export frames tomp4.- Pure full-GPU decode OOMed on this 24GB card at higher resolutions, so CPU offload stays enabled by default.
- The current implementation is optimized for clarity first: single process, single GPU, one scheduled diffusion step at a time.
config.py: engine/runtime configurationsampling_params.py: runtime sampling arguments and the validatedqualitypresetrequest.py: user-facing request objectcache.py: CPU-side prompt embedding cachesched/interface.py: scheduler contract and request statesched/base_scheduler.py: waiting/running/finished queue bookkeepingsched/step_scheduler.py: step-wise diffusion schedulerworker/utils.py: per-request runner state and runner outputmodels/interface.py: minimal step-execution pipeline protocolmodels/wan22/pipeline.py: Wan2.2 TI2V/I2V step-execution pipeline adapterengine/model_runner.py: step-wise request execution and state cacheengine/omni_engine.py: top-level engine loopllm.py: user-facing APIutils.py: resize/export helpers
