Local-first dataset management for diffusion model fine-tuning.
Ingest · Validate · Transform · Caption · Export
demo.mp4
Training a video diffusion model requires datasets where every clip satisfies strict frame-count and resolution rules. Getting there from raw footage is tedious: trim clips, normalize frames, resize to model-specific multiples, add captions, package everything correctly. DiffForge automates all of that in a browser-based editor backed by a local FastAPI service.
Drop a folder of videos, images, or GIFs. DiffForge scans it, pairs .txt sidecar files with their media, flags orphaned captions and unsupported files, and loads everything into an indexed dataset in seconds. Session state persists across reloads via IndexedDB — your work is never lost.
Every file is validated against the target model's constraints directly in the browser — no round-trip needed:
| Model | Resolution | Frame Rule | Frame Range |
|---|---|---|---|
| LTX Video | ×32 multiples, min 64px | 8n+1 | 1–257 |
| WAN | ×32 multiples, min 32px | 4n+1 | 1–600 |
Invalid files are flagged with specific issue messages (Width 854px — not ×32, ~120 frames — not 8n+1).
Apply resolution and frame normalisation to your entire dataset in one click:
- Resolution — Auto (round to nearest valid multiple) or Manual (explicit W×H)
- Frames — Auto (snap to nearest valid count per model rule) or Manual (fixed target)
- Both toggleable independently with ON/OFF switches
A 5-sample preview shows before/after metadata before you commit. Transforms run through the local backend using ffmpeg.
Open any file in a full-screen workspace for fine-grained control:
- Item-level transform config — override the global settings for just this file
- Frame slicer — split a clip into segments at arbitrary frame boundaries (manual or evenly-spaced)
- Frame grid — visualise every frame after normalisation; click to delete individual frames before encoding
- Live "After Transform" preview — resolution and frame count update as you type, immediately showing whether the output will be valid
- Progress updates (%, message) with cancel support
Generate text descriptions for every clip using three provider options:
| Provider | Models |
|---|---|
| Azure OpenAI | Any deployed vision model |
| OpenAI | gpt-4o, gpt-4.1, gpt-4.1-mini |
| Google Gemini | gemini-2.5-pro, gemini-2.5-flash, gemini-2.0-flash |
DiffForge builds a sprite sheet from up to 8 evenly-spaced frames and sends it to the vision model with a customisable system prompt. Preview 5 samples before running the full batch. Choose "empty only" or "override" mode.
Export your finished dataset as a ZIP ready to drop into a training script:
my-dataset.zip
├── 0001_clip_name.mp4
├── 0001_clip_name.txt ← "token, caption text"
├── 0002_another_clip.mp4
├── 0002_another_clip.txt
└── metadata.json
Optional trigger word is prepended to every caption automatically.
Every destructive action (transform, delete, caption update) is reversible. A 50-step history stack covers the full session.
- Node.js 20+
- Python 3.10+
- ffmpeg on
$PATH
cd frontend
npm install
npm run dev
# → http://localhost:3000cd backend
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install -r requirements.txt
uvicorn app.main:app --reload --port 8000Set NEXT_PUBLIC_API_URL in frontend/.env.local if the backend runs elsewhere (default: http://localhost:8000).
docker-compose -f docker-compose.yaml up --buildFrontend — Next.js 16 (App Router) · React 19 · TypeScript · Tailwind CSS v4 · shadcn/ui · JSZip · Lucide
Backend — FastAPI · Uvicorn · NumPy · Pillow · ffmpeg
| Model | Status |
|---|---|
| LTX Video | Full (transform + export) |
| WAN | Config + validation (processor coming) |
The processor system is pluggable — see docs/extending.md to add support for a new model in ~50 lines.
If something here could be improved, please open an issue or submit a pull request.
This project is licensed under the MIT License. See the LICENSE file for more details.
