VLM-AutoYOLO

🖼️ image/video → 🔍 VLM / SAM3 detection → 🎯 SAM2/SAM3 mask → ✏️ refine → 📦 export → 🚀 YOLO → ✅ model

Images or videos in → YOLO model out, with VLM auto-labeling (LocateAnything-3B), SAM2.1 / SAM3 mask refinement, and human-in-the-loop correction. Multi-format export, one-click YOLO training (detect & segment), video keyframe extraction, and model validation — all GPU-accelerated on macOS MPS and Windows/Linux CUDA.

See Architecture & Workflow Documentation for detailed Mermaid diagrams.

Key Features

🤖 VLM auto-labeling: Open-vocabulary object detection with LocateAnything-3B
🎯 SAM2 / SAM3 segmentation: Bbox → pixel-precise mask with SAM 2.1 or SAM3 text-driven detection+segmentation in one pass, BBox/Mask toggle on canvas
🎥 Video annotation: Intelligent keyframe extraction (scene / motion / interval), SSIM dedup
✏️ Manual refinement: Canvas draw mode, NMS filtering, hide/show individual boxes
📦 Multi-format export/import: YOLO, YOLO-Seg, COCO JSON, Pascal VOC XML, CreateML JSON — import datasets via chunked ZIP upload (max 10GB, resume support)
🚀 Training queue: Sequential job processing with cancel support, one-click training (YOLOv8 / v11 / v26) with real-time SSE progress
✅ Model validation: Batch image / video testing, MJPEG live stream, SSE video inference
💾 Smart model management: Lazy loading, idle auto-unload, MPS/CUDA strategy pattern cleanup
🌐 i18n: English / 简体中文 / 日本語 · 🎨 Theme: Light / dark mode

Documentation

📚 User Guide (English) | 📚 用户指南 (中文)

Comprehensive guides: quick start, annotation best practices, training parameter tuning, model deployment.

Screenshots

VLM Pre-annotation & Refinement	YOLO Training

Video Keyframe Entry	Model Validation

Tech Stack

Layer	Technology
Visual Grounding	NVIDIA LocateAnything-3B (Qwen2.5-3B + MoonViT)
Segmentation	SAM 2.1 / SAM3 — Segment Anything Model 2 / 3
Object Detection	YOLOv8 / v11 / v26 — Detect & Segment (Ultralytics)
Backend	Python FastAPI + PostgreSQL + SSE
Frontend	React + TypeScript + Vite + Tailwind CSS + antd
GPU Memory	Strategy Pattern (`gpu_memory.py`) — CUDA expandable segments / MPS synchronize + empty_cache
State	Zustand + TanStack Query + ahooks
i18n	i18next (English / 简体中文 / 日本語)
Video	ffmpeg (scene / motion / interval extraction)
Tooling	pnpm, ESLint, Prettier, Husky, commitlint, Playwright

Quick Start

CLI (Recommended for macOS / Linux)

git clone https://github.com/Somnusochi/VLM-AutoYOLO.git
cd VLM-AutoYOLO
python3 cli.py all

The CLI handles everything: dependency checks, Python venv, pip install, pnpm install, database migrations, and launches both services. Open http://localhost:5173.

Commands:

python3 cli.py all                       # Setup + download models + start
python3 cli.py all --no-models           # Skip model download
python3 cli.py all --models=vlm          # Only download VLM model
python3 cli.py all --models=vlm,sam2     # Download VLM + SAM2
python3 cli.py setup                     # Install deps + init DB
python3 cli.py start                     # Launch services
python3 cli.py stop                      # Stop services
python3 cli.py status                    # Check if running
python3 cli.py download --models=vlm     # Re-download specific model

Docker Deployment

Requirements: Linux or Windows (WSL2) with NVIDIA GPU + NVIDIA Container Toolkit. macOS is not supported — Docker on Mac has no GPU passthrough. Use Manual Setup instead.

Quick start with pre-built images:

curl -O https://raw.githubusercontent.com/Somnusochi/VLM-AutoYOLO/master/docker-compose.yml
docker compose up -d
open http://localhost        # Frontend
open http://localhost:8000/docs  # API docs

Build from source:

git clone https://github.com/Somnusochi/VLM-AutoYOLO.git
cd VLM-AutoYOLO
docker compose up -d --build

Services:

Service	Port	Description
Frontend	80	React web UI (Nginx)
Backend	8000	FastAPI server
SAM3	8002	SAM3 standalone inference service
Database	5432	PostgreSQL

GPU Support — docker-compose.yml now has built-in GPU passthrough configured. No manual editing required.

Persistent Storage (Docker volumes):

pgdata — Database · model-cache — VLM, SAM2 & SAM3 models · uploads — User images/videos · training-data — YOLO training outputs

Backup / Restore:

docker compose exec db pg_dump -U postgres autolabeling > backup.sql
cat backup.sql | docker compose exec -T db psql -U postgres autolabeling

Manual Setup

Requirements:

Resource	Minimum	Recommended
Python	3.12+	3.12+
Node.js	22+	22+
PostgreSQL	16+	16+
ffmpeg	Any	—
macOS	Apple Silicon 16GB	24GB+
NVIDIA GPU	12GB VRAM	16GB+

Setup:

git clone https://github.com/Somnusochi/VLM-AutoYOLO.git
cd VLM-AutoYOLO

# Backend
cd backend
python3 -m venv .venv
source .venv/bin/activate  # Windows: .venv\Scripts\activate
pip install -r requirements.txt
cd ..

# Frontend
cd frontend
pnpm install
cd ..

# Database (PostgreSQL recommended, but SQLite is supported out of the box)
# If using PostgreSQL:
# psql -d postgres -c "CREATE DATABASE autolabeling;"
# cp backend/.env.example backend/.env
# If you prefer a zero-setup SQLite database, just skip the two steps above. The system will auto-generate autolabeling.db

# Migrations
cd backend
PYTHONPATH=. alembic upgrade head

Pre-download models (optional):

huggingface-cli download nvidia/LocateAnything-3B --local-dir backend/model

Launch:

./start.sh   # macOS / Linux
start.bat    # Windows

Service	URL
Frontend	http://localhost:5173
Backend	http://localhost:8000
API Docs	http://localhost:8000/docs

Project Structure

Full directory tree: docs/STRUCTURE.md

Features

VLM Pre-annotation

Upload images or video keyframes with open-vocabulary descriptions (e.g. fire, smoke, red car). LocateAnything-3B automatically detects and draws bounding boxes.

Open-vocabulary natural language descriptions
Auto-resize by long-side cap (VRAM-based: 800–1333px)
Batch upload folders or video keyframes, streaming results

SAM2 Segmentation

Enable SAM2 (Segment Anything Model 2) to refine VLM bounding boxes into pixel-precise masks.

Check "Enable SAM2 Segmentation" before detection — runs automatically after VLM
SAM 2.1 model (base+), lazy-loaded with idle auto-unload
Score threshold slider for mask quality filtering
Masks rendered as semi-transparent overlays on canvas
BBox and Mask independently toggled on both main canvas and hover preview
Result table shows polygon vertex count per box

SAM3 Detection + Segmentation

Switch to SAM3 mode for text-driven detection and segmentation in a single pass — no VLM required.

Toggle between VLM+SAM2 and SAM3 via the model selector in the sidebar
Enter open-vocabulary text prompts (e.g. cat, red car) — SAM3 detects and segments all matching instances
Confidence threshold slider (0.0–1.0, default 0.5) controls detection sensitivity
Mask threshold slider (0.0–1.0, default 0.5) controls mask tightness
Enable/disable segmentation independently — bbox-only mode skips mask extraction for faster results
SAM3 runs as a standalone HTTP service on port 8002 with its own venv (backend/sam3-venv/)
Requires HF_TOKEN — set this env var before starting the backend. Two steps:
1. Open huggingface.co/facebook/sam3 in browser, click "Agree and access repository"
2. Create a Read token at huggingface.co/settings/tokens (no need for Fine-grained — a plain Read token inherits your account's permissions) Model cached in ~/.cache/huggingface/hub/ after first download
Auto-starts on first use, idle auto-unload after 10 min
Real-time loading status via SSE (starting → loading → loaded)
Manual unload button to free GPU memory
Backend auto-switches: using SAM3 unloads VLM/SAM2, and vice versa
Detection records tagged with model_type (VLM / VLM+SAM2 / SAM3) for traceability

Video Annotation

Upload a video, extract keyframes, select and batch-annotate.

Three extraction modes: scene change, motion detection (optical flow), fixed interval
SSIM deduplication: auto-removes near-duplicate frames
Timeline preview: horizontal scrollable strip, click for full-size view
Multi-select: check frames, select/cancel all, load to annotation queue

Manual Annotation

Canvas-based annotation with View / Draw modes.

Category quick-fill from history
VLM pre-annotation baseline → delete mistakes → draw missing boxes
All / Best / NMS filter modes, settings saved per detection
Hide individual boxes while inspecting dense results
Per-frame re-detection

History Management

Thumbnail + category tag previews, tag-based multi-select filtering
Click to view details, re-detect with updated labels, virtual scroll with infinite loading
Single / batch export in 5 formats: YOLO, YOLO-Seg, COCO JSON, Pascal VOC XML, CreateML JSON
Format selection via dropdown menu, one-click zip download

YOLO Training

Series: YOLOv8 / v11 / v26 (n/s/m/l/x)
Task types: Object Detection (Detect), Instance Segmentation (Segment)
Segmentation training auto-uses SAM2 polygon labels; falls back to bbox when unavailable
Tag filter + thumbnail preview for precise data selection
Virtual scroll with "Load All" button for large datasets
Dataset split presets (70/20/10, 80/20, 90/10, 60/20/20)
Real-time SSE progress: Epoch / Loss / mAP50
Rename training jobs for easier identification
Auto ONNX export; download PT / ONNX / dataset zip

Model Validation

Dual source: trained models or externally uploaded .pt files
Conf / IoU sliders for real-time threshold tuning
Batch image validation with bounding boxes and confidence scores
Video validation (three modes):
- MJPEG live stream with interactive play/pause
- SSE prediction stream with per-frame JSON events
- Sync batch prediction — all frames at once
Temporary results; export predictions as YOLO .txt files

Model Management

Lazy loading: VLM, SAM2, and SAM3 load on first use, unload after idle (default 10 min)
Idle watchdog: all three models auto-unload after MODEL_IDLE_TIMEOUT_SECONDS of inactivity
Unified SSE status: GET /api/v1/model/events streams VLM, SAM2, SAM3 status in one connection
Manual unload: each model has its own unload button and API endpoint
GPU memory: Strategy Pattern (gpu_memory.py) — CUDA expandable_segments / MPS synchronize+empty_cache+gc

API Reference

Full API documentation with request/response examples: docs/API.md

Cross-Platform

Platform	Inference	Training
macOS (Apple Silicon)	MPS	MPS
Linux / Windows (NVIDIA)	CUDA	CUDA

Auto-detection: CUDA → MPS. Override via DEVICE env. CPU not supported.

Inference Benchmarks

Tested locally on an Apple MacBook Pro (M4 Pro, 24GB Unified Memory) using Apple MPS hardware acceleration.

Image Resolution (Max Side)	Inference Latency	Actual Memory Footprint
Thumbnail (256px)	`~0.68s`	Stable around `~11.8GB`
High-Res (1024px)	`~4.35s`	Stable around `~11.8GB`

Full detailed benchmarks across different hardware configurations: docs/BENCHMARKS.md

Highlights

MPS / CUDA full-pipeline GPU acceleration — VLM, SAM2, and YOLO training all GPU-accelerated
Strategy Pattern GPU memory — gpu_memory.py centralizes CUDA / MPS cleanup; expandable_segments:True
SAM2 / SAM3 mask refinement — SAM2 refines VLM bboxes; SAM3 does text-driven detection+segmentation in one pass
5 export formats — YOLO, YOLO-Seg, COCO, Pascal VOC, CreateML
Detect & Segment training — polygon labels auto-used when SAM2 masks are available
Cross-platform — macOS MPS, Windows / Linux CUDA, unified codebase
Unified SSE model status — single EventSource for VLM, SAM2, SAM3 states; no polling

Development

# Frontend
cd frontend && pnpm install && pnpm run lint && pnpm run build

# Backend
cd backend && source .venv/bin/activate
PYTHONPATH=. alembic upgrade head
python -m compileall app alembic

Stargazers

License

Code: AGPL-3.0.

Third-party dependencies:

LocateAnything-3B model — NVIDIA License (non-commercial use only)
SAM3 model — Facebook Research License (gated repository, requires HuggingFace access token)
Ultralytics YOLO — AGPL-3.0 (copyleft; training/deployment may trigger obligations)

If this project helps you, please ⭐ star it on GitHub. I'm open to new opportunities — reach out: somnusochi@gmail.com

Name		Name	Last commit message	Last commit date
Latest commit History 339 Commits
.github/workflows		.github/workflows
.husky		.husky
backend		backend
docs		docs
frontend		frontend
scripts		scripts
test_images/cat		test_images/cat
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
README_ZH.md		README_ZH.md
cli.py		cli.py
commitlint.config.js		commitlint.config.js
docker-compose.yml		docker-compose.yml
start.bat		start.bat
start.sh		start.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VLM-AutoYOLO

Key Features

Documentation

Screenshots

Tech Stack

Quick Start

CLI (Recommended for macOS / Linux)

Docker Deployment

Manual Setup

Project Structure

Features

VLM Pre-annotation

SAM2 Segmentation

SAM3 Detection + Segmentation

Video Annotation

Manual Annotation

History Management

YOLO Training

Model Validation

Model Management

API Reference

Cross-Platform

Inference Benchmarks

Highlights

Development

Stargazers

License

About

Uh oh!

Releases 32

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

VLM-AutoYOLO

Key Features

Documentation

Screenshots

Tech Stack

Quick Start

CLI (Recommended for macOS / Linux)

Docker Deployment

Manual Setup

Project Structure

Features

VLM Pre-annotation

SAM2 Segmentation

SAM3 Detection + Segmentation

Video Annotation

Manual Annotation

History Management

YOLO Training

Model Validation

Model Management

API Reference

Cross-Platform

Inference Benchmarks

Highlights

Development

Stargazers

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 32

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages