Gemma Vision Agent

An agentic visual reasoning pipeline combining Falcon Perception (0.6B, instance segmentation) with Gemma 4 (4B, visual language model) for object detection, counting, tracking, and scene understanding. Runs fully local on Apple Silicon via MLX.

Quick Start

# Activate the environment
source .venv/bin/activate

# Launch the main app (recommended)
python vision_studio.py
# Open http://localhost:7860

What It Does

Upload an image, type a natural language query, and the agent decides what to do:

Query	What Happens
`dog`	Falcon segments all dogs, Gemma describes them
`How many cars?`	Falcon counts with exact bounding boxes + masks
`Are there more cars than people?`	Falcon detects both, compares counts
`Describe the largest dog`	Falcon detects, crops the largest, Gemma describes it
`What is happening here?`	Gemma analyzes, then re-plans (may trigger detection)
`Count everything`	Gemma lists object types, Falcon detects each one

Every step shows which model is running (Falcon Perception or Gemma 4), the task type, execution time, and visual output.

Applications

Agent Pipeline (main tab)

Step-by-step agentic reasoning with visual output at each stage. The agent can re-plan after seeing results — if Gemma identifies objects, it can trigger Falcon to detect them, then reason over the annotated image.

Compare Mode (second tab)

Runs the same query through two sub-pipelines side-by-side:

Gemma-only: VLM reasoning without detection
Falcon + Gemma: Full agent pipeline with segmentation

This demonstrates when segmentation is needed vs. when a VLM alone suffices.

Video Tracking

Frame-by-frame detection with IoU-based tracking to maintain object identities across video. Available via python demo.py (Video tab) or python video_tracker.py.

Models

Model	HuggingFace ID	Params	Quant	Disk	Role
Falcon Perception	`tiiuae/Falcon-Perception`	0.6B	float16	~1.2 GB	Detection + segmentation
Gemma 4 E4B	`mlx-community/gemma-4-e4b-it-8bit`	4B effective	8-bit	~8 GB	Visual reasoning

Models are downloaded automatically on first run. Total: ~9.2 GB. Requires 16 GB+ RAM (Apple Silicon M1+).

How the Pipeline Works

User: "Are there more cars than people?"
                    |
              Plan Router
           (pattern matching)
            /              \
    DETECT 'cars'    DETECT 'people'
    Falcon (0.6B)    Falcon (0.6B)
    15 found          19 found
            \              /
             COMPARE counts
             cars: 15 | people: 19
                    |
                 ANSWER
    "More people (19) than cars (15)"

For open-ended questions, the agent enters a re-planning loop where Gemma decides the next action after each step (max 8 steps).

Available Tools

Tool	Model	What It Does
DETECT	Falcon Perception (0.6B)	Instance segmentation with bounding boxes + masks
VLM	Gemma 4 E4B (4B)	Visual reasoning, scene description, Q&A
CROP	(utility)	Zoom into a specific detection for closer inspection
COMPARE	(utility)	Compare counts between two object types
DETECT_EACH	Falcon Perception	Detect multiple object types identified by VLM
VLM_PLAN	Gemma 4 E4B	Re-planning — Gemma decides what to do next

Falcon Perception Internals

Falcon uses a single early-fusion Transformer (no separate vision encoder). For each detected object, it outputs a Chain-of-Perception sequence:

<coord> token — predicts center (x, y) normalized to 0-1
<size> token — predicts height, width normalized to 0-1
<seg> token — produces a mask embedding, dot-producted with upsampled image features for a full-resolution binary mask

Output is COCO RLE-encoded and decoded via pycocotools.

Gemma-Only vs. Falcon+Gemma

Results from identical queries on the same images:

Task	Gemma-Only	Falcon+Gemma
Count dogs (2 in image)	"two" (correct)	2 exact + 2 masks
Count cars (dense scene)	"at least 10 taxis" (vague)	16 exact + 16 masks
Count people (crowded)	"approximately 25-30" (range)	31 exact + 31 masks
Spatial grounding	No bounding boxes or masks	Per-instance bbox + pixel mask
Scene understanding	Strong	Strong (with visual evidence)
Speed	1-4s	3-25s (depends on objects)

When to use Falcon: Exact counting (>5 objects), spatial localization, instance separation, visual grounding.

When Gemma alone suffices: Scene description, mood/context, simple yes/no, breed/type identification.

File Structure

Gemma-4/
├── vision_studio.py      # Main app — FastAPI + premium HTML/CSS/JS UI
│                          # Two tabs: Agent Pipeline + Compare
│
├── agent_studio.py       # Core pipeline logic (detection, VLM, planning,
│                          # re-planning, rendering, step metadata)
│
├── agent.py              # Standalone Gradio agent UI (older version)
├── demo.py               # Gradio unified UI (image + video tabs)
├── app.py                # Gradio image analysis app
├── video_tracker.py      # Video tracking with IoU tracker
├── main.py               # Combined Gradio launcher
│
├── ARCHITECTURE.md       # Detailed architecture docs with Mermaid diagrams
├── README.md             # This file
│
├── test_data/            # Example images + test videos
│   ├── dogs.jpg          # Two dogs (Corgi + Yorkshire Terrier)
│   ├── street.jpg        # NYC street scene (cars + people)
│   ├── kitchen.jpg       # Kitchen scene (people cooking)
│   ├── dogs_video.mp4    # Dogs zoom video (20 frames)
│   └── test_panning.mp4  # Street panning video (30 frames)
│
├── step_outputs/         # Temporary step images for UI rendering
└── .venv/                # Python 3.12 virtual environment

All Entry Points

Command	Port	Description
`python vision_studio.py`	7860	Main app — FastAPI, premium UI, Agent + Compare tabs
`python agent_studio.py`	7860	Gradio agent UI (step-by-step, older)
`python demo.py`	7860	Gradio unified UI (image + video tabs)
`python video_tracker.py`	7861	Gradio video tracking UI
`python app.py`	7860	Gradio image analysis (original)

Setup (from scratch)

# Create virtual environment
/opt/homebrew/bin/python3.12 -m venv .venv
source .venv/bin/activate

# Install dependencies
pip install mlx-vlm gradio opencv-python-headless Pillow numpy pycocotools
pip install "falcon-perception[mlx] @ git+https://github.com/tiiuae/falcon-perception.git"
pip install fastapi uvicorn

# Launch
python vision_studio.py

Requirements

macOS with Apple Silicon (M1/M2/M3/M4)
Python 3.10+
16 GB+ RAM recommended
~7 GB disk for model weights (auto-downloaded)

References

Falcon Perception — TII's open-vocabulary segmentation model
Falcon Perception Paper — arXiv:2603.27365
MLX-VLM — Vision language models on Apple Silicon
Gemma 4 — Google's efficient multimodal model
mlx-vlm-falcon — Inspiration for the combined pipeline

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Gemma Vision Agent

Quick Start

What It Does

Applications

Agent Pipeline (main tab)

Compare Mode (second tab)

Video Tracking

Models

How the Pipeline Works

Available Tools

Falcon Perception Internals

Gemma-Only vs. Falcon+Gemma

File Structure

All Entry Points

Setup (from scratch)

Requirements

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
diagrams		diagrams
test_data		test_data
.gitignore		.gitignore
ARCHITECTURE.md		ARCHITECTURE.md
README.md		README.md
VIDEO_PLAN.md		VIDEO_PLAN.md
agent.py		agent.py
agent_studio.py		agent_studio.py
app.py		app.py
demo.py		demo.py
main.py		main.py
video_tracker.py		video_tracker.py
vision_studio.py		vision_studio.py

Folders and files

Latest commit

History

Repository files navigation

Gemma Vision Agent

Quick Start

What It Does

Applications

Agent Pipeline (main tab)

Compare Mode (second tab)

Video Tracking

Models

How the Pipeline Works

Available Tools

Falcon Perception Internals

Gemma-Only vs. Falcon+Gemma

File Structure

All Entry Points

Setup (from scratch)

Requirements

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages