A lightweight local AI vision agent built with Streamlit + LangGraph + Ollama that analyzes images and answers user questions using a multi-step pipeline.
-
Upload any image (JPG, PNG, WebP, GIF)
-
Multi-step reasoning pipeline:
- Vision → image description (LLaVA)
- Research → reasoning over description
- Writer → clean final answer
-
Fully local (no API costs)
-
Clean modern UI with Streamlit
User Input
│
▼
[ Vision Model (llava-phi3) ]
│
▼
[ Research Agent (llama3.2) ]
│
▼
[ Writer Agent (llama3.2) ]
│
▼
Final Answer
Built using LangGraph state machine.
- Python 3.9+
- Ollama installed → https://ollama.com
- 8GB RAM recommended
# Install dependencies
pip install -r requirements.txt
# Start Ollama
ollama serve
# Pull models
ollama pull llava-phi3
ollama pull llama3.2:1bstreamlit run frontend.py.
├── frontend.py # Streamlit frontend
├── backend.py # LangGraph pipeline
└── README.md
- Image is converted to base64
- Sent to LLaVA (vision model) via Ollama
- Output is analyzed by LLM (llama3.2)
- Final answer is generated and displayed
