High-Performance Vision-Language Models for Multimodal AI
Unified framework for training, fine-tuning, and deploying Vision-Language Models
Installation | Quick Start | Models | Documentation | Examples
FlashVLM is a production-ready framework for Vision-Language Models (VLMs) that unifies leading architectures under a single, consistent API. It supports visual question answering, image captioning, visual grounding, document understanding, and multimodal reasoning.
- Unified Architecture Support — LLaVA, Qwen-VL, InternVL, and Phi-Vision under one API
- Multiple Vision Encoders — CLIP, SigLIP, DINOv2 with flexible projection layers
- Efficient Training — LoRA/QLoRA, DeepSpeed ZeRO, gradient checkpointing
- Advanced Generation — Streaming, beam search, top-k/top-p sampling
- Task Modules — VQA, captioning, grounding, OCR, chain-of-thought reasoning
- Production Solutions — Multimodal chatbot, document QA, image analysis
- RLHF & DPO — Alignment training for safer, more helpful VLMs
git clone https://github.com/FlashVision/FlashVLM.git
cd FlashVLM
pip install -e ".[all]"pip install flashvlmpip install flashvlm[training]bash setup_env.shfrom flashvlm import FlashVLM
model = FlashVLM.from_pretrained("llava-v1.5-7b")
answer = model.ask("What is happening in this image?", image="photo.jpg")
print(answer)from flashvlm import FlashVLM
model = FlashVLM.from_pretrained("llava-v1.5-7b")
caption = model.caption("photo.jpg")
print(caption)from flashvlm.solutions import MultimodalChat
chat = MultimodalChat(model_name="llava-v1.5-7b")
chat.add_image("diagram.png")
response = chat.send("Explain this diagram step by step.")
print(response)# Visual QA
flashvlm vqa --image photo.jpg --question "What color is the car?"
# Image captioning
flashvlm caption --image photo.jpg --max-tokens 100
# Interactive chat
flashvlm chat --model llava-v1.5-7b
# Export model
flashvlm export --model llava-v1.5-7b --format onnx
# Run benchmarks
flashvlm benchmark --model llava-v1.5-7b --dataset vqav2| Model | Parameters | Vision Encoder | Architecture | Tasks |
|---|---|---|---|---|
| LLaVA-1.5-7B | 7B | CLIP ViT-L/14 | LLaVA | VQA, Caption, Chat |
| Qwen-VL-7B | 7B | ViT-bigG | Qwen-VL | VQA, OCR, Grounding |
| InternVL-7B | 7B | InternViT-6B | InternVL | VQA, Caption, Reasoning |
| Phi-Vision-4B | 4B | SigLIP | Phi-Vision | VQA, Caption, Chat |
from flashvlm import FlashVLM, Trainer
from flashvlm.cfg import get_config
config = get_config("configs/flashvlm_llava_7b.yaml")
model = FlashVLM(config)
trainer = Trainer(model, config)
trainer.train()from flashvlm import FlashVLM, apply_lora
model = FlashVLM.from_pretrained("llava-v1.5-7b")
model = apply_lora(model, rank=16, alpha=32, target_modules=["q_proj", "v_proj"])FlashVLM/
├── configs/ # Model configuration YAML files
├── docker/ # Docker deployment files
├── docs/ # Documentation
├── examples/ # Usage examples
├── flashvlm/ # Main package
│ ├── cfg/ # Configuration management
│ ├── data/ # Datasets and transforms
│ ├── engine/ # Training and inference engines
│ ├── models/ # VLM architectures
│ ├── generation/ # Text generation utilities
│ ├── tasks/ # Task-specific modules
│ ├── training/ # Training strategies (SFT, DPO, RLHF)
│ ├── solutions/ # End-to-end solutions
│ ├── analytics/ # Benchmarking and metrics
│ └── utils/ # Utility functions
├── tests/ # Unit tests
├── pyproject.toml # Package configuration
└── README.md # This file
| Model | VQAv2 | GQA | TextVQA | MM-Bench |
|---|---|---|---|---|
| LLaVA-1.5-7B | 78.5 | 62.0 | 58.2 | 64.3 |
| Qwen-VL-7B | 78.8 | 59.3 | 63.8 | 38.2 |
| InternVL-7B | 79.3 | 62.9 | 57.0 | 65.4 |
We welcome contributions! Please see CONTRIBUTING.md for guidelines.
@software{flashvlm2024,
title={FlashVLM: High-Performance Vision-Language Models},
author={FlashVision Team},
year={2024},
url={https://github.com/FlashVision/FlashVLM}
}This project is licensed under the MIT License — see LICENSE for details.