Vision-language models (VLMs) can describe images, answer questions about photographs, read documents, and solve visual puzzles. But how do they work? What architectural choices enable a language model to "see"? And what are the failure modes that reveal the limits of current approaches?

This post unpacks the internals of VLMs: the vision encoders, the adaptation layers, and the training regimes that produce multimodal understanding.



## Vision Encoders: What the Model Sees

Before an LLM can process an image, the image must be converted into tokens. This is the job of the **vision encoder**.

**Vision Transformer (ViT)**: The dominant architecture. Split the image into patches (typically 14×14 or 16×16 pixels), embed each patch as a vector, add positional embeddings, and run through transformer layers. The output is a sequence of patch embeddings—one per spatial location.

**CLIP (Contrastive Language-Image Pre-training)**: Trained to align image and text embeddings using contrastive learning. Given image-caption pairs, CLIP learns to maximize similarity between matching pairs and minimize it for non-matching. The resulting vision encoder produces embeddings that are already somewhat aligned with language.

**SigLIP**: A variant of CLIP that uses sigmoid loss instead of softmax, enabling more efficient training on larger batches and producing better-calibrated embeddings.

**DINOv2**: Self-supervised vision encoder trained without text. Focuses purely on visual structure—produces embeddings useful for dense prediction tasks like segmentation.

The choice of vision encoder affects what information is available to the language model. CLIP-based encoders emphasize semantic concepts aligned with language; DINOv2 emphasizes visual structure. Some models use both.



## Fusion Strategies: How Modalities Meet

Given vision embeddings and a language model, how do you combine them? Several strategies exist:

**Early fusion**: Concatenate vision tokens directly with text tokens as input to the LLM. Simple but expensive—every forward pass processes all vision tokens through all layers.

**Late fusion**: Process modalities separately until final layers, then combine. Efficient but limits cross-modal interaction.

**Cross-attention**: Add cross-attention layers where text tokens attend to vision tokens (or vice versa). The LLM can selectively query visual information without processing all vision tokens through all layers.

**Gated fusion**: Learn gates that control how much visual information flows into the language model at each layer. Allows the model to ignore vision when it's irrelevant.

Most modern VLMs use some form of early fusion with an adapter layer that compresses or transforms vision tokens before they enter the LLM.



## Adapter Modules: The Bridge Between Modalities

Vision encoders produce embeddings in their own space; LLMs expect tokens in their embedding space. The **adapter** bridges this gap.

**Linear projection**: The simplest approach—a learned linear layer that projects vision embeddings to the LLM's hidden dimension. Used in LLaVA. Surprisingly effective.

**MLP adapter**: A small feedforward network (2-3 layers) with nonlinearity. Slightly more expressive than linear projection.

**Q-Former (Querying Transformer)**: Used in BLIP-2. A set of learnable query tokens attend to (frozen) vision encoder outputs through cross-attention, producing a fixed number of output tokens regardless of image size. This compresses visual information and learns what aspects are relevant for language.

**Perceiver Resampler**: Similar idea to Q-Former—use a small number of latent tokens to "summarize" the visual input through cross-attention. Used in Flamingo.

**C-Abstractor and variants**: Learn hierarchical abstractions, preserving some spatial structure while compressing token count.

The number of visual tokens matters for efficiency. A 336×336 image at 14×14 patches produces 576 vision tokens. That's a lot of context for the LLM. Adapters that reduce token count (Q-Former, Perceiver) trade off visual detail for efficiency.



## Training Regimes

VLMs typically train in stages:

**Stage 1: Vision-language alignment (pretraining)**
Freeze both vision encoder and LLM; only train the adapter. Use image-caption pairs. Goal: teach the adapter to project visual information into a form the LLM can use. This is cheap because most parameters are frozen.

**Stage 2: Instruction tuning**
Unfreeze the LLM (and sometimes vision encoder). Train on visual instruction-following data: visual question-answering, image description, visual reasoning puzzles. Goal: teach the model to follow instructions about images.

**Stage 3 (optional): RLHF or preference tuning**
Use human preferences to further refine outputs. Important for reducing hallucinations and improving helpfulness.

The quality and diversity of training data at each stage heavily influences capabilities. Models trained primarily on natural images struggle with documents, charts, and diagrams. Models trained on clean web data may hallucinate about photos with unusual content.



## Emergent Capabilities

Modern VLMs exhibit impressive capabilities that weren't explicitly trained:

**Visual reasoning**: "How many apples are on the table?" "Which object is closer to the camera?" "What would happen if I pushed the ball?"

**OCR and document understanding**: Reading text in images, understanding tables, parsing receipts and forms.

**Spatial reasoning**: Understanding relative positions, following directions ("the object to the left of..."), parsing maps.

**World knowledge grounding**: Identifying landmarks, recognizing species, understanding cultural context.

**Chart and graph interpretation**: Extracting data from visualizations, answering questions about trends.

These capabilities emerge from scale and diverse training data. No single supervision signal taught "chart reading"—it arose from exposure to many charts paired with textual descriptions.



## Failure Modes

VLMs fail in characteristic ways that reveal their limitations:

**Hallucination**: Confidently describing objects or properties that aren't in the image. "The woman is wearing a red hat" when there's no hat. This is the most common and dangerous failure mode.

**Object binding errors**: Correctly identifying that there's a red ball and a blue cube, but incorrectly stating "the ball is blue." Visual properties get misattributed to wrong objects.

**Counting failures**: VLMs notoriously struggle with counting. "How many people are in the image?" often gets wrong answers, especially for larger numbers.

**Positional confusion**: Left/right, above/below, near/far relations are often unreliable. The model may understand "there are two objects" but not their spatial relationship.

**Negation blindness**: "Is there NOT a dog in the image?" is harder to answer reliably than "Is there a dog in the image?"

**Fine-grained discrimination**: Distinguishing similar breeds, species, or models. CLIP-style training emphasizes coarse categories over fine distinctions.

These failures suggest that current VLMs don't build robust internal representations of scene structure. They're doing something more like sophisticated pattern matching on visual features plus language context.



## Case Studies: Current Models

**LLaVA**: Clean, open architecture. ViT (CLIP-pretrained) → linear projection → Vicuna/LLaMA. Shows that sophisticated adapters aren't always necessary.

**GPT-4V**: Proprietary, likely uses a large ViT variant with substantial multimodal pretraining. Best-in-class on many benchmarks but expensive and closed.

**Gemini**: Trained natively multimodal from early stages (not a vision encoder bolted onto a language model). Interleaves modalities more fluidly.

**Claude's vision**: Focuses on safety and reliability. Strong at document understanding and refusing problematic requests about images.

**Qwen-VL, InternVL, CogVLM**: Open-source models pushing various frontiers: resolution, efficiency, specialized capabilities.

The landscape is moving fast. New models appear monthly, each with different trade-offs between capability, efficiency, and openness.



## Implementation Considerations

If you're working with VLMs, keep in mind:

**Resolution matters**: Higher resolution preserves more detail but increases cost (quadratic in resolution for attention). Many models use multi-resolution strategies—process at low res first, then zoom in on regions of interest.

**Aspect ratio handling**: Images aren't all squares. Naive resizing distorts content. Better approaches: pad, crop, or tile into multiple regions.

**Video**: Extend to video by sampling frames and treating each as an image. Challenges: temporal reasoning, long-context efficiency, computing over many frames.

**Multiple images**: Some VLMs handle multiple images per query; others struggle. Important for comparison tasks, before/after, or document processing.

**Prompt engineering for vision**: Just like text, how you phrase visual questions matters. Being specific, asking for step-by-step reasoning, and providing context improves performance.



## Where This Is Heading

Several trends are shaping the future of VLMs:

**Native multimodal training**: Instead of adapting language models to vision, train on interleaved images and text from the start. Gemini and Chameleon point this direction.

**More modalities**: Audio, 3D, video, embodied sensors. The same architectural patterns extend.

**Agentic capabilities**: VLMs that can interact with GUIs, browse the web, and manipulate visual environments.

**Specialized vision**: Domain-specific models for medical imaging, satellite imagery, scientific figures.

**Efficiency**: Making VLMs fast and cheap enough for edge deployment.

Vision-language models represent a significant step toward AI systems that perceive and reason about the world as humans do. Understanding how they work—and where they fail—is essential for using them effectively.

