<a href="https://colab.research.google.com/github/JordanDCunha/Hands-On-Machine-Learning-with-Scikit-Learn-and-PyTorch/blob/main/Chapter16.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Vision Transformers

Vision transformers didn’t pop out of a vacuum: before they were invented, there were RNNs with visual attention, and hybrid CNN–Transformer models. Let’s take a look at these ViT ancestors before we dive into some of the most influential ViTs.


## RNNs with Visual Attention

One of the first applications of attention mechanisms beyond NLP was to generate image captions using visual attention. Here, a convolutional neural network first processes the image and outputs feature maps, then a decoder RNN equipped with an attention mechanism generates the caption one token at a time.

The decoder uses an attention layer at each decoding step to focus on the most relevant part of the image. For example, when generating the word *“Frisbee”*, the model’s attention is concentrated on the Frisbee in the image.


## Explainability

A key benefit of attention mechanisms is **explainability**—the ability to understand what parts of the input led to a particular output.

For example, if a model labels a dog in the snow as a *wolf*, attention maps might reveal that the model focused heavily on the snow, suggesting it learned an incorrect correlation. This insight allows practitioners to fix the issue by rebalancing the training data.

In some domains, explainability is not optional but a legal requirement, such as systems that decide whether to grant loans.


Once transformers were introduced, they were quickly applied to visual tasks, often replacing RNNs while still relying on CNNs for feature extraction. Although transformers were involved, these models are usually not considered true vision transformers.

A notable example is DETR.


## DETR: A CNN–Transformer Hybrid for Object Detection

The Detection Transformer (DETR), introduced in 2020, combines a CNN backbone with an encoder–decoder transformer.

1. A CNN extracts feature maps from the image.
2. The feature maps are converted into a sequence of visual tokens.
3. An encoder–decoder transformer processes these tokens.
4. The output is a set of bounding box predictions.

This raised a natural question: can we remove the CNN entirely?


## The Original Vision Transformer (ViT)

In October 2020, Google introduced the first CNN-free vision transformer: **ViT**.

The idea is simple:
- Split the image into fixed-size patches (e.g., 16 × 16).
- Flatten each patch and project it into an embedding vector.
- Treat the sequence of patch embeddings like word embeddings.
- Add positional embeddings.
- Feed everything into a standard encoder-only transformer.


For a 224 × 224 image with 3 color channels:
- The image is split into 14 × 14 = 196 patches
- Each patch is flattened into a 768-dimensional vector
- A learnable class token is prepended
- The class token output is used for classification (BERT-style)

ViT achieved state-of-the-art results but required massive datasets because it lacks CNN inductive biases such as locality and translation invariance.


### Inductive Bias

An inductive bias is an assumption built into a model’s architecture.

Examples:
- CNNs assume locality and translation invariance
- RNNs assume sequential order
- Transformers assume relationships are learned via attention

More inductive bias → less data needed  
Less inductive bias → more data required


Now let’s implement a Vision Transformer from scratch using PyTorch.


In [None]:
import torch
import torch.nn as nn


In [None]:
class PatchEmbedding(nn.Module):
    def __init__(self, in_channels, embed_dim, patch_size=16):
        super().__init__()
        self.conv2d = nn.Conv2d(
            in_channels, embed_dim,
            kernel_size=patch_size, stride=patch_size
        )

    def forward(self, x):
        x = self.conv2d(x)            # [B, E, H', W']
        x = x.flatten(2)              # [B, E, H' * W']
        return x.transpose(1, 2)      # [B, L, E]


In [None]:
class ViT(nn.Module):
    def __init__(
        self,
        img_size=224,
        patch_size=16,
        in_channels=3,
        num_classes=1000,
        embed_dim=768,
        depth=12,
        num_heads=12,
        ff_dim=3072,
        dropout=0.1,
    ):
        super().__init__()

        self.patch_embed = PatchEmbedding(
            in_channels, embed_dim, patch_size
        )

        num_patches = (img_size // patch_size) ** 2

        self.cls_token = nn.Parameter(torch.randn(1, 1, embed_dim) * 0.02)
        self.pos_embed = nn.Parameter(
            torch.randn(1, num_patches + 1, embed_dim) * 0.02
        )

        self.dropout = nn.Dropout(dropout)

        encoder_layer = nn.TransformerEncoderLayer(
            d_model=embed_dim,
            nhead=num_heads,
            dim_feedforward=ff_dim,
            dropout=dropout,
            activation="gelu",
            batch_first=True
        )

        self.encoder = nn.TransformerEncoder(encoder_layer, depth)
        self.norm = nn.LayerNorm(embed_dim)
        self.head = nn.Linear(embed_dim, num_classes)

    def forward(self, x):
        x = self.patch_embed(x)
        cls = self.cls_token.expand(x.size(0), -1, -1)
        x = torch.cat((cls, x), dim=1)
        x = x + self.pos_embed
        x = self.dropout(x)
        x = self.encoder(x)
        x = self.norm(x[:, 0])
        return self.head(x)


In [None]:
vit = ViT()
batch = torch.randn(4, 3, 224, 224)
logits = vit(batch)
logits.shape


The model can now be trained using cross-entropy loss. However, training ViTs from scratch is expensive, so pretrained models are typically fine-tuned instead.


In [None]:
from datasets import load_dataset

pets = load_dataset("timm/oxford-iiit-pet")


In [None]:
from transformers import ViTForImageClassification, AutoImageProcessor

model_id = "google/vit-base-patch16-224-in21k"

vit_model = ViTForImageClassification.from_pretrained(
    model_id, num_labels=37
)
vit_processor = AutoImageProcessor.from_pretrained(
    model_id, use_fast=True
)


In [None]:
def vit_collate_fn(batch):
    images = [x["image"] for x in batch]
    labels = [x["label"] for x in batch]
    inputs = vit_processor(
        images, return_tensors="pt", do_convert_rgb=True
    )
    inputs["labels"] = torch.tensor(labels)
    return inputs


In [None]:
from transformers import Trainer, TrainingArguments

args = TrainingArguments(
    "my_pets_vit",
    per_device_train_batch_size=16,
    eval_strategy="epoch",
    num_train_epochs=3,
    remove_unused_columns=False
)

trainer = Trainer(
    model=vit_model,
    args=args,
    data_collator=vit_collate_fn,
    train_dataset=pets["train"],
    eval_dataset=pets["test"]
)

trainer.train()


# Multimodal Transformers

Humans are multimodal creatures: we perceive the world through multiple senses—sight, hearing, smell, taste, touch, sense of balance, proprioception (i.e., sense of body position), and several others—and we act upon the world through movement, speech, writing, etc.

Each of these modalities can be considered at a very low level (e.g., sound waves) or at a higher level (e.g., words, intonations, melody). Importantly, modalities are heterogeneous: one modality may be continuous while another is discrete, one may be temporal while the other is spatial, one may be high-resolution (e.g., 48 kHz audio) while the other is not (e.g., text), one may be noisy while the other is clean, and so on.


## Fusion and Alignment

Multimodal machine learning requires designing models that can handle heterogeneous data and capture interactions between modalities.

There are two main challenges:

**Fusion**  
Combining different modalities, often by encoding them into the same representation space.

**Alignment**  
Discovering relationships between modalities (e.g., aligning spoken words with timestamps, grounding text queries in images).

Common multimodal tasks include image captioning, image retrieval, VQA, STT, TTS, embodied AI, and more.


## Why Transformers Work Well for Multimodality

Transformers can ingest almost any modality once it is tokenized into sequences:
- Text → words/subwords
- Images → patches
- Audio/video → short clips

Once embedded, modalities can be fused via concatenation, summation, or fusion encoders.  
Multi-head attention enables both intra-modal and cross-modal reasoning, solving alignment.


## VideoBERT (2019)

VideoBERT is a BERT-style multimodal transformer handling text + video.

Key ideas:
- Video clips encoded using a pretrained 3D CNN (S3D)
- Visual features clustered via hierarchical k-means into a discrete vocabulary
- Visual tokens treated like word tokens
- Pretraining tasks:
  - Masked token prediction
  - Linguistic–visual alignment (binary classification)


### Video Tokenization

- Videos split into 1.5s clips (30 frames)
- Each clip → 1024D vector
- Hierarchical k-means → 20,736 visual tokens
- Significant compression, but key semantics preserved


## ViLBERT (2019)

ViLBERT introduced a **dual-stream** architecture:
- Separate text encoder and visual encoder
- Cross-modal interaction via **co-attention**

Motivation:
- Modalities require different processing depths
- Visual features often already high-level
- Avoid damaging pretrained BERT weights


### Co-Attention

In co-attention layers:
- Queries from one modality attend to keys/values of the other
- Enables bidirectional information flow


## CLIP (2021)

CLIP (Contrastive Language–Image Pretraining) uses:
- A text encoder
- A vision encoder
- Contrastive loss over large image–caption datasets

Goal:
Matching image–text pairs have similar embeddings; mismatched pairs are pushed apart.


In [None]:
from transformers import pipeline

model_id = "openai/clip-vit-base-patch32"
clip_pipeline = pipeline(
    task="zero-shot-image-classification",
    model=model_id,
    device_map="auto",
    dtype="auto"
)

candidate_labels = ["cricket", "ladybug", "spider"]
image_url = "https://homl.info/ladybug"

results = clip_pipeline(
    image_url,
    candidate_labels=candidate_labels,
    hypothesis_template="This is a photo of a {}."
)

results


CLIP enables powerful zero-shot classification by comparing image embeddings with text embeddings generated from class names or prompts.


In [None]:
from transformers import CLIPProcessor, CLIPModel
import PIL.Image
import urllib.request
import torch

processor = CLIPProcessor.from_pretrained(model_id)
model = CLIPModel.from_pretrained(model_id)

image = PIL.Image.open(
    urllib.request.urlopen(image_url)
).convert("RGB")

captions = [f"This is a photo of a {label}." for label in candidate_labels]

inputs = processor(
    text=captions,
    images=[image],
    return_tensors="pt",
    padding=True
)

with torch.no_grad():
    outputs = model(**inputs)

image_features = outputs.image_embeds
text_features = outputs.text_embeds

similarities = image_features @ text_features.T
similarities


## DALL·E

- **DALL·E (2021)**: GPT-like autoregressive image token generator
- **DALL·E 2 (2022)**: CLIP + diffusion
- **DALL·E 3 (2023)**: Diffusion + LLM prompt rewriting

Each version improved compositional accuracy and image fidelity.


## Perceiver (2021)

Perceiver processes raw inputs directly (pixels, audio frames, characters) using:

- Cross-attention from inputs → latent tokens
- Latent bottleneck to avoid quadratic scaling
- Modality-agnostic design


### Latent Bottleneck

Instead of attending over millions of inputs:
- Inputs attend to a fixed number of latent tokens
- Complexity scales linearly with input size


### Latent Bottleneck

Instead of attending over millions of inputs:
- Inputs attend to a fixed number of latent tokens
- Complexity scales linearly with input size


## Flamingo (2022)

Flamingo is a few-shot vision-language model built from:
- Frozen vision encoder
- Frozen decoder-only LLM
- Perceiver-based visual resampler
- Gated cross-attention modules

Enables open-ended visual dialogue and reasoning.


## BLIP and BLIP-2

BLIP:
- Unified vision–language pretraining
- Image–text contrastive, matching, and LM objectives

BLIP-2:
- Reuses frozen vision model + frozen LLM
- Introduces the Q-Former
- Stronger performance with fewer trainable parameters


In [None]:
## BLIP and BLIP-2

BLIP:
- Unified vision–language pretraining
- Image–text contrastive, matching, and LM objectives

BLIP-2:
- Reuses frozen vision model + frozen LLM
- Introduces the Q-Former
- Stronger performance with fewer trainable parameters


# Other Multimodal Models

We’ve covered quite a few multimodal models, with very different architectures and pretraining techniques, but of course there are many others. Below is a quick overview of some of the most notable multimodal models released in recent years.


## Notable Multimodal Models

**LayoutLM (Microsoft, Dec. 2019)**  
Document understanding based on text, vision, and document layout.  
Widely used for forms, invoices, and scanned documents.  
LayoutLMv3 was released in April 2022.

**GLIP (Microsoft, Dec. 2021)**  
A vision-language model for visual grounding and object detection.  
Unifies object detection and phrase grounding.  
GLIP-2 was released in 2022.

**Stable Diffusion (Stability AI, Dec. 2021)**  
A powerful text-to-image diffusion model.  
Open-weight and highly influential in generative image modeling.

**OFA (Microsoft, Feb. 2022)**  
“One For All”: a unified vision-language pretraining framework.  
Handles captioning, VQA, classification, and more with a single architecture.

**CoCa (Google, May 2022)**  
A vision-language model pretrained with both contrastive and captioning objectives.  
Strong at zero-shot and generative tasks.  
Influenced later models such as PaLI-X and Flamingo-2.


## Large-Scale Multimodal Foundation Models

**PaLI (Google, Sep. 2022)**  
Multilingual multimodal models for VQA, captioning, and reasoning.  
Strong zero-shot and cross-lingual performance.  
Follow-ups include PaLI-X and PaLI-3 (2023), and PaliGemma (May 2024).

**Kosmos-1 (Microsoft, Feb. 2023)**  
Vision-language model with strong visual grounding capabilities.  
Extended by Kosmos-2 and Kosmos-2.5 later in 2023.

**PaLM-E (Google, Mar. 2023)**  
Extends PaLM with visual inputs and embodied sensor data.  
A decoder-only LLM outputs textual action commands that are executed by a robot via a downstream system.

**LLaVA (H. Liu et al., Apr. 2023)**  
One of the strongest open-source vision-language chat models.  
Combines a vision encoder with a large language model.

**ImageBind (Meta, May 2023)**  
Extends CLIP-style contrastive learning to six modalities:  
image, text, audio, IMU, depth, and thermal data.

**RT-2 (DeepMind, Jul. 2023)**  
A vision-language-action model trained on large-scale robotic instruction data.  
Capable of reasoning and robotic control.


## Speech, Video, and Real-Time Multimodal Models

**SeamlessM4T (Meta, Aug. 2023)**  
A single unified model for:
- speech-to-text
- speech-to-speech
- text-to-speech
- text-to-text translation  
Supports nearly 100 languages.

**Qwen-VL (Alibaba, Sep. 2023)**  
An open vision-language model family (7B–72B).  
Became one of the strongest open multimodal baselines.  
Followed by:
- Qwen2-VL (Aug. 2024)
- Qwen3-Omni (Sep. 2025), expanding to video and audio and reaching trillion-parameter scale.

**Fuyu (Adept AI, Oct. 2023)**  
Processes interleaved image and text inputs in real time using a unified transformer.


## Generative and Interactive Multimodal Systems

**EMO (Alibaba, Feb. 2024)**  
Given:
- an image of a person
- an audio clip (speech or singing)

The model generates a video of the person synchronized with the audio.  
EMO-2 was released in January 2025.

**GLaMM (H. Rasheed et al., Jun. 2024)**  
A visual dialogue model that outputs:
- natural language responses
- object segmentation masks

**LaViDa (UCLA, Panasonic, Adobe, Salesforce, May 2025)**  
A family of open diffusion-based vision-language models.


## Tip

Short links for all models discussed in this chapter are available at:

https://homl.info/<modelname>

Use lowercase names without hyphens, for example:
https://homl.info/qwen2vl


## Commercial Multimodal Models

Several commercial multimodal models do not publicly disclose their full architectures, including:

- GPT-4.1 and Sora (OpenAI)
- Gemini 2.5 Pro (Google)
- Veo-3 (DeepMind)
- Claude 4 Opus (Anthropic)

Access usually requires an account, subscription, or API key.


## Example: Querying Gemini 2.5 via API

The following example shows how to query Gemini 2.5 using the `google-genai` library.
You must first obtain an API key from Google AI Studio.


In [None]:
from google import genai

gemini_api_key = [...]  # load from secrets, file, or environment variable
gemini_client = genai.Client(api_key=gemini_api_key)

cats_photo = gemini_client.files.upload(file="my_cats_photo.jpg")

question = "What animal and how many? Format: [animal, number]"

response = gemini_client.models.generate_content(
    model="gemini-2.5-flash",  # or "gemini-2.5-pro"
    contents=[cats_photo, question]
)

print(response.text)  # Example output: "[cat, 2]"


This example assumes that:
- the `google-genai` library is installed (it is preinstalled on Colab)
- the file `my_cats_photo.jpg` exists in the working directory


## Chapter Wrap-Up

This wraps up the chapter on multimodal transformers.

Transformers can now:
- read and write
- see and hear
- reason across text, images, audio, video, and sensor data

In the next chapter, we will explore advanced techniques for speeding up and scaling transformers.

As Daft Punk put it: *harder, better, faster, stronger.*
