# Introduction

I work with LLMs every day. Mostly with text inputs, but also with image inputs too. I continue to spend time learning the internals of the transformer architecture used in decoder style LLMs. But most of my studying has been with text inputs. I was always a little curious, and confused, on how images were passed into LLMs. I just never took the time to dig into it more. Now I am finally getting around to it. This blog post is some notes I'm taking as I learn more about this topic. I write about things so I can better understand them, and my future self is always grateful.

The main motivation for this post was Sebastian Raschka's recent blog post [Understanding Multimodal LLMs](https://magazine.sebastianraschka.com/p/understanding-multimodal-llms?utm_source=post-email-title&publication_id=1174659&post_id=151078631&utm_campaign=email-post-title&isFreemail=true&r=1urfra&triedRedirect=true&utm_medium=email). I highly recommend reading it. I'm not going to regurgitate his content here, but I am going to focus on learning a subset of the topics he discusses. I am starting with focusing on what he refers to as *Method A: Unified Embedding Decoder Architecture*. I may go deeper on other topics in other blog posts, but for now this is a good start for me.

# Recap of Transformer Architecture for Text

We first need to have an understanding of the transformer architecture used in decoder style LLMs.
Earlier this year I wrote my first blog post with some notes on the [transformer architecture](https://drchrislevy.github.io/posts/basic_transformer_notes/transformers.html). To get the most out of this post, it would be good to have some familiarity with the transformer architecture. We will give a quick reminder of some basic concepts.


We will load one of the [SmolLM2 LLM models](https://huggingface.co/collections/HuggingFaceTB/smollm2-6723884218bcda64b34d7db9) created by
 the Hugging Face team. This is not the instruction fine tuned model, but rather the base pre-trained model.


In [1]:
# | warning: false

import torch
import torch.nn.functional as F
from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM2-135M")
model = AutoModelForCausalLM.from_pretrained("HuggingFaceTB/SmolLM2-135M")
model

  from .autonotebook import tqdm as notebook_tqdm


LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(49152, 576)
    (layers): ModuleList(
      (0-29): 30 x LlamaDecoderLayer(
        (self_attn): LlamaSdpaAttention(
          (q_proj): Linear(in_features=576, out_features=576, bias=False)
          (k_proj): Linear(in_features=576, out_features=192, bias=False)
          (v_proj): Linear(in_features=576, out_features=192, bias=False)
          (o_proj): Linear(in_features=576, out_features=576, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear(in_features=576, out_features=1536, bias=False)
          (up_proj): Linear(in_features=576, out_features=1536, bias=False)
          (down_proj): Linear(in_features=1536, out_features=576, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((576,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((576,), eps=1e-05)
      )
    )
    (norm): LlamaRMSNorm

The input to the transformer model is a **sequence of embeddings**. In the case of text inputs, the input first gets converted into a sequence of tokens. Then each token is converted into an embedding vector.

Here is the conversion of the input text to tokens ids.

In [2]:
inputs = tokenizer(["The dog jumped over the"], return_tensors="pt")
input_ids = inputs.input_ids
print(inputs)
print(input_ids.shape)
print(input_ids)

{'input_ids': tensor([[  504,  2767, 25437,   690,   260]]), 'attention_mask': tensor([[1, 1, 1, 1, 1]])}
torch.Size([1, 5])
tensor([[  504,  2767, 25437,   690,   260]])


Each token id has an associated embedding vector. 
In the case of this SmolLM2 model, the embedding dimension is 576
and there are 49152 tokens in the vocabulary.

In [3]:
embedding_lkp = model.model.embed_tokens
print(embedding_lkp.weight.shape)

torch.Size([49152, 576])


We can get the token embeddings by passing the token ids to the embedding lookup table.
Each row of the returned tensor, ignoring the batch dimension, is a vector representation of a token. 

In [4]:
embedding_vectors = embedding_lkp(input_ids)
print(embedding_vectors.shape)
print(embedding_vectors)

torch.Size([1, 5, 576])
tensor([[[ 0.1177,  0.0199, -0.0942,  ...,  0.0405,  0.1182,  0.0762],
         [-0.0356,  0.1338,  0.0050,  ...,  0.0996,  0.0791,  0.0791],
         [-0.0093,  0.0122,  0.0197,  ...,  0.0613, -0.1021, -0.0923],
         [-0.0339,  0.0825, -0.1562,  ...,  0.0349,  0.1172, -0.0752],
         [-0.1514,  0.0181, -0.0742,  ...,  0.0430,  0.0986,  0.0664]]],
       grad_fn=<EmbeddingBackward0>)


It is this sequence of embedding vectors that flows through the transformer layers.
The input shape to the transformer layers is `(batch_size, sequence_length, embedding_dim)`
and the output shape is `(batch_size, sequence_length, hidden_size)`.
You can get the last hidden state by passing the inputs to the model, excluding
the final classification head.

In [5]:
last_hidden_state = model.model(**inputs).last_hidden_state
print(last_hidden_state.shape)
last_hidden_state

torch.Size([1, 5, 576])


tensor([[[ 0.3476,  0.7350,  0.1515,  ..., -0.0168,  0.8690,  1.1515],
         [ 0.0334,  0.6300,  0.7636,  ..., -0.6490,  0.0102, -0.2357],
         [-1.0193,  0.9439,  0.1579,  ..., -0.3536, -2.4959,  1.6141],
         [-2.0151, -0.3402, -0.6598,  ...,  1.7252, -1.6691,  1.4883],
         [-0.6080, -0.9785, -0.8922,  ...,  3.4061, -0.1228, -0.6294]]],
       grad_fn=<MulBackward0>)

Then this final transformer output is passed to the classification head.
The classification head is a single linear layer that maps the hidden state to the logits for the next token.
The output shape of the classification head is `(batch_size, sequence_length, vocab_size)`.


In [6]:
logits = model.lm_head(last_hidden_state)
assert torch.allclose(logits, model(**inputs).logits)
logits.shape

torch.Size([1, 5, 49152])

Next we convert the logits to probabilities using the softmax function.
While this is useful for visualization and inference, during training we typically
use the raw logits directly with CrossEntropyLoss for better numerical stability.
Note that we get logits (and after softmax, probabilities) for the next token at **each position** in the sequence.
During inference, we typically only care about the last position's values since that's where we'll generate the next token.

In [7]:
probs = F.softmax(logits, dim=-1)
probs.shape

torch.Size([1, 5, 49152])

This shows that at inference time we actually get the probabilities for the next token at **each position** in the sequence.
Here we'll just look at the top 5 predictions for each token in the sequence.

In [8]:
K = 5  # Number of top predictions to show
top_probs, top_indices = torch.topk(probs[0], k=K, dim=-1)  # Remove batch dim and get top K

# Convert token indices to actual tokens and print predictions for each position
input_text = tokenizer.decode(input_ids[0])  # Original text
print(f"Original text: {input_text}\n")

for pos in range(len(input_ids[0])):
    token = tokenizer.decode(input_ids[0][pos])
    print(f"After token: '{token}'")
    print(f"Top {K} predicted next tokens:")
    for prob, idx in zip(top_probs[pos], top_indices[pos]):
        predicted_token = tokenizer.decode(idx)
        print(f"  {predicted_token}: {prob:.3f}")
    print()

Original text: The dog jumped over the

After token: 'The'
Top 5 predicted next tokens:
   first: 0.022
   same: 0.015
   most: 0.012
   world: 0.011
   last: 0.006

After token: ' dog'
Top 5 predicted next tokens:
   was: 0.063
   is: 0.062
  's: 0.047
  ’: 0.039
  ,: 0.031

After token: ' jumped'
Top 5 predicted next tokens:
   up: 0.200
   on: 0.135
   into: 0.068
   over: 0.063
   out: 0.062

After token: ' over'
Top 5 predicted next tokens:
   the: 0.793
   a: 0.032
   it: 0.030
   and: 0.017
   him: 0.013

After token: ' the'
Top 5 predicted next tokens:
   fence: 0.408
   wall: 0.029
   top: 0.017
   bridge: 0.017
   table: 0.013



In summary, the input to the transformer layers is a sequence of embeddings, of shape `(batch_size, sequence_length, embedding_dim)`.
The transformer layers process this sequence and return a new sequence of hidden states, of shape `(batch_size, sequence_length, hidden_size)`.
It is often the case that the hidden size is the same as the embedding dimension, but this is not a requirement. Even if you forget the details of the inner workings of the transformer layers (self attention, etc.), this is a useful mental model to keep in mind. The final classifier layer returns a probability distribution over the next token for each position in the sequence, of shape `(batch_size, sequence_length, vocab_size)`.

Another thing to mention is that the above explanation was focusing on auto-regressive decoders. Recall that most of the LLMs which generate text are auto-regressive decoders. There are also pure encoder models. Here are some of the key differences between decoder and encoder models:

- Attention Masking:
    - Decoder: Uses causal (or triangular) masked attention to ensure that each position can only attend to previous positions, enforcing an autoregressive quality. This prevents the model from "seeing the future," which is essential for tasks like text generation.
    - Encoder: Doesn't require causal masking because it processes the entire input sequence at once. Each token can attend to every other token in the sequence, providing a comprehensive context.

- Purpose and Data Flow:
    - Decoder: Designed for autoregressive tasks, where each output token is generated one by one, conditioning on previously generated tokens. This step-by-step generation is central to tasks like text generation, where each token "builds" upon the preceding tokens.
    - Encoder: Designed to encode the entire input sequence into a contextualized representation in one shot, capturing relationships across the whole sequence. It’s typically used in understanding or embedding tasks and classification tasks, where the model needs a holistic view of the input.

Now we'll move on to how we can pass images into decoder style LLMs, along side text, to generate text outputs.
If you remember that the input to the transformer layers is a sequence of embeddings, then passing in images is no different.
We just need to convert the images into a sequence of embeddings suitable for the LLM decoder model.

**TODO: A SIMPLE DIAGRAM HERE OF (B,T,C) ---> Transformer(B,T,C) ---> (B,T,C)**

# Introducing Images into LLMs

Now that we've revisited how transformers handle text inputs, let's explore how images can be incorporated into LLMs. Remember, the key idea is that transformers process sequences of embeddings. For text, these are token embeddings. For images, we'll need to convert them into embeddings as well.

## Image Embeddings with Vision Transformers (ViT)

To convert an image into a sequence of embeddings, we can use a Vision Transformer (ViT) image encoder.
At this point we are going to treat this image encoder as a black box. We will go into more details on how it works later on. 
The ViT image encoder converts an input image into a sequence of embeddings. We don't need to know the internal workings of the image encoder at this point; we just need to know that it produces embeddings. 

Why use an Image Encoder? Just like text needs to be tokenized and embedded before being processed by an LLM, images need to be transformed into a suitable format. The image encoder serves this purpose by translating visual information into a sequence of numerical embeddings.

### SigLIP
SigLIP is a state-of-the-art model that can understand both images and text.

In [12]:
from PIL import Image
import requests
from transformers import AutoProcessor, AutoModel
import torch

model = AutoModel.from_pretrained("google/siglip-so400m-patch14-384")
processor = AutoProcessor.from_pretrained("google/siglip-so400m-patch14-384")

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)

texts = ["a photo of 2 cats", "a photo of 2 dogs"]
inputs = processor(text=texts, images=image, padding="max_length", return_tensors="pt")

with torch.no_grad():
    outputs = model(**inputs)

logits_per_image = outputs.logits_per_image
probs = torch.sigmoid(logits_per_image) # these are the probabilities
print(f"{probs[0][0]:.1%} that image 0 is '{texts[0]}'")


40.5% that image 0 is 'a photo of 2 cats'


In [13]:
model

SiglipModel(
  (text_model): SiglipTextTransformer(
    (embeddings): SiglipTextEmbeddings(
      (token_embedding): Embedding(32000, 1152)
      (position_embedding): Embedding(64, 1152)
    )
    (encoder): SiglipEncoder(
      (layers): ModuleList(
        (0-26): 27 x SiglipEncoderLayer(
          (self_attn): SiglipSdpaAttention(
            (k_proj): Linear(in_features=1152, out_features=1152, bias=True)
            (v_proj): Linear(in_features=1152, out_features=1152, bias=True)
            (q_proj): Linear(in_features=1152, out_features=1152, bias=True)
            (out_proj): Linear(in_features=1152, out_features=1152, bias=True)
          )
          (layer_norm1): LayerNorm((1152,), eps=1e-06, elementwise_affine=True)
          (mlp): SiglipMLP(
            (activation_fn): PytorchGELUTanh()
            (fc1): Linear(in_features=1152, out_features=4304, bias=True)
            (fc2): Linear(in_features=4304, out_features=1152, bias=True)
          )
          (layer_norm2): L

# Random Notes

- Rough Structure of Blog: https://chatgpt.com/c/673221d7-5b34-8011-b2a3-a7adbc56ab29

ZeroShot Models

- CLIP - text and image encoder - probs add to 1 - went through softmax
- SigLIP (newer/better?) - sigmoid was used instead of softmax
- OWLViT/V2 -  CLIP architecture but with object detection
- Segmentation Mask - segment anything model
- OWLSAM

Vision Language Models
- LLaVA - one of the first - see https://huggingface.co/blog/vlms for more details. Images there about pre-training and post-training.
- PaliGemma - Siglip and Gemma model
- Qwen2-VL - and 2.5? - 675M ViT image encoder, MLP projector, Qwen2 LLM decoder
- Pixtral 400M vision encoder, 12B Mistral Nemo Text decoder
- Molmo - CLIP encoder with different decoders
- [OpenVLM Leaderboard](https://huggingface.co/spaces/opencompass/open_vlm_leaderboard)

Newer Advancements 
[See here YT time stamp](https://youtu.be/_TlhKHTgWjY?t=2186)


# Resources

## Multimodal LLMs

[Understanding Multimodal LLMs](https://magazine.sebastianraschka.com/p/understanding-multimodal-llms?utm_source=post-email-title&publication_id=1174659&post_id=151078631&utm_campaign=email-post-title&isFreemail=true&r=1urfra&triedRedirect=true&utm_medium=email)

[AI Visions Live | Merve Noyan | Open-source Multimodality](https://www.youtube.com/watch?v=_TlhKHTgWjY)

[Vision Language Models Explained](https://huggingface.co/blog/vlms)


[PaliGemma – Google's Cutting-Edge Open Vision Language Model](https://huggingface.co/blog/paligemma)
[PaliGemma: A versatile 3B VLM for transfer: Paper](https://arxiv.org/pdf/2407.07726)
[Gemma explained: PaliGemma architecture: Google for Developers](https://developers.googleblog.com/en/gemma-explained-paligemma-architecture/)
[Awesome-Multimodal-Large-Language-Models](https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models?tab=readme-ov-file#awesome-papers)

[Vision Arena](https://huggingface.co/spaces/WildVision/vision-arena)

[smol-vision](https://github.com/merveenoyan/smol-vision)


## Vision Transformer (ViT)





[AN IMAGE IS WORTH 16X16 WORDS: TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE](https://arxiv.org/pdf/2010.11929)

[Vision Transformer (ViT: Hugging Face)](https://huggingface.co/docs/transformers/en/model_doc/vit)

[Fine-Tune ViT for Image Classification with 🤗 Transformers](https://huggingface.co/blog/fine-tune-vit)

[SigLIP Model Card](https://huggingface.co/google/siglip-so400m-patch14-384)