In [8]:
import os

# Set HF datasets cache
os.environ["HF_HOME"] = "/projectnb/ds598/admin/xthomas/sp2024_notebooks/discussion/tmp"

## Transformers and Attention Mechanism

![img](https://miro.medium.com/v2/resize:fit:1400/format:webp/1*kf871smtQKXAf3dSYOlVPA.png)

![img](https://storrs.io/content/images/2021/08/Screen-Shot-2021-08-07-at-7.51.37-AM.png)

![img](https://miro.medium.com/v2/resize:fit:1400/format:webp/1*7sy0KlRgyh3n0M0SH-6xqQ.gif)

![img](https://storrs.io/content/images/size/w1600/2021/08/image3--8-.png)

![img](https://storrs.io/content/images/size/w1600/2021/08/image8--1-.png)

<img src="https://storrs.io/content/images/size/w1600/2021/08/image7--2-.png" alt="Description of Image" width="800px">


<img src="https://storrs.io/content/images/size/w1600/2021/08/image4--2-.png" alt="Description of Image" width="800px">


<img src="https://storrs.io/content/images/size/w1600/2021/08/image6--3-.png" alt="Description of Image" width="800px">


<img src="https://storrs.io/content/images/size/w1600/2021/08/image5--3-.png" alt="Alt Text" width="800px">


![image](https://miro.medium.com/v2/resize:fit:1400/format:webp/1*YE0dWWP7uzWIa5zmNN2xdA.png)

## Vision-Transformer

### Components of a Vision Transformer

Since Vision Transformer is based on standard transformer architecture, the only difference is that it is being used for image tasks rather than for text, components used here are almost the same.

### Step 1: Split the image into image patches    

![title](https://cdn.sanity.io/images/vr8gru94/production/7a096efc8f3cc40849ee17a546dc0e685da2dc73-4237x1515.png)

### Step 2: Linear projection to pacth embeddings

![title](https://cdn.sanity.io/images/vr8gru94/production/88ec9bd73a8c7c2d0d00098eebe1e1281e8c80e4-2193x1117.png)

The linear projection layer attempts to transform arrays into vectors while maintaining their “physical dimensions”. Meaning similar image patches should be mapped to similar patch embeddings. Linear projection transforms the input to have the correct dimensions for the transformer model.

Each patch is considered as a single token, and the image is flattened into a sequence of tokens.

The patches are flattened and linearly transformed. The linear transformation is applied to each patch independently, and the same linear transformation is applied to all patches.

### Step 3: Learnable Embeddings

One feature introduced to transformers with the popular BERT models was the use of a `[CLS]` (or “classification”) token. The `[CLS]` token was a “special token” prepended to every sentence fed into BERT

<img src="https://cdn.sanity.io/images/vr8gru94/production/cc1a9b538be26c73a668540350e2485a046c2abb-2024x1309.png" alt="Description of Image" width="800px">


![title](https://cdn.sanity.io/images/vr8gru94/production/ef2026fc131e20c7fb0b0298ab88d5e365339514-4009x1821.png)

This `[CLS]` token is converted into a token embedding and passed through several encoding layers.
Two things make `[CLS]` embeddings special. First, it does not represent an actual token, meaning it begins as a “blank slate” for each sentence. Second, the final output from the `[CLS]` embedding is used as the input into a classification head during pretraining.
Using a “blank slate” token as the sole input to a classification head pushes the transformer to learn to encode a “general representation” of the entire sentence into that embedding. The model must do this to enable accurate classifier predictions.
ViT applies the same logic by adding a “learnable embedding”. This learnable embedding is the same as the `[CLS]` token used by BERT.

### Step 4: Positional Embeddings

Transformers do not have any default mechanism that considers the “order” of token or patch embeddings. Yet, order is essential. In language, the order of words can completely change their meaning.    

The same is true for images. If given a jumbled jigsaw set, it’s hard-to-impossible for a person to accurately predict what the complete puzzle represents. This applies to transformers too. We need a way of enabling the model to infer the order or position of the puzzle pieces.
We enable order with positional embeddings. For ViT, these positional embeddings are learned vectors with the same dimensionality as our patch embeddings.    
|
After creating the patch embeddings and prepending the “class” embedding, we sum them all with positional embeddings.

**Note**: the positional embeddings in ViT's are typically learned, not fixed like in the original transformer model. During training, these embeddings converge into vector spaces where they show high similarity to their neighboring position embeddings — particularly those sharing the same column and row:

<img src="https://cdn.sanity.io/images/vr8gru94/production/f29a1da461dbb154ce8bb2789962d20f8af65587-1911x1551.png" alt="Description of Image" width="800px">


The visualization indicates that positional embeddings are more similar to their immediate neighbors, particularly those within the same row or column. This makes intuitive sense, as adjacent patches in an image are more likely to be related to each other than patches that are further apart.

## Vision-Language Models

What Are Vision Language Models?

A vision-language model is a fusion of vision and natural language models. It ingests images and their respective textual descriptions as inputs and learns to associate the knowledge from the two modalities. The vision part of the model captures spatial features from the images, while the language model encodes information from the text.

![title](https://cdn.sanity.io/images/vr8gru94/production/a54a2f1fa0aeac03748c09df0fdfbb42aadc96b7-2430x1278.png)

Similar text and images will be encoded into a similar vector space. Dissimilar text and images do not share a similar vector space.

Both models “speak the same language” by encoding similar concepts in text and images into similar vectors. That means that the text “two dogs running across a frosty field” would output a vector similar to an image of two dogs running across a frosty field.

## CLIP

![title](https://cdn.sanity.io/images/vr8gru94/production/539716ea1571e459908c1fdc5a898fea239d8243-2803x1672.png)

CLIP consists of two models trained in parallel. A 12-layer text transformer for building text embeddings and a ResNet or vision transformer (ViT) for building image embedding

Both these models were trained seperately, and by default, have no understanding of one another. CLIP solves this thanks to image-text contrastive pretraining.     

Contrastive pretraining works by taking a (text, image) pair – where the text describes the image – and learning to encode the pairs as closely as possible in vector space.    
For this to work well, we also need negative pairs to provide a contrastive comparison. We need positive pairs that should output similar vectors and negative pairs that should output dissimilar vectors.

## Contrastive Learning

a technique that learns data points by understanding their differences. The method computes a similarity score between data instances and aims to minimize contrastive loss. It’s most useful in semi-supervised learning, where only a few labeled samples guide the optimization process to label unseen data points.

![title](https://miro.medium.com/v2/resize:fit:1400/format:webp/1*7xMtz23e_U1xfqVWYrsevA.png)

Positive pairs:

(T1, I1)
(T2, I2)
From these, we generate negative pairs by swapping the components:

Negative pair 1: (T1, I2)
Negative pair 2: (T2, I1)
The loss function aims to:

Maximize similarity between (T1, I1) and (T2, I2)
Minimize similarity between (T1, I2) and (T2, I1)

![title](https://cdn.sanity.io/images/vr8gru94/production/d6868e6dae721512fed8f1287fc9ffe6b6a2cddd-2332x1342.png)

**Note**: A fundamental assumption is that there are no other positive pairs within a single batch. For example, we assume that “two dogs running across a frosty field” is only relevant to the image it is paired with. We assume there are no other texts or images with similar meanings.    

This assumption is possible because the datasets used for pretraining are diverse and large enough that the likelihood of two similar pairs appearing in a single batch is negligible. Therefore, rare enough to have a little-to-no negative impact on pretraining performance.

### Zero-Shot Image Classification with CLIP

#### How does CLIP do zero-shot classification?

- We know that the image and text encoder creates a 512-dimensional image and text vector that map to the same vector space.
Considering this vector space alignment, what if we wrote the dataset classes as text sentences?

Given a task where we must identify whether a photo contains a car, bird, or cat: we could create and encode three text classes:
"a photo of a car" -> T_1,  "a photo of a bird" -> T_2,  "a photo of a cat" -> T_3

Each of these "classes" are output from the text encoder as vectors T1, T2, and T3, respectively. Given a photo of a cat, we encode it with the ViT model to create vector I1. When we calculate the similarity of these vectors with cosine similarity, we expect sim(T3, I1) to return the highest score.

![title](https://cdn.sanity.io/images/vr8gru94/production/d9a6ebbc9a2f3334ec57a6b54d90155043c07595-1292x447.png)

#### Why bother with creating a sentence for each class?

- We format the one-word classes into sentences because we expect CLIP saw more sentence-like text during pretraining.

In [9]:
# import the imagenette dataset
from datasets import load_dataset

imagenette = load_dataset(
    'frgfm/imagenette',
    '320px',
    split='validation',
    revision="4d512db"
)
# show dataset info
imagenette

You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.


Dataset({
    features: ['image', 'label'],
    num_rows: 3925
})

In [10]:
# check labels in the dataset
set(imagenette['label'])

{0, 1, 2, 3, 4, 5, 6, 7, 8, 9}

In [11]:
# labels names 
labels = imagenette.info.features['label'].names
labels

['tench',
 'English springer',
 'cassette player',
 'chain saw',
 'church',
 'French horn',
 'garbage truck',
 'gas pump',
 'golf ball',
 'parachute']

In [12]:
# generate sentences
clip_labels = [f"a photo of a {label}" for label in labels]
clip_labels

['a photo of a tench',
 'a photo of a English springer',
 'a photo of a cassette player',
 'a photo of a chain saw',
 'a photo of a church',
 'a photo of a French horn',
 'a photo of a garbage truck',
 'a photo of a gas pump',
 'a photo of a golf ball',
 'a photo of a parachute']

In [13]:
# initialization
from transformers import CLIPProcessor, CLIPModel

model_id = "openai/clip-vit-base-patch32"

processor = CLIPProcessor.from_pretrained(model_id, cache_dir="/projectnb/ds598/admin/xthomas/sp2024_notebooks/discussion/tmp")
model = CLIPModel.from_pretrained(model_id, cache_dir="/projectnb/ds598/admin/xthomas/sp2024_notebooks/discussion/tmp")

In [14]:
# create label tokens
label_tokens = processor(
    text=clip_labels,
    padding=True,
    images=None,
    return_tensors='pt'
).to(device)

label_tokens['input_ids'][0][:10]

tensor([49406,   320,  1125,   539,   320,  1149,   634, 49407])

In [15]:
# encode tokens to sentence embeddings
label_emb = model.get_text_features(**label_tokens)
# detach from pytorch gradient computation
label_emb = label_emb.detach().cpu().numpy()
label_emb.shape

(10, 512)

In [16]:
import torch
import numpy as np

# if you have CUDA set it to the active device like this
device = "cuda" if torch.cuda.is_available() else "cpu"
# move the model to the device
model.to(device)

from tqdm.auto import tqdm

preds = []
batch_size = 32

for i in tqdm(range(0, len(imagenette), batch_size)):
    i_end = min(i + batch_size, len(imagenette))
    images = processor(
        text=None,
        images=imagenette[i:i_end]['image'],
        return_tensors='pt'
    )['pixel_values'].to(device)
    img_emb = model.get_image_features(images)
    img_emb = img_emb.detach().cpu().numpy()
    scores = np.dot(img_emb, label_emb.T)
    preds.extend(np.argmax(scores, axis=1))

  0%|          | 0/123 [00:00<?, ?it/s]

In [17]:
true_preds = []
for i, label in enumerate(imagenette['label']):
    if label == preds[i]:
        true_preds.append(1)
    else:
        true_preds.append(0)

sum(true_preds) / len(true_preds)

0.965859872611465

#### Zero Shot Acc: ~96.58%

## Object Localization

Prompt: `"A photo of a cat"`

<img src="https://cdn.sanity.io/images/vr8gru94/production/a8b3638cd76b8c92f728f71bb81b62bb584e4beb-1171x1784.png" width="292.75" height="446">


Get the scores at each window, weight the pixel values by the scores

![title](https://cdn.sanity.io/images/vr8gru94/production/a7a86913765c70dfa415ffbf28bdedaf49a89699-1706x913.png)

We can repeat the same but with the prompt `"A photo of a butterfly"` to return:

<img src="https://cdn.sanity.io/images/vr8gru94/production/df6a8d683c8d42ffc3dc8643506a91c4704fa348-894x1328.png" width="223.5" height="332">


<img src="https://cdn.sanity.io/images/vr8gru94/production/c01610aed767d1e42eb01575a71dfacc1d7f7097-409x578.png" width="223.5" height="332">

Ref: https://encord.com/blog/vision-language-models-guide/#:~:text=Vision%2Dlanguage%20models%20are%20a,visual%20semantics%20to%20textual%20representations.