# Week 14

Deep Learning Models and Embeddings

In [None]:
!wget -q https://github.com/PSAM-5020-2025F-A/5020-utils/raw/main/src/nn_utils.py

!wget -qO- https://github.com/PSAM-5020-2025F-A/5020-utils/releases/latest/download/bob-ross.tar.gz | tar xz
!wget -qO- https://github.com/PSAM-5020-2025F-A/5020-utils/releases/latest/download/lfw.tar.gz | tar xz
!wget -qO- https://github.com/PSAM-5020-2025F-A/5020-utils/releases/latest/download/metfaces.tar.gz | tar xz

In [None]:
from numpy import argsort, asarray

from os import listdir
from PIL import Image as PImage

from sklearn.metrics.pairwise import cosine_distances, euclidean_distances

from torch import nn, Tensor, no_grad, cuda
from torch import float32 as t_float32, uint8 as t_uint8

from torchvision.models import resnet50, ResNet50_Weights
from torchvision.transforms import v2

from nn_utils import get_num_params

DEVICE = "cuda" if cuda.is_available() else "cpu"

## Review

### CNNs:

These are the networks that use convolution kernels to extract location-independent features from images.

<!-- <img src="./imgs/cnn_layers.jpg" height="320px"/> -->
<img src="https://i.postimg.cc/rpdq7DSd/cnn-layers.jpg" height="320px"/>

### ResNet:

[ResNet](https://arxiv.org/abs/1512.03385) is a specific CNN architecture. It combines convolution, pooling and residual layers to enable deep networks for visual tasks.

The [pre-trained ResNet](https://pytorch.org/hub/pytorch_vision_resnet/) models in the `PyTorch` library were trained on the [ImageNet](https://image-net.org/download.php) dataset, which contains $1\text{,}281\text{,}167$ training images classified into $1\text{,}000$ classes.

<!-- <img src="./imgs/resnet34_01.jpg" width="900px" /> -->
<img src="https://i.postimg.cc/hP20Rn9D/resnet34-01.jpg" width="900px" />

## Embeddings

### ResNet without a head:

Due to the diversity of the millions of images used in training, these pre-trained `ResNet` models are capable of detecting very specific patterns and features. The final layers of the network, right before the final, fully-connected, classification layer, contain very dense representations of the content and style of the images being passed through the network.

<!-- <img src="./imgs/resnet_embed.jpg" width="900px" /> -->
<img src="https://i.postimg.cc/W1v1shXh/resnet-embed.jpg" width="900px" />

We can instantiate a `ResNet`, remove its classification layer and use the outputs of its final neurons as encoded features. This is what is referred to as _embeddings_: the high dimensional information of an image gets _embedded_ into a lower-dimensional representation. Instead of $500 \times 500$ pixels of color information, we get $2048$ features of visual information.

### Instantiating a model

We'll use the pre-trained `ResNet50` model from the `PyTorch` library and replace its final, fully-connected, layer (`model.fc`) with an `Identity` layer, which doesn't do anything, just passes whatever it gets as input to its output.

The `model.eval()` function tells `PyTorch` that we're not training any models, just using them, so some parts of the network, like `Dropout` layers can be simplified/disabled.

In [None]:
model = resnet50(weights=ResNet50_Weights.DEFAULT).to(DEVICE)
model.fc = nn.Identity()
model.eval()

get_num_params(model)

### `ResNet` pre-processing

Like in `WK13`, we have to pre-process our data so it looks like the data that was used during training.

This is as if someone ran `fit()` on the training data, and now we have to run `transform()` to get the data into the same units, size and shape as the training data. But there's no pre-fabricated `transform()` function so we have to put together a pre-processing routine using the `Compose` functionality of `PyTorch`.

From https://pytorch.org/hub/pytorch_vision_resnet/:

_All pre-trained models expect input images normalized in the same way, i.e. mini-batches of 3-channel RGB images of shape (3 x H x W), where H and W are expected to be at least 224. The images have to be loaded in to a range of [0, 1] and then normalized using mean = [0.485, 0.456, 0.406] and std = [0.229, 0.224, 0.225]._

The following `PyTorch` transformations will do this for us.

In [None]:
res_transforms = v2.Compose([
  v2.ToDtype(t_uint8),
  v2.Resize(224),
  v2.ToDtype(t_float32, scale=True),
  v2.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])

### Load Data

Let's load the data from `./data/image/bob-ross/`. This dataset contains $400$ images of most of the paintings done by Bob Ross in his TV show from $1983$ to $1994$.

The images all have the same dimensions, so we could load them into a `DataFrame` and then do the processing, but it is a little easier to just use a for loop to iterate over all of the files and:
- Open each image as a `PIL` image
- Put pixels into a `Tensor` object
- Pre-process the image using the transformations defined above (size, normalize, etc)
- Pass the pre-processed image through the model and save the resulting embeddings as a normal `Python` list

Some observations about the code:
- `asarray()`: shortcut to extract pixels from a `PIL` image and keep them in $2D$ (width x height)
- `permute()`: re-orders the color channels into $3$ single-channel layers instead of $1$ RGB layer; required by CNNs
- `out[0]`: `PyTorch` models are designed to analyze multiple inputs at once, `[0]` is the location of our result
- `tolist()`: moves the resulting embedding from the GPU to the CPU and converts it to a regular `Python` list

In [None]:
IMG_DIR = "./data/image/bob-ross"
fnames = sorted([f for f in listdir(IMG_DIR) if f.endswith("jpg")])

img_embeddings = []

for f in fnames:
  img = PImage.open(f"{IMG_DIR}/{f}")
  img_t = Tensor(asarray([img])).permute(0,3,1,2)  # 1 x c x h x w
  img_t = res_transforms(img_t).to(DEVICE)
  out = model(img_t)
  img_embeddings.append(out[0].tolist())

In [None]:
print(len(img_embeddings))
print(len(img_embeddings[0]))

### Search

Ok. Now we have $2048$ features for each of our images.

Since these features encode dense visual information about the images, we can use them to navigate our data in different ways.

For example, we can use clustering to explore image groups, or use specific image embeddings to find similar images.

The process for the latter involves using:
- `euclidean_distances()` to compute embedding distances between every possible pair of images.
- `argsort()` to order the images by their distances to a reference image

For example, let's say a dataset of $4$ images has the following $3$-dimensional embeddings:

||embedding &nbsp;&nbsp;&nbsp;&nbsp;|
|-|-|
|img0|$\left[2.0, 5.0, 3.0\right]$|
|img1|$\left[1.0, 4.0, 6.0\right]$|
|img2|$\left[8.0, 2.0, 1.0\right]$|
|img3|$\left[1.0, 6.0, 1.0\right]$|

Using `euclidean_distances()` to compute the pairwise distances between image embeddings gives:

||img0|img1|img2|img3|
|-|-|-|-|-|
|**img0**|$0.00$|$3.32$|$7.00$|$2.45$|
|**img1**|$3.32$|$0.00$|$8.83$|$5.38$|
|**img2**|$7.00$|$8.83$|$0.00$|$8.06$|
|**img3**|$2.45$|$5.38$|$8.06$|$0.00$|

And `argsort()` gives us the indexes of the columns that would sort each row by distance values:

|img0|$0$|$3$|$1$|$2$|
|-|-|-|-|-|
|**img1**|$1$|$0$|$3$|$2$|
|**img2**|$2$|$0$|$3$|$1$|
|**img3**|$3$|$0$|$1$|$2$|

The `bob-ross` dataset has $400$ images and we're using $2048$-dimensional embeddings, but the idea is the same.

If we want to see what images are the most similar to image $10$, we can look in row $10$ of the sorted indexes: the very first value there will be the index of image $10$ itself, and the other indexes are sorted by how similar their corresponding image is to image $10$.

In [None]:
src_idx = 10
dists = euclidean_distances(img_embeddings)
sorted_idxs = argsort(dists[src_idx])

for idx in sorted_idxs[:5]:
  img = PImage.open(f"{IMG_DIR}/{fnames[idx]}")
  img.thumbnail((256, 256))
  display(img)

### Repeat

We can repeat this exercise, but use other images as queries for our search.

For example, we can search for Bob Ross paintings that are most similar to landscape paintings or photographs from other artists.

These are just some examples, but feel free to explore other possibilities:


<img src="https://samuelearp.com/wp-content/uploads/2023/10/IMG_1512-scaled.jpeg" height="200px"><br>
[url](https://samuelearp.com/wp-content/uploads/2023/10/IMG_1512-scaled.jpeg)

<img src="https://i0.wp.com/inesepogagallery.com/wp-content/uploads/1-Old-grey-barn-in-meadow-framed-acrylic-painting.jpg" height="200px"><br>
[url](https://i0.wp.com/inesepogagallery.com/wp-content/uploads/1-Old-grey-barn-in-meadow-framed-acrylic-painting.jpg)

<img src="https://posterjack.ca/cdn/shop/articles/landscape_photography_tips_featured_image.jpg?v=1563408049&width=2048" height="200px"><br>
[url](https://posterjack.ca/cdn/shop/articles/landscape_photography_tips_featured_image.jpg?v=1563408049&width=2048)

<img src="https://preview.redd.it/ansel-adams-the-tetons-and-the-snake-river-1942-grand-teton-v0-jfoc6jdjzpt81.jpg?width=1080&crop=smart&auto=webp&s=f687f44a04706b0571d8b30fb1f0ef8dcf32691e" height="200px"><br>
[url](https://preview.redd.it/ansel-adams-the-tetons-and-the-snake-river-1942-grand-teton-v0-jfoc6jdjzpt81.jpg?width=1080&crop=smart&auto=webp&s=f687f44a04706b0571d8b30fb1f0ef8dcf32691e)

In [None]:
# TODO: Set up and perform search using non-Bob-Ross images

## CLIP

Embeddings are super useful and CNNs were the first networks that really allowed us to do this kind of feature extraction by leveraging learned characteristics of massive datasets. People even used CNNs on non-image data, by coming up with clever transformations to encode different types of data into image-like representations.

This started to change around $2017$ when a new type of network architecture was proposed for text translation models. By $2020$, _transformer_ networks had been adapted to work with any kind of input, not just text. 

`CLIP` is an example of a contrastive, transformer-based, model. Contrastive here means that it was trained on multi-modal inputs (images and text) at the same time. This allows us to create text and image embeddings using the same _units_. Embedding values for the word _dog_ will be similar to embedding values for images of dogs.

We can access pre-trained `CLIP` models using the `transformers` library. This is a library developed on top of `PyTorch` to facilitate the use of pre-trained transformer models.

We start by creating pre-processor and model objects from the same training instance:

In [None]:
from transformers import CLIPProcessor, CLIPModel

CLIP_MODEL_NAME = "openai/clip-vit-large-patch14"
DEVICE = "cuda" if cuda.is_available() else "cpu"

clip_processor = CLIPProcessor.from_pretrained(CLIP_MODEL_NAME)
clip_model = CLIPModel.from_pretrained(CLIP_MODEL_NAME).to(DEVICE)

### Embed

This will look familiar, and perhaps even simpler, since the `CLIP` pre-processor is already built for us and we can just use it directly on `PIL` images.

The `no_grad()` function below is another way to tell `PyTorch` that we're not training a network, so it can optimize some of its internal code.

In [None]:
IMG_DIR = "./data/image/bob-ross"
fnames = sorted([f for f in listdir(IMG_DIR) if f.endswith("jpg")])

img = PImage.open(f"{IMG_DIR}/{fnames[0]}")

img_t = clip_processor(images=img, return_tensors="pt", padding=True).to(DEVICE)

with no_grad():
  clip_embedding = clip_model.get_image_features(**img_t)[0]

clip_embedding.shape

Our `CLIP` embeddings are $768$-dimensional.

Let's compute `CLIP` embeddings for all of the images in the `bob-ross` dataset.

In [None]:
IMG_DIR = "./data/image/bob-ross"
fnames = sorted([f for f in listdir(IMG_DIR) if f.endswith("jpg")])

img_embeddings = []

for f in fnames:
  img = PImage.open(f"{IMG_DIR}/{f}")
  img_t = clip_processor(images=img, return_tensors="pt", padding=True).to(DEVICE)
  with no_grad():
    out = clip_model.get_image_features(**img_t)
    img_embeddings.append(out[0].tolist())

As before, we can use `euclidean_distances()` and `argsort()` to find similar images in our dataset.

In [None]:
src_idx = 10
dists = euclidean_distances(img_embeddings)
sorted_idxs = argsort(dists[src_idx])

for idx in sorted_idxs[:5]:
  img = PImage.open(f"{IMG_DIR}/{fnames[idx]}")
  img.thumbnail((256, 256))
  display(img)

# TODO: does cosine_distances() change results?

### But wait ! There's more !

So... `CLIP` is $20$ times larger, takes longer to run and the embeddings are "only" $768$ values...

What did we get?

Since this is a contrastive language-image model, we can encode text and use text embeddings to navigate images.

The process is similar. We pre-process our string and then get its embedding from the model.

In [None]:
txt = "barn"

txt_t = clip_processor(text=txt, padding="max_length", max_length=64, return_tensors="pt").to(DEVICE)

with no_grad():
  txt_embedding = clip_model.get_text_features(**txt_t)[0].tolist()

len(txt_embedding)

And now we can order our images by their distance from the text embedding:

In [None]:
dists = euclidean_distances([txt_embedding], img_embeddings)
sorted_idxs = argsort(dists[0])

for idx in sorted_idxs[:5]:
  img = PImage.open(f"{IMG_DIR}/{fnames[idx]}")
  img.thumbnail((256, 256))
  display(img)

### Any text

Text of any length gets embedded into the same number of features, so we can even try to search using full sentences and more specific terms.

In [None]:
txt = "cottage in the snow"

txt_t = clip_processor(text=txt, padding="max_length", max_length=64, return_tensors="pt").to(DEVICE)

with no_grad():
  txt_embedding = clip_model.get_text_features(**txt_t)[0].tolist()

len(txt_embedding)

In [None]:
dists = euclidean_distances([txt_embedding], img_embeddings)
sorted_idxs = argsort(dists[0])

for idx in sorted_idxs[:5]:
  img = PImage.open(f"{IMG_DIR}/{fnames[idx]}")
  img.thumbnail((256, 256))
  display(img)

### Embedding arithmetic (part 1)

The above worked ok, but some of the images didn't have a cottage, or snow.

This could be because the phrase "_cottage in the snow_" is nudging the model towards some more specific concepts and distancing it from the space of _paintings_.

One way we can embed multiple terms without steering the embedding towards some more specific concepts is by adding individual embeddings.

Instead of embedding "_cottage in the snow_", we'll embed the terms _cottage_ and _snow_ separately and add them together before searching for images.

The following cell pre-processes and embeds a list of words, like we did a list of images above:

In [None]:
txt = ["cottage", "snow"]

txt_embeddings = []

for t in txt:
  txt_t = clip_processor(text=t, padding="max_length", max_length=64, return_tensors="pt").to(DEVICE)
  with no_grad():
    out = clip_model.get_text_features(**txt_t)
    txt_embeddings.append(out[0].tolist())

print(f"txt_embeddings shape: ({len(txt_embeddings)}, {len(txt_embeddings[0])})")

We can now add the terms from the individual embeddings, and even add a bit of extra importance to _snow_ by multiplying its embedding terms by $2$, before performing the search:

In [None]:
txt_embedding = [t0 + 2 * t1 for t0,t1 in zip(txt_embeddings[0], txt_embeddings[1])]

dists = euclidean_distances([txt_embedding], img_embeddings)
sorted_idxs = argsort(dists[0])

for idx in sorted_idxs[:5]:
  img = PImage.open(f"{IMG_DIR}/{fnames[idx]}")
  img.thumbnail((256, 256))
  display(img)

### Zero-shot classification

Instead of finding images that are similar to a given text, we can change our code slightly and use embeddings to determine which label from a pre-determined set of possible labels best describes a given image.

This technique of leveraging generic model knowledge to create a classification model with dynamic labels is sometimes called zero-shot classification: we can get the model to classify images even though it was not trained to classify images.

In [None]:
txt = ["mountain", "forest", "lake"]

txt_t = clip_processor(text=txt, padding="max_length", max_length=64, return_tensors="pt").to(DEVICE)

with no_grad():
  txt_embeddings = clip_model.get_text_features(**txt_t).tolist()

In [None]:
for img_idx in range(5):
  dists = euclidean_distances([img_embeddings[img_idx]], txt_embeddings)
  sorted_idxs = argsort(dists[0])

  img = PImage.open(f"{IMG_DIR}/{fnames[img_idx]}")
  img.thumbnail((256, 256))
  display(img)

  print(txt[sorted_idxs[0]])

### More embedding arithmetic

Everything is a number, so why not ?

Since `CLIP` embeddings are able to represent similar concepts using similar numbers, regardless of whether those concepts are initially expressed as text or images, subtracting the embedding for the word "snow" from the embedding of an image of a Bob Ross painting with snow, should leave us with an embedding of a Bob Ross painting without snow.

We can't turn the resulting embedding into a brand new image, but we can use it to search existing images for the one that most closely resembles the Bob Ross painting with snow, but without the snow.

Let's encode "snow":

In [None]:
txt = "snow"

txt_t = clip_processor(text=txt, padding="max_length", max_length=64, return_tensors="pt").to(DEVICE)

with no_grad():
  txt_embedding = clip_model.get_text_features(**txt_t)[0].tolist()

Let's encode an image of a painting with snow.

The following indexes are for paintings with snow:

```python
snow_img_idxs = [201, 254, 61, 223, 16, 184, 89, 0, 124, 111]
```

In [None]:
img_idx = 201

snow_img = PImage.open(f"{IMG_DIR}/{fnames[img_idx]}")
snow_img.thumbnail((256, 256))
display(snow_img)

Let's subtract the "snow" embedding from the embedding of the selected image.

We'll use a comprehension and `zip()` to do the subtraction of each individual item.

then, we use `euclidean_distances()` and `argsort()` to find the image that most resembles the resulting embedding.

In [None]:
no_snow_embedding = [i - t for i,t in zip(img_embeddings[img_idx], txt_embedding)]

dists = euclidean_distances([no_snow_embedding], img_embeddings)
sorted_idxs = argsort(dists[0])

no_snow_img = PImage.open(f"{IMG_DIR}/{fnames[sorted_idxs[0]]}")
no_snow_img.thumbnail((256, 256))

# combine images horizontally
img = PImage.new("RGB", (2 * snow_img.width, snow_img.height))
img.paste(snow_img, (0,0))
img.paste(no_snow_img, (snow_img.width, 0))

display(img)

## Search for a lookalike

Use the `metfaces` dataset to look for faces most-similar to Bob Ross:

<img src="https://www.bobross.com/content/bob_ross_img.png" height="300px">

[url](https://www.bobross.com/content/bob_ross_img.png)

In [None]:
# TODO: read and embed all metfaces images, and then search for Bob Ross lookalikes

## SigLIP

This is a newer contrastive model.

We can load its pre-processor and pre-trained weights using the `transformers` library and its interface will be the same as the interface for the `CLIP` model.

In [None]:
from transformers import AutoModel, AutoProcessor

SIGLIP_MODEL_NAME = "google/siglip2-giant-opt-patch16-256"

siglip_processor = AutoProcessor.from_pretrained(SIGLIP_MODEL_NAME)
siglip_model = AutoModel.from_pretrained(SIGLIP_MODEL_NAME, device_map="auto").to(DEVICE)

In [None]:
# TODO: Try SigLIP (on a different dataset)