# Homework14

Pre-Trained Transformer Models and Embeddings

## Goals

- Practice setting up classification and clustering modeling task from scratch
- Experiment with pre-trained transformer models for embedding and feature extraction
- Get more familiar with the `argmax()` and `pairwise_distance()` functions

### Setup

Run the following 2 cells to import all necessary libraries and helpers for this homework

In [None]:
!wget -qO- https://github.com/PSAM-5020-2025F-A/5020-utils/releases/latest/download/flowers102.tar.gz | tar xz

In [None]:
import pandas as pd

from numpy import argsort
from os import listdir
from PIL import Image as PImage

from sklearn.metrics import accuracy_score, classification_report
from sklearn.metrics.pairwise import euclidean_distances

from torch import nn, Tensor, no_grad
from torch import float32 as t_float32, uint8 as t_uint8

from torchvision.models import resnet50, ResNet50_Weights
from torchvision.transforms import v2

from transformers import AutoModel, AutoProcessor
from transformers import CLIPProcessor, CLIPModel

# One-Shot Classification

## Intro

We're going to leverage the general knowledge and patterns learned by pre-trained deep learning models to create a one-shot classifier model.

One-shot classifiers are models that learn how to describe/detect objects by just looking at one example from each class.

The overall flow for doing this using image embeddings can be something like:
- extract embeddings for all images in our dataset
- training: for each class, average a small number of embeddings to create class embeddings
- these embeddings now represent information about the classes we're trying to identify
- predicting: find the closest class embedding to each image embedding in the dataset

This is similar to one of the examples in our [Week 14](https://github.com/PSAM-5020-2025F-A/WK14) notebook where we used words to classify images.

## The Data

We're going to classify the [Oxford Flowers Dataset](https://www.robots.ox.ac.uk/~vgg/data/flowers/). This is a dataset made up of images of flowers and their names.

All of the images we're going to be working with should have been downloaded by the first cell, into separate `train` and `test` subdirectories inside `./data/image/flowers102/`.

The images all have the same (or similar) height, but very different widths. Depending on the architecture/model that we choose to use for  embedding, this might be something we have to standardize. Transformer architectures and their preprocessors will automatically deal with these differences, where CNN models are a bit more strict.

Let's start by defining a function that helps parse the classification label from file names or paths.

In [None]:
def filepath_to_label(filepath):
  return filepath.split("/")[-1].split("_")[-1].split(".")[0]

## The Model

Now we have to define an embedding model, and (probably) a pre-processing strategy.

What wee need here is a pre-trained model that is able to turn images of various sizes into feature lists of fixed-length.

Our [Week 14](https://github.com/PSAM-5020-2025F-A/WK14) notebook has a couple of examples of how to use `ResNet`, `CLIP` and `SigLIP` models to do this, but there are other options that could be used. Since we're not doing any text processing, any kind of deep learning visual model can (theoretically) be used.

Some other examples: [Nomic Vision](https://huggingface.co/nomic-ai/nomic-embed-vision-v1.5), [EfficientNet](https://pytorch.org/hub/nvidia_deeplearningexamples_efficientnet/), [ViT](https://huggingface.co/google/vit-base-patch16-224-in21k), [DINOv3](https://huggingface.co/facebook/dinov3-vitl16-pretrain-lvd1689m), etc.

In [None]:
# TODO: define model and pre-processing routine/function/strategy for images

## Train Data

Now we process the train data. Fun.

There are many ways to do this, but one possible strategy is to go through all of the files inside the `./data/image/flowers102/train` directory and append each image's label and embedding to separate lists, called `train_labels` and `train_embeddings`.

Then, create a `DataFrame` using the embeddings and add the labels to the same `DataFrame`.

Depending on the model chosen, this can take a few minutes.

In [None]:
# TODO: extract labels and embeddings for each image in the training dataset

TRAIN_DIR = "./data/image/flowers102/train"

train_fnames = sorted([f for f in listdir(TRAIN_DIR) if f.endswith("jpg")])

train_labels = []
train_embeddings = []

for fname in train_fnames:
  # TODO: replace this with code
  pass

# TODO: combine the lists into a DataFrame

### Train Data Questions

<span style="color:hotpink;">
How many images are in the training dataset ?<br>
How many <em>"features"</em> ?<br>
How many unique classes do we have for this data ?
</span>

<span style="color:hotpink;">ADD ANSWER TO THIS CELL</span>

## Test Data

Repeat the above process for the files inside the `./data/image/flowers102/test` directory: append each image's label and embedding to separate lists, called `test_labels` and `test_embeddings`.

Since we won't do any other kind of processing on this data, it's not as important to combine the labels and embeddings into a `DataFrame`.

And again, this can take a few minutes.

In [None]:
# TODO: Repeat label and embedding extraction for all images in the test dataset

### Test Data Questions

<span style="color:hotpink;">
How many images are in the test dataset ?<br>
Anything odd or unusual about this ?<br>
Is the test dataset balanced ?<br>
Does it matter ?
</span>

<span style="color:hotpink;">ADD ANSWER TO THIS CELL</span>

## Train the model

This is the unconventional part. We don't need to train any models, but use the already-trained one to derive some information about our training data that can then be used to make new predictions.

There are different ways to do this, but a recommended strategy here could be:
- get a list of unique labels in our dataset
- iterate over the labels, filter the `DataFrame` by label and compute an average embedding for all images of each label
- these are now class embeddings, as they hold aggregate information about multiple instances of each class
- we should end up with as many class embeddings as there are unique labels in the dataset

In [None]:
# TODO: create class embeddings by averaging embeddings for all images of each label

## Predict and Evaluate

We have class embeddings, and we have instance embeddings from all images in the test dataset.

We can use the `euclidean_distances()` function to calculate pairwise distances between each image and each class embedding.

Then, we'll go through each image and get the index of the class embedding that is closest to its image embedding.

We can create a `predictions` list to compare to the `test_labels` list we extracted above.

In [None]:
# TODO: use euclidean_distances() and argsort() to determine the closest class for each image

In [None]:
print(classification_report(test_labels, test_preds))

### Interpretation

<span style="color:hotpink;">
So... What happened ?<br>
What are some advantages and disadvantages of using this strategy for classification ?
</span>

<span style="color:hotpink;">ADD ANSWER TO THIS CELL</span>