# Week 3a: Cosine similarity and vector embeddings

This notebook gives an introduction to how embeddings generated by neural networks can be used represent individual data samples like images and text, using [the famous CLIP model from OpenAI](https://openai.com/index/clip/) and how we can use those embeddings to measure and compare how similar to data points are to each other. Before that though, there will be an introduction to the [cosine similarity](https://www.youtube.com/watch?v=5Ao8Ji-f3i8), a very handy mathematical formula for calculating the similarity of two vectors. This is widely used in AI in many domains and tasks, and is of fundamental importance to how powerful modern AI systems like LLMs operate (more on this later in the term).

First of all you will need to set up this notebook to use the right environment.

### Setting up your Python environment

Before you work through this notebook, please follow the instructions in [Setup-and-test-conda-environment.ipynb](Setup-and-test-conda-environment.ipynb)

Once you have done that you will need to make sure that the environment selected to run this notebook and all the other notebooks used in this unit is called `aim`. 

To do this click the **Select kernel** button in the top right corner of this notebook, and then select `aim`.

To make sure that is configured properly, Hit the run cell button (▶) on the cell below:

In [1]:
import os
print(os.environ['CONDA_DEFAULT_ENV'])

aim


Does it output the text `aim`?

If it does not output the text `aim`, please revisit and follow the instructions in [Setup-and-test-conda-environment.ipynb](Setup-and-test-conda-environment.ipynb).

If you still cannot get it working, please raise this with the course instructor. 

#### Import libraries 

In [2]:
import os
import numpy as np
from PIL import Image
import torch
import open_clip

## Introduction to the Cosine Similarity

Cosine similarity is a mathematical formula that we can use to measure how closely related two vectors are to each other. This tells us how similar the direction from the origin is for a set of two vectors by measuring the angle between them. 

When we calculate the cosine difference we get a number between 1 & -1. The resulting number tells us:
- $1$: When the vectors are identical in direction (most similar).
- $0$: When the vectors are orthogonal (unrelated).
- $-1$: When the vectors are opposite in direction (total opposite).

The diagram below gives a visual representation of this:

<img src="media/cosine_similarity_diagram.png" alt="cosine_similarity_diagram" width="500"/>

The formula for calculating the dot product between two vectors $\vec{a}$ & $\vec{b}$ is given as:
$$
\text{Cosine Similarity} = \frac{\vec{a} \cdot \vec{b}}{\|\vec{a}\| \|\vec{b}\|}
$$
Where $\vec{a} \cdot \vec{b}$ is the [dot product of the two vectors](https://www.youtube.com/watch?v=0iNrGpwZwog), and $\|\vec{a}\| \|\vec{b}\|$ is the product of the [magnitude (aka length from the origin) of the vectors](https://www.youtube.com/watch?v=mGcZGiUn39k) $\vec{a}$ & $\vec{b}$. 

### Cosine distance in numpy

In [3]:
def get_cosine_similarity(vec_a, vec_b):
        dot_product = vec_a @ vec_b
        product_of_magnitudes = np.linalg.norm(vec_a) * np.linalg.norm(vec_b)
        return dot_product / product_of_magnitudes

In [4]:
vec_1 = np.array([1,0])
vec_2 = np.array([0,1])
vec_3 = np.array([-1,0])

If we get calculate the cosine similarity of the same vector we will get 1 as they are pointing in the same direction:

In [5]:
similarity = get_cosine_similarity(vec_1, vec_1)
print(f'cosine similarity of vec_1 with vec_1 is: {similarity}')

cosine similarity of vec_1 with vec_1 is: 1.0


Even if the magnitude (length) of the vectors is different, if the are pointing in the same direction the cosine distance will be the same:

In [6]:
vec_1_scaled = vec_1 * 10
print(f'vec_1: {vec_1}') 
print(f'vec_1_scaled: {vec_1_scaled}')
print(f'cosine similarity of vec_1 with vec_1_scaled is: {similarity}')

vec_1: [1 0]
vec_1_scaled: [10  0]
cosine similarity of vec_1 with vec_1_scaled is: 1.0


Now lets compare [orthogonal vectors](https://www.youtube.com/watch?v=6nqMegdbxik):

In [7]:
similarity = get_cosine_similarity(vec_1, vec_2)
print(f'cosine similarity of vec_1 with vec_2 is: {similarity}')

cosine similarity of vec_1 with vec_2 is: 0.0


Finally lets compare vectors pointing in opposite directions:

In [8]:
similarity = get_cosine_similarity(vec_1, vec_3)
print(f'cosine similarity of vec_1 with vec_3 is: {similarity}')

cosine similarity of vec_1 with vec_3 is: -1.0


Try experiment with some different vectors (e.g. $\begin{bmatrix}2 \\ 4\end{bmatrix}$ or $\begin{bmatrix}-3 \\ 5\end{bmatrix}$) and see what the similarities between them are:

In [9]:
vec_4 = np.array([
vec_5 = np.array([
print(f'cosine similarity of vec_4 with vec_5 is: {similarity}')

SyntaxError: '[' was never closed (2787332265.py, line 2)

## Getting embeddings with CLIP

The [CLIP model](https://openai.com/index/clip/) from OpenAI to get vector embeddings of both text and images. The CLIP has two component neural networks that both output vectors:
- **The text encoder** processes text as input and outputs a vector embedding. 
- **The image encoder** processes text as input and outputs a vector embedding. 

What is clever about CLIP is that these vector outputs are in the same 'embedding space', because of this you can compare the vector embeddings of text and images, this is designed such that when you calculate the cosine distance on these vectors it captures **semantically meaningful** relationships between them, even when they represent completely different types of data:

<img src="media/clip_diagram.png" alt="clip diagram" width="700"/>

The original CLIP was trained on approximately 400 million text and image pairings, that OpenAI scraped from the whole of the world wide web (without permission). The huge amount of data that went into training this model makes these vector embedding representations very powerful. This is a very large and complex neural network compared to what you have been looking at up to this point. For now, you do not need to worry about the details of how this network is built (the classes on computer vision and transformers later in the term will cover that in more detail).

##### Load CLIP

This code will load the pre-trained CLIP model from internet onto your PC. Once loaded the code will then show you how many individual weight parameters the CLIP model has:

In [10]:
model, _, preprocess = open_clip.create_model_and_transforms('ViT-B-32', pretrained='openai')

model.eval()

print("Model parameters:", f"{np.sum([int(np.prod(p.shape)) for p in model.parameters()]):,}")



Model parameters: 151,277,313


## Processing an image with CLIP

Lets use CLIP to get an embedding for this cute doggo:

<img src="media/golden-retriever.jpg" alt="golden retriever" width="500"/>

Lets first load in our image from a file using the [PIL (Python image model) library](https://pillow.readthedocs.io/en/stable/reference/Image.html). This will creat a PIL.Image object as a variable to store the image data:

In [11]:
dog_image = Image.open('media/golden-retriever.jpg').convert("RGB")
print(type(dog_image))

<class 'PIL.Image.Image'>


##### Converting an image to a torch tensor

We now need to convert this into a torch tensor. OpenCLIP has a handy `preprocess` function to get the image into the right shape and size for the model. Lets use this to convert the image into a torch tensor. Lets use the `.shape` member variable of this torch tensor to see the dimensionality of this tensor:

In [12]:
image_tensor = preprocess(dog_image)
print(image_tensor.shape)

torch.Size([3, 224, 224])


The result should be a 3-dimensional tensor. The first dimension are the colour channels (RGB), the second and third dimensions represent the width and height of the image (in pixels). If you put this image directly into the neural network you will get this nasty looking error:

In [13]:
with torch.no_grad():
    image_embedding = model.encode_image(image_tensor)

RuntimeError: Sizes of tensors must match except in dimension 1. Expected size 768 but got size 7 for tensor number 1 in the list.

#### Adding the batch dimension to a torch tensor

This is because torch expects the first dimension of the tensor to represent the **batch** dimension for batch processing and training of data. To get the tensor in the right format for CLIP we need to add a **fourth dimension** to the tensor, which by convention should be the first dimension (0). 

This does not change the data contained in the tensor in any way, it simply adds an extra 'empty' dimension to the tensor (with length 1), though this sounds like a pointless step, it is an essential step when processing data for and extracting data out of pytorch models. 

To do this you can use the `.unsqueeze()` function. Becuase we want to make the first dimension of the tensor the 'empty' batch dimension, we pass in the index of the first dimension (0) into this function. Run this code and see how the dimension of the tensor has changed:

In [14]:
image_tensor = image_tensor.unsqueeze(0)
print(image_tensor.shape)

torch.Size([1, 3, 224, 224])


Now lets try processing that image again with CLIP:

In [15]:
with torch.no_grad():
    image_embedding = model.encode_image(image_tensor)

Now you can see the shape of this tensor. It should be a tensor with the shape 1x512. Though technically a matrix, this is essentially a vector with the first dimension (used for batch processing) being 'empty' with a length of 1:

In [17]:
print(image_embedding.shape)

torch.Size([1, 512])


To get rid of this empty dimension and convert this into a proper vector we can use the function `.squeeze()`:

In [19]:
image_embedding = image_embedding.squeeze()
print(image_embedding.shape)

torch.Size([512])


## Processing text with CLIP

Now lets process some text with CLIP. Lets take the following sentence as a string:


In [20]:
dog_text = 'A cute dog'
print(dog_text)

A cute dog


Now lets tokenize this text using the openCLIP text tokenizer. This will give us a vector of length 77 tokens (with the empty batch dimension created for you this time by the tokenizer), which is the maximum number of tokens in a text string CLIP can process:

In [21]:
text_tokens_tensor = open_clip.tokenizer.tokenize(dog_text)
print(text_tokens_tensor.shape)

torch.Size([1, 77])


Which is essentially just a list of integers, each one being the number that represents that text token fragment. [This website has a nice interactive demo of the OpenAI tokenizers used for the GPT models](https://platform.openai.com/tokenizer). You can actually see the list of tokens if you print the data contained in this tensor directly:

In [22]:
print(text_tokens_tensor)

tensor([[49406,   320,  2242,  1929, 49407,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0]])


Now you can encode this list of tokens with CLIP:

In [23]:
with torch.no_grad():
    text_embedding = model.encode_text(text_tokens_tensor)

Which just like when encoding images, gives a tensor with the shape 1x512.

In [24]:
print(text_embedding.shape)

torch.Size([1, 512])


Which can be converted into a proper vector using `squeeze()`:

In [25]:
text_embedding = text_embedding.squeeze()
print(text_embedding.shape)

torch.Size([512])


### Calculating simiarity with CLIP embeddings 

Now you have our two vector embeddings (`image_embedding` and `text_embedding`). Now you need to convert these vector embeddings from torch tensors to numpy arrays using the `.numpy()` function:

In [26]:
image_embedding = image_embedding.numpy()
text_embedding = text_embedding.numpy()
print(type(image_embedding))
print(type(text_embedding))

<class 'numpy.ndarray'>
<class 'numpy.ndarray'>


You can now compare the similarity of these vector embeddings using the function `get_cosine_similarity` from the [cosine similarity section at the start of the notebook](#introduction-to-the-cosine-similarity).

In [27]:
similarity = get_cosine_similarity(image_embedding, text_embedding)
print(similarity)

0.26818934


This should give a similarity score of approximately $0.26$. Can you edit the text description given in the string variable `dog_text` and re-run the this code to get a closer match?

## Next steps
Now you have the basics down, you can now move onto the next notebook [Week-3b-Downloading-and-processing-museum-dataset-with-CLIP.ipynb](Week-3b-Downloading-and-processing-museum-dataset-with-CLIP.ipynb) where you will be using CLIP to download a dataset of images from a museum
collection, and using CLIP to create vector embeddings for each image.
