In [9]:
pip install sentence-transformers==4.1.0 | tail -n 1

Note: you may need to restart the kernel to use updated packages.


- numpy for mathematical operations.
- scipy for additional mathematical operations not found in numpy.
- torch for vector operations, although this library is also typically used for deep learning.
- sentence_transformers for obtaining vector embeddings from text data using pre-trained models.

In [10]:
import math

import numpy as np
import scipy
import torch
from sentence_transformers import SentenceTransformer

  from .autonotebook import tqdm as notebook_tqdm


## Obtain Vector Embeddings

To calculate distance and similarity metrics, we first need to generate vector embeddings for some text documents. 

In [11]:
# Example documents
documents = [
    'Bugs introduced by the intern had to be squashed by the lead developer.',
    'Bugs found by the quality assurance engineer were difficult to debug.',
    'Bugs are common throughout the warm summer months, according to the entomologist.',
    'Bugs, in particular spiders, are extensively studied by arachnologists.'
]


As shown above, there are four example documents, each consisting of a single sentence. These sentences are intentionally designed to be challenging for semantic similarity search.

All four sentences begin with the word "Bugs," but they refer to different meanings of the word depending on context. The first two sentences relate to software bugs in programming, while the last two refer to physical bugs, such as insects or spiders.

The key to distinguishing between these meanings lies in the context, particularly the type of professional mentioned in each sentence. For example, if we replaced the word "arachnologists" (scientists who study spiders and other arthropods) in the last sentence with "lead developers," the sentence would instead refer to programming bugs.

In [12]:
# Load a pre-trained model
model = SentenceTransformer('paraphrase-MiniLM-L6-v2')

The code above creates an instance of the SentenceTransformer class from the sentence_transformers library, which is commonly used to generate vector embeddings from pre-trained models.

In this example, we’re using the paraphrase-MiniLM-L6-v2 model. It was trained on pairs of paraphrased sentences, with the goal of generating similar embeddings for sentences that express the same meaning. While the model was originally designed for paraphrase identification, it also performs well on general semantic similarity tasks


In [13]:
# Generate embeddings
embeddings = model.encode(documents)

In [14]:
embeddings.shape

(4, 384)

In [15]:
embeddings

array([[-0.22804335, -0.246477  , -0.00319243, ...,  0.4552812 ,
         0.6341975 ,  0.53750485],
       [-0.35791564, -0.32084018,  0.15963301, ..., -0.07050626,
         0.9275025 ,  0.34377253],
       [ 0.20302981, -0.2689862 ,  0.1628514 , ..., -0.19650973,
        -0.03379842,  0.5956156 ],
       [-0.04264269, -0.45721576, -0.09526495, ..., -0.5803074 ,
         0.17248403,  0.09127846]], shape=(4, 384), dtype=float32)

As shown above, the four text documents have been converted into a NumPy array with shape (4, 384). Each row in the array represents the embedding of one document, and the number 384 is the dimensionality of the embeddings produced by the paraphrase-MiniLM-L6-v2 model. In other words, each document has been transformed into a numerical vector containing 384 values.

## L2 (Euclidean) Distance
The L2 (Euclidean) distance between 2 vectors $a$ and $b$ can be calculated using <br>
$$ \text{L2}(a,b) = \ \sqrt{\sum_{i=1}^n (a_i - b_i)^2} $$

Let's try to implement the L2 distance manually before looking at off-the-shelf solutions available from third party libraries.


### Manual implementation of L2 distance calculation
The function below implements the L2 distance formula. It first calculates the sum of the squared differences between corresponding elements, then returns the square root of that sum:


In [16]:
def euclidean_distance_fn(vector1, vector2):
    squared_sum = sum((x - y) ** 2 for x, y in zip(vector1, vector2))
    return math.sqrt(squared_sum)

Note that euclidean_distance_fn computes the distance between two individual vectors, not across an entire array. To see how it works, let's calculate the distance between the first vector (index 0) and the second vector (index 1):

In [17]:
euclidean_distance_fn(embeddings[0], embeddings[1])

5.96179017134276

In [18]:
euclidean_distance_fn(embeddings[1], embeddings[0])

5.96179017134276

In order to calculate all of the distances between all of the vectors in the embeddings array, we can use nested loops. In the following, i and j are the indices of the vectors in the array:

In [19]:
l2_dist_manual = np.zeros([4,4])
for i in range(embeddings.shape[0]):
    for j in range(embeddings.shape[0]):
        l2_dist_manual[i,j] = euclidean_distance_fn(embeddings[i], embeddings[j])

l2_dist_manual

array([[0.        , 5.96179017, 7.33939859, 7.15578223],
       [5.96179017, 0.        , 7.76861699, 7.39359074],
       [7.33939859, 7.76861699, 0.        , 5.919928  ],
       [7.15578223, 7.39359074, 5.919928  , 0.        ]])

`l2_dist_manual` is a 4×4 array where each element represents the L2 distance between two vectors: the vector at the row index and the vector at the column index. For example, the distance between the first vector (index 0) and the second vector (index 1) is located at position `[0, 1]` in the array:



In [20]:
l2_dist_manual[0,1]

np.float64(5.96179017134276)

In [21]:
l2_dist_manual[1,0]

np.float64(5.96179017134276)

#### Exercise 1 - Make the manual calculation more efficient

The code used to populate the `l2_dist_manual array` is not very efficient. First, it redundantly calculates the distance between a vector and itself, even though the L2 distance in such cases is always zero. Second, the array is symmetric—meaning the distance between vectors at indices $i$ and $j$ is the same as between $j$ and $i$. Therefore, each distance only needs to be computed once.


In [22]:
l2_dist_manual_improved = np.zeros([4,4])
for i in range(embeddings.shape[0]):
    for j in range(embeddings.shape[0]):
        if i < j: 
            l2_dist_manual_improved[i,j] = euclidean_distance_fn(embeddings[i], embeddings[j])
        else:
            l2_dist_manual_improved[i,j] = l2_dist_manual_improved[j,i]

l2_dist_manual_improved

array([[0.        , 5.96179017, 7.33939859, 7.15578223],
       [5.96179017, 0.        , 7.76861699, 7.39359074],
       [7.33939859, 7.76861699, 0.        , 5.919928  ],
       [7.15578223, 7.39359074, 5.919928  , 0.        ]])

### Calculate L2 distance using `scipy`
Instead of writing your own function, you can use a variety of different functions provided by external libraries to calculate the L2 distance. The example below uses a function from `scipy`: 


In [23]:
l2_dist_scipy = scipy.spatial.distance.cdist(embeddings, embeddings, 'euclidean')
l2_dist_scipy

array([[0.        , 5.96179042, 7.33939999, 7.15578144],
       [5.96179042, 0.        , 7.76861693, 7.39359088],
       [7.33939999, 7.76861693, 0.        , 5.91992784],
       [7.15578144, 7.39359088, 5.91992784, 0.        ]])

The following verifies that `l2_dist_manual` and `l2_dist_scipy` are identical (after accounting for rounding errors) by using the `allclose()` function from `numpy`:


In [24]:
np.allclose(l2_dist_manual, l2_dist_scipy)

True

### Interpret the L2 Distance Results

An analysis of `l2_dist_scipy` shows that, in this case, the L2 distance metric performed well for similarity search. For example, the first vector—corresponding to the sentence `Bugs introduced by the intern had to be squashed by the lead developer.`—had the smallest distance to the second vector, which represents the sentence `Bugs found by the quality assurance engineer were difficult to debug.` This result aligns with expectations, as both sentences refer to programming bugs.

Similarly, the third and fourth sentences—both related to physical bugs rather than programming—were closest to each other in terms of distance, which again matches our intuition.


## Dot Product Similarity and Distance
The dot product between vectors $a$ and $b$ is calculated using  <br>
$$ a \cdot b = \sum_{i=1}^{n} a_i b_i $$

Let's define a custom function for calculating the dot product between two vectors.


### Manual implementation of dot product calculation
The function below implements the dot product formula:


In [25]:
def dot_product_fn(vector1, vector2):
    return sum(x * y for x, y in zip(vector1, vector2))

Let's use the dot product function to calculate the dot product between the first and second vectors:


In [26]:
dot_product_fn(embeddings[0], embeddings[1])

np.float32(18.535397)

In [27]:
dot_product_manual = np.empty([4,4])
for i in range(embeddings.shape[0]):
    for j in range(embeddings.shape[0]):
        dot_product_manual[i,j] = dot_product_fn(embeddings[i], embeddings[j])

dot_product_manual

array([[33.74440384, 18.53539658,  8.56981182,  7.83093166],
       [18.53539658, 38.86933899,  7.8899622 ,  8.6634016 ],
       [ 8.56981182,  7.8899622 , 37.26202011, 17.66955757],
       [ 7.83093166,  8.6634016 , 17.66955757, 33.12266159]])

First, observe that the diagonal of the dot_product_manual matrix contains non-zero values. This is expected, as the dot product of a vector with itself is not zero. However, since these self-similarity scores aren't particularly informative, we can safely ignore them.

Excluding the diagonal, the matrix is symmetric—the upper triangle mirrors the lower triangle—as was the case with the L2 distance.

It's also important to remember that the dot product measures similarity, not distance. Higher values indicate greater similarity between vectors.

In this case, the dot product similarity performed well: the first two sentences are most similar to each other, and the last two sentences are most similar to each other, which aligns with our expectations.

### Calculate the dot product using matrix multiplication
We can compute the dot product efficiently using the [matrix multiplication](https://en.wikipedia.org/wiki/Matrix_multiplication) operator `@`. To do this, we multiply the `embeddings` matrix by its transpose. Since `embeddings` has a shape of 4×384, its transpose will be 384×4. Multiplying these gives us a 4×4 matrix, which is the desired result:


In [28]:
# Matrix multiplication operator
dot_product_operator = embeddings @ embeddings.T
dot_product_operator

array([[33.74442  , 18.5354   ,  8.569813 ,  7.8309317],
       [18.5354   , 38.869335 ,  7.8899646,  8.663406 ],
       [ 8.569813 ,  7.8899646, 37.262005 , 17.669558 ],
       [ 7.8309317,  8.663406 , 17.669558 , 33.122658 ]], dtype=float32)

We can verify that the matrix multiplication operator returns the same result as our custom function after accounting for rounding:


In [29]:
np.allclose(dot_product_manual, dot_product_operator, atol=1e-05)

True

Equivalently, if both of the matrices we want to multiply are two-dimensional, we can use the `matmul()` function from `numpy` instead:



In [30]:
# Equivalent to `np.matmul()` if both arrays are 2-D:
np.matmul(embeddings,embeddings.T)

array([[33.74442  , 18.5354   ,  8.569813 ,  7.8309317],
       [18.5354   , 38.869335 ,  7.8899646,  8.663406 ],
       [ 8.569813 ,  7.8899646, 37.262005 , 17.669558 ],
       [ 7.8309317,  8.663406 , 17.669558 , 33.122658 ]], dtype=float32)

Finally, `numpy` also provides a `dot()` function, which provides an identical result:


In [31]:
# `np.dot` returns an identical result, but `np.matmul` is recommended if both arrays are 2-D:
np.dot(embeddings,embeddings.T)

array([[33.74442  , 18.5354   ,  8.569813 ,  7.8309317],
       [18.5354   , 38.869335 ,  7.8899646,  8.663406 ],
       [ 8.569813 ,  7.8899646, 37.262005 , 17.669558 ],
       [ 7.8309317,  8.663406 , 17.669558 , 33.122658 ]], dtype=float32)

### Calculate dot product distance
The dot product between two vectors provides a similarity score. If, on the other hand, we would like a distance, we can simply take the negative of the dot product:


In [32]:
dot_product_distance = -dot_product_manual
dot_product_distance

array([[-33.74440384, -18.53539658,  -8.56981182,  -7.83093166],
       [-18.53539658, -38.86933899,  -7.8899622 ,  -8.6634016 ],
       [ -8.56981182,  -7.8899622 , -37.26202011, -17.66955757],
       [ -7.83093166,  -8.6634016 , -17.66955757, -33.12266159]])

Although it might seem unusual for all the distances to be negative, the essential property of a distance metric is still preserved: smaller values indicate lower distance and thus greater similarity. So even with negative values, the relative comparisons remain valid—lower values still correspond to shorter distances.


## Cosine Similarity and Distance
The cosine similarity between vectors $a$ and $b$ is calculated using <br>
$$ \text{cossim}(a, b) = \frac{a \cdot b}{||a|| \ ||b||} =  \frac{a}{||a||} \cdot \frac{b}{||b||} $$

where $||a|| = \sqrt{\sum_{k=1}^n a_k^2}$ is the L2 norm, or the magnitude, of vector $a$, and $\cdot$ is the dot product.

Also note that $\frac{a}{||a||}$ represents a normalized vector. This means it has the same direction as vector $a$ but a magnitude (or length) of 1. Thus, cosine similarity can be calculated by taking the dot product of two normalized vectors.


### Manual implementation of cosine similarity calculation
Since we’ve already covered how to compute the dot product, our strategy for manually calculating cosine similarity will focus on normalizing the vectors. This is because cosine similarity is simply the dot product of two normalized vectors, as was shown after the last equals sign in the cosine similarity calculation formula.

However, in order to normalize vectors, we must first compute their L2 norms.



#### Calculate the L2 norm
The following calculates the L2 norms for all the vectors in the `embeddings` array. The calculation simply squares each vector component, sums across columns (note the `axis=1` parameter in the `sum`), and takes a square root:


In [33]:
# L2 norms
l2_norms = np.sqrt(np.sum(embeddings**2, axis=1))
l2_norms

array([5.8089943, 6.2345276, 6.1042614, 5.7552285], dtype=float32)

Note that the result is a vector with 4 numbers, each of which corresponds to the L2 norm, or the magnitude, of each vector. In order to normalize the vectors to a length of one, we should divide each vector's components by the norm. However, in order to do that efficiently, let's reshape the `l2_norms` vector into a 4x1 array:


In [34]:
# L2 norms reshaped
l2_norms_reshaped = l2_norms.reshape(-1,1)
l2_norms_reshaped

array([[5.8089943],
       [6.2345276],
       [6.1042614],
       [5.7552285]], dtype=float32)

#### Normalize embedding vectors
The following code calculates normalized embedding vectors by dividing every component in the vector by the vector's L2 norm:


In [35]:
normalized_embeddings_manual = embeddings/l2_norms_reshaped
normalized_embeddings_manual

array([[-0.03925694, -0.04243023, -0.00054957, ...,  0.07837522,
         0.10917509,  0.09252976],
       [-0.05740862, -0.05146183,  0.02560467, ..., -0.011309  ,
         0.1487687 ,  0.05514011],
       [ 0.03326034, -0.04406532,  0.02667831, ..., -0.03219222,
        -0.00553686,  0.09757374],
       [-0.00740938, -0.07944354, -0.01655276, ..., -0.10083134,
         0.02996997,  0.01586009]], shape=(4, 384), dtype=float32)

In [36]:
# Verify normalization
norms_after_normalization = np.sqrt(np.sum(normalized_embeddings_manual**2, axis=1))
print(norms_after_normalization)


[1.         1.         0.99999994 1.        ]


#### Normalize embeddings using PyTorch
You can normalize embeddings in PyTorch using `torch.nn.functional.normalize()`. If your data is in a NumPy array, first convert it to a PyTorch tensor using `torch.from_numpy()`. After normalization, convert the tensor back to a NumPy array using the `numpy()` method:


In [37]:
normalized_embeddings_torch = torch.nn.functional.normalize(
    torch.from_numpy(embeddings)
).numpy()
normalized_embeddings_torch

array([[-0.03925694, -0.04243023, -0.00054957, ...,  0.07837522,
         0.10917509,  0.09252976],
       [-0.05740862, -0.05146183,  0.02560467, ..., -0.011309  ,
         0.1487687 ,  0.05514011],
       [ 0.03326034, -0.04406532,  0.02667831, ..., -0.03219222,
        -0.00553686,  0.09757374],
       [-0.00740938, -0.07944354, -0.01655276, ..., -0.10083134,
         0.02996997,  0.01586009]], shape=(4, 384), dtype=float32)

We can verify that the normalized embeddings we calculated manually and the normalized embeddings calculated using `torch` are indeed identical using numpy's `allclose()` function:


In [38]:
np.allclose(normalized_embeddings_manual, normalized_embeddings_torch)

True

#### Calculate cosine similarity manually
To calculate cosine similarity between two normalized embedding vectors, we simply take their dot product. To do this, we can leverage the dot product function we defined before. For instance, the following calculates the cosine similarity between the vector embeddings of the first and second sentence:


In [41]:
dot_product_fn(normalized_embeddings_manual[0], normalized_embeddings_manual[1])

np.float32(0.5117967)

Likewise, to calculate the cosine similarities between all normalized vectors, we can use a nested loop:


In [42]:
cosine_similarity_manual = np.empty([4,4])
for i in range(normalized_embeddings_manual.shape[0]):
    for j in range(normalized_embeddings_manual.shape[0]):
        cosine_similarity_manual[i,j] = dot_product_fn(
            normalized_embeddings_manual[i], 
            normalized_embeddings_manual[j]
        )

cosine_similarity_manual

array([[0.99999982, 0.51179671, 0.24167807, 0.23423398],
       [0.51179671, 1.        , 0.20731872, 0.2414474 ],
       [0.24167807, 0.20731872, 1.00000072, 0.50295585],
       [0.23423398, 0.2414474 , 0.50295585, 1.00000024]])

### Calculate cosine similarity using matrix multiplication
Just like with the dot product, we can compute cosine similarity using matrix algebra. By multiplying the matrix of normalized embeddings with its transpose using the matrix multiplication operator, we obtain the cosine similarity matrix. This works because, once vectors are normalized, cosine similarity can be calculated by simply taking the dot product:


In [43]:
cosine_similarity_operator = normalized_embeddings_manual @ normalized_embeddings_manual.T
cosine_similarity_operator

array([[0.99999994, 0.5117967 , 0.24167809, 0.234234  ],
       [0.5117967 , 1.0000001 , 0.20731868, 0.24144733],
       [0.24167809, 0.20731868, 1.        , 0.50295603],
       [0.234234  , 0.24144733, 0.50295603, 1.0000001 ]], dtype=float32)

We can verify that the matrix algebra solution is the same as the one found using the nested loop:


In [44]:
np.allclose(cosine_similarity_manual, cosine_similarity_operator)

True

### Calculate cosine distance
The cosine distance between vectors $a$ and $b$ is simply 1 minus the cosine similarity between $a$ and $b$: <br>
$$ 1 - cossim(a,b) $$

Using numpy, this can be calculated as follows:


In [45]:
1 - cosine_similarity_manual

array([[ 1.78813934e-07,  4.88203287e-01,  7.58321926e-01,
         7.65766025e-01],
       [ 4.88203287e-01,  0.00000000e+00,  7.92681277e-01,
         7.58552596e-01],
       [ 7.58321926e-01,  7.92681277e-01, -7.15255737e-07,
         4.97044146e-01],
       [ 7.65766025e-01,  7.58552596e-01,  4.97044146e-01,
        -2.38418579e-07]])

## Exercise 3 - Similarity Search Using a Query
In the above examples, we calculated similarity between 4 documents:

```python
documents = [
    'Bugs introduced by the intern had to be squashed by the lead developer.',
    'Bugs found by the quality assurance engineer were difficult to debug.',
    'Bugs are common throughout the warm summer months, according to the entomologist.',
    'Bugs, in particular spiders, are extensively studied by arachnologists.'
]
```

Now, your task is to find which of these 4 documents is most similar to the query `Who is responsible for a coding project and fixing others' mistakes?` using cosine similarity. You can reuse the `documents` and `normalized_embeddings_manual` arrays in your answer:


In [46]:
# First, embed the query:
query_embedding = model.encode(
    ["Who is responsible for a coding project and fixing others' mistakes?"]
)

In [47]:
# Second, normalize the query embedding:
normalized_query_embedding = torch.nn.functional.normalize(
    torch.from_numpy(query_embedding)
).numpy()

In [48]:
# Third, calculate the cosine similarity between the documents and the query by using the dot product:
cosine_similarity_q3 = normalized_embeddings_manual @ normalized_query_embedding.T

In [51]:
# Fourth, find the position of the vector with the highest cosine similarity:
highest_cossim_position = cosine_similarity_q3.argmax()
highest_cossim_position


np.int64(0)

In [50]:
# Fifth, find the document in that position in the `documents` array:
documents[highest_cossim_position]

'Bugs introduced by the intern had to be squashed by the lead developer.'