<a href="https://colab.research.google.com/github/StrategicalIT/PipedPiperAI/blob/main/Lab03.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# LAB3: Sentence Embeddings with SBERT
In this lab we are going to explore sentence embeddings using [Sentence Transformers a.k.a "SBERT"](https://sbert.net/).

This tool was created in 2018 but it is still actively used and developed. Its [GitHub repo](https://github.com/UKPLab/sentence-transformers/) has 16K stars and 200 contributors. As it names indicates it leverages transformers which use the "attention" mechanism introduced by Google in 2017. You can choose from a "wide" selection of over 5,000 :) pre-trained Sentence Transformers models available on 🤗 Hugging Face. A common way of deciding what model to use is to check the [Massive Text Embeddings Benchmark (MTEB) leaderboard](https://huggingface.co/spaces/mteb/leaderboard)


## Install dependencies

The first step is to install the necessary libraries. In this case we will install the [sentence transformers](https://pypi.org/project/sentence-transformers/) Python library. This library is a popular collection of tools and models to compute embeddings for sentences, paragraphs and images. The models are based on transformer networks like BERT / RoBERTa / XLM-RoBERT.

In [1]:
!pip install sentence-transformers

Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.2.1.3 (from torch>=1.11.0->sentence-transformers)
 

Let's import a few things we need

In [2]:
from sentence_transformers import SentenceTransformer
import numpy as np

## Load the model and create embeddings

We are going to create a model by pulling it from HuggingFace. This one is about 440MB so it should come down in a minute or less

In [3]:
model = SentenceTransformer('bert-base-nli-mean-tokens') #all-MiniLM-L6-v2

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/625 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/399 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

added_tokens.json:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Let's create a small corpus with a few sentences and use the model to encode them

In [4]:
sentences = [
       "I ate dinner.",
       "We had a three-course meal.",
       "Brad came to dinner with us.",
       "He loves fish tacos.",
       "In the end, we all felt like we ate too much.",
       "We all agreed; it was a magnificent evening."]
sentence_embeddings = model.encode(sentences)

The sentences have now been encoded and we can examine them. Notice below that size of the vectors this model creates. Different models using different number of dimensions.

The variable "sentence_embeddings" we have just created is a list so what you are looking at is the embedding for the first sentence, ie the one with index=0. As you can see it includes also negative values.  

In [5]:
print('The size of a vector is', len(sentence_embeddings[0]))
print('This is the embedding vector for the first sentence', sentence_embeddings[0])

The size of a vector is 768
This is the embedding vector for the first sentence [ 1.71653569e-01  2.26478279e-02  1.93014562e+00 -2.18605831e-01
  3.40927951e-02  5.23479044e-01 -1.25486863e+00  9.17776406e-01
 -3.36900681e-01 -5.32600224e-01  3.54305506e-01  8.74753892e-01
  7.92069674e-01  1.30202636e-01  3.55883059e-03  5.88225685e-02
  5.33439279e-01 -2.87108511e-01  1.44994557e-01 -8.17292869e-01
 -2.01646611e-02  1.39118955e-01 -9.96068954e-01  1.77107334e-01
  4.00360562e-02  4.30928618e-01 -2.58670002e-01  4.17331070e-01
  1.20064068e+00  9.88091156e-02 -2.55320042e-01 -1.98270939e-02
  8.69255245e-01 -8.33961904e-01  1.74719572e-01 -7.85964727e-01
 -1.65305868e-01  2.84225494e-01 -4.37051386e-01  6.77007675e-01
 -3.97849709e-01  1.01804338e-01  7.46632755e-01  4.15896565e-01
 -2.42899861e-02  2.68644661e-01  1.04453194e+00  1.43222380e+00
  4.57947344e-01 -1.08110774e+00  9.75882709e-01 -1.14395511e+00
 -2.62539387e-01  6.03446484e-01 -5.55950940e-01  1.41757774e+00
  2.096866

## Calculate similarities

We can use the "similarity" method to compare a query sentence to all the embeddings in our corpus. But first we need to convert our query into a vector.

In [6]:
query = ["I had pizza and pasta for dinner"]
query_embedding = model.encode(query)
similarities = model.similarity(query_embedding, sentence_embeddings)
for s in range(0,len(sentences)):
    print(f'{sentences[s]:<50}', " : ", similarities[0][s].detach().numpy())


I ate dinner.                                       :  0.76924634
We had a three-course meal.                         :  0.59097755
Brad came to dinner with us.                        :  0.6792341
He loves fish tacos.                                :  0.6034061
In the end, we all felt like we ate too much.       :  0.3894853
We all agreed; it was a magnificent evening.        :  0.14968927


The closer the similarity score is to 1 the closer semantically the query is to that sentence. Does the result make sense?


## Ideas to explore further

Try changing the queries and observe the results

### End of Lab 3