
# **Understanding Embeddings with BERT**

In this notebook, we go into the world of embeddings, utilizing the BERT model to understand the semantic relationships between words in different contexts.

##  Setup and Installation

Start by installing the necessary libraries to ensure all functionalities are available.

In [1]:
!pip install transformers==4.29.2
!pip install scipy==1.7.3

Collecting transformers==4.29.2
  Downloading transformers-4.29.2-py3-none-any.whl (7.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.1/7.1 MB[0m [31m24.1 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers==4.29.2)
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m35.0 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tokenizers, transformers
  Attempting uninstall: tokenizers
    Found existing installation: tokenizers 0.19.1
    Uninstalling tokenizers-0.19.1:
      Successfully uninstalled tokenizers-0.19.1
  Attempting uninstall: transformers
    Found existing installation: transformers 4.41.1
    Uninstalling transformers-4.41.1:
      Successfully uninstalled transformers-4.41.1
Successfully installed tokenizers-0.13.3 transformers-4.29.2
Collecting scipy==1.7.3
  Down

##  Importing Libraries

Import essential modules for our tasks.

In [2]:
from transformers import BertModel, AutoTokenizer
from scipy.spatial.distance import cosine



##  Model Setup

Load the pre-trained BERT model and tokenizer. This model will help us extract embeddings for our analysis.

In [3]:
# Defining the model name
model_name = "bert-base-cased"

# Loading the pre-trained model and tokenizer
model = BertModel.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/436M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertModel: ['cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]


##  Function Definition: Predict

Define a function that encodes input text into tensors, which are then fed to the model to obtain embeddings.

In [4]:
# Defining a function to encode the input text and get model predictions
def predict(text):
    encoded_inputs = tokenizer(text, return_tensors="pt")
    # same as return model(**encoded_inputs)[0]
    return model(**encoded_inputs).last_hidden_state
    # The ** operator unpacks the encoded_inputs dictionary so that each key-value pair is passed as a separate argument to the model.
    # The model expects input in the form of keyword arguments like input_ids and attention_mask.

##  Defining the Sentences

Set up sentences to analyze. The

In [5]:
# Defining the sentences
sentence3 = "The course is made for people who want to learn DSA from A to Z for free in a well-organized and structured manner."
sentence4 = "The course quality is better than what you get in paid courses, the only thing we don’t provide is doubt support, but trust me our YouTube video comments resolve that as well, we have a wonderful community of 250K+ people who engage in all of the videos."
sentence1 = "There was a fly drinking from my soup"
sentence2 = "There is a fly swimming in my juice"
# sentence2 = "To become a commercial pilot, he had to fly for 1500 hours." # second fly example

# Tokenizing the sentences
tokens1 = tokenizer.tokenize(sentence1)
tokens2 = tokenizer.tokenize(sentence2)

##  Tokenization and Model Predictions

Tokenize the sentences and obtain predictions (embeddings) from the model.

In [6]:
# Getting model predictions for the sentences
out1 = predict(sentence1)
out2 = predict(sentence2)

# out1 is a tensor with shape [1, 8, 768]
# (assuming the hidden size is 768 and there are 8 tokens).
out1

tensor([[[ 0.5312,  0.2162,  0.0967,  ..., -0.2778,  0.2341, -0.2684],
         [ 0.1996, -0.2155, -0.1150,  ...,  0.1952,  0.5705,  0.0951],
         [ 0.2500, -0.0971,  0.5910,  ...,  0.1830, -0.0093,  0.0130],
         ...,
         [-0.1890, -0.1836,  0.0684,  ..., -0.1357,  0.2555,  0.3643],
         [ 0.2328,  0.0131,  0.2131,  ..., -0.2474, -0.2384,  0.0111],
         [ 1.3495,  0.1425,  0.1270,  ..., -0.5699,  0.5849, -0.7200]]],
       grad_fn=<NativeLayerNormBackward0>)

### Extracting Embeddings with Technical Details

The line `emb1 = out1[0, tokens1.index("fly"), :].detach()` is extracting the embedding (hidden state representation) of the word "fly" from the BERT model's output for a given sentence. Here’s a detailed breakdown of each part:

#### Model Output: `out1`

- `out1 = predict(sentence1)`
- The `predict` function takes a sentence as input, tokenizes it, and passes it through the BERT model.
- The model’s output, `out1`, is a tensor containing the hidden states for each token in the input sentence. The shape of this tensor is `[batch_size, sequence_length, hidden_size]`.

#### Batch Dimension: `[0, ...]`

- `out1[0, ...]`
- Since the batch size is 1 (we're processing one sentence at a time), `out1[0, ...]` selects the first (and only) batch.

#### Token Index: `tokens1.index("fly")`

- `tokens1 = tokenizer.tokenize(sentence1)`
- This tokenizes the input sentence into individual tokens (subwords in the BERT vocabulary).
- `tokens1.index("fly")` finds the index of the token "fly" in the list of tokens.
- This index is used to access the hidden state corresponding to the token "fly".

#### Hidden State Dimension: `[..., :]`

- `out1[0, tokens1.index("fly"), :]`
- The `:` operator is used to select all hidden units (the hidden size dimension) for the token "fly".



In [7]:
# Extracting embeddings for the word 'fly' in both sentences
emb1 = out1[0:, tokens1.index("fly"), :].detach()[0]
emb2 = out2[0:, tokens2.index("fly"), :].detach()[0]

# emb1 = out1[0:, 3, :].detach()
# emb2 = out2[0:, 3, :].detach()
emb1.size()

torch.Size([768])

In [8]:
emb1 = out1[0:, tokens1.index("fly"), :].detach()
emb1.size()

torch.Size([1, 768])

##  Calculating Cosine Similarity

Calculate the cosine similarity between the embeddings of the word "fly" from both sentences to measure how context affects meaning.

In [9]:
# Calculating the cosine similarity between the embeddings
cosine(emb1, emb2)

0.06798791885375977