# 24 Spring CISC 484/684 Assignment 1 - Analyzing Paragraph Similarity with BERT Embeddings on Google Colab


## Objective
This assignment aims to familiarize you with the use of pretrained BERT (Bidirectional Encoder Representations from Transformers) for generating sentence embeddings and comparing their similarities. By leveraging Google Colab and the HuggingFace transformers library, you will produce embeddings for selected paragraphs and compute dot-product and cosine similarity measures to evaluate the model's ability to discern contextual similarities and differences.

## Part I: Set up the Environment

1) To begin your assignment, please create your own version of this Colab notebook. Click on "File" at the top of the menu bar, then choose "Save a copy in Drive". This will generate a personal copy of the notebook in your Google Drive where you can complete the tasks. Upon finishing your homework, please turn in both a downloaded version (in .ipynb) and include a shareable link to your Colab notebook for review and grading. This will be elaborated at the end of this notebook.




2) The first task is to install [HuggingFace Transformers library](https://huggingface.co/docs/transformers/en/index). The provided link is the official documentation of the HuggingFace `transformers` library, in which it displays the other transformer models that this library supports and provides more tutorials about using this library.

Please execute the following cell to install HuggingFace's `transformers`.

In [None]:
!pip install transformers



3) After installing, import the BERT model from the transformers library and Tensorflow or PyTorch framework of your choice.

3.1) If you are more familiar with Tensorflow, you may import your libraries by executing the following cell. Click [here](https://huggingface.co/docs/transformers/en/model_doc/bert#transformers.TFBertModel) to learn more about `TFBertModel`.

In [None]:
import tensorflow as tf
from transformers import BertTokenizer, TFBertModel

3.2) If you are more familiar with PyTorch, you may import your libraries by executing the following cell. Click [here](https://huggingface.co/docs/transformers/en/model_doc/bert#transformers.BertModel) to learn more about `BertModel`.

In [None]:
import torch
import torch.nn.functional as F
from transformers import BertTokenizer, BertModel

## Part II: Choose Your Paragraphs

Please choose three paragraphs for comparison, each consisting of 1-3 sentences. The first paragraph will serve as the base. The second should be contextually similar to the first, and the third should differ significantly in context from the first.

In [None]:
# base paragraph
paragraph_base = "Chennai Super Kings is a professional cricket franchise based in Chennai, Tamil Nadu, India, that competes in the Indian Premier League"

# Contextually similar paragraph
# paragraph_similar = "Chennai Super Kings won five titles in Indian Premier League"
paragraph_similar =  "Mumbai Indians is a professional cricket franchise based in Mumbai, Maharastra, India, that competes in the Indian Premier League"

# Contextually different paragraph
paragraph_different = "My name is Nikhil"

## Part III: Generate Embeddings and Compute Similarities

Prior to generating embeddings, it's essential to convert your paragraphs into tokens, as BERT requires tokenized input. Begin by loading the pretrained BERT tokenizer. To learn more about the [tokenizer](https://huggingface.co/docs/transformers/v4.38.1/en/model_doc/bert#transformers.BertTokenizer) and the [BERT base model (uncased)](https://huggingface.co/google-bert/bert-base-uncased), click the links to the corresponding documentations.

In [None]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

Subsequently, tokenize your paragraphs as the first step in the Tensorflow or PyTorch sections below.



### Tensorflow

If you choose Tensorflow as your deep learning framework, follow the below cells. If you are using PyTorch, you may skip to the PyTorch section after Tensorflow.



1) Tokenize your paragraphs by following the instructions in the below cell.

Padding adds special tokens to shorter sequences to match the length of the longest sequence in a batch (or a specified maximum length), and truncation shortens sequences longer than a specified maximum length, which defaults to the model's maximum input size, and in this case, [512 tokens for BERT base model](https://github.com/mim-solutions/bert_for_longer_texts). Basically, they are techniques used to handle sequences of varying lengths in batch processing. To learn more about padding and truncation, refer to the [documentation](https://huggingface.co/docs/transformers/en/pad_truncation).


In [None]:
encoded_input_b = tokenizer(paragraph_base, padding=True, truncation=True, return_tensors='tf')
encoded_input_s = tokenizer(paragraph_similar, padding=True, truncation=True, return_tensors='tf')
encoded_input_d = tokenizer(paragraph_different, padding=True, truncation=True, return_tensors='tf')

2) Load the [BERT base model](https://huggingface.co/google-bert/bert-base-uncased) for TensorFlow:


In [None]:
model_tf = TFBertModel.from_pretrained('bert-base-uncased')

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFBertModel: ['cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing TFBertModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFBertModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions w

3) Get model outputs

In [None]:
outputs_tf_b = model_tf(encoded_input_b)
outputs_tf_s = model_tf(encoded_input_s)
outputs_tf_d = model_tf(encoded_input_d)

The output from the model call `model_tf(encoded_input)` typically includes several components for BERT models:


1.   `last_hidden_state`: The final layer's hidden states for each token in the sequence. For each token, this is a vector representing the token's contextual embedding.
2.   `pooler_output`: The output of a pooling operation applied to the last hidden state of the [CLS] token, often used for classification tasks.
3. Optionally, `hidden_states` and `attentions`: If the model is configured to return these, they represent the hidden states and attention weights from all layers of the model, respectively.

The `last_hidden_state` tensor can be used to obtain token-level embeddings, which we will explore next.

4) Obtain the embeddings

Here, we examine two different embedding methods

1. Using the mean of the last hidden state outputs across the sequence length dimension to generate a sentence-level embedding (the mean pooling method). This approach averages the token embeddings produced by the BERT model, providing a single vector that represents the entire input sequence.

2. Using the [CLS] token embedding, which is specifically designed to capture the context of the entire sequence for classification tasks. The [CLS] embedding is the first token's output from the last hidden layer, directly intended for sequence-level tasks, while the mean pooling method combines information from all tokens to form a representation. While the [CLS] embedding is commonly utilized in classification tasks, its capability to encapsulate sequence context also allows for its application in similarity comparisons between texts. It's important to realize this method, especially as we delve into text classification topics like sentiment analysis.

After knowing the [CLS] token embedding method, going back to the mean pooling method, there are two ways of doing it.

(1) One way is to computes the mean across the sequence dimension (axis=1) excluding the first token of each sequence. This is useful when you want to ignore the special [CLS] token in BERT's output when calculating the average embedding. The [CLS] token is often used for classification tasks, and excluding it might be preferred for tasks where only the content tokens' embeddings are relevant.

In [None]:
# mean pooling method that excludes the [CLS] token

# embeddings_mean_b = tf.reduce_mean(outputs_tf_b.last_hidden_state[:, 1:, :], axis=1)
# embeddings_mean_s = tf.reduce_mean(outputs_tf_s.last_hidden_state[:, 1:, :], axis=1)
# embeddings_mean_d = tf.reduce_mean(outputs_tf_d.last_hidden_state[:, 1:, :], axis=1)

(2) The second common way is to compute the mean across the sequence dimension (axis=1) including all tokens, i.e., the [CLS] token, and any other tokens present in the sequence. This might be useful when the context captured by the [CLS] token is also considered relevant for your task.

In [None]:
# mean pooling method that includes the [CLS] token

embeddings_mean_b = tf.reduce_mean(outputs_tf_b.last_hidden_state[:, :, :], axis=1)
embeddings_mean_s = tf.reduce_mean(outputs_tf_s.last_hidden_state[:, :, :], axis=1)
embeddings_mean_d = tf.reduce_mean(outputs_tf_d.last_hidden_state[:, :, :], axis=1)

In practice, the choice of which method may depend on trial and error and/or the specific needs of your task. In this homework, you may choose either of the above two cells to run to obtain the `embeddings_mean` from the mean pooling method to report your findings.

The below cell provides code for obtaining the the [CLS] token embedding.

In [None]:
# [CLS] token embedding
embeddings_cls_b = outputs_tf_b.last_hidden_state[:, 0, :]
embeddings_cls_s = outputs_tf_s.last_hidden_state[:, 0, :]
embeddings_cls_d = outputs_tf_d.last_hidden_state[:, 0, :]

Feel free to explore this [post](https://stackoverflow.com/questions/62705268/why-bert-transformer-uses-cls-token-for-classification-instead-of-average-over) for a deeper understanding of the utilization of the [CLS] token embedding in BERT and its distinctions from the mean pooling method.


5) Compute dot-product and cosine similarity:

In [None]:
# Dot product and cosine similarity for mean pooling

dot_product_mean_s = tf.reduce_sum(tf.multiply(embeddings_mean_b, embeddings_mean_s), axis=1)
dot_product_mean_d = tf.reduce_sum(tf.multiply(embeddings_mean_b, embeddings_mean_d), axis=1)

cosine_similarity_mean_s = tf.reduce_sum(tf.multiply(embeddings_mean_b, embeddings_mean_s), axis=1) / (
    tf.norm(embeddings_mean_b, axis=1) * tf.norm(embeddings_mean_s, axis=1)
)
cosine_similarity_mean_d = tf.reduce_sum(tf.multiply(embeddings_mean_b, embeddings_mean_d), axis=1) / (
    tf.norm(embeddings_mean_b, axis=1) * tf.norm(embeddings_mean_d, axis=1)
)

# Dot product and cosine similarity for [CLS] token

dot_product_cls_s = tf.reduce_sum(tf.multiply(embeddings_cls_b, embeddings_cls_s), axis=1)
dot_product_cls_d = tf.reduce_sum(tf.multiply(embeddings_cls_b, embeddings_cls_d), axis=1)


cosine_similarity_cls_s = tf.reduce_sum(tf.multiply(embeddings_cls_b, embeddings_cls_s), axis=1) / (
    tf.norm(embeddings_cls_b, axis=1) * tf.norm(embeddings_cls_s, axis=1)
)
cosine_similarity_cls_d = tf.reduce_sum(tf.multiply(embeddings_cls_b, embeddings_cls_d), axis=1) / (
    tf.norm(embeddings_cls_b, axis=1) * tf.norm(embeddings_cls_d, axis=1)
)


6) Now you can evaluate these tensors to get the actual values

In [None]:
dot_product_mean_s = dot_product_mean_s.numpy()
dot_product_mean_d = dot_product_mean_d.numpy()

cosine_similarity_mean_s = cosine_similarity_mean_s.numpy()
cosine_similarity_mean_d = cosine_similarity_mean_d.numpy()

dot_product_cls_s = dot_product_cls_s.numpy()
dot_product_cls_d = dot_product_cls_d.numpy()

cosine_similarity_cls_s = cosine_similarity_cls_s.numpy()
cosine_similarity_cls_d = cosine_similarity_cls_d.numpy()

print(dot_product_mean_s, dot_product_mean_d)
print(cosine_similarity_mean_s, cosine_similarity_mean_d)

print(dot_product_cls_s, dot_product_cls_d)
print(cosine_similarity_cls_s, cosine_similarity_cls_d)

[81.15794] [44.172375]
[0.9235468] [0.49835715]
[256.81693] [115.12463]
[0.9825427] [0.4938733]


### PyTorch

1) Tokenize your paragraphs by following the instructions in the below cell.

Padding adds special tokens to shorter sequences to match the length of the longest sequence in a batch (or a specified maximum length), and truncation shortens sequences longer than a specified maximum length, which defaults to the model's maximum input size, and in this case, [512 tokens for BERT base model](https://github.com/mim-solutions/bert_for_longer_texts). Basically, they are techniques used to handle sequences of varying lengths in batch processing. To learn more about padding and truncation, refer to the [documentation](https://huggingface.co/docs/transformers/en/pad_truncation).


In [None]:
encoded_input_b = tokenizer(paragraph_base, padding=True, truncation=True, return_tensors='pt')
encoded_input_s = tokenizer(paragraph_similar, padding=True, truncation=True, return_tensors='pt')
encoded_input_d = tokenizer(paragraph_different, padding=True, truncation=True, return_tensors='pt')

2) Load the [BERT base model](https://huggingface.co/google-bert/bert-base-uncased) for PyTorch:

In [None]:
model = BertModel.from_pretrained('bert-base-uncased')

3) Get model outputs

In [None]:
with torch.no_grad():
    outputs_pt_b = model(**encoded_input_b)
    outputs_pt_s = model(**encoded_input_s)
    outputs_pt_d = model(**encoded_input_d)

The output from the model call `model_tf(encoded_input)` typically includes several components for BERT models:


1.   `last_hidden_state`: The final layer's hidden states for each token in the sequence. For each token, this is a vector representing the token's contextual embedding.
2.   `pooler_output`: The output of a pooling operation applied to the last hidden state of the [CLS] token, often used for classification tasks.
3. Optionally, `hidden_states` and `attentions`: If the model is configured to return these, they represent the hidden states and attention weights from all layers of the model, respectively.

The `last_hidden_state` tensor can be used to obtain token-level embeddings, which we will explore next.

4) Obtain the embeddings

Here, we examine two different embedding methods

1. Using the mean of the last hidden state outputs across the sequence length dimension to generate a sentence-level embedding (the mean pooling method). This approach averages the token embeddings produced by the BERT model, providing a single vector that represents the entire input sequence.

2. Using the [CLS] token embedding, which is specifically designed to capture the context of the entire sequence for classification tasks. The [CLS] embedding is the first token's output from the last hidden layer, directly intended for sequence-level tasks, while the mean pooling method combines information from all tokens to form a representation. While the [CLS] embedding is commonly utilized in classification tasks, its capability to encapsulate sequence context also allows for its application in similarity comparisons between texts. It's important to realize this method, especially as we delve into text classification topics like sentiment analysis.

After knowing the [CLS] token embedding method, going back to the mean pooling method, there are two ways of doing it.

(1) One way is to computes the mean across the sequence dimension (axis=1) excluding the first token of each sequence. This is useful when you want to ignore the special [CLS] token in BERT's output when calculating the average embedding. The [CLS] token is often used for classification tasks, and excluding it might be preferred for tasks where only the content tokens' embeddings are relevant.

In [None]:
# mean pooling method that excludes the [CLS] token

# with torch.no_grad():
#     embeddings_mean_b = torch.mean(outputs_pt_b.last_hidden_state[:, 1:, :], dim=1)
#     embeddings_mean_s = torch.mean(outputs_pt_s.last_hidden_state[:, 1:, :], dim=1)
#     embeddings_mean_d = torch.mean(outputs_pt_d.last_hidden_state[:, 1:, :], dim=1)

(2) The second common way is to compute the mean across the sequence dimension (axis=1) including all tokens, i.e., the [CLS] token, and any other tokens present in the sequence. This might be useful when the context captured by the [CLS] token is also considered relevant for your task.

In [None]:
# mean pooling method that includes the [CLS] token

with torch.no_grad():
    embeddings_mean_b = torch.mean(outputs_pt_b.last_hidden_state[:, :, :], dim=1)
    embeddings_mean_s = torch.mean(outputs_pt_s.last_hidden_state[:, :, :], dim=1)
    embeddings_mean_d = torch.mean(outputs_pt_d.last_hidden_state[:, :, :], dim=1)

In practice, the choice of which method may depend on trial and error and/or the specific needs of your task. In this homework, you may choose either of the above two cells to run to obtain the `embeddings_mean` from the mean pooling method to report your findings.

The below cell provides code for obtaining the the [CLS] token embedding.

In [None]:
# [CLS] token embedding

with torch.no_grad():
    embeddings_cls_b = outputs_pt_b.last_hidden_state[:, 0, :]
    embeddings_cls_s = outputs_pt_s.last_hidden_state[:, 0, :]
    embeddings_cls_d = outputs_pt_d.last_hidden_state[:, 0, :]

Feel free to explore this [post](https://stackoverflow.com/questions/62705268/why-bert-transformer-uses-cls-token-for-classification-instead-of-average-over) for a deeper understanding of the utilization of the [CLS] token embedding in BERT and its distinctions from the mean pooling method.


5) Compute dot-product and cosine similarity:

In [None]:
# Dot product and cosine similarity for mean pooling
dot_product_mean_s = torch.sum(embeddings_mean_b * embeddings_mean_s, dim=1)
dot_product_mean_d = torch.sum(embeddings_mean_b * embeddings_mean_d, dim=1)

cosine_similarity_mean_s = torch.sum(embeddings_mean_b * embeddings_mean_s, dim=1) / (
    torch.norm(embeddings_mean_b, dim=1) * torch.norm(embeddings_mean_s, dim=1)
)
cosine_similarity_mean_d = torch.sum(embeddings_mean_b * embeddings_mean_d, dim=1) / (
    torch.norm(embeddings_mean_b, dim=1) * torch.norm(embeddings_mean_d, dim=1)
)

# Dot product and cosine similarity for [CLS] token
dot_product_cls_s = torch.sum(embeddings_cls_b * embeddings_cls_s, dim=1)
dot_product_cls_d = torch.sum(embeddings_cls_b * embeddings_cls_d, dim=1)

cosine_similarity_cls_s = torch.sum(embeddings_cls_b * embeddings_cls_s, dim=1) / (
    torch.norm(embeddings_cls_b, dim=1) * torch.norm(embeddings_cls_s, dim=1)
)
cosine_similarity_cls_d = torch.sum(embeddings_cls_b * embeddings_cls_d, dim=1) / (
    torch.norm(embeddings_cls_b, dim=1) * torch.norm(embeddings_cls_d, dim=1)
)


6) Now you can evaluate these tensors to get the actual values

In [None]:
dot_product_mean_s = dot_product_mean_s.numpy()
dot_product_mean_d = dot_product_mean_d.numpy()

cosine_similarity_mean_s = cosine_similarity_mean_s.numpy()
cosine_similarity_mean_d = cosine_similarity_mean_d.numpy()

dot_product_cls_s = dot_product_cls_s.numpy()
dot_product_cls_d = dot_product_cls_d.numpy()

cosine_similarity_cls_s = cosine_similarity_cls_s.numpy()
cosine_similarity_cls_d = cosine_similarity_cls_d.numpy()

# Print results
print(dot_product_mean_s, dot_product_mean_d)
print(cosine_similarity_mean_s, cosine_similarity_mean_d)

print(dot_product_cls_s, dot_product_cls_d)
print(cosine_similarity_cls_s, cosine_similarity_cls_d)

[81.15797] [44.172398]
[0.92354697] [0.49835733]
[256.81686] [115.12464]
[0.9825426] [0.4938734]


## Part IV: Analyze Your Results

1. For the embeddings between paragraph_base and paragraph_similar, how similar are these two paragraphs according to BERT?

  The dot product between paragraph_base and paragraph_similar(means, cls) are 81.15797, 256.81286 which are higher compared to Dot Product between paragraph_base and paragraph_different and Cosine Similarity between paragraph_base and paragraph_similar(means, cls) are 0.92 and 0.98 respectively, which are very high. These higher values suggest that they are very similar to each other.

2. For the embeddings between paragraph_base and paragraph_different, how similar are these two paragraphs according to BERT?

  The dot product between paragraph_base and paragraph_different(means, cls) are 44.172398, 115.12464 which are lower compared to Dot Product between paragraph_base and paragraph_different and Cosine Similarity between paragraph_base and paragraph_similar(means, cls) are 0.498 and 0.494 respectively, which are low. These lesser values suggest that they are not similar(Different) to each other.

3. What does this tell you about the nature of sentence embeddings and their use in understanding contextual similarity?

  The results demonstrate that sentence embeddings, generated by BERT in this case, can effectively capture contextual similarity between sentences. Both dot product and cosine similarity provide insights into the similarity of the contextual representations of sentences. BERT embeddings consider not only individual words but also their surrounding context, enabling a nuanced understanding of similarity between sentences.

4. Which embedding method, mean pooling or the [CLS] token, would you find more effective for this analysis? Please explain your choice.

  In this analysis, both mean pooling and [CLS] token embeddings give great(true) results. However, the given paragraphs differ in context. So the [CLS] token embedding method tends to be more effective for this analysis. The results also show that [CLS] token embeddings gives Higher Similarity(0.98) compared to mean pooling.

5. Please investigate the meaning and role of each dimension in the `last_hidden_state` output from the BERT model through comprehensive research. This investigation will deepen your understanding of how BERT processes and represents textual data. First print out the three dimensions by executing the following code

  The three dimension output is (1, 26, 768). This means we have 1 batch, 26 tokens, and each token has a 768-dimensional vector representation.
  26 tokens represents the largest size of the paragraph. BERT adds two special tokens (CLS and SEP). Here, paragraph_base has 24(21 words, 3 commas)tokens and 2 (CLS and SEP) are added, so makes to 26.

In [None]:
# for TensorFlow
print(outputs_tf_b.last_hidden_state[:, :, :].shape)

(1, 26, 768)


In [None]:
# for PyTorch
print(outputs_pt_b.last_hidden_state[:, :, :].shape)

torch.Size([1, 26, 768])
