## TI3160TU: Natural Language Processing - Transformers and BERT models Lab

In this hands-on lab, we will explore BERT models. As we have seen in the lecture, BERT models are based on Transformers and allow us to extract contextual embedding as well as perform various NLP tasks by either acting as a feature extractor or by fine-tuning the model and using it for a specific task. In this lab, we will explore three main aspects of BERT models:

1. **Basic Functionalities of BERT models**
2. **Performing NLP Classification tasks using BERT models as a feature extractor**
3. **Performing NLP Classification tasks by fine-tuning BERT models and training a classification head**

For the purposes of this lab, we are going to use DistilBERT instead of the original version of BERT released by Google. The main reason for using DistilBERT is that it is a ligher and faster variant of BERT, designed for cases where computational resources or speed is constrained. Nevertheless, DistilBERT retains 97% of BERT performance, while using 40% fewer parameters. All the aspects that we will see in this lab apply to the original BERT model as well.

### 1. Basic Functionalities of BERT models

We will start this lab by demonstrating some basic functionalities of BERT models. We will focus on:
    
1. Extracting Contextual Word Embeddings
2. Extracting Contextual Sentence/Document Embeddings

### 1.1. Extracting Contextual Word Embeddings

We begin by demonstrating how we can leverage BERT-based models to extract contextual embeddings for words/documents in our dataset. We will do the following steps:
1. Load a BERT-based model using the Transformers library. As mentioned above, we will use DistilBERT.
2. Extract contextual word embeddings
4. Demonstrate how the same tokens have different embeddings based on the provided context (i.e., surrounding words).


In [1]:
!pip install transformers
!pip install tqdm



In [2]:
# demonstrate basic stuff
# 1. Extract Contextual Word Embeddings
# 2. Extract classification embeddings
from transformers import DistilBertTokenizer, DistilBertModel
import torch

# Load DistilBERT
distilbert_tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
distilbert_model = DistilBertModel.from_pretrained('distilbert-base-uncased')

# Define two example documents. These are the examples included in the Lecture
example1 = "I hit the ball with a baseball bat."
example2 = "I was scared because I saw a bat."


# function to extract contextual embeddings from BERT based models
# INPUT: The trained BERT-model, its tokenizer function, and the document that we want to get embeddings for
# OUTPUT: A tensor with all the embeddings extracted from the BERT model
def extract_embeddings(model, tokenizer, text):
    inputs = tokenizer(text, return_tensors="pt")
    outputs = model(**inputs)
    embeddings = outputs.last_hidden_state.squeeze(0)
    return embeddings


# extract the embeddings for our examples
example1_embeddings = extract_embeddings(distilbert_model, distilbert_tokenizer, example1)
example2_embeddings = extract_embeddings(distilbert_model, distilbert_tokenizer, example2)

# lets inspect the shapes of our embeddings
print(f"Example 1: {example1} Shape of Embeddings for Example 1: {example1_embeddings.shape}")
print(f"Example 2: {example2} Shape of Embeddings for Example 2: {example2_embeddings.shape}")


Example 1: I hit the ball with a baseball bat. Shape of Embeddings for Example 1: torch.Size([11, 768])
Example 2: I was scared because I saw a bat. Shape of Embeddings for Example 2: torch.Size([11, 768])


We observe that for each of our examples, we obtained a Pytorch tensor that includes 11 embeddings (X-dimension), with each embedding having a dimension of 768 (DistilBERT generates embeddings with size of 768 dimensions).

The question here is what each of these 11 embeddings correspond to. By looking at our examples, we observe that both documents consists of 8 words and 1 punctuation character. So you might be wondering why do we get 11 embeddings. To demistify this, lets look how our BERT-based model sees these examples after tokenization.

In [3]:
#helper function that tokenizes a document based on the tokenizer function of our BERT model
#INPUT: the tokenizer function from the BERT model and the document that we want to tokenize
#OUTPUT: a list of tokens
def extract_tokens(tokenizer_function, text):
    inputs = tokenizer_function(text, add_special_tokens=True, return_tensors="pt", truncation=True, padding=True)
    tokens = tokenizer_function.convert_ids_to_tokens(inputs["input_ids"][0])
    return tokens

# print the tokens of each example based on DistilBERT tokenizer
print(f"Example 1: {example1} Tokenized Example 1 using DistilBERT tokenizer: {extract_tokens(distilbert_tokenizer, example1)}")
print(f"Example 2: {example2} Tokenized Example 2 using DistilBERT tokenizer: {extract_tokens(distilbert_tokenizer, example2)}")



Example 1: I hit the ball with a baseball bat. Tokenized Example 1 using DistilBERT tokenizer: ['[CLS]', 'i', 'hit', 'the', 'ball', 'with', 'a', 'baseball', 'bat', '.', '[SEP]']
Example 2: I was scared because I saw a bat. Tokenized Example 2 using DistilBERT tokenizer: ['[CLS]', 'i', 'was', 'scared', 'because', 'i', 'saw', 'a', 'bat', '.', '[SEP]']


Remember that BERT-based models use a sub-word tokenizer and also include some special tokens, such as [CLS] and [SEP]. BY tokenizing our examples using our model's tokenizer, we can observe that indeed each example consists of 11 tokens (each token has one embedding). For instance, for example 1:
1. In position 1: Embedding of token **[CLS]**
2. In position 2: Embedding of token **i**
3. In position 3: Embedding of token **hit**
4. In position 4: Embedding of token **the**
5. In position 5: Embedding of token **ball**
6. In position 6: Embedding of token **with**
7. In position 7: Embedding of token **a**
8. In position 8: Embedding of token **baseball**
9. In position 9: Embedding of token **bat**
10. In position 10: Embedding of token **.**
11. In position 11: Embedding of token **[SEP]**


As a reminder:

1. [CLS]: Stands for "classification". In BERT, every input sequence starts with this token. After processing the sequence, the embedding of this token is often used as a representation for the entire sequence. Especially after fine-tuning on a classification task, the [CLS] embedding captures aggregate information of the sequence, making it suitable for sequence-level predictions. Without fine-tuning we usually do not use this token, as it doesn't have much meaning.

2. [SEP]: Stands for "separator". This token is used in BERT to separate two sequences when the model takes in a pair of sequences as input. For instance, in tasks like question answering or sentence-pair classification, the two sequences (e.g., question and context or sentence1 and sentence2) are separated by a [SEP] token to indicate the end of one sequence and the beginning of another.

Now, lets inspect our embeddings and demonstrate that they are contextual (Note that these embeddings are substantially better and different that the embeddings that we saw in Word2vec model that are generated in a context-free manner). Lets start first with the token "bat". Both examples include the same word, however, the word has different meanings in the two provided example. We expect that the generated embeddings for "bat" in the two examples will be substantially different to capture that in one of the example we mean a baseball bat, while on the other we mean bat the mammal.

The token "bat" appears in position 8 for both examples. So lets compare the two embeddings that are provided in position 8...


In [4]:
# necessary import to calculate metrics such as Cosine similarity between tensors
import torch.nn.functional as F

# extract the embeddings corresponding to the token "bat" in these two examples
bat_embedding_example1 = example1_embeddings[8]
bat_embedding_example2 = example2_embeddings[8]

# Calculate cosine similarity using Pytorch's function
similarity = F.cosine_similarity(bat_embedding_example1.unsqueeze(0), bat_embedding_example2.unsqueeze(0)).item()

# print the similarity
print(f"Similarity betweeen the two embeddings for the token bat {similarity}")

Similarity betweeen the two embeddings for the token bat 0.8347097635269165


We observe that these embeddings have a cosine similarity of 0.83. This indicates that the BERT model, based on these document, the token is quite similar across the two documents, however, the embeddings are different based on the context (i.e., the surrounding words). Remember that in Word2vec the word "bat" will have the same embedding in both example, hence having a cosine similarity of 1.0.

Thus far, we have demonstrated that the same token can have different embeddings in different documents. But can a token have different embeddings within the same document? For instance, in example 2, the token "i" appears twice. We expect that this token will have different embeddings because of the Transformer's attention mechanism. Lets demonstrate this with the token "i" that appears in positions 1 and 5 in example 2.

In [5]:
# extract the embeddings corresponding to the token "i" in example 2
i1_embedding_example2 = example2_embeddings[1]
i2_embedding_example2 = example2_embeddings[5]

# Calculate cosine similarity using Pytorch's function
similarity = F.cosine_similarity(i1_embedding_example2.unsqueeze(0), i2_embedding_example2.unsqueeze(0)).item()

# print the similarity
print(f"Similarity betweeen the two embeddings for the token i in example 2 {similarity}")

Similarity betweeen the two embeddings for the token i in example 2 0.7227323651313782


We observe that these embeddings have a cosine similarity of 0.72. Despite being part of the same document and being the same exact token, the generated embeddings are different based on how the tokens are used in the document!

### 1.2. Extracting Contextual Sentence Embeddings

BERT and its variants are primarily designed to generate token-level embeddings, but there are several strategies to derive sentence-level embeddings from these token-level embeddings. The three most used are:

- **[CLS] Token Embedding:** After feeding a sentence through BERT, the embedding corresponding to the [CLS] token can be used as the sentence representation. This method works well when BERT is fine-tuned on a downstream classification task since the [CLS] token is trained to aggregate sequence-level information for such tasks. However, for the pre-trained BERT model without any fine-tuning, the [CLS] embedding may not be the best representative for the whole sentence.

- **Mean/Average Pooling:** Take the average of all token embeddings in the sequence. This approach gives equal importance to all tokens, which might not always be the best representation, especially if the sentence is long.

- **Sentence-BERT (SBERT):** SBERT is a modification of the pre-trained BERT to derive fixed-size sentence embeddings. It uses a siamese or triplet network structure to derive semantically meaningful sentence embeddings that are then fine-tuned on various NLP tasks. The embeddings from SBERT can be used directly for sentence-level comparisons and are optimized to be more semantically meaningful than naive BERT embeddings. See https://www.sbert.net/ for more details.


For the purposes of this lab, at this stage, we are going to use the second and third approach to generate sentence embeddings. Later on the lab, we will fine-tune DistilBERT and use the [CLS] token to perform classification.

In [6]:
# Define a function to get embeddings
def get_document_embeddings(model, tokenizer, text):
    inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=256)
    outputs = model(**inputs)
    return outputs.last_hidden_state.mean(dim=1)  # Get the mean of the token embeddings as a document representation

example1_sentence_embedding = get_document_embeddings(distilbert_model, distilbert_tokenizer, example1)
example2_sentence_embedding = get_document_embeddings(distilbert_model, distilbert_tokenizer, example2)

# Calculate cosine similarity using Pytorch's function
similarity = F.cosine_similarity(example1_sentence_embedding, example2_sentence_embedding).item()

# print the similarity
print(f"Similarity betweeen the sentence embeddings of our two examples {similarity}")


Similarity betweeen the sentence embeddings of our two examples 0.8488353490829468


We observe that by using the mean of the word embeddings we obtain a high cosine similarity (0.84) between the two examples, while in fact they are quite semantically different. Lets try to do the same thing, but using Sentence Transformers that are more suitable for assessing the semantic similarity across sentences/documents.

In [7]:
!pip install sentence_transformers
!pip install datasets



In [8]:
from sentence_transformers import SentenceTransformer

# Load the distilbert model trained for sentence embeddings
model = SentenceTransformer('distilbert-base-nli-mean-tokens')

# Create a list of sentences
sentences = [example1, example2]

# Get embeddings for the sentences
embeddings = model.encode(sentences)

# Calculate cosine similarity using Pytorch's function
similarity = F.cosine_similarity(torch.from_numpy(embeddings[0]).unsqueeze(0), torch.from_numpy(embeddings[1]).unsqueeze(0)).item()

# print the similarity
print(f"Similarity betweeen the sentence embeddings of our two examples {similarity}")




Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Similarity betweeen the sentence embeddings of our two examples 0.6303306818008423


We observe that by using the Sentence Transformers model, we obtain a lower cosine similarity (0.63) between the two examples than before. Overall, for semantic similarity tasks on the Sentence/Document model is better to use Sentence Transformers model like the one we used here.

### 2. Performing NLP Classification tasks using BERT models as a feature extractor

Here, we will demonstrate how we can use BERT to solve classification tasks where BERT models will be solely used for feature extraction. Specifically, we will implement an NLP pipeline that takes as an input a movie review (in text), then we will use BERT models (specifically DistilBERT) to convert the raw text into a dense vector representation that encapsulates the semantics of the review. Then, we will use traditional ML classifiers (e.g., Logistic Regression) to perform classiciation on the movie reviews.

#### 2.1 Extracting representations from BERT models

In [9]:
import torch
from transformers import DistilBertTokenizer, DistilBertModel
from datasets import load_dataset
import pandas as pd
import numpy as np

# lets use GPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")


# Load Distilbert and move to GPU
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
model = DistilBertModel.from_pretrained('distilbert-base-uncased').to(device)

# function to extract the sentence embeddings
def sentences_to_embeddings(sentences, tokenizer, model, device):
    # Tokenize a batch of sentences and prepare the tensors
    inputs = tokenizer(sentences, return_tensors='pt', padding=True, truncation=True, max_length=512).to(device)
    
    with torch.no_grad():
        outputs = model(**inputs)
    
    embeddings = outputs.last_hidden_state
    mean_embeddings = torch.mean(embeddings, dim=1)
    return mean_embeddings.cpu()  # Move embeddings back to CPU if necessary


# load the IMDB reviews dataset
df = pd.read_csv('movie.csv')
reviews = df['text'].tolist()


# we extract the embeddings in batches to avoid memory issues
chunk_size = 100  # Adjust based on GPU memory
embeddings = []
for i in range(0, len(df), chunk_size):
    batch = df['text'][i:i + chunk_size].tolist()
    batch_embeddings = sentences_to_embeddings(batch, tokenizer, model, device)
    embeddings.extend(batch_embeddings)
    

# Convert embeddings to numpy for easy handling
embeddings = [embedding.detach().numpy() for embedding in embeddings]

# convert all embeddings to a 2D array that is our input
X= np.vstack(embeddings)

# our class labels which is the output of the classification task (ground truth)
Y = df['label'].tolist()

Using device: cuda


#### 2.2 Sentiment Classification using BERT representations and Logistic Regression

Having extracted dense representations for each review in our dataset, now we are going to proceed and train a classifier using Logistic Regression. This part is identical to what we saw in the Vector Semantics Lab; the only difference is that here we are using BERT representations instead of TF-IDF representations.

In [10]:
# to split our dataset into training and test sets
from sklearn.model_selection import train_test_split

#import the logistic regression from sklearn
from sklearn.linear_model import LogisticRegression

# import evaluation metrics
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=42)

# Train a logistic regression classifier
clf = LogisticRegression(max_iter=1000)
clf.fit(X_train, y_train)

# Predict on the test set
y_pred = clf.predict(X_test)

# Evaluate the classifier
print("Accuracy: %.3f" %(accuracy_score(y_test, y_pred)))
print("Precision: %.3f" %(precision_score(y_test, y_pred)))
print("Recall: %.3f" %(recall_score(y_test, y_pred)))
print("F1-Score: %.3f" %(f1_score(y_test, y_pred)))


Accuracy: 0.880
Precision: 0.888
Recall: 0.871
F1-Score: 0.880


We observe that using BERT representations, our classifier can identify positive and sentiment reviews with an F1 score of 0.88!

#### 2.3. **Exercise:** Train a Logistic Regression classifier to perform the sentiment classification task but in this time, instead of using the mean of the word embeddings as the document embedding, use the Sentence tranformers DistilBERT to extract the features for each document.

In [11]:
# Write your code here

### 3. Performing NLP Classification tasks by fine-tuning BERT models and training a classification head

What we have done thus far is to use the pre-trained BERT model to extract BERT representations, so we essentially did not update any weights on the model based on our annotated dataset of IMDB movie reviews.

In [14]:
# Necessary Imports
import torch
from transformers import BertTokenizer, BertForSequenceClassification, Trainer, TrainingArguments
from transformers import DistilBertTokenizer, DistilBertForSequenceClassification, Trainer, TrainingArguments
from datasets import load_dataset
from datasets import Dataset
from sklearn.metrics import accuracy_score, precision_recall_fscore_support


def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = logits.argmax(axis=-1)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, predictions, average='binary')
    acc = accuracy_score(labels, predictions)
    return {
        'accuracy': acc,
        'f1': f1,
        'precision': precision,
        'recall': recall
    }

# For GPU: Remvoe this line and to(device) in model definition. also change no_cuda to False in Training arguments
#device = torch.device('cpu')


# Split data into training and test set
train_df = df.sample(frac=0.8, random_state=42)  # 80% for training
test_df = df.drop(train_df.index)

train_dataset = Dataset.from_pandas(train_df)
test_dataset = Dataset.from_pandas(test_df)

# Load the BERT tokenizer
tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased")

#tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

# Tokenize the dataset
def tokenize_function(examples):
    return tokenizer(examples['text'], padding='max_length', truncation=True, max_length=256)

tokenized_train_dataset = train_dataset.map(tokenize_function, batched=True)
tokenized_test_dataset = test_dataset.map(tokenize_function, batched=True)

# Load the BERT model for sequence classification
model = DistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2).to(device) # 2 labels: pos and neg

#model = BertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2).to(device) # 2 labels: pos and neg

# Define training arguments and set up Trainer
training_args = TrainingArguments(
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=1,
    evaluation_strategy="epoch",
    logging_dir="./logs",
    logging_steps=500,
    do_train=True,
    do_eval=True,
    #no_cuda=True,
    load_best_model_at_end=True,
    save_strategy="epoch",
    report_to="tensorboard", # for logging to TensorBoard
    output_dir="./results",
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train_dataset,
    eval_dataset=tokenized_test_dataset,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

# Train the model
trainer.train()

# Evaluate the model
results = trainer.evaluate()

print(results)

  0%|          | 0/32 [00:00<?, ?ba/s]

  0%|          | 0/8 [00:00<?, ?ba/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.weight', 'pre_classifier.bias', 'classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,0.223,0.220685,0.916375,0.91649,0.905972,0.927254




{'eval_loss': 0.22068540751934052, 'eval_accuracy': 0.916375, 'eval_f1': 0.9164898264885782, 'eval_precision': 0.9059723593287266, 'eval_recall': 0.9272543571608992, 'eval_runtime': 43.4892, 'eval_samples_per_second': 183.954, 'eval_steps_per_second': 11.497, 'epoch': 1.0}


We observe that by fine-tuning our BERT model and performing the classification task using the classification head on the BERT model that we improve the pefromacne substantially. The BERT-based classifier achieves an F1 score of 0.91! This is a substantially better classifier than the one we implemented above that was using Logistic Regression on the pre-trained BERT embeddings (F1 score of 0.88). Also, keep in mind that here we only fine-tuned for 1 epoch to speed-up the process. Usually we fine-tune for more epochs and pick the best-performing model after each epoch. So a way to improve performance is fine-tune the model for more epochs and pick the top performing model as the classifier.

## TI3160TU: Natural Language Processing - Transformers and BERT models Lab -- END