# Learning BERT

## Notes

- [Original BERT Paper](https://arxiv.org/pdf/1810.04805)
    - BERT is a model that is trained using two steps
        - Pretraining -> unsupervised learning to identify relationships between words in a huge corpus of text
        - Fine-tuning -> supervised learning, conducted by attaching a network to the front of BERT and training it used examples + labels
            - **Ex.** Feeding BERT movie review text (input) and making it predict a rating (output/label)
                - We use the review's actual rating to measure how well BERT performed, and then use backpropagation + gradient descent to fine-tune bert
    - BERT is pretrained using MLM (Mask language modelling), and NSP
        - Masked lanugage modelling (MLM) -> Hide words from a text, and get BERT to predict the word from the surrounding context 
        - Nest sentence prediction (NSP) -> Present two sentences, and get BERT to correctly predicti if the second sentence follows the first sentence
    - **NOTE:** MLM seems to provide decently high performance alone
        - In the ablation study, removing NSP still had accuracy of SST-2 (sentiment analysis task, which is relevant to us)
- [Basic explaination of BERT](https://jalammar.github.io/illustrated-bert/)
- [Hugging Face Transformers BertModel Documentation](https://huggingface.co/docs/transformers/en/model_doc/bert?usage=Pipeline#transformers.BertModel)
- [Youtube Series on BERT](https://www.youtube.com/watch?v=q9NS5WpfkrU)

## MLM Training

In [8]:
import os
import sys
module_path = os.path.abspath(os.path.join('..'))
print("INIT module_path: ", module_path)
if module_path not in sys.path:
    sys.path.append(module_path)

DATA_DIR = module_path + "/data"

INIT module_path:  c:\Users\Alan\Desktop\Open_Source\BERT-TLSA-paper


In [15]:
import pandas as pd

DATA_FILE = "myanimelist_reviews.csv"
data_df = pd.read_csv(f"{DATA_DIR}/{DATA_FILE}")
data_df

Unnamed: 0,site,user,review_target,review,score,max_score
0,MyAnimeList,Sorrowful,Shingeki no Kyojin,"Oh dear Shingeki no Kyojin, where do I even be...",10,10
1,MyAnimeList,Gladius650,Shingeki no Kyojin,I started to follow the manga after watching t...,10,10
2,MyAnimeList,SonDavid,Shingeki no Kyojin,"In the 80's, Mobile Suit Gundam catapulted ani...",10,10
3,MyAnimeList,Kerma_,Shingeki no Kyojin,Shingeki no Kyojin... Where do I start? In sum...,5,10
4,MyAnimeList,emberreviews,Shingeki no Kyojin,"Every once in a while, and even more frequentl...",9,10
...,...,...,...,...,...,...
13408,MyAnimeList,DaAn1meGuy,Odd Taxi,"If I had to summarize this show in two words, ...",6,10
13409,MyAnimeList,Edenharley,Odd Taxi,Odd Taxi was ab absolute ride from start to fi...,10,10
13410,MyAnimeList,SanaeK10,Odd Taxi,"If Odd Taxi was just the Gacha Episode, it wou...",9,10
13411,MyAnimeList,boykunron,Odd Taxi,I was unbelievable shocked by how good this sh...,10,10


In [None]:
from transformers import BertTokenizer, BertForMaskedLM
import torch

# Create tokenizer + already trained mode
tokenizer: BertTokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model: BertForMaskedLM = BertForMaskedLM.from_pretrained("bert-base-uncased")

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [103]:
# Extract just the review texts from the data, and take only the first 100
review_texts = data_df["review"].to_list()[:100]
print(f"review_texts len: {len(review_texts)}")
review_texts

review_texts len: 100


['Oh dear Shingeki no Kyojin, where do I even begin. If you\'ve talked with your friends about anime, then the couple anime that everyone talks about are Naruto, Bleach, One Piece, Dragon Ball, and... Shingeki no Kyojin. What\'s the difference between Shingeki and the rest? Shingeki only has 25 episodes so far yet it\'s on par in popularity with the other super long, Americanized anime. Why is it popular? Well that\'s simply because it\'s stunningly amazing. Those people that call Shingeki no Kyojin "overrated" may not have the same taste as me, and that\'s perfectly fine, but in my honest opinion, Shingeki no Kyojin isone of if not the greatest anime to be made. It\'s not popular for no reason.\n\nThe story is one of the most captivating stories I\'ve ever seen. 100 years prior to the start of the anime, humanity has been on the bridge of extinction due to the monstrous humanoid Titans that devour humans. Now, present day in the anime, the remaining small population of mankind lives c

In [None]:
from typing import TypedDict

class TokenizedInputs(TypedDict):
    input_ids: torch.IntTensor
    token_type_ids: torch.IntTensor
    attention_mask: torch.IntTensor
    labels: torch.IntTensor

# Tokenize the texts
#   First argument is the list of strings (sentences) to tokenize
#
#   return_tensors = "pt" -> Makes the tokenizer return PyTorch tensors
#       Other options for this parameter include "tf" for TensorFlow tensors.
#   max_length = 512 -> Max length of any input is 512
#   truncation = True -> cap input sizes to the max_length
#   padding = "max_length" -> add padding to nay input < max_length, 
#       in order to reach max_length
encodings: TokenizedInputs = tokenizer(review_texts, return_tensors="pt", max_length=512, truncation=True, padding="max_length")

# The tokenizer returns a dictionary after tokenizing some text:
#   input_ids -> token IDs of each word in a sentence
#       Ex. [101, 2821, 6203, ..., 2145, 2282, 102]
#           101 = CLS token -> used at start of input
#           102 = SEP token -> used to separate sentences
#   token_type_ids -> Identifies which "segment" a token is in, within a sentence.
#       Ex. [0, 0, 0, 0, 1, 1, 1, 1]
#           means the given sentence has two segments, 0 and 1
#           The tokens in the first half are a part of segment 0
#           The tokens in the second half are a part of segment 2
#   attention_mask -> indicates which tokens should be paid attention to (1) or ignored (0)
#       We usually want to mask out any padding we introduced. 
encodings

{'input_ids': tensor([[  101,  2821,  6203,  ...,  2145,  2282,   102],
        [  101,  1045,  2318,  ...,  1996,  2886,   102],
        [  101,  1999,  1996,  ...,  2003,  2073,   102],
        ...,
        [  101,  1000, 24462,  ...,  2203,  1012,   102],
        [  101,  1045,  1005,  ...,     0,     0,     0],
        [  101,  2821,  2158,  ...,  1996,  6919,   102]]), 'token_type_ids': tensor([[0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        ...,
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1,  ..., 1, 1, 1],
        [1, 1, 1,  ..., 1, 1, 1],
        [1, 1, 1,  ..., 1, 1, 1],
        ...,
        [1, 1, 1,  ..., 1, 1, 1],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 1, 1, 1]])}

In [105]:
# Create a new field labels that is a clone of input_ids
encodings["labels"] = encodings["input_ids"].detach().clone()
encodings

{'input_ids': tensor([[  101,  2821,  6203,  ...,  2145,  2282,   102],
        [  101,  1045,  2318,  ...,  1996,  2886,   102],
        [  101,  1999,  1996,  ...,  2003,  2073,   102],
        ...,
        [  101,  1000, 24462,  ...,  2203,  1012,   102],
        [  101,  1045,  1005,  ...,     0,     0,     0],
        [  101,  2821,  2158,  ...,  1996,  6919,   102]]), 'token_type_ids': tensor([[0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        ...,
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1,  ..., 1, 1, 1],
        [1, 1, 1,  ..., 1, 1, 1],
        [1, 1, 1,  ..., 1, 1, 1],
        ...,
        [1, 1, 1,  ..., 1, 1, 1],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 1, 1, 1]]), 'labels': tensor([[  101,  2821,  6203,  ...,  2145,  2282,   102],
        [  101,  1045,  2318,  ...,  1996,  2886,   102],
        [  101,  1999, 

In [106]:
# In BERT paper, each token has 15% chance of being masked

# First, create random vector that spans all of the input_ids (spans all the tokens)
rand = torch.rand(encodings["input_ids"].shape)
print(f"rand shape: {rand.shape}")
rand

rand shape: torch.Size([100, 512])


tensor([[0.9008, 0.0031, 0.7098,  ..., 0.1034, 0.2811, 0.7747],
        [0.9126, 0.2598, 0.4632,  ..., 0.4568, 0.1203, 0.9190],
        [0.1924, 0.5935, 0.8606,  ..., 0.2105, 0.6723, 0.9046],
        ...,
        [0.8821, 0.0805, 0.8507,  ..., 0.3863, 0.3781, 0.6440],
        [0.1248, 0.0683, 0.8789,  ..., 0.6844, 0.3718, 0.0996],
        [0.1908, 0.7232, 0.0146,  ..., 0.8112, 0.0295, 0.5860]])

In [107]:
# (rand < 0.15) -> Any token that has a corresponding random value of < 0.15, we mask
# In the mask_arr, we want
#   true = mask this element
#   false = don't mask this element
#
# (inputs["input_ids" != 101] * ...) -> We do not want to mask CLS, SEP, and padding tokens
# The * operator is elementwise multiplication
# For boolean arrays, this is effectively an AND operation (see below for explanation)
#   In numpy, true = 1, and false = 0
#   Therefore elementwise multiplication of
#       true * true = 1 * 1 = 1 (true)
#       true * false = 1 * 0 = 0 (false)
#       false * true = 0 * 1 = 0 (false)
#       false * false = 0 * 0 = 0 (false)
mask_arr = (rand < 0.15) * (encodings["input_ids"] != 101) * (encodings["input_ids"] != 0) * (encodings["input_ids"] != 102)
mask_arr

tensor([[False,  True, False,  ...,  True, False, False],
        [False, False, False,  ..., False,  True, False],
        [False, False, False,  ..., False, False, False],
        ...,
        [False,  True, False,  ..., False, False, False],
        [False,  True, False,  ..., False, False, False],
        [False, False,  True,  ..., False,  True, False]])

In [108]:
selection = []

# Iterate over each row in the mask_arr (basically each sentence in our text data)
for i in range(mask_arr.shape[0]):
    # .nonzero() -> finds the indicies where we have "true" values (since true = 1 and false = 0 in pytorch)
    selection.append(mask_arr[i].nonzero().flatten().tolist())

print("selection:")
print("\n".join([str(x) for x in selection[:10]]))

selection:
[1, 16, 20, 28, 38, 40, 44, 53, 62, 66, 78, 79, 95, 104, 118, 123, 126, 137, 138, 142, 143, 152, 154, 157, 158, 161, 164, 186, 220, 230, 234, 239, 240, 250, 267, 275, 286, 303, 306, 312, 332, 350, 351, 354, 357, 358, 359, 365, 372, 377, 378, 382, 386, 388, 396, 397, 401, 402, 405, 418, 445, 448, 449, 451, 457, 460, 463, 466, 474, 479, 485, 486, 502, 504, 509]
[3, 5, 17, 39, 44, 53, 68, 71, 94, 97, 100, 105, 113, 129, 136, 138, 146, 150, 156, 161, 164, 170, 179, 184, 189, 195, 199, 201, 204, 205, 219, 220, 222, 247, 248, 267, 272, 274, 286, 295, 297, 298, 299, 301, 304, 311, 315, 322, 323, 325, 326, 330, 335, 345, 347, 352, 363, 371, 373, 376, 382, 386, 392, 396, 416, 419, 428, 429, 430, 432, 448, 457, 464, 471, 486, 510]
[13, 15, 23, 25, 38, 56, 62, 71, 96, 104, 109, 113, 114, 117, 134, 138, 140, 154, 156, 157, 160, 161, 163, 164, 166, 176, 177, 179, 187, 205, 211, 226, 229, 231, 232, 237, 260, 270, 286, 296, 299, 309, 311, 313, 320, 336, 339, 347, 349, 353, 355, 363, 369, 3

In [109]:
# Apply our mask_arr in each row (each sentence)
for i in range(mask_arr.shape[0]):
    # Special Tensor syntax -> we can pass in a list of indicies for any of the axes
    #   In this case, we pass in a list of indices in the column axis, to effectively
    #   select the columns (tokens) we want to mask out
    encodings["input_ids"][i, selection[i]] = 103
encodings["input_ids"]

tensor([[  101,   103,  6203,  ...,   103,  2282,   102],
        [  101,  1045,  2318,  ...,  1996,   103,   102],
        [  101,  1999,  1996,  ...,  2003,  2073,   102],
        ...,
        [  101,   103, 24462,  ...,  2203,  1012,   102],
        [  101,   103,  1005,  ...,     0,     0,     0],
        [  101,  2821,   103,  ...,  1996,   103,   102]])

In [None]:
class MaskedTextDatasetItem(TypedDict):
    input_ids: torch.IntTensor
    token_type_ids: torch.IntTensor
    attention_mask: torch.IntTensor
    labels: torch.IntTensor
    original_text: str

class MaskedTextDataset(torch.utils.data.Dataset):
    def __init__(self, encodings: TokenizedInputs, original_text: list[str] = None):
        self.encodings = encodings
        self.original_text = original_text
    def __getitem__(self, index: int) -> MaskedTextDatasetItem:
        # Return the dictionary just like encodings, except it only
        # contains the entries for a specific row (sentence)
        res = {key: val[index] for key, val in self.encodings.items() }
        if self.original_text:
            res["original_text"] = self.original_text[index]
        return res
    def __len__(self):
        return len(self.encodings["input_ids"])

dataset = MaskedTextDataset(encodings)
# batch_size = 16 -> separates data into batches, and each batch contains 16 sentences
# shuffle = True -> load random sentences into each batch
dataloader = torch.utils.data.DataLoader(dataset, batch_size=16, shuffle=True)

In [135]:
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
print(f"device: {device}")

# Move the model to the device we speicified
#   Ideally use CUDA (GPU) if available
model.to(device)

device: cuda


BertForMaskedLM(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwi

In [112]:
# Create the optimizer
optim = torch.optim.AdamW(model.parameters(), lr=1e-5)

In [None]:
from tqdm import tqdm
from typing import Iterator, cast

# Set model to training mode
model.train()

# Training loop
epochs = 20
for epoch in range(epochs):
    # tqdm() -> Creates loading bar from iterator
    # leave = True -> Progress bar remains on screen after completion
    loop = tqdm(cast(Iterator[TokenizedInputs], dataloader), leave=True)
    log_data = []
    for batch in loop:
        # Reset optimizer gradient
        optim.zero_grad()
        input_ids = batch["input_ids"].to(device)
        attention_mask = batch["attention_mask"].to(device)
        labels = batch["labels"].to(device)

        outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
        loss = outputs.loss

        # Apply backward propagation based on loss
        loss.backward()
        optim.step()

        # Set description of the tqdm progress bar
        loop.set_description(f"Epoch {epoch}")
        loop.set_postfix(loss=loss.item())


  return {key: torch.tensor(val[index]) for key, val in self.encodings.items() }
Epoch 0: 100%|██████████| 7/7 [00:02<00:00,  3.05it/s, loss=0.866]
Epoch 1: 100%|██████████| 7/7 [00:01<00:00,  3.66it/s, loss=2.42]
Epoch 2: 100%|██████████| 7/7 [00:01<00:00,  3.66it/s, loss=2.13]
Epoch 3: 100%|██████████| 7/7 [00:01<00:00,  3.66it/s, loss=1.74]
Epoch 4: 100%|██████████| 7/7 [00:01<00:00,  3.65it/s, loss=1.47]
Epoch 5: 100%|██████████| 7/7 [00:01<00:00,  3.65it/s, loss=1.61] 
Epoch 6: 100%|██████████| 7/7 [00:01<00:00,  3.66it/s, loss=0.521]
Epoch 7: 100%|██████████| 7/7 [00:01<00:00,  3.65it/s, loss=0.535]
Epoch 8: 100%|██████████| 7/7 [00:01<00:00,  3.63it/s, loss=0.739]
Epoch 9: 100%|██████████| 7/7 [00:01<00:00,  3.62it/s, loss=0.801]
Epoch 10: 100%|██████████| 7/7 [00:01<00:00,  3.65it/s, loss=0.463]
Epoch 11: 100%|██████████| 7/7 [00:01<00:00,  3.64it/s, loss=0.56] 
Epoch 12: 100%|██████████| 7/7 [00:01<00:00,  3.63it/s, loss=0.522]
Epoch 13: 100%|██████████| 7/7 [00:01<00:00,  3.6

In [None]:
# Test the model
import torch.utils.data.dataloader

def get_text_dataset(text: list[str]) -> MaskedTextDataset:
    test_encodings: TokenizedInputs = tokenizer(text, return_tensors="pt", max_length=512, truncation=True, padding="max_length")
    return MaskedTextDataset(test_encodings, text)

with torch.no_grad():
    model.eval()
    test_dataset = get_text_dataset([
        "I really really [MASK] this show! It's okay.",
        "I really really [MASK] this show! It sucks!",
        "I really really [MASK] this show! It's awesome and full of mystery!",
    ])
    print("Test model...")
    # Use batch size of 1, because we want to test sentence by sentence
    test_dataloader = torch.utils.data.dataloader.DataLoader(test_dataset, batch_size=1)
    for batch in cast(Iterator[TokenizedInputs], test_dataloader):
        input_ids = batch["input_ids"].to(device)
        attention_mask = batch["attention_mask"].to(device)
        original_text = batch["original_text"]

        output = model(input_ids, attention_mask=attention_mask)
        logits: torch.FloatTensor = output.logits
        # dim = -1 -> Use argmax along the last dimension of the tensor
        # in this case, our tensor has dimension [1, 512, 30522]
        #   We have 1 sentence in our batch
        #   Each sentence has 512 words (tokens) max
        #   There are 30522 words in the vocabulary
        #       The logits represents the probability distribution for each token.
        #       What is the probability that each token in the
        #       sentence is a specific word?
        #           If we take the argmax of each distribution, we get the most likely
        #           word for each token
        predicted_token_ids = logits.argmax(dim = -1).squeeze()
        # print(f"predicted_token_ids: {predicted_token_ids}")
        predicted_text = tokenizer.decode(predicted_token_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True)
        
        print(f"    original_text:  '{original_text}'")
        print(f"        pred_text:  '{predicted_text}'")

Test model...
    original_text:  '["I really really [MASK] this show! It's okay."]'
        pred_text:  ') i really really enjoyed this show! it's okay.! okay everything okay okay everything everything everything everything okay everything like freaking freaking freaking everything okay everything okay everything okay okay! everything okay everything everything everything enjoyed freaking freaking everything it'just okay okay okay everything okay okay everything everything everything okay everything everything everything sucks okay it it'okay okay! everything everything okay everything everything okay okay it it'just okay!! okay okay okay okay everything okay everything okay okay okay okay like like everything'is okay okay it it everything okay okay! ‖ everything everything okay it okay okay everything okay okay okay everything everything everything okay okay everything okay everything okay absolutely absolutely absolutely freaking okay okay okay okay okay freaking absolutely absolute