## BERT's BASICS

BERT's architecture allows for fine tuning specific tasks like:
* Text summarization
* Question Answering
* Sentiment Analysis

Uses only `encoder_only` architecture to process entire sequences of text simultaneously
`MLM` involves randomly masking some of the input tokens and training BERT to predict the original masked tones


For prediction:

    * Encoder outputs a set of contextual embeddings
    * Contextual embeddings are passed through another layer and converted into a set of logits.
    * Masked word is identified by selecting the word corresponding to the index with the highest logit value. 

Encoder models have access to the entire sequence.

The training method is `bidirectional`

    * It enables the model to understand the context from both sides of any given word in a sentence.
    

### Installing Required Libraries

In [1]:
import torch
from torch.utils.data import DataLoader,Dataset
from torch import Tensor
from torch.nn.utils.rnn import pad_sequence
import torch.nn as nn

# New
from torch.nn import Transformer
from transformers import BertTokenizer


import torchtext
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import Vocab, build_vocab_from_iterator
from torchtext.datasets import IMDB

import random
import pandas as pd
import json
import math
import csv
import matplotlib.pyplot as plt


import warnings
def warn(*args, **kwargs):
    pass
warnings.warn = warn
warnings.filterwarnings('ignore')

### Pretraining objectives

Pretraining objectives are crucial components of the pretraining process for transformers. These objectives define the tasks that the model is trained on during the pretraining phase, allowing it to learn meaningful contextual representations of language. Two commonly used pretraining objectives are masked language modeling (MLM) and next sentence prediction (NSP).

1. Masked Language Modeling (MLM):
   Masked language modeling involves randomly masking some words in a sentence and training the model to predict the masked words based on the context provided by the surrounding words(i.e., words that appear either before or after the masked word). The objective is to enable the model to learn contextual understanding and fill in missing information.

   Here's how MLM works:
   - Given an input sentence, a certain percentage of the words are randomly chosen and replaced with a special [MASK] token.
   - The model's task is to predict the original words that were masked, given the context of the surrounding words.
   - During training, the model learns to understand the relationship between the masked words and the rest of the sentence, effectively capturing the contextual information.

2. Next Sentence Prediction (NSP):
   Next sentence prediction involves training the model to predict whether two sentences are consecutive in the original text or randomly chosen from the corpus. This objective helps the model learn sentence-level relationships and understand the coherence between sentences.

   Here's how NSP works:
   - Given a pair of sentences, the model is trained to predict whether the second sentence follows the first sentence in the original text or if it is randomly selected from the corpus.
   - The model learns to capture the relationships between sentences and understand the flow of information in the text.

   NSP is particularly useful for tasks that involve understanding the relationship between multiple sentences, such as question answering or document classification. By training the model to predict the coherence of sentence pairs, it learns to capture the semantic connections between them.

It's important to note that different pretrained models may use variations or combinations of these objectives, depending on the specific architecture and training setup.


## Loading Data

In [4]:
!wget -O BERT_dataset.zip https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/bZaoQD52DcMpE7-kxwAG8A.zip
!unzip BERT_dataset.zip

--2025-07-31 11:50:54--  https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/bZaoQD52DcMpE7-kxwAG8A.zip
Resolving cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud (cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud)... 169.45.118.108
connected. to cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud (cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud)|169.45.118.108|:443... 
HTTP request sent, awaiting response... 200 OK
Length: 88958506 (85M) [application/zip]
Saving to: ‘BERT_dataset.zip’


2025-07-31 11:51:42 (2.06 MB/s) - ‘BERT_dataset.zip’ saved [88958506/88958506]

Archive:  BERT_dataset.zip
   creating: /Users/tinonturjamajumder/Generative AI Language Modelling with Transformers_3/bert_dataset
  inflating: bert_dataset/.DS_Store  
  inflating: bert_dataset/bert_train_data.csv  
  inflating: bert_dataset/bert_test_data_sampled.csv  
  inflating: bert_dataset/bert_test_data.csv  
  inflating: bert_dataset/bert_train_data_sampled.csv  

In [11]:
class BERTCSVDataset(Dataset):

    def __init__(self,filename):
        self.data = pd.read_csv(filename)
        self.tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')


    def __len__(self):
        return len(self.data)

    def __getitem__(self,idx):
        row = self.data.iloc[idx]

        try:
            bert_input = torch.tensor(json.loads(row["BERT Input"]),dtype = torch.long)
            bert_label = torch.tensor(json.loads(row['BERT Label']),dtype = torch.long)
            segment_label = torch.tensor([int(x) for x in row['Segment Label'].split(',')],dtype = torch.long)
            is_next = torch.tensor(row['Is Next'],dtype = torch.long)
            original_text = row['Original Text']


        except json.JSONDecodeError as e:
            
            print(f"Error decoding JSON for row {idx}: {e}")
            print(f"BERT Input: {row['BERT Input']}'")
            print(f"BERT Label: {row["BERT Label"]}")
            return None


        encoded_input = self.tokenizer.encode_plus(
            original_text,
            add_special_tokens = True,
            max_length = 512,
            padding = 'max_length',
            truncation = True,
            return_tensors = 'pt'
        )

        input_ids = encoded_input['input_ids'].squeeze()
        attention_mask = encoded_input['attention_mask'].squeeze()

        return(bert_input,bert_label,segment_label,is_next,input_ids,attention_mask,original_text)
            

In [12]:
PAD_IDX = 0
def collate_batch(batch):
    bert_inputs_batch,bert_label_batch,bert_segment_batch,is_next_batch,input_ids_batch,attention_mask_batch,original_text_batch =  [], [], [], [],[],[],[]


    for bert_inputs,bert_label,bert_segment,is_next,input_ids,attention_mask,original_text in batch:

        bert_inputs_batch.append(torch.tensor(bert_inputs,dtype = torch.long))
        bert_label_batch.append(torch.tensor(bert_label,dtype = torch.long))
        bert_segment_batch.append(torch.tensor(bert_segment),dtype= torch.long)
        is_next_batch.append(is_next)
        input_ids_batch.append(input_ids)
        attention_mask_batch.append(attention_mask)
        original_text_batch.append(original_text)
        

    # pad the sequences in the batch
    bert_inputs_final = pad_sequence(bert_inputs_batch,padding_value = PAD_IDX,batch_first = False)
    bert_labels_final = pad_sequence(bert_label_batch,padding_value = PAD_IDX, batch_first = False)
    segments_label_final = pad_sequence(bert_segment_batch,padding_value = PAD_IDX, batch_first = False)
    is_nexts_final = torch.tensor(is_next_batch,dtype = torch.long)

    return bert_inputs_final, bert_labels_final, segments_label_final,is_nexts_final