# BERT

**BERT**, which stands for *Bidirectional Encoder Representations from Transformers*, is a method of pre-training language representations that was introduced by researchers at Google AI Language in 2018

<img src="https://www.aionlinecourse.com/uploads/blog/image/BERT1.png" width="600" height="200" />

It's designed to understand the context and relations among words in a sentence. Unlike traditional word embeddings like Word2Vec, which generate a single word embedding for each word, BERT examines the context of each occurrence of a word in a bidirectional manner—meaning it looks at the words before and after the target word

## Differences from Classical Transformer:

The classical Transformer model, introduced in the paper "Attention is All You Need" by Vaswani et al. (2017), consists of an encoder and a decoder. The encoder reads the input text, and the decoder produces the output text. This type of architecture excels at sequence-to-sequence tasks like machine translation.

Here are the key differences between BERT and the classical Transformer:

- **Directionality**:
    - BERT: Bidirectional context is used to understand the relationship between words in a sentence by analyzing the words that come before and after a given word.
    - Transformer: Processes all words in parallel and doesn’t inherently have a sense of word order (i.e., it treats the input words in order or reversely the same unless positional encodings are added).

- **Model Architecture**:
     - BERT: Utilizes only the encoder part of the Transformer architecture.
     - Transformer: Comprises both an encoder and a decoder.

- **Training Objective**:
     - BERT: Is trained on a masked language modeling task and next sentence prediction.
     - Transformer: Is trained to map input sequences to output sequences, like in machine translation.

- **Usage**:
     - BERT: More suited for tasks that require understanding of context and relationships between parts of a sentence.
     - Transformer: Better suited for sequence-to-sequence tasks.

- **Data**:
     - BERT: Can be fine-tuned on a smaller dataset after pre-training.
     - Transformer: Typically requires a larger amount of training data and might not benefit from pre-training as much as BERT does.

BERT's bidirectional understanding and the utilization of pre-training followed by fine-tuning have propelled its performance to state-of-the-art levels across a broad spectrum of NLP tasks.

In [41]:
import torch
from transformers import BertTokenizer, BertForSequenceClassification
#https://huggingface.co/docs/transformers/index
from torch.utils.data import DataLoader, TensorDataset
from transformers import Trainer, TrainingArguments
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
import torch.nn as nn
from tqdm.notebook import tqdm
import torch.optim as optim

Other models that you can fine-tune    
- RoBERTa:
A robustly optimized BERT approach that modifies key hyperparameters in BERT, including removing BERT's next-sentence pretraining objective, and training with much larger mini-batches and learning rates.

- GPT-2 and its variants like GPT-Large, GPT-Medium, and GPT-XL:
 GPT-2 is well-suited for tasks like text generation, and it can also be fine-tuned for various downstream tasks

- DistilBERT:
    A smaller, faster, cheaper, and lighter version of BERT which is trained by distilling BERT base. It can be fine-tuned for various tasks like its teacher model, BERT.

# Binary Classification

In [84]:
# Replace this with loading your actual data
texts = ['I love programming!', 'Python is awesome', 'I hate bugs']
labels = [1, 1, 0] #1 positive, 0 negative 

In [85]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
tokenized_input = tokenizer(texts, padding=True, truncation=True, max_length=512, return_tensors='pt')

In [86]:
model = BertForSequenceClassification.from_pretrained('bert-base-uncased')

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

A downstream task is a specific task for which a pre-trained model is fine-tuned

The specific variant "BERT-base-uncased" denotes a particular configuration of BERT that has been trained with certain specifics:

- **Uncased**:
The term "uncased" signifies that the text data used for training was converted to lowercase before the training process commenced, hence the model does not differentiate between uppercase and lowercase letters.

- **Base**:
The term "base" refers to the size and architecture of the BERT model. Specifically, BERT-base is the smaller variant with 12 layers (transformer blocks), 12 attention heads, and 110 million parameters. This is in contrast to BERT-large which has 24 layers, 16 attention heads, and 340 million parameters.

- **Training Data**:
BERT-base-uncased was trained on two large-scale datasets:
   - BooksCorpus: A dataset containing 800 million words from 11,038 unpublished books.
   - English Wikipedia: The text from the English Wikipedia, amounting to 2,500 million words. However, only the text and not any lists, tables, or headers were included.

- **Training Task**:
BERT was pre-trained using two unsupervised tasks:
 - Masked Language Modeling (MLM): Randomly masking out words and training the model to predict them based on the surrounding context.
  - Next Sentence Prediction (NSP): Training the model to predict whether two sentences come consecutively in a text.



In [87]:
train_dataset = TensorDataset(tokenized_input['input_ids'], tokenized_input['attention_mask'], torch.tensor(labels))
train_loader = DataLoader(train_dataset, batch_size=8, shuffle=True)

In [88]:
tokenized_input['attention_mask']

tensor([[1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 0],
        [1, 1, 1, 1, 1, 0]])

In the context of transformer models such as BERT, tokenization involves converting each text into a sequence of tokens, which are then mapped to a sequence of numerical indices corresponding to a model's vocabulary. In addition to this, the model requires a couple of additional inputs to work correctly. One of these is the "attention mask", which is represented by the tokenized_input['attention_mask'] in your code.

The attention mask is a binary mask that indicates which tokens are actual words and which are padding tokens, added to ensure that all sequences are the same length for batch processing. A value of 1 in the attention mask indicates a real token, while a value of 0 indicates a padding token.

In the tensor above:
```python out
    The first row [1, 1, 1, 1, 1, 1] indicates that all the tokens in the first text sequence are real tokens, with no padding tokens.
    The second and third rows [1, 1, 1, 1, 1, 0] indicate that the last token in these text sequences is a padding token, as denoted by the 0 value, while all preceding tokens are real tokens.
```
The attention mask is crucial for the transformer model to correctly apply self-attention mechanisms only to the real tokens in the sequence, and ignore the padding tokens. This way, the model can process batches of sequences efficiently while still producing meaningful outputs.

*p.s.*
    Not all sequences (text) that you have are of the same length, yet many machine learning models require input data to be fed in batches with consistent dimensions. This is where padding comes into play.
A padding token is a special token that is added to the shorter sequences in a batch to make them all the same length. The padding token itself carries no real information and is typically ignored by the model during training and evaluation through the use of masking

In [105]:
optimizer = torch.optim.AdamW(model.parameters(), lr=2e-5)
#AdamW optimizer, which is a variant of the Adam optimizer with a different weight decay regularization scheme. 
#AdamW was proposed to fix the improper weight decay implementation in the original Adam optimizer, 
#leading to better generalization performance

loss_fn = nn.BCEWithLogitsLoss()
#In PyTorch, Binary Cross Entropy loss is provided through the torch.nn.BCEWithLogitsLoss class. 
# This loss combines a Sigmoid layer and the BCELoss (Binary Cross Entropy Loss) in one single class, 
# which makes it more numerically stable. 
num_epochs = 3

[PyTorch AdamW Documentation](https://pytorch.org/docs/stable/optim.html#torch.optim.AdamW)


In [90]:
model.train()  # Set the model to training mode

for epoch in range(num_epochs):
    epoch_loss = 0
    for batch in tqdm(train_loader):
        optimizer.zero_grad() #to clear the old gradients, because by default, gradients in PyTorch accumulate
        input_ids, attention_mask, labels = batch
        outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
        loss = outputs.loss
        loss.backward()
        optimizer.step()
        epoch_loss += loss.item()
    
    print(f'Epoch {epoch + 1}, Loss: {epoch_loss / len(train_loader)}')
    
#p.s. The reason for zeroing the gradients at the beginning of each loop is to ensure that the gradients 
# computed in the current batch are solely due to the current batch and do not contain any remnant of the gradients 
# from previous batches.

  0%|          | 0/1 [00:00<?, ?it/s]

Epoch 1, Loss: 0.775959312915802


  0%|          | 0/1 [00:00<?, ?it/s]

Epoch 2, Loss: 0.6330906748771667


  0%|          | 0/1 [00:00<?, ?it/s]

Epoch 3, Loss: 0.5627160668373108


In [91]:
input_text = "I love data science"  # Replace with your text
tokens = tokenizer(input_text, padding=True, truncation=True, max_length=512, return_tensors='pt')

In [92]:
model.eval()  # Set the model to evaluation mode
with torch.no_grad():  # No gradient computation is needed
    outputs = model(**tokens)

In [93]:
logits = outputs.logits
predicted_class = torch.argmax(logits, dim=1).item()

In [94]:
label_map = {0: "Negative", 1: "Positive"}  # Adjust according to your task
predicted_label = label_map[predicted_class]
print(f'Predicted Label: {predicted_label}')

Predicted Label: Positive


# Multiclassification

In [95]:
# Assume you have a dataset with text and labels (2 for Positive, 1 for Neutral, 0 for Negative).
texts = ['I love programming!', 'Python is awesome', 'I hate bugs', 'The code is okay']
labels = [2, 2, 0, 1]

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
tokenized_input = tokenizer(texts, padding=True, truncation=True, max_length=512, return_tensors='pt')

model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=3)  # Specify the number of labels

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

In [96]:
train_dataset = TensorDataset(tokenized_input['input_ids'], tokenized_input['attention_mask'], torch.tensor(labels))
train_loader = DataLoader(train_dataset, batch_size=8, shuffle=True)

In [97]:
optimizer = optim.AdamW(model.parameters(), lr=2e-5)
loss_fn = nn.CrossEntropyLoss()
num_epochs = 3

In [98]:
model.train()  # Set the model to training mode

for epoch in range(num_epochs):
    epoch_loss = 0
    for batch in tqdm(train_loader):
        optimizer.zero_grad()
        input_ids, attention_mask, labels = batch
        outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
        loss = outputs.loss
        loss.backward()
        optimizer.step()
        epoch_loss += loss.item()
    
    print(f'Epoch {epoch + 1}, Loss: {epoch_loss / len(train_loader)}')

  0%|          | 0/1 [00:00<?, ?it/s]

Epoch 1, Loss: 1.1351358890533447


  0%|          | 0/1 [00:00<?, ?it/s]

Epoch 2, Loss: 1.081408977508545


  0%|          | 0/1 [00:00<?, ?it/s]

Epoch 3, Loss: 1.0835492610931396


In [104]:
input_text = "Today I love apple"  # Replace with your text
tokens = tokenizer(input_text, padding=True, truncation=True, max_length=512, return_tensors='pt')


model.eval()  # Set the model to evaluation mode
with torch.no_grad():  # No gradient computation is needed
    outputs = model(**tokens)


logits = outputs.logits
predicted_class = torch.argmax(logits, dim=1).item()


label_map = {0: "Negative", 1: "Neutral", 2:"Positive"}  # Adjust according to your task
predicted_label = label_map[predicted_class]
print(f'Predicted Label: {predicted_label}')


Predicted Label: Positive
