# 8.3.4 BERT

## Explanation of **BERT (Bidirectional Encoder Representations from Transformers)**

**BERT (Bidirectional Encoder Representations from Transformers)** is a groundbreaking model introduced by Google in 2018. It is built upon the Transformer architecture and is designed to better understand the context of words in a sentence. Unlike traditional models that read text sequentially (left-to-right or right-to-left), BERT reads the entire sequence of words at once, capturing context from both directions simultaneously. This bidirectional approach allows BERT to grasp the meaning of a word based on its surrounding words more effectively.

Key features of BERT include:
- **Bidirectional Context**: BERT considers both the left and right context of a word, which enhances its understanding of the text.
- **Pre-training and Fine-tuning**: BERT is pre-trained on a large corpus of text using two tasks: Masked Language Model (MLM) and Next Sentence Prediction (NSP). It is then fine-tuned on specific tasks with additional task-specific layers.

## Applications and Benefits of BERT

### Applications
1. **Text Classification**: BERT is used for sentiment analysis, spam detection, and topic categorization.
2. **Named Entity Recognition (NER)**: It identifies and classifies entities like names, dates, and locations in text.
3. **Question Answering**: BERT excels in understanding and answering questions based on a given context.
4. **Text Summarization**: It can generate concise summaries of longer texts.
5. **Translation**: BERT improves machine translation by understanding the context better.

### Benefits
- **Enhanced Contextual Understanding**: BERT's bidirectional approach allows it to better capture the meaning of words in context, leading to improved performance on various NLP tasks.
- **Pre-trained Models**: BERT provides pre-trained models that can be fine-tuned for specific tasks, reducing the need for extensive training data and computational resources.
- **Versatility**: BERT can be adapted to a wide range of NLP tasks, making it a flexible tool for different applications.


___
___
### Readings:
- [What is BERT? How it is trained ? A High Level Overview](https://medium.com/@Suraj_Yadav/what-is-bert-how-it-is-trained-a-high-level-overview-1207a910aaed)
- [BERT — Bidirectional Encoder Representations from Transformer](https://gayathri-siva.medium.com/bert-bidirectional-encoder-representations-from-transformer-8c84bd4c9021)
- [Large Language Models: BERT](https://towardsdatascience.com/bert-3d1bf880386a)
- [BERT Explained: A Complete Guide with Theory and Tutorial](https://medium.com/@samia.khalid/bert-explained-a-complete-guide-with-theory-and-tutorial-3ac9ebc8fa7c)
- [Understanding BERT](https://pub.towardsai.net/understanding-bert-b69ce7ad03c1)
___
___

In [None]:
!pip install datasets

In [1]:
# Import necessary libraries
from datasets import load_dataset
from transformers import DistilBertForSequenceClassification, DistilBertTokenizerFast
from transformers import Trainer, TrainingArguments
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

import warnings
warnings.filterwarnings("ignore", category=FutureWarning)

In [2]:
# Load the dataset
dataset = load_dataset('imdb')

# Initialize the tokenizer
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')


In [3]:

# Tokenize the dataset
def preprocess_function(examples):
    return tokenizer(examples['text'], truncation=True, padding=True, max_length=512)

encoded_dataset = dataset.map(preprocess_function, batched=True)



Map:   0%|          | 0/50000 [00:00<?, ? examples/s]

In [None]:
# Split the dataset into train and test sets
train_dataset = encoded_dataset['train'].shuffle(seed=42).select(range(10000))  # Using a smaller subset for faster training
test_dataset = encoded_dataset['test'].shuffle(seed=42).select(range(2000))  # Using a smaller subset for faster evaluation

# Initialize the model
model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased', num_labels=2)


In [5]:

# Define the metrics
def compute_metrics(p):
    preds = p.predictions.argmax(-1)
    precision, recall, f1, _ = precision_recall_fscore_support(p.label_ids, preds, average='binary')
    acc = accuracy_score(p.label_ids, preds)
    return {'accuracy': acc, 'precision': precision, 'recall': recall, 'f1': f1}

# Define training arguments
training_args = TrainingArguments(
    output_dir='./results',          # Output directory
    num_train_epochs=2,              # Number of training epochs
    per_device_train_batch_size=16,  # Batch size for training
    per_device_eval_batch_size=16,   # Batch size for evaluation
    warmup_steps=500,                # Number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # Strength of weight decay
    logging_dir='./logs',            # Directory for storing logs
    logging_steps=10,
    evaluation_strategy="epoch",     # Evaluate every epoch
    save_strategy="epoch",           # Save model every epoch
    load_best_model_at_end=True,     # Load the best model at the end of training
)

# Initialize the Trainer
trainer = Trainer(
    model=model,                         # The instantiated 🤗 Transformers model to be trained
    args=training_args,                  # Training arguments, defined above
    train_dataset=train_dataset,         # Training dataset
    eval_dataset=test_dataset,           # Evaluation dataset
    compute_metrics=compute_metrics,     # Function to compute metrics for evaluation
)


In [6]:
# Train the model
trainer.train()

# Evaluate the model
results = trainer.evaluate()

# Print the evaluation results
print(f"Evaluation Results: {results}")

Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
1,0.2333,0.270237,0.9085,0.896987,0.923,0.909808
2,0.1263,0.261713,0.9165,0.921132,0.911,0.916038


Evaluation Results: {'eval_loss': 0.26171255111694336, 'eval_accuracy': 0.9165, 'eval_precision': 0.9211324570273003, 'eval_recall': 0.911, 'eval_f1': 0.9160382101558573, 'eval_runtime': 34.959, 'eval_samples_per_second': 57.21, 'eval_steps_per_second': 3.576, 'epoch': 2.0}


In [7]:
import torch
# Prepare new data
sentences = ["This movie was fantastic!", "I didn't like the film at all."]
encodings = tokenizer(sentences, truncation=True, padding=True, max_length=512, return_tensors='pt')

# Move encodings to GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
encodings = {key: val.to(device) for key, val in encodings.items()}

# Get predictions
outputs = model(**encodings)
predictions = outputs.logits.argmax(dim=-1)

for sentence, prediction in zip(sentences, predictions):
    print(f"Sentence: {sentence}\nPrediction: {'Positive' if prediction.item() == 1 else 'Negative'}\n")

Sentence: This movie was fantastic!
Prediction: Positive

Sentence: I didn't like the film at all.
Prediction: Negative



In [8]:
from sklearn.metrics import confusion_matrix, classification_report

# Predict on the test set
predictions = trainer.predict(test_dataset)
preds = predictions.predictions.argmax(-1)

# Confusion Matrix
cm = confusion_matrix(test_dataset['label'], preds)
print(f"Confusion Matrix:\n{cm}")

# Classification Report
report = classification_report(test_dataset['label'], preds, target_names=['Negative', 'Positive'])
print(f"Classification Report:\n{report}")


Confusion Matrix:
[[922  78]
 [ 89 911]]
Classification Report:
              precision    recall  f1-score   support

    Negative       0.91      0.92      0.92      1000
    Positive       0.92      0.91      0.92      1000

    accuracy                           0.92      2000
   macro avg       0.92      0.92      0.92      2000
weighted avg       0.92      0.92      0.92      2000



## Conclusion

BERT (Bidirectional Encoder Representations from Transformers) has significantly advanced the field of Natural Language Processing (NLP) by introducing a model that understands context bidirectionally. This innovation enhances the model's ability to capture nuanced meanings in text, which has led to substantial improvements in various NLP tasks such as text classification, named entity recognition, question answering, text summarization, and machine translation.

The model's pre-training on a large corpus followed by fine-tuning on specific tasks allows it to achieve high performance with minimal task-specific data. BERT's versatility and robust contextual understanding make it a powerful tool for both research and practical applications in NLP.

Implementing BERT involves using the `transformers` library to load pre-trained models and tokenizers, preparing and tokenizing data, setting up training arguments, and employing the Trainer for model training and evaluation. By following these steps, you can harness BERT's capabilities to tackle a wide range of NLP challenges effectively and efficiently.

Overall, BERT's contributions to NLP highlight the importance of context-aware models in achieving state-of-the-art results and set a benchmark for future advancements in the field.
