This Jupyter Notebook was tested with the following Python packages:
- torch==2.5.1
- transformers==4.48.1
- datasets==3.2.0

# BERT-style Encoder Language Models

## Overview

BERT (Bidirectional Encoder Representations from Transformers) is a transformer-based model designed to pre-train deep bidirectional representations by jointly conditioning on both left and right context in all layers. This allows the model to understand the context of a word based on its surroundings, making it highly effective for various NLP tasks. Thus it can be seen as a model producing a contextualized embedding that can be used as a replacement for e.g., GloVe.

## Key Features

1. **Bidirectional Context**: Unlike traditional models that read text sequentially (left-to-right or right-to-left), BERT reads the entire sequence of words at once, allowing it to understand the context of a word based on both its preceding and following words.

2. **Pre-training Objectives**:
    - **Masked Language Model (MLM)**: Randomly masks some of the tokens in the input, and the objective is to predict the original vocabulary id of the masked word based only on its context.
    - **Next Sentence Prediction (NSP)**: Predicts whether a given pair of sentences is consecutive in the original text, helping the model understand sentence relationships.

## Applications

BERT can be fine-tuned for various downstream tasks, including:

- **Text Classification**: Sentiment analysis, spam detection, etc.
- **Question Answering**: Extracting answers from a given context.
- **Named Entity Recognition (NER)**: Identifying entities like names, dates, and locations in text.
- **Text Summarization**: Generating concise summaries of longer texts.


We first install the needed packages:

In [None]:
%pip install transformers[torch]

## Imports

In [None]:
import torch
from datasets import load_dataset
from torch.utils.data import DataLoader
from tqdm.auto import tqdm
from transformers import (
    DataCollatorWithPadding,
    RobertaForSequenceClassification,
    RobertaModel,
    RobertaTokenizer,
    Trainer,
    TrainingArguments,
    pipeline,
    EvalPrediction,
)

We will again use the SST2 dataset. Let's load it:

In [None]:
# Load the dataset
dataset = load_dataset("stanfordnlp/sst2")
print(dataset)

When dealing with pre-trained models, you will have to download the weights from somewhere.\
These weights have to fit the model architecture, therefore we rely on a library for the models and its weights.\
We here use the Hugging Face transformers library for loading the model and its tokenizer.

There are so many encoder language models; here we use RoBERTa which is an optimized (in terms of hyperparameters) version of BERT.

In [None]:
# Load the tokenizer and model
tokenizer: RobertaTokenizer = RobertaTokenizer.from_pretrained("roberta-base")
roberta = RobertaModel.from_pretrained("roberta-base")

We can inspect the model's architecture by printing it:

In [None]:

print(roberta)

As before, we prepare the dataset, but this time the tokenizer does most of the work converting text to token ids for the loaded model:

In [None]:
# Activate if you want to use a subset of the dataset for faster training
USE_SUBSET = False

In [None]:
# Tokenize the dataset
def tokenize(examples, tokenizer):
    return tokenizer(examples["sentence"], truncation=True, return_attention_mask=False)


# Prepare the datasets
tokenized_datasets = dataset.map(tokenize, fn_kwargs=dict(tokenizer=tokenizer), batched=True, num_proc=10)

train_dataset = tokenized_datasets["train"]
val_dataset = tokenized_datasets["validation"]
test_dataset = tokenized_datasets["test"]

# Use a subset for quick training
if USE_SUBSET:
    train_dataset = train_dataset.shuffle(seed=42).select(range(1000))
    val_dataset = val_dataset.shuffle(seed=42).select(range(100))
    test_dataset = test_dataset.shuffle(seed=42).select(range(100))

# Print the features
print(train_dataset.features)
# Print the first example
print(train_dataset[0])

## Task
Feed the first training example through the LM.
>Note: The model takes the token ids with the argument `input_ids`.

In [None]:
# Call model on the first example
output = roberta(input_ids=torch.tensor(train_dataset[0]["input_ids"]).unsqueeze(0))
print(output.last_hidden_state.shape)
print(output.pooler_output.shape)

The model's output is a contextualized representation of the input, and therefor can be used as such in your neural network:

In [None]:
class SentimentAnalysisModel(torch.nn.Module):
    def __init__(self, base_model, freeze_embedder=False):
        super().__init__()
        # We are using the transformer model as base_model
        self.base_model = base_model
        # Freeze the base_model
        if freeze_embedder:
            for param in self.base_model.parameters():
                param.requires_grad = False
        # We add a linear layer on top of the base_model
        self.classifier = torch.nn.Linear(base_model.config.hidden_size, 2)
        # Initialize the classifier weights
        torch.nn.init.xavier_uniform_(self.classifier.weight) # Xavier uniform initialization
        torch.nn.init.zeros_(self.classifier.bias) # Initialize bias with zeros

    def forward(self, **model_inputs):
        # model_inputs is a dict
        # Pass the inputs to the model to produce an embedding
        base_model_output = self.base_model(**model_inputs)
        # Pass the embedding through the classifier
        # here we use the pooler_output as the representation of the sentence (depends on the model)
        output = self.classifier(base_model_output.pooler_output)
        return output



In [None]:
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
model = SentimentAnalysisModel(roberta, freeze_embedder=False).to(device) # Set to True to freeze the embedder for faster training
model(input_ids=torch.tensor([train_dataset[0]["input_ids"]], device=device))

So far we have only passed the token ids to the model.\
However, there are more inputs for transformer-based models which you need to be aware of:

1. **input_ids**: Token IDs to be fed to the model.
2. **attention_mask**: Mask to avoid performing attention on padding token indices.
3. **token_type_ids**: Segment token indices to indicate different portions of the inputs (used in models like BERT for tasks like question answering).
4. **position_ids**: Indices of positions of each input sequence token in the position embeddings.

If you don't pass them into the model, the library will take care of it!\
However, if you perform padding during batching, the model might not know where padding was applied and therefore you will also have to provide the `attention_mask`!

Having the processed data and the model in place, we create the appropriate dataloader:

In [None]:
def collate_fn(features):
    # We need to pad the input to make sure all sentences have the same length
    input_ids = torch.nn.utils.rnn.pad_sequence(
        [torch.tensor(f["input_ids"]) for f in features],
        batch_first=True,
        padding_value=tokenizer.pad_token_id,
    )
    if "attention_mask" in features[0]:
        # If the features contain attention_mask, we should pad them as well
        attention_mask = torch.nn.utils.rnn.pad_sequence(
            [torch.tensor(f["attention_mask"]) for f in features],
            batch_first=True,
            padding_value=0,
        )
    else:
        # We need to create an attention mask from input_ids
        attention_mask = (input_ids != tokenizer.pad_token_id).int()
    labels = torch.tensor([f["label"] for f in features])
    batch = {
        "input_ids": input_ids,
        "attention_mask": attention_mask,
        "labels": labels,
    }
    return batch


train_dataloader = DataLoader(
    train_dataset, batch_size=2, shuffle=True, collate_fn=collate_fn
)
val_dataloader = DataLoader(
    val_dataset, batch_size=8, shuffle=False, collate_fn=collate_fn
)

for batch in train_dataloader:
    print(batch)
    print(batch.keys())
    break

Instead of implementing the collate function ourselbes, the transformers library provides them for different applications as well:

In [None]:
train_dataloader = DataLoader(
    train_dataset.with_format(columns=["input_ids", "label"]),
    batch_size=32,
    shuffle=True,
    collate_fn=DataCollatorWithPadding(tokenizer, padding="longest"),
)

for batch in train_dataloader:
    print(batch)
    print(batch.keys())
    break

In [None]:
val_dataloader = DataLoader(
    val_dataset.with_format(columns=["input_ids", "label"]),
    batch_size=128,
    shuffle=False,
    collate_fn=DataCollatorWithPadding(tokenizer, padding="longest"),
)

for batch in val_dataloader:
    print(batch)
    print(batch.keys())
    break

We also write an evaluation function that computes the accuracy of all samples in a dataloader using the model:

In [None]:
# Evaluate the model
def evaluate(model, dataloader, device=None):
    with torch.no_grad():
        model.eval()
        total = 0
        correct = 0
        for batch in tqdm(dataloader, desc="Evaluation", leave=False):
            input_ids = batch["input_ids"].to(device)
            attention_mask = batch["attention_mask"].to(device)
            labels = batch["labels"].to(device)
            outputs = model(input_ids=input_ids, attention_mask=attention_mask)
            predicted = torch.argmax(outputs, dim=1)
            total += labels.size(0)
            correct += (predicted == labels).sum().item()
    return correct / total

In [None]:
print(f"Evaluation accuracy on validation data: {evaluate(model, val_dataloader, device=device)}")

Next, we need a training loop that is just the same as before.\
We have to make sure to feed in the correct arguments to our model:

In [None]:
optimizer = torch.optim.AdamW(model.parameters(), lr=0.00005, weight_decay=0.01)
loss_fn = torch.nn.CrossEntropyLoss()
num_epochs = 3
metric_dict = {"Loss": "-", "Val Acc": evaluate(model, val_dataloader, device=device)}

with tqdm(
    total=num_epochs * len(train_dataloader), desc="Training", unit="batch"
) as pbar:
    for epoch in range(num_epochs):
        # Set the model in training mode
        model.train()
        for batch in train_dataloader:
            input_ids = batch["input_ids"].to(device)
            attention_mask = batch["attention_mask"].to(device)
            labels = batch["labels"].to(device)
            output = model(input_ids=input_ids, attention_mask=attention_mask)
            loss = loss_fn(output, labels)
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            metric_dict["Loss"] = loss.item()
            pbar.set_postfix(metric_dict)
            pbar.update(1)
        metric_dict["Val Acc"] = evaluate(model, val_dataloader, device=device)
        pbar.set_postfix(metric_dict)

Alternatively, the transformers library makes it simple to use LMs as it includes task-specific models for finetuning:

In [None]:
# RobertaForSequenceClassification model can be used for text classification tasks like sentiment analysis
# It has a sequence classification head, that is a linear layer on top of the RoBERTa model that outputs a classification label
model = RobertaForSequenceClassification.from_pretrained("roberta-base", num_labels=2)

You can also use their Trainer API so that you don't have to implement the training loop again and again:

In [None]:
# Define a function to compute the metrics
def compute_metrics(pred_and_label: EvalPrediction):
    return {
        "accuracy": (pred_and_label.predictions.argmax(axis=-1) == pred_and_label.label_ids)
        .mean()
        .item()
    }


# Define the training arguments
training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    logging_strategy="epoch",
    optim="adamw_hf",
    learning_rate=5e-5,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=128,
    num_train_epochs=3,
    weight_decay=0.01,
    use_cpu=False,
    eval_on_start=True,
    save_strategy="no",  # We will not save the model for now to save disk space
)

# Define the trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    processing_class=tokenizer,  # enables padding of batches
    compute_metrics=compute_metrics,
)

# Train the model
trainer.train()

# Evaluate the model
trainer.evaluate()

### Text Generation Language Models with Prompting

Text generation language models, such as GPT-3, can be used with prompting to perform various natural language processing tasks, including sentiment analysis.\
Prompting involves providing the model with a specific input or "prompt" that guides it to generate the desired output.\
This technique leverages the model's pre-trained knowledge to perform tasks without requiring additional fine-tuning.

#### Applying Prompting to Sentiment Analysis

To use a text generation model for sentiment analysis, you can craft a prompt that instructs the model to classify the sentiment of a given text. The prompt should be designed to elicit a response that indicates whether the sentiment is positive or negative.


The transformers library makes it simple to load a text generation model.
>Note that they can be quite large and potentially do not fit or run slowly on your CPU/RAM.

In [None]:
# Load the pipeline
pipe = pipeline(
    "text-generation",
    model="microsoft/Phi-3.5-mini-instruct",
    trust_remote_code=True,  # Trust the remote code; this is required for some models, but always check the code first!
    device=device,  # Set this to "cuda" for GPU acceleration if available
    torch_dtype=torch.bfloat16,  # Use bfloat16 for less memory usage and faster inference
)

### Input Format for Chat-Based Decoder Models

Chat-based decoder models, such as GPT-3, typically require inputs in a specific format to generate coherent and contextually relevant responses.\
The input format generally consists of a sequence of messages, each with a role and content.\
The roles depend on the model, and often are "system", "user", or "assistant".

#### Example Input Format

```python
messages = [
    {
        "role": "system",
        "content": "Your task is to perform sentiment analysis. Classify the sentiment of the provided text into 'negative' or 'positive' and return only this label.",
    },
    {"role": "user", "content": "This movie was the worst movie I have ever seen."},
]
```

#### Key Components

1. **Role**: Indicates the role of the message sender. Common roles include:
   - `system`: Provides instructions or context for the conversation.
   - `user`: Represents the input from the user.
   - `assistant`: Represents the response from the model.

2. **Content**: The actual text of the message.

In [None]:
messages = [
    {
        "role": "system",
        "content": "Your task is to perform sentiment analysis. Classify the sentiment of the provided text into 'negative' or 'positive' and return only this label.",
    },
    {"role": "user", "content": "This movie was the worst movie I have ever seen."},
]

generation_args = {
    "max_new_tokens": 3,  # maximum number of tokens to generate
    "return_full_text": False,
    "temperature": 0.0,  # temperature for sampling (on if do_sample=True)
    "do_sample": False,  # whether to sample from the output distribution
}

output = pipe(messages, **generation_args)
print(output)

There is much more that you can do with text generation models!\
For example, in-context learning (sometimes also called demonstration learning) is a technique where the model is provided with examples of the task it needs to perform within the prompt itself.\
This helps the model understand the task better and generate more accurate responses.\
For sentiment analysis, you can provide a few examples of sentences along with their sentiment labels in the prompt.\
The model will then use these examples to infer the sentiment of new sentences.

Also, constraining the output tokens can help guiding the model to generate expected outputs and make it easier to parse the output.