# Ideas:


1. **Transformers and BERT**:
    - **Overview of the Transformer architecture**:
        - Explanation of self-attention mechanism and multi-head attention.
        - Understanding positional encoding and its importance.
        - Comparison between RNNs and Transformers.
    - **Implementing a Transformer model in PyTorch**:
        - Building the Transformer encoder and decoder from scratch.
        - Training the Transformer model on a language modeling task (e.g., text generation).
    - **Introduction to BERT (Bidirectional Encoder Representations from Transformers)**:
        - Understanding the pre-training objectives: Masked Language Model (MLM) and Next Sentence Prediction (NSP).
        - Fine-tuning BERT for a specific task using PyTorch (e.g., sentiment analysis, question answering).
        - Exploring variations of BERT like RoBERTa, DistilBERT, and their use cases.

2. **Language Model Evaluation**:
    - **Metrics for evaluating language models**:
        - Explanation of perplexity and its calculation.
        - Understanding BLEU score for evaluating translation tasks.
        - Introduction to other metrics like ROUGE, METEOR, and their applications.
    - **Implementing evaluation metrics in PyTorch**:
        - Writing functions to calculate perplexity for a given language model.
        - Implementing BLEU score calculation for evaluating Seq2Seq models.
        - Using libraries like `nltk` or `sacrebleu` for metric calculations.
    - **Practical evaluation**:
        - Evaluating the performance of different models (RNN, LSTM, GRU, Transformer) on the same dataset.
        - Analyzing the results and discussing the strengths and weaknesses of each model.
        - Visualizing evaluation metrics using tools like Matplotlib or Seaborn.

3. **Advanced Topics**:
    - Exploring recent advancements like GPT-3 and their implementations.
    - Ethical considerations and biases in language models.


In [None]:
%pip install transformers[torch]

In [None]:
import torch
from torch.utils.data import DataLoader
from transformers import (
    DataCollatorWithPadding,
    RobertaModel,
    RobertaTokenizer,
    RobertaForSequenceClassification,
    Trainer,
    TrainingArguments,
)
from tqdm.auto import tqdm
from datasets import load_dataset

Load the dataset:

In [None]:
# Load the dataset
dataset = load_dataset("stanfordnlp/sst2")
print(dataset)

Load model with tokenizer:

In [None]:
# Load the tokenizer and model
tokenizer: RobertaTokenizer = RobertaTokenizer.from_pretrained("roberta-base")
roberta = RobertaModel.from_pretrained("roberta-base")

Inspect the model:

In [None]:

print(roberta)

Prepare the dataset:

In [None]:
# Tokenize the dataset
def tokenize(examples):
    return tokenizer(examples["sentence"], truncation=True, return_attention_mask=False)


# Prepare the datasets
tokenized_datasets = dataset.map(tokenize, batched=True, num_proc=10)

train_dataset = tokenized_datasets["train"]
test_dataset = tokenized_datasets["test"]
val_dataset = tokenized_datasets["validation"]

# Use a subset for quick training
train_dataset = train_dataset.shuffle(seed=42).select(range(1000))
test_dataset = test_dataset.shuffle(seed=42).select(range(100))
val_dataset = val_dataset.shuffle(seed=42).select(range(100))

print(train_dataset.features)
print(train_dataset[0])

In [None]:
# Call model on the first example
output = roberta(input_ids=torch.tensor([train_dataset[0]["input_ids"]]))
print(output.last_hidden_state.shape)
print(output.pooler_output.shape)

The model's output is a contextualized representation of the input, and therefor can be used as such in your neural network:

In [None]:
class SentimentAnalysisModel(torch.nn.Module):
    def __init__(self, embedder, freeze_embedder=True):
        super().__init__()
        # We are using the transformer model as embedder
        self.embedder = embedder
        # Freeze the embedder
        if freeze_embedder:
            for param in self.embedder.parameters():
                param.requires_grad = False
        # We add a linear layer on top of the embedder
        self.classifier = torch.nn.Linear(embedder.config.hidden_size, 2)

    def forward(self, **model_inputs):
        # Pass the inputs to the model to produce an embedding
        embeddings = self.embedder(**model_inputs)
        # Pass the embedding through the classifier
        # here we use the pooler_output as the representation of the sentence (depends on the model)
        output = self.classifier(embeddings.pooler_output)
        return output


model = SentimentAnalysisModel(roberta)
model(input_ids=torch.tensor([train_dataset[0]["input_ids"]]))

TODO Describe model inputs, like input_ids, attention_mask and token_type_ids (latter is not used by Roberta)

Dataloader for batching:

In [None]:
def collate_fn(features):
    # We need to pad the input to make sure all sentences have the same length
    input_ids = torch.nn.utils.rnn.pad_sequence(
        [torch.tensor(f["input_ids"]) for f in features],
        batch_first=True,
        padding_value=tokenizer.pad_token_id,
    )
    labels = torch.tensor([f["label"] for f in features])
    batch = {
        "input_ids": input_ids,
        "labels": labels,
    }
    if "attention_mask" in features[0]:
        batch["attention_mask"] = torch.nn.utils.rnn.pad_sequence(
            [torch.tensor(f["attention_mask"]) for f in features],
            batch_first=True,
            padding_value=0,
        )
    return batch


train_dataloader = DataLoader(
    train_dataset, batch_size=8, shuffle=True, collate_fn=collate_fn
)

for batch in train_dataloader:
    print(batch)
    break

In [None]:
train_dataloader = DataLoader(
    train_dataset.with_format(columns=["input_ids", "label"]),
    batch_size=8,
    shuffle=True,
    collate_fn=DataCollatorWithPadding(tokenizer, padding="longest"),
)

for batch in train_dataloader:
    print(batch)
    break

In [None]:
val_dataloader = DataLoader(
    val_dataset.with_format(columns=["input_ids", "label"]),
    batch_size=8,
    shuffle=False,
    collate_fn=DataCollatorWithPadding(tokenizer, padding="longest"),
)

In [None]:
# Evaluate the model
def evaluate(model, dataloader):
    with torch.no_grad():
        model.eval()
        total = 0
        correct = 0
        for batch in tqdm(dataloader, desc="Evaluation", leave=False):
            input_ids = batch["input_ids"]
            attention_mask = batch["attention_mask"]
            labels = batch["labels"]
            outputs = model(input_ids=input_ids, attention_mask=attention_mask)
            predicted = torch.argmax(outputs, dim=1)
            total += labels.size(0)
            correct += (predicted == labels).sum().item()
    return correct / total

print(f"Evaluation accuracy on validation data: {evaluate(model, val_dataloader)}")

Training loop:

In [None]:
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)
loss_fn = torch.nn.CrossEntropyLoss()
num_epochs = 10
metric_dict = {"Loss": "-", "Val Acc": evaluate(model, val_dataloader)}

with tqdm(
    total=num_epochs * len(train_dataloader), desc="Training", unit="batch"
) as pbar:
    for epoch in range(num_epochs):
        # Set the model in training mode
        model.train()
        for batch in train_dataloader:
            input_ids = batch["input_ids"]
            attention_mask = batch["attention_mask"]
            labels = batch["labels"]
            output = model(input_ids=input_ids, attention_mask=attention_mask)
            loss = loss_fn(output, labels)
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            metric_dict["Loss"] = loss.item()
            pbar.set_postfix(metric_dict)
            pbar.update(1)
        metric_dict["Val Acc"] = evaluate(model, val_dataloader)
        pbar.set_postfix(metric_dict)

Alternatively, the transformers library makes it simple to use LMs as it includes task-specific models for finetuning:

In [None]:
# RobertaForSequenceClassification model can be used for text classification tasks like sentiment analysis
# It has a sequence classification head, that is a linear layer on top of the RoBERTa model that outputs a classification label
model = RobertaForSequenceClassification.from_pretrained("roberta-base", num_labels=2)

# Define the training arguments
training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    optim="sgd",
    learning_rate=5e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=3,
    weight_decay=0.01,
    use_cpu=True,
)

# Define the trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    processing_class=tokenizer, # enables padding of batches
)

# Train the model
trainer.train()

# Evaluate the model
trainer.evaluate()