# Language Processing

In this notebook, we explore text classification and entity recognition. 

We start by introducing the main architecture currently used to process texts: the Transformer.

## The Transformer

Published in the paper [Attention is All you Need](https://arxiv.org/abs/1706.03762) by Vaswani et al. (2017), the Transformer and its variants have become the main neural network architecture across data modalities. The Transformer is [natively implemented in PyTorch](https://pytorch.org/docs/stable/generated/torch.nn.Transformer.html) as a network model.

In our introduction to this architecture, we borrow from the following sources:
1. [The Illustrated Transformer](https://jalammar.github.io/illustrated-transformer/).
2. [A Very Gentle Introduction to Large Language Models without the Hype](https://mark-riedl.medium.com/a-very-gentle-introduction-to-large-language-models-without-the-hype-5f67941fa59e).
3. [Tutorial 6: Transformers and Multi-Head Attention](https://uvadlc-notebooks.readthedocs.io/en/latest/tutorial_notebooks/tutorial6/Transformers_and_MHAttention.html).
4. [The Illustrated GPT2](https://jalammar.github.io/illustrated-gpt2/).


### Language Models

**Language models** allow us to represent textual inputs numerically, all the while storing information about all layers of the linguistic stack, from syntax to semantics, and more.

Every language model starts with the need to represent text numerically. This step is often called **tokenization**: the slicing up of a text into units that are assigned a vector representation, or **embedding**, before being further processed by the language model. The most widely used tokenizers nowadays work at the subword level, i.e., short sequences of characters. Many different alternatives are possible.

Once a text has been tokenized, and in order to train a language model, an approach is taken which is called **self-supervision**: large quantities of human-generated texts are used with words masked at random or sequentially, and the model is trained on the task of predicting the missing words. This simple approach turns out to be extremely powerful in practice. See [2] for more.

Random masking:

<img src="figures/mask1.png" width="400px" heigth="400px">

Sequential masking, or next word prediction:

<img src="figures/mask2.png" width="400px" heigth="400px">

"GPT stands for **Generative Pre-trained Transformer**. Let’s break this down:

* Generative. The model is capable of generating continuations to the provided input. That is, given some text, the model tries to guess which words come next.
* Pre-trained. The model is trained on a very large corpus of general text and is meant to be trained once and used for a lot of different things without needing to be re-trained from scratch.

A **transformer** is a particular type of deep learning model that transforms the encoding in a particular way that makes it easier to guess the blanked out word. At the heart of a transformer is the classical **encoder-decoder network**. The encoder does a very standard encoding process. But then it adds something else called **self-attention**.

A transformer learns which words in an input sequence are related and then creates a new encoding for each position in the input sequence that is a merger of all the related words." [2]

<img src="figures/transformer_architecture.svg" width="400px" heigth="800px">

### Encoder-Decoder Architectures

<img src="figures/encoderdecoder1.png" width="400px" heigth="400px">

The basic idea of Encoder-Decoder architectures originates by the need to deal with so called *"sequence to sequence tasks"*, e.g., translation. The input, for example your text in French, is first *encoded*, and then the output, for example a translation in English, is generated by the *decoder* on the basis of the encoded input. It turns out that these architectures are also generally useful when you want to encode a query, and generate an answer, like ChatGPT does.

<img src="figures/encoderdecoder3.png" width="500px" heigth="300px">

In the encoder and the decoder, we can use a variety of modules. 

With transformers, feed forward layers are used in conjunction with self-attention layers. When the encoded input is passed to the decoder, another form of attention is applied: cross-attention:

<img src="figures/encoderdecoder2.png" width="500px" heigth="300px">

In cross-attention, the queries (*Q*) come from the decoder hidden states, while the keys (*K*) and values (*V*) come from the encoder hidden states. We will see next what this means.

### Attention

<img src="figures/attention1.png" width="500px" heigth="400px">

The key idea of attention is to allow the model to look at, or attend, at any relevant part of the input when making a prediction, for example when predicting a masked word, or translating a single word into another language.

<img src="figures/attention2.png" width="400px" heigth="400px">

Attention units can be visualized and show how a model might consider different parts of the input at a given step, with varied intensity based on how relevant each part of the input is to the prediction at this step.

#### Self-Attention

Attention layers work by introducing three sets of parameters: *queries Q, keys K, and values V*.

Following [1], we have that at each calculation of self-attention:

1. The first step is to create three vectors from each of the encoder’s input vectors (in this case, the embedding of each word). So for each word, we create a query vector, a key vector, and a value vector. These vectors are created by multiplying the embedding by the three matrices Q, K, V, that are trained during the training process.
2. The second step in calculating self-attention is to calculate a score. Say we’re calculating the self-attention for the first word in this example, “Thinking”. We need to score each word of the input sentence against this word. The score determines how much focus to place on other parts of the input sentence as we encode a word at a certain position.
3. The third step applies a normalization and the softmax, to get probability scores for each input word against the current word.
4. The next step is to multiply each value vector by the softmax score (in preparation to sum them up). The intuition here is to keep intact the values of the word(s) we want to focus on, and reduce the weight of irrelevant words.
5. The fifth step is to sum up the weighted value vectors. This produces the output of the self-attention layer at this position (for the first word).

<img src="figures/selfattention.png" width="500px" heigth="800px">

These operations are easily implemented as matrix multiplications. 

Note again that **cross-attention** works in a similar way, just the queries come from the encoder, whilst the keys and values from the decoder. Lastly, note that the transformer used **multi-head attention**: several attentio modules or heads are stacked up in each layer, the model training determine how they are actually used. This gives the model much more capacity.

*See 6.2. How Does Self-Attention Work? [2] and [1] here for an in-depth discussion.*

Note that the transformer is a general purpose architecture. It has been used in encoder-only architectures (e.g., BERT: Bidirectional Encoder Representations from Transformers), in vision architectures (e.g., ViT: Vision Transformer), and more.

Also note that further tools used to improve GPTs after language model training:
* **Instruction tuning**: basically, provide correct answers to the question you asked (i.e., supervised learning).
* **Reinforcement learning from human feedback**: provide signal on whether the answer is good or bad (e.g., thumbs up or down). This is used as signal by the LLM to improve itself.

See [4] for more details on these aspects.

*This is a very quick introduction to the transformer, aiming only at providing intuition. In most cases, as we will see below, transformers are used as pre-trained models, that we then fine-tune or otherwise adjust to our specific tasks.*

In [139]:
from sklearn.preprocessing import LabelEncoder
import torch
import torch.nn as nn
from transformers import BertTokenizer, BertForSequenceClassification, DataCollatorWithPadding
from torch.utils.data import DataLoader, Dataset
from sklearn.metrics import accuracy_score, classification_report
import pandas as pd
from torch.optim import AdamW
from sklearn.model_selection import train_test_split
import numpy as np

## Imports for plotting
import matplotlib.pyplot as plt
%matplotlib inline
from matplotlib.colors import to_rgba
import seaborn as sns
sns.set_theme('notebook', style='whitegrid')

## Progress bar
from tqdm.notebook import tqdm

# Set device (GPU or CPU)
device = torch.device("mps") if torch.backends.mps.is_available() else torch.device("cpu")

In [140]:
import torch
torch.manual_seed(42) # Setting the seed

print("Using torch", torch.__version__)

Using torch 2.2.1


## Text classification

In this exercise, we are given a dataset of digitized books from the British Library. These books belong to different genres (e.g., poetry or prose). We are interested in training a classifier to distinguish between such genres. [Read more about this dataset here](https://github.com/mromanello/ADA-DHOxSS/tree/master/data#british-library-19th-century-books).

In [141]:
import pandas as pd

In [142]:
df_books = pd.read_csv('data/bl_books/sample_tidy/df_book.csv')
df_texts = pd.read_csv('data/bl_books/sample_tidy/df_book_text.csv')

In [143]:
df_books.head(2)

Unnamed: 0,datefield,publisher,title,edition,place,issuance,first_pdf,number_volumes,identifier,fulltext_filename,type,genre
0,1841.0,Privately printed,"The Poetical Aviary, with a bird's-eye view of...",,Calcutta,monographic,lsidyv35c55757,1,196,000000196_01_text.json,poet,Poetry
1,1888.0,Rivingtons,A History of Greece. Part I. From the earliest...,,London,monographic,lsidyv376da437,1,4047,000004047_01_text.json,story,Prose


In [144]:
df_books.genre.value_counts()

genre
Music     119
Poetry    114
Drama     111
Prose     108
Name: count, dtype: int64

In [145]:
df = df_books.merge(df_texts, on='fulltext_filename')

In [146]:
df.head(2)

Unnamed: 0,datefield,publisher,title,edition,place,issuance,first_pdf,number_volumes,identifier,fulltext_filename,type,genre,fulltext,book_id
0,1841.0,Privately printed,"The Poetical Aviary, with a bird's-eye view of...",,Calcutta,monographic,lsidyv35c55757,1,196,000000196_01_text.json,poet,Poetry,"THE POETICAL AVIARY, WITH A B I R D'S-E ...",196
1,1888.0,Rivingtons,A History of Greece. Part I. From the earliest...,,London,monographic,lsidyv376da437,1,4047,000004047_01_text.json,story,Prose,HISTORY OF GREECE ABBOTT A HISTORY OF G...,4047


In [147]:
# divide all the data into training and testing

from sklearn.model_selection import train_test_split
train_df, test_df = train_test_split(df, test_size=0.2, random_state=42)

In [148]:
# Encode string labels into integers using LabelEncoder from sklearn
# The label column in both the training and testing dataframes contains string labels
label_encoder = LabelEncoder()
# Fit the encoder on the training labels and transform them to integers
train_df['label'] = label_encoder.fit_transform(train_df['genre'])
# Apply the same transformation on the test labels
test_df['label'] = label_encoder.transform(test_df['genre'])

In [149]:
# Define a custom dataset class for our text data
class TextDataset(Dataset):
    def __init__(self, dataframe, tokenizer, max_len):
        self.texts = dataframe['fulltext'].values  # Get the 'text' column from the dataframe
        self.labels = dataframe['label'].values  # Get the 'label' column (which now contains integers)
        self.tokenizer = tokenizer  # BERT tokenizer passed as an argument
        self.max_len = max_len  # Maximum length for the tokenized input

    def __len__(self):
        return len(self.texts)  # Returns the number of samples in the dataset

    def __getitem__(self, index):
        text = self.texts[index]  # Get the text at the given index
        label = self.labels[index]  # Get the label corresponding to the text

        # Tokenize the text
        # encode_plus returns a dictionary with input_ids, attention_mask, etc.
        encoding = self.tokenizer.encode_plus(
            text,
            add_special_tokens=True,  # Add [CLS] and [SEP] tokens
            max_length=self.max_len,  # Truncate or pad to the max length
            return_token_type_ids=False,  # We don't need token type IDs for classification
            padding='longest',  # Pad the sequence to max_len
            return_attention_mask=True,  # Return attention mask (which indicates padded tokens)
            return_tensors='pt',  # Return PyTorch tensors
            truncation=True  # Truncate longer sequences
        )

        # Return the input ids, attention mask, and label for this sample
        return {
            'input_ids': encoding['input_ids'].flatten(),
            'attention_mask': encoding['attention_mask'].flatten(),
            'labels': torch.tensor(label, dtype=torch.long)  # Convert label to tensor
        }

# Function to train the model
def train_model(model, data_loader, optimizer, loss_fn, device, scheduler, n_examples):
    model = model.train()  # Set the model to training mode
    total_loss = 0  # Initialize the total loss
    correct_predictions = 0  # Initialize correct predictions count

    for batch in data_loader:
        # Move the data to the device (GPU or CPU)
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['labels'].to(device)

        # Forward pass through the model
        outputs = model(
            input_ids=input_ids,
            attention_mask=attention_mask,
            labels=labels
        )

        loss = outputs.loss  # Get the loss from the output
        logits = outputs.logits  # Get the raw output (logits)

        # Backward pass and optimization
        optimizer.zero_grad()  # Zero out the gradients
        loss.backward()  # Perform backpropagation
        optimizer.step()  # Update model parameters
        scheduler.step()  # Update the learning rate based on the scheduler

        total_loss += loss.item()  # Accumulate the total loss
        _, preds = torch.max(logits, dim=1)  # Get the predicted labels (the highest logit)
        correct_predictions += torch.sum(preds == labels)  # Count correct predictions

    return correct_predictions.float() / n_examples, total_loss / len(data_loader)  # Return accuracy and average loss

# Function to evaluate the model on the validation set
def eval_model(model, data_loader, loss_fn, device, n_examples):
    model = model.eval()  # Set the model to evaluation mode
    total_loss = 0  # Initialize the total loss
    correct_predictions = 0  # Initialize correct predictions count

    with torch.no_grad():  # Disable gradient calculation (we're not training)
        for batch in data_loader:
            input_ids = batch['input_ids'].to(device)  # Move data to the device
            attention_mask = batch['attention_mask'].to(device)
            labels = batch['labels'].to(device)

            # Forward pass
            outputs = model(
                input_ids=input_ids,
                attention_mask=attention_mask,
                labels=labels
            )

            loss = outputs.loss  # Get the loss
            logits = outputs.logits  # Get the raw output (logits)

            total_loss += loss.item()  # Accumulate the loss
            _, preds = torch.max(logits, dim=1)  # Get the predicted labels
            correct_predictions += torch.sum(preds == labels)  # Count correct predictions

    return correct_predictions.float() / n_examples, total_loss / len(data_loader)  # Return accuracy and average loss

In [150]:
# Load pre-trained BERT tokenizer and model
PRE_TRAINED_MODEL_NAME = 'bert-base-uncased'  # Using the uncased version of BERT
tokenizer = BertTokenizer.from_pretrained(PRE_TRAINED_MODEL_NAME, clean_up_tokenization_spaces=True)  # Load tokenizer

# Initialize a data collator for dynamic padding
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

# Hyperparameters
FREEZE = False  # Whether to freeze the BERT layers or not
MAX_LEN = 512  # Maximum length of input sequences
BATCH_SIZE = 16  # Batch size for training and evaluation
EPOCHS = 10  # Number of epochs to train the model
LEARNING_RATE = 5e-5  # Learning rate for the optimizer

# Assume train_df and test_df are the pandas DataFrames containing the dataset
# train_df = pd.DataFrame(...)  # DataFrame containing 'fulltext' and 'label'
# test_df = pd.DataFrame(...)   # DataFrame containing 'fulltext' and 'label'

# Create datasets for training and testing using the TextDataset class
train_dataset = TextDataset(train_df, tokenizer, MAX_LEN)
test_dataset = TextDataset(test_df, tokenizer, MAX_LEN)

# Create DataLoaders to feed data into the model in batches
train_data_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True, collate_fn=data_collator)  # Shuffle the training data
test_data_loader = DataLoader(test_dataset, batch_size=BATCH_SIZE, collate_fn=data_collator)  # No need to shuffle the test data

# Initialize the pre-trained BERT model for sequence classification
model = BertForSequenceClassification.from_pretrained(PRE_TRAINED_MODEL_NAME, num_labels=len(label_encoder.classes_))
model = model.to(device)  # Move the model to the device (GPU/CPU)

# Ensure all layers are trainable
# NB this is costly in terms of memory and computation
if not FREEZE:
    for param in model.parameters():
        param.requires_grad = True

# Define the optimizer and learning rate scheduler
optimizer = AdamW(model.parameters(), lr=LEARNING_RATE)  # AdamW optimizer (recommended for BERT)
total_steps = len(train_data_loader) * EPOCHS  # Total number of training steps

scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=1, gamma=0.9)  # Learning rate scheduler
loss_fn = nn.CrossEntropyLoss().to(device)  # Cross entropy loss (common for classification tasks)

# Training loop
for epoch in range(EPOCHS):
    print(f'Epoch {epoch + 1}/{EPOCHS}')  # Print the current epoch
    print('-' * 10)

    # Train the model for one epoch
    train_acc, train_loss = train_model(model, train_data_loader, optimizer, loss_fn, device, scheduler, len(train_df))
    print(f'Train loss {train_loss} accuracy {train_acc}')  # Print training loss and accuracy

    # Evaluate the model on the validation set
    test_acc, test_loss = eval_model(model, test_data_loader, loss_fn, device, len(test_df))
    print(f'Test loss {test_loss} accuracy {test_acc}')  # Print validation loss and accuracy

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch 1/10
----------
Train loss 1.3777993140013323 accuracy 0.31855955719947815
Test loss 1.3249172170956929 accuracy 0.4065934121608734
Epoch 2/10
----------
Train loss 1.321768563726674 accuracy 0.371191143989563
Test loss 1.3043472369511921 accuracy 0.450549453496933
Epoch 3/10
----------
Train loss 1.3007759166800457 accuracy 0.40166205167770386
Test loss 1.3017238179842632 accuracy 0.4285714328289032
Epoch 4/10
----------
Train loss 1.3041148755861365 accuracy 0.4487534761428833
Test loss 1.3015182216962178 accuracy 0.4285714328289032
Epoch 5/10
----------
Train loss 1.3011114027189172 accuracy 0.41274237632751465
Test loss 1.3015063405036926 accuracy 0.4285714328289032
Epoch 6/10
----------
Train loss 1.3070047730984895 accuracy 0.42382270097732544
Test loss 1.3015062808990479 accuracy 0.4285714328289032
Epoch 7/10
----------
Train loss 1.2946502488592397 accuracy 0.4736842215061188
Test loss 1.301506261030833 accuracy 0.4285714328289032
Epoch 8/10
----------
Train loss 1.304159

In [151]:
# Evaluation and metrics (after training is completed)
y_true = []  # List to store true labels
y_pred = []  # List to store predicted labels

model.eval()  # Set the model to evaluation mode
with torch.no_grad():  # Disable gradient calculation (not needed for inference)
    for batch in test_data_loader:
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['labels'].to(device)

        outputs = model(input_ids=input_ids, attention_mask=attention_mask)  # Forward pass
        _, preds = torch.max(outputs.logits, dim=1)  # Get the predicted labels

        y_true.extend(labels.cpu().numpy())  # Store true labels
        y_pred.extend(preds.cpu().numpy())  # Store predicted labels

# Print the classification report (precision, recall, F1-score)
print(classification_report(y_true, y_pred, target_names=label_encoder.classes_, zero_division=1))

              precision    recall  f1-score   support

       Drama       0.48      0.46      0.47        24
       Music       0.33      0.35      0.34        20
      Poetry       0.38      0.31      0.34        26
       Prose       0.50      0.62      0.55        21

    accuracy                           0.43        91
   macro avg       0.42      0.43      0.43        91
weighted avg       0.42      0.43      0.42        91



### Exercise 1 (Easy): RoBERTa

The model clearly could use some improvement. One way it to switch to a better model architecture, which allows larger inputs to be processed. Consider RoBERTa:

```Python
from transformers import RobertaTokenizer, RobertaForSequenceClassification

# Change the model name to a RoBERTa pre-trained model
PRE_TRAINED_MODEL_NAME = 'roberta-base'  # You can use 'roberta-large' for a larger model

# Load the RoBERTa tokenizer and model
tokenizer = RobertaTokenizer.from_pretrained(PRE_TRAINED_MODEL_NAME)
model = RobertaForSequenceClassification.from_pretrained(PRE_TRAINED_MODEL_NAME, num_labels=len(label_encoder.classes_))

# Move model to the appropriate device
model = model.to(device)
```

You can remove the line: ```return_token_type_ids=False``` since RoBERTa uses a Byte-pair encoding that does not include type IDs. How is the performance changing? How about the training time?

### Exercise 2 (Medium): Book type

Try using the `type` column instead of genre. Is it evenly distributed or not? How does the classifier perform with it?

## Named Entity Recognition

In this example, we are given a dataset of toponyms in 19th-century Engish newspapers. We are interested in developing a model to perform Named Entity Recognition, i.e., detecting mentions to entities of interest in a text. [The dataset is accessible here](https://github.com/hipe-eval/HIPE-2022-data/blob/main/documentation/README-topres19th.md#topres19th-dataset).

This is how the data looks like:

```
#FORMAT=WebAnno TSV 3.2
#T_SP=webanno.custom.Customentity|identifiier|value

#Text=THE POOLE AND SOUTH-WESTERN HERALD, THURSDAY, OCTOBER 20, 1864.
1-1	0-3	THE	_	_	
1-2	4-9	POOLE	_	_	
1-3	10-13	AND	_	_	
1-4	14-27	SOUTH-WESTERN	_	_	
1-5	28-34	HERALD	_	_	
1-6	34-35	,	_	_	
1-7	36-44	THURSDAY	_	_	
1-8	44-45	,	_	_	
1-9	46-53	OCTOBER	_	_	
1-10	54-56	20	_	_	
1-11	56-57	,	_	_	
1-12	58-62	1864	_	_	
1-13	62-63	.	_	_	

#Text=POOLE TOWN COUNCIL.
2-1	65-70	POOLE	https://en.wikipedia.org/wiki/Poole	LOC
2-2	71-75	TOWN	_	_	
2-3	76-83	COUNCIL	_	_	
2-4	83-84	.	_	_	
```

In [152]:
# First, we need to load data into memory

import os, re, random
import pandas as pd
import numpy as np
import torch
from torch.utils.data import Dataset, DataLoader

# Path to the dataset directory
DATA_DIR = "data/topRes19th_v2"

allowed_classes = ['LOC','_']
# Create the label_to_id dictionary by enumerating the allowed_classes
label_to_id = {label: idx for idx, label in enumerate(allowed_classes)}
def strip_square_brackets(word):
    # Use regex to remove brackets and their contents
    return re.sub(r'\[.*?\]', '', word)

class NERDataset(Dataset):
    def __init__(self, data_dir, split="train", context_window=3):
        """
        Initialize the dataset by loading all the files in the specified split (train/test).
        Strips out long sequences of non-entity tokens (labeled '_'), and keeps tokens 
        with 'LOC' labels along with a context window of surrounding tokens.

        Args:
        - data_dir (str): The directory where the dataset is stored.
        - split (str): 'train' or 'test'.
        - context_window (int): The number of surrounding tokens to keep around LOC occurrences.
        """
        self.data_dir = os.path.join(data_dir, split, "annotated_tsv")
        self.files = [f for f in os.listdir(self.data_dir) if f.endswith(".tsv")]
        self.context_window = context_window  # Context window around LOC tokens
        self.data = []
        self._load_data()

    def _load_data(self):
        """ Load and parse the data from the TSV files. """
        for file in self.files:
            file_path = os.path.join(self.data_dir, file)
            with open(file_path, "r", encoding="utf-8") as f:
                tokens, labels = [], []
                for line in f:
                    if line.strip() == "" or line.startswith("#"):  # Skip headers and empty lines
                        continue
                    columns = line.strip().split("\t")
                    token = columns[2]  # The word/token itself
                    label = strip_square_brackets(columns[-1])  # The NER label (LOC, BUILDINGS, etc.)
                    if label != "_":
                        label = "LOC"  # Simplify the label to LOC
                    tokens.append(token)
                    labels.append(label)

                # Now we strip the excess '_' tokens and keep the context around 'LOC' tokens
                stripped_tokens, stripped_labels = self._strip_non_loc(tokens, labels)

                if stripped_tokens:
                    self.data.append((stripped_tokens, stripped_labels))

    def _strip_non_loc(self, tokens, labels):
        """
        Strips long sequences of tokens labeled as '_', keeping tokens with 'LOC' labels and
        a context window of surrounding tokens.

        Args:
        - tokens (List[str]): List of tokens in a sentence.
        - labels (List[str]): List of corresponding labels.

        Returns:
        - stripped_tokens (List[str]): List of tokens after stripping excess '_'.
        - stripped_labels (List[str]): List of labels after stripping excess '_'.
        """
        stripped_tokens = []
        stripped_labels = []

        loc_indices = [i for i, label in enumerate(labels) if label == "LOC"]

        if not loc_indices:
            # No LOC labels, return empty (this sentence will be dropped)
            return stripped_tokens, stripped_labels

        # Loop through each LOC token and capture its context
        for loc_idx in loc_indices:
            # Start and end indices for the context window around LOC
            start_idx = max(0, loc_idx - self.context_window)
            end_idx = min(len(tokens), loc_idx + self.context_window + 1)

            # Append the tokens and labels in the context window
            stripped_tokens.extend(tokens[start_idx:end_idx])
            stripped_labels.extend(labels[start_idx:end_idx])

        return stripped_tokens, stripped_labels

    def __len__(self):
        """ Returns the total number of articles in the dataset. """
        return len(self.data)

    def __getitem__(self, idx):
        """
        Returns the tokens and labels for the given index.
        Args:
        - idx (int): The index of the article.
        Returns:
        - tokens (List[str]): The list of tokens.
        - labels (List[str]): The corresponding NER labels for the tokens.
        """
        tokens, labels = self.data[idx]
        return tokens, labels

In [153]:
# Example usage
# Load the dataset
train_dataset = NERDataset(DATA_DIR, split="train")
test_dataset = NERDataset(DATA_DIR, split="test")

# Example: Load one article
tokens, labels = train_dataset[0]
print("Tokens:", tokens)
print("Labels:", labels)

# PyTorch DataLoader
train_loader = DataLoader(train_dataset, batch_size=1, shuffle=True)

# Iterate through the DataLoader
for batch in train_loader:
    tokens, labels = batch
    print("Batch tokens:", tokens)
    print("Batch labels:", labels)
    break  # Stop after the first batch

Tokens: ['INDIA', '.', 'Calcutta', ',', 'INDIA', '.', 'Calcutta', ',', 'Aug', '.', 'The', 'Lieutenant-Governor', 'left', 'Calcutta', 'on', 'the', '23rd', '.', 'for', 'the', 'Assam', 'districts', '.', 'The', 'The', 'Bishop', 'of', 'Calcutta', 'is', 'about', 'to', 'Provinces', 'of', 'the', 'Punjaub', '.', 'Bengal', 'is', 'the', 'Punjaub', '.', 'Bengal', 'is', 'reported', 'healthy', 'are', 'good', '.', 'Bombay', ',', 'Aug', '.', 'the', 'Bank', 'of', 'Bombay', 'the', 'directors', 'reported', 'bank', '.', 'The', 'Poonah', 'branch', 'of', 'the', 'the', 'Bank', 'of', 'Bombay', 'expected', 'to', 'lose', 'the', 'Banks', 'of', 'Bombay', 'and', 'Bengal', 'has', 'of', 'Bombay', 'and', 'Bengal', 'has', 'been', 'abandoned', 'latest', 'news', 'from', 'Cabul', 'represents', 'Afzul', 'Khans', 'is', 'expected', 'that', 'Cabul', 'itself', 'will', 'soon']
Labels: ['LOC', '_', 'LOC', '_', 'LOC', '_', 'LOC', '_', '_', '_', '_', '_', '_', 'LOC', '_', '_', '_', '_', '_', '_', 'LOC', '_', '_', '_', '_', '_', '

In [154]:
from collections import Counter
import torch

def calculate_class_weights(train_dataset, label_to_id):
    label_counts = Counter()
    
    for example in train_dataset:
        labels = list(map(lambda x: x, example[1]))  # Extract the label from tuple
        label_counts.update(labels)
    
    total_labels = sum(label_counts.values())
    class_weights = {}
    print(label_counts)
    
    # Inverse weighting: More frequent classes get lower weights
    for label, count in label_counts.items():
        class_weights[label_to_id[label]] = total_labels / (len(label_to_id) * count)
    
    # Convert to tensor
    class_weights = torch.tensor([class_weights[i] for i in range(len(label_to_id))], dtype=torch.float)
    
    return class_weights

# Example: Calculate class weights
class_weights = calculate_class_weights(train_dataset, label_to_id)
print(f"Class Weights: {class_weights}")
print(label_to_id)

Counter({'_': 21848, 'LOC': 18581})
Class Weights: tensor([1.0879, 0.9252])
{'LOC': 0, '_': 1}


In [155]:
def collate_fn(batch, tokenizer, label_to_id, max_length=512, label_pad_token=-100):
    """
    Collate function to process batches with tokens and labels stored in tuples.
    
    Args:
    - batch: List of dicts containing 'tokens' and 'labels', where both are tuples.
    - tokenizer: Pretrained tokenizer (RobertaTokenizer).
    - label_to_id: Dictionary mapping string labels (e.g., 'LOC') to integers.
    - max_length: Maximum sequence length (default 512 for RoBERTa).
    - label_pad_token: Token to pad labels, default is -100.
    
    Returns:
    - input_ids: Tensor of tokenized input sequences.
    - labels: Tensor of padded labels.
    - attention_mask: Tensor mask to ignore padding tokens.
    """
    # Extract tokens and labels from tuples
    tokens = [list(map(lambda x: x, item[0])) for item in batch]
    labels = [list(map(lambda x: x, item[1])) for item in batch]

    # Tokenize the tokens with truncation and padding
    tokenized_inputs = tokenizer(tokens, 
                                 is_split_into_words=True, 
                                 padding=True, 
                                 truncation=True, 
                                 max_length=max_length,  # Ensure max length, also drops what is longer!
                                 return_tensors="pt")
    
    # Map string labels to integers and pad them to match input length
    max_len = tokenized_inputs['input_ids'].shape[1]  # Get max length from tokenized input
    padded_labels = []
    for label_seq in labels:
        # Convert string labels to IDs
        label_ids = [label_to_id.get(label, label_pad_token) for label in label_seq]
        padded_label = label_ids[:max_len] + [label_pad_token] * (max_len - len(label_ids))  # Truncate and pad
        padded_labels.append(padded_label)
    
    # Convert to tensors
    padded_labels = torch.tensor(padded_labels, dtype=torch.long)
    
    return tokenized_inputs['input_ids'], padded_labels, tokenized_inputs['attention_mask']

In [156]:
import torch
from torch.optim import AdamW
from transformers import RobertaTokenizer, RobertaForTokenClassification

# Load pre-trained RoBERTa model and tokenizer
model = RobertaForTokenClassification.from_pretrained('roberta-base', num_labels=len(allowed_classes))
tokenizer = RobertaTokenizer.from_pretrained('roberta-base', clean_up_tokenization_spaces=True)

BATCH_SIZE = 50  # Adjust batch size as necessary
EPOCHS = 10  # Adjust epochs as necessary

# Create PyTorch datasets and loaders
train_dataset = NERDataset(DATA_DIR, split="train")
test_dataset = NERDataset(DATA_DIR, split="test")
train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, collate_fn=lambda batch: collate_fn(batch, tokenizer, label_to_id))
test_loader = DataLoader(test_dataset, batch_size=BATCH_SIZE, collate_fn=lambda batch: collate_fn(batch, tokenizer, label_to_id))

model.to(device)  # Move model to device

# Define the weighted loss function
loss_fn = torch.nn.CrossEntropyLoss(weight=class_weights.to(device), ignore_index=-100)

# Optimizer
optimizer = AdamW(model.parameters(), lr=5e-5)

# Ensure all layers are trainable
# NB this is costly in terms of memory and computation
for param in model.parameters():
    param.requires_grad = True

# Training loop with device placement
model.train()
for epoch in range(EPOCHS):  # Adjust epochs as necessary
    for tokens, labels, mask in train_loader:

        optimizer.zero_grad()

        # Move data to device
        tokens = tokens.to(device)
        labels = labels.to(device)
        mask = mask.to(device)

        # Forward pass
        outputs = model(input_ids=tokens, attention_mask=mask)
        
        # Compute loss with weighted CrossEntropyLoss
        loss = loss_fn(outputs.logits.view(-1, len(label_to_id)), labels.view(-1))
        
        # Backward pass
        loss.backward()
        optimizer.step()

    print(f'Epoch {epoch + 1}, Loss: {loss.item()}')

Some weights of RobertaForTokenClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch 1, Loss: 0.6667300462722778
Epoch 2, Loss: 0.6363546848297119
Epoch 3, Loss: 0.6143303513526917
Epoch 4, Loss: 0.6364577412605286
Epoch 5, Loss: 0.6191195249557495
Epoch 6, Loss: 0.6238633394241333
Epoch 7, Loss: 0.5848383903503418
Epoch 8, Loss: 0.5780908465385437
Epoch 9, Loss: 0.5862990617752075
Epoch 10, Loss: 0.558539628982544


In [157]:
from sklearn.metrics import classification_report, confusion_matrix
import numpy as np

# Prepare lists to accumulate predictions and true labels across batches
all_preds = []
all_labels = []

# Evaluation on the test set with device placement
model.eval()
with torch.no_grad():
    for tokens, labels, mask in test_loader:
        # Move data to device
        tokens = tokens.to(device)
        labels = labels.to(device)
        mask = mask.to(device)

        # Get model predictions
        outputs = model(input_ids=tokens, attention_mask=mask)
        predictions = torch.argmax(outputs.logits, dim=-1)

        # Convert predictions and labels back to CPU and flatten the arrays
        all_preds.append(predictions.cpu().numpy().flatten())
        all_labels.append(labels.cpu().numpy().flatten())

# Concatenate all batches into a single array
all_preds = np.concatenate(all_preds)
all_labels = np.concatenate(all_labels)

# Filter out padding tokens (ignore -100 in labels)
mask = all_labels != -100
all_preds = all_preds[mask]
all_labels = all_labels[mask]

# Get the class names (in the same order as `label_to_id`)
class_names = [x[0] for x in sorted(label_to_id.items(), key=lambda item: item[1])]

# Confusion Matrix
conf_matrix = confusion_matrix(all_labels, all_preds)
print("Confusion Matrix:")
print(conf_matrix)

# Classification Report with micro and macro averages
report = classification_report(all_labels, all_preds, target_names=class_names, zero_division=0)
print("\nClassification Report:")
print(report)

Confusion Matrix:
[[1497 1912]
 [1815 5354]]

Classification Report:
              precision    recall  f1-score   support

         LOC       0.45      0.44      0.45      3409
           _       0.74      0.75      0.74      7169

    accuracy                           0.65     10578
   macro avg       0.59      0.59      0.59     10578
weighted avg       0.65      0.65      0.65     10578



---

### Exercise 1 (Easy)

The results are not outstanding, but there is a lot we can tweak and try out to make them better. Try to change pretrained model, configure hyperparameters differently, and see if you can improve upon results.

This model is intense to train. If you experience excessive delays, attempt to reduce the data you use to train it.

### Exercise 2 (Easy)

Experiment with the context window parameter. What happens when you reduce it or when you increase it?

### Exercise 3 (Medium)

The original dataset contains more classes than `LOC`. Check them and see how frequent they are. Second, keep them into the dataset and train the model again using multiple classes. How does the model performance changes?

---