# Natural Language Processing
![](https://i.imgur.com/qkg2E2D.png)

## Assignment 004 - BERT-based NER Tagger

> Notebook by:
> - NLP Course Stuff
## Revision History

| Version | Date       | User        | Content / Changes                                                   |
|---------|------------|-------------|---------------------------------------------------------------------|
| 0.1.000 | 09/06/2024 | course staff| First version                                                       |


## Overview
In this assignment, you will further work on assignment 3, that is, you will build a complete training and testing pipeline for a neural sequential tagger for named entities using BERT, this time.

**This assignment is not mandatory, we will take the 3/4 best grades, but we recomment you doing it.**

## Dataset
You will work with the ReCoNLL 2003 dataset, a corrected version of the [CoNLL 2003 dataset](https://www.clips.uantwerpen.be/conll2003/ner/):

**Click on those links so you have access to the data!**
- [Train data](https://drive.google.com/file/d/1CqEGoLPVKau3gvVrdG6ORyfOEr1FSZGf/view?usp=sharing)

- [Dev data](https://drive.google.com/file/d/1rdUida-j3OXcwftITBlgOh8nURhAYUDw/view?usp=sharing)

- [Test data](https://drive.google.com/file/d/137Ht40OfflcsE6BIYshHbT5b2iIJVaDx/view?usp=sharing)

As you will see, the annotated texts are labeled according to the `IOB` annotation scheme (more on this below), for 3 entity types: Person, Organization, Location.

## Your Implementation

Please create a local copy of this template Colab's Notebook:

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1KGkObwUn5QQm_v0nB0nAUlB4YrwThuzl#scrollTo=Z-fCqGh9ybgm)

The assignment's instructions are there; follow the notebook.

## Submission
- **Notebook Link**: Add the URL to your assignment's notebook in the `notebook_link.txt` file, following the format provided in the example.
- **Access**: Ensure the link has edit permissions enabled to allow modifications if needed.
- **Deadline**: <font color='green'>27/06/2024</font>.
- **Platform**: Continue using GitHub for submissions. Push your project to the team repository and monitor the test results under the actions section.

Good Luck 🤗


<!-- ## NER schemes:  

> `IO`: is the simplest scheme that can be applied to this task. In this scheme, each token from the dataset is assigned one of two tags: an inside tag (`I`) and an outside tag (`O`). The `I` tag is for named entities, whereas the `O` tag is for normal words. This scheme has a limitation, as it cannot correctly encode consecutive entities of the same type.

> `IOB`: This scheme is also referred to in the literature as BIO and has been adopted by the Conference on Computational Natural Language Learning (CoNLL) [1]. It assigns a tag to each word in the text, determining whether it is the beginning (`B`) of a known named entity, inside (`I`) it, or outside (`O`) of any known named entities.

> `IOE`: This scheme works nearly identically to `IOB`, but it indicates the end of the entity (`E` tag) instead of its beginning.

> `IOBES`: An alternative to the IOB scheme is `IOBES`, which increases the amount of information related to the boundaries of named entities. In addition to tagging words at the beginning (`B`), inside (`I`), end (`E`), and outside (`O`) of a named entity. It also labels single-token entities with the tag `S`.

> `BI`: This scheme tags entities in a similar method to `IOB`. Additionally, it labels the beginning of non-entity words with the tag B-O and the rest as I-O.

> `IE`: This scheme works exactly like `IOE` with the distinction that it labels the end of non-entity words with the tag `E-O` and the rest as `I-O`.

> `BIES`: This scheme encodes the entities similar to `IOBES`. In addition, it also encodes the non-entity words using the same method. It uses `B-O` to tag the beginning of non-entity words, `I-O` to tag the inside of non-entity words, and `S-O` for single non-entity tokens that exist between two entities. -->


## NER Schemes

### IO
- **Description**: The simplest scheme for named entity recognition (NER).
- **Tags**:
  - `I`: Inside a named entity.
  - `O`: Outside any named entity.
- **Limitation**: Cannot correctly encode consecutive entities of the same type.

### IOB (BIO)
- **Description**: Adopted by the Conference on Computational Natural Language Learning (CoNLL).
- **Tags**:
  - `B`: Beginning of a named entity.
  - `I`: Inside a named entity.
  - `O`: Outside any named entity.
- **Advantage**: Can encode the boundaries of consecutive entities.

### IOE
- **Description**: Similar to IOB, but indicates the end of an entity.
- **Tags**:
  - `I`: Inside a named entity.
  - `O`: Outside any named entity.
  - `E`: End of a named entity.
- **Advantage**: Focuses on the end boundary of entities.

### IOBES
- **Description**: An extension of IOB with additional boundary information.
- **Tags**:
  - `B`: Beginning of a named entity.
  - `I`: Inside a named entity.
  - `O`: Outside any named entity.
  - `E`: End of a named entity.
  - `S`: Single-token named entity.
- **Advantage**: Provides more detailed boundary information for named entities.

### BI
- **Description**: Tags entities similarly to IOB and labels the beginning of non-entity words.
- **Tags**:
  - `B`: Beginning of a named entity.
  - `I`: Inside a named entity.
  - `B-O`: Beginning of a non-entity word.
  - `I-O`: Inside a non-entity word.
- **Advantage**: Distinguishes the beginning of non-entity sequences.

### IE
- **Description**: Similar to IOE but for non-entity words.
- **Tags**:
  - `I`: Inside a named entity.
  - `O`: Outside any named entity.
  - `E`: End of a named entity.
  - `E-O`: End of a non-entity word.
  - `I-O`: Inside a non-entity word.
- **Advantage**: Highlights the end of non-entity sequences.

### BIES
- **Description**: Encodes both entities and non-entity words using the IOBES method.
- **Tags**:
  - `B`: Beginning of a named entity.
  - `I`: Inside a named entity.
  - `O`: Outside any named entity.
  - `E`: End of a named entity.
  - `S`: Single-token named entity.
  - `B-O`: Beginning of a non-entity word.
  - `I-O`: Inside a non-entity word.
  - `S-O`: Single non-entity token.
- **Advantage**: Comprehensive encoding for both entities and non-entities.




In [None]:
!mkdir data
# Fetch data
# train_link = 'https://drive.google.com/file/d/1CqEGoLPVKau3gvVrdG6ORyfOEr1FSZGf/view?usp=sharing'
# dev_link   = 'https://drive.google.com/file/d/1rdUida-j3OXcwftITBlgOh8nURhAYUDw/view?usp=sharing'
# test_link  = 'https://drive.google.com/file/d/137Ht40OfflcsE6BIYshHbT5b2iIJVaDx/view?usp=sharing'

!wget -q --no-check-certificate 'https://docs.google.com/uc?export=download&id=1CqEGoLPVKau3gvVrdG6ORyfOEr1FSZGf' -O data/train.txt
!wget -q --no-check-certificate 'https://docs.google.com/uc?export=download&id=1rdUida-j3OXcwftITBlgOh8nURhAYUDw' -O data/dev.txt
!wget -q --no-check-certificate 'https://docs.google.com/uc?export=download&id=137Ht40OfflcsE6BIYshHbT5b2iIJVaDx' -O data/test.txt


In [None]:
# Any additional needed libraries
!pip install -qU transformers[torch] wandb

In [None]:
# Standard Library Imports
import os
import copy
import random
import warnings
from collections import defaultdict
from typing import Optional
import json
from google.colab import files

# ML
import numpy as np
import scipy as sp
import pandas as pd

# Visual
import matplotlib
import seaborn as sns
from tqdm import tqdm
from tabulate import tabulate
import matplotlib.pyplot as plt
from IPython.display import display

# DL
import torch as th
import torch.nn as nn
from torch.optim import Adam
from torch.utils.data import DataLoader, Dataset
from transformers import AutoTokenizer, AutoModelForTokenClassification, TrainingArguments, Trainer

# Metrics
from sklearn import metrics
from sklearn.metrics import accuracy_score , roc_auc_score, classification_report, confusion_matrix, precision_recall_fscore_support


In [None]:
model_name = 'bert-base-uncased'
SEED = 42
# Set the random seed for Python
random.seed(SEED)

# Set the random seed for numpy
np.random.seed(SEED)

# Set the random seed for pytorch
th.manual_seed(SEED)

# If using CUDA (for GPU operations)
th.cuda.manual_seed(SEED)

# Set up the device
DEVICE = "cuda" if th.cuda.is_available() else "cpu"
# assert DEVICE == "cuda"

DataType = dict[str, list[list[str]]]

# Part 1 - Dataset Preparation

## Step 1: Read Data
Write a function for reading the data from a single file (of the ones that are provided above).   
- The function recieves a filepath
- The funtion encodes every sentence individually using a pair of lists, one list contains the words and one list contains the tags.
- The function returns a dictionary of the texts as a list and the tags as a list.

Example output:
```
{
  "texts": [
    ['At','Trent','Bridge',':'],
    ...],
  "tags":[
    ['O','B-LOC','I-LOC ','O'],
    ...]
  ...
}
```

In [None]:
def read_data(filepath:str) -> DataType:
  """
  Read data from a single file.
  The function recieves a filepath
  The funtion encodes every sentence using a pair of lists, one list contains the words and one list contains the tags.
  :param filepath: path to the file
  :return: data as a list of tuples
  """
  data = {
    "texts": [],
    "tags": []
  }
  # TO DO ----------------------------------------------------------------------
  with open(filepath, 'r', encoding='utf-8') as file:
        sentence = []
        tags = []
        
        for line in file:
            if line.strip() == '':
                # Check that the sentence and tags lists are not empty
                if sentence and tags:  
                    data["texts"].append(sentence)
                    data["tags"].append(tags)
                sentence = []
                tags = []
            else:
                splits = line.strip().split()
                word = splits[0]
                tag = splits[-1]
                sentence.append(word)
                tags.append(tag)
        
        # Add the last sentence if it's non-empty and not added
        if sentence and tags:
            data["texts"].append(sentence)
            data["tags"].append(tags)
  return data

In [None]:
train_raw = read_data("data/train.txt")
dev_raw = read_data("data/dev.txt")
test_raw = read_data("data/test.txt")
print(f"Train size: {len(train_raw['texts'])}")
print(f"Dev size: {len(dev_raw['texts'])}")
print(f"Test size: {len(test_raw['texts'])}")

## Step 2: Prepare Data
Write a function `prepare_data` that takes one of the [train, dev, test], and encodes it to tensors.

### Your Task
1. Load the BERT Tokenizer
2. Tokenize the data and encode the labels

In [None]:
# Prepare tag2id dictionaries
tag2id = {}
id2tag = {}
tags = ["O", "B-PER", "I-PER", "B-LOC", "I-LOC", "B-ORG", "I-ORG"]
for tag in tags:
  tag2id[tag] = len(tag2id)
  id2tag[len(id2tag)] = tag

In [None]:
tag2id

In [None]:
tokenizer = None
# TO DO ----------------------------------------------------------------------

# TO DO ----------------------------------------------------------------------
tokenizer

In [None]:
def prepare_data(data: DataType, tag2id: dict[str, int]) -> dict[str, th.Tensor]:
  enc_data = {
    "texts": None,
    "labels": None
  }
  # TO DO ----------------------------------------------------------------------
  # Tokenize the texts

  # TO DO ----------------------------------------------------------------------
  return enc_data

In [None]:
train_sequences = prepare_data(train_raw, tag2id)
dev_sequences = prepare_data(dev_raw, tag2id)
test_sequences = prepare_data(test_raw, tag2id)

## Step 3: Dataset
Create datasets for each split in the dataset. They should return the samples as Tensors.


In [None]:
class NERDataset(Dataset):
  # TO DO ----------------------------------------------------------------------

  # TO DO ----------------------------------------------------------------------

In [None]:
train_ds = None
dev_ds = None
test_ds = None
# TO DO ----------------------------------------------------------------------
train_ds = NERDataset(train_sequences)
dev_ds = NERDataset(dev_sequences)
test_ds = NERDataset(test_sequences)
# TO DO ----------------------------------------------------------------------

<br><br><br><br><br><br>

# Part 2 - NER Model Training

## Step 1: Load Model

Load a token classification model.

In [None]:
model = None
def load_model(model_name: str, tag2id) -> nn.Module:
# TO DO ----------------------------------------------------------------------

# TO DO ----------------------------------------------------------------------
model = load_model(model_name, tag2id)
model

## Step 2: Training

Write a training function that utilizes the huggingface Trainer. The function should log the loss of both train dataset and the dev one every here and there.

In [None]:
N_EPOCHS = 5
# TO DO ----------------------------------------------------------------------
BATCH_SIZE = 0
# TO DO ----------------------------------------------------------------------

In [None]:
def train_model(model, n_epochs: int, batch_size: int, train_ds: Dataset, dev_ds: Dataset) -> Trainer:
  """
  Train a model.
  :param model: model instance
  :param n_epochs: number of epochs to train on
  :param batch_size: batch size
  :param train_ds: train dataset
  :param dev_ds: dev dataset
  :return: loss and accuracy during training
  """
  # TO DO ----------------------------------------------------------------------

  # TO DO ----------------------------------------------------------------------

In [None]:
wandb.watch(model, log_freq=15)
trainer = train_model(model, N_EPOCHS, BATCH_SIZE, train_ds, dev_ds)

<br><br><br><br><br><br>

# Part 3 - Evaluation


## Step 1: Evaluation Function

Write an evaluation function for a trained model using the dev and test datasets. This function will print the `Recall`, `Precision`, and `F1` scores and plot a `Confusion Matrix`.

Perform this evaluation twice:
1. For all labels (7 labels in total).
2. For all labels except "O" (6 labels in total).

## Metrics and Display

### Metrics
- **Recall**: True Positive Rate (TPR), also known as Recall.
- **Precision**: The opposite of False Positive Rate (FPR), also known as Precision.
- **F1 Score**: The harmonic mean of Precision and Recall.

*Note*: For all these metrics, use **weighted** averaging:
Calculate metrics for each label, and find their average weighted by support. Refer to the [sklearn documentation](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_fscore_support.html#sklearn.metrics.precision_recall_fscore_support) for more details.

### Display
1. Print the `Recall`, `Precision`, and `F1` scores in a tabulated format.
2. Display a `Confusion Matrix` plot:
   - Rows represent the predicted labels.
   - Columns represent the true labels.
   - Include a title for the plot, axis names, and the names of the tags on the X-axis.

In [None]:
def evaluate(trainer: Trainer, title: str, dataset: Dataset, tag2id: dict[str, int]):
  """
  Evaluate a trained model on the given dataset.
  :param trainer: trainer instance containing a trained model
  :param title: title for the plot
  :param dataset: dataset
  :param tag2id: tag2id dictionary
  :return: Dictionary of evaluation results
  """
  results = {}
  # TO DO ----------------------------------------------------------------------

  # TO DO ----------------------------------------------------------------------
  return results


In [None]:
# Assuming model is trained, and dl_dev is the DataLoader for dev dataset
results = evaluate(trainer, "Evaluation on Dev Set", dev_ds, tag2id)

## Step 2 - Logs and Visualization
Explore and intagrate [wandb](https://wandb.ai/home) as a logging and visualization tool. Integrate it in the training and evaluation steps. Look for the plots of the loss (train, eval) and see how useful it can be :) Also make sure to log some results, such as plots and funal results before printing.

## Step 3: Development
Experiment your training with diffenet Hyperparameters and optimize them based on the results on the **development set**.

Decide which model performs the best. Note that this time the parameters changes will be inside the model initialization or the train functions and will not be given as parameters to the load_model function. So just hard-code them in the other functions.

In [None]:
# TO DO ----------------------------------------------------------------------

# TO DO ----------------------------------------------------------------------

## Step 4 - Final Evaluation
After configring your params such that the model loaded is the best one,train it, evaluate it on the test set and print the results. This part simulates the real world data.

In [None]:
model = None
# TO DO ----------------------------------------------------------------------

# TO DO ----------------------------------------------------------------------

<br><br><br><br><br>

In [None]:
####################
# PLACE TESTS HERE #
####################
