# Natural Language Processing
![](https://i.imgur.com/qkg2E2D.png)

## Assignment 003 - NER Tagger

> Notebook by:
> - NLP Course Staff

## Revision History

| Version | Date       | User        | Content / Changes                                                   |
|---------|------------|-------------|---------------------------------------------------------------------|
| 0.1.000 | 21/05/2024 | course staff| First version                                                       |
| 0.1.001 | 23/05/2024 | course staff| Updated instructions for `Vocab` class to allow flexible special tokens definition |

## Overview
In this assignment, you will build a complete training and testing pipeline for a neural sequential tagger for named entities using LSTM.

## Dataset
You will work with the ReCoNLL 2003 dataset, a corrected version of the [CoNLL 2003 dataset](https://www.clips.uantwerpen.be/conll2003/ner/):

**Click on those links so you have access to the data!**
- [Train data](https://drive.google.com/file/d/1CqEGoLPVKau3gvVrdG6ORyfOEr1FSZGf/view?usp=sharing)

- [Dev data](https://drive.google.com/file/d/1rdUida-j3OXcwftITBlgOh8nURhAYUDw/view?usp=sharing)

- [Test data](https://drive.google.com/file/d/137Ht40OfflcsE6BIYshHbT5b2iIJVaDx/view?usp=sharing)

As you will see, the annotated texts are labeled according to the `IOB` annotation scheme (more on this below), for 3 entity types: Person, Organization, Location.

## Your Implementation

Please create a local copy of this template Colab's Notebook:

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1VVtBtlwZZnxQWdluNVkDgTMvDKVaqDOM?usp=sharing)

The assignment's instructions are there; follow the notebook.

## Submission
- **Notebook Link**: Add the URL to your assignment's notebook in the `notebook_link.txt` file, following the format provided in the example.
- **Access**: Ensure the link has edit permissions enabled to allow modifications if needed.
- **Deadline**: <font color='green'>06/06/2024</font>.
- **Platform**: Continue using GitHub for submissions. Push your project to the team repository and monitor the test results under the actions section.

Good Luck 🤗


<!-- ## NER schemes:  

> `IO`: is the simplest scheme that can be applied to this task. In this scheme, each token from the dataset is assigned one of two tags: an inside tag (`I`) and an outside tag (`O`). The `I` tag is for named entities, whereas the `O` tag is for normal words. This scheme has a limitation, as it cannot correctly encode consecutive entities of the same type.

> `IOB`: This scheme is also referred to in the literature as BIO and has been adopted by the Conference on Computational Natural Language Learning (CoNLL) [1]. It assigns a tag to each word in the text, determining whether it is the beginning (`B`) of a known named entity, inside (`I`) it, or outside (`O`) of any known named entities.

> `IOE`: This scheme works nearly identically to `IOB`, but it indicates the end of the entity (`E` tag) instead of its beginning.

> `IOBES`: An alternative to the IOB scheme is `IOBES`, which increases the amount of information related to the boundaries of named entities. In addition to tagging words at the beginning (`B`), inside (`I`), end (`E`), and outside (`O`) of a named entity. It also labels single-token entities with the tag `S`.

> `BI`: This scheme tags entities in a similar method to `IOB`. Additionally, it labels the beginning of non-entity words with the tag B-O and the rest as I-O.

> `IE`: This scheme works exactly like `IOE` with the distinction that it labels the end of non-entity words with the tag `E-O` and the rest as `I-O`.

> `BIES`: This scheme encodes the entities similar to `IOBES`. In addition, it also encodes the non-entity words using the same method. It uses `B-O` to tag the beginning of non-entity words, `I-O` to tag the inside of non-entity words, and `S-O` for single non-entity tokens that exist between two entities. -->


## NER Schemes

### IO
- **Description**: The simplest scheme for named entity recognition (NER).
- **Tags**:
  - `I`: Inside a named entity.
  - `O`: Outside any named entity.
- **Limitation**: Cannot correctly encode consecutive entities of the same type.

### IOB (BIO)
- **Description**: Adopted by the Conference on Computational Natural Language Learning (CoNLL).
- **Tags**:
  - `B`: Beginning of a named entity.
  - `I`: Inside a named entity.
  - `O`: Outside any named entity.
- **Advantage**: Can encode the boundaries of consecutive entities.

### IOE
- **Description**: Similar to IOB, but indicates the end of an entity.
- **Tags**:
  - `I`: Inside a named entity.
  - `O`: Outside any named entity.
  - `E`: End of a named entity.
- **Advantage**: Focuses on the end boundary of entities.

### IOBES
- **Description**: An extension of IOB with additional boundary information.
- **Tags**:
  - `B`: Beginning of a named entity.
  - `I`: Inside a named entity.
  - `O`: Outside any named entity.
  - `E`: End of a named entity.
  - `S`: Single-token named entity.
- **Advantage**: Provides more detailed boundary information for named entities.

### BI
- **Description**: Tags entities similarly to IOB and labels the beginning of non-entity words.
- **Tags**:
  - `B`: Beginning of a named entity.
  - `I`: Inside a named entity.
  - `B-O`: Beginning of a non-entity word.
  - `I-O`: Inside a non-entity word.
- **Advantage**: Distinguishes the beginning of non-entity sequences.

### IE
- **Description**: Similar to IOE but for non-entity words.
- **Tags**:
  - `I`: Inside a named entity.
  - `O`: Outside any named entity.
  - `E`: End of a named entity.
  - `E-O`: End of a non-entity word.
  - `I-O`: Inside a non-entity word.
- **Advantage**: Highlights the end of non-entity sequences.

### BIES
- **Description**: Encodes both entities and non-entity words using the IOBES method.
- **Tags**:
  - `B`: Beginning of a named entity.
  - `I`: Inside a named entity.
  - `O`: Outside any named entity.
  - `E`: End of a named entity.
  - `S`: Single-token named entity.
  - `B-O`: Beginning of a non-entity word.
  - `I-O`: Inside a non-entity word.
  - `S-O`: Single non-entity token.
- **Advantage**: Comprehensive encoding for both entities and non-entities.




In [23]:
!mkdir data
# Fetch data
# train_link = 'https://drive.google.com/file/d/1CqEGoLPVKau3gvVrdG6ORyfOEr1FSZGf/view?usp=sharing'
# dev_link   = 'https://drive.google.com/file/d/1rdUida-j3OXcwftITBlgOh8nURhAYUDw/view?usp=sharing'
# test_link  = 'https://drive.google.com/file/d/137Ht40OfflcsE6BIYshHbT5b2iIJVaDx/view?usp=sharing'

# !wget -q --no-check-certificate 'https://docs.google.com/uc?export=download&id=1CqEGoLPVKau3gvVrdG6ORyfOEr1FSZGf' -O data/train.txt
# !wget -q --no-check-certificate 'https://docs.google.com/uc?export=download&id=1rdUida-j3OXcwftITBlgOh8nURhAYUDw' -O data/dev.txt
# !wget -q --no-check-certificate 'https://docs.google.com/uc?export=download&id=137Ht40OfflcsE6BIYshHbT5b2iIJVaDx' -O data/test.txt

# # mac version:
!curl -L -o data/train.txt 'https://drive.google.com/uc?export=download&id=1CqEGoLPVKau3gvVrdG6ORyfOEr1FSZGf'
!curl -L -o data/dev.txt 'https://drive.google.com/uc?export=download&id=1rdUida-j3OXcwftITBlgOh8nURhAYUDw'
!curl -L -o data/test.txt 'https://drive.google.com/uc?export=download&id=137Ht40OfflcsE6BIYshHbT5b2iIJVaDx'



mkdir: data: File exists
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0:--:-- --:--:-- --:--:--     0
100  257k  100  257k    0     0   200k      0  0:00:01  0:00:01 --:--:--  200k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100 36633  100 36633    0     0  32671      0  0:00:01  0:00:01 --:--:--  362k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100 75891  100 75891    0     0  70768      0  0:00:01  0:00:01 

In [24]:
# Any additional needed libraries
!pip install --q

[31mERROR: You must give at least one requirement to install (see "pip help install")[0m[31m
[0m

In [25]:
# Standard Library Imports
import os
import copy
import random
import warnings
from collections import defaultdict
from typing import Optional

# ML
import numpy as np
import scipy as sp
import pandas as pd

# Visual
import matplotlib
import seaborn as sns
from tqdm.notebook import tqdm
from tabulate import tabulate
import matplotlib.pyplot as plt
from IPython.display import display

# DL
import torch as th
import torch.nn as nn
from torch.optim import Adam
from torch.utils.data import DataLoader, Dataset

# Metrics
from sklearn import metrics
from sklearn.metrics import accuracy_score , roc_auc_score, classification_report, confusion_matrix, precision_recall_fscore_support


In [26]:
SEED = 42
# Set the random seed for Python
random.seed(SEED)

# Set the random seed for numpy
np.random.seed(SEED)

# Set the random seed for pytorch
th.manual_seed(SEED)

# If using CUDA (for GPU operations)
th.cuda.manual_seed(SEED)

# Set up the device
# TO DO ----------------------------------------------------------------------
# DEVICE = "cuda" if th.cuda.is_available() else ("mps" if th.backends.mps.is_available() else "cpu")
# TO DO ----------------------------------------------------------------------
# if not th.backends.mps.is_available():
    # assert DEVICE == "cuda"

DEVICE = 'cuda' if th.cuda.is_available() else 'cpu'

DataType = list[tuple[list[str],list[str]]]

# Part 1 - Dataset Preparation

## Step 1: Read Data
Write a function for reading the data from a single file (of the ones that are provided above).   
- The function recieves a filepath
- The funtion encodes every sentence individually using a pair of lists, one list contains the words and one list contains the tags.
- Each list pair will be added to a general list (data), which will be returned back from the function.

Example output:
```
[
  (['At','Trent','Bridge',':'],['O','B-LOC','I-LOC ','O']),
  ([...],[...]),
  ...
]
```

In [27]:
def read_data(filepath:str) -> DataType:
  """
  Read data from a single file.
  The function recieves a filepath
  The funtion encodes every sentence using a pair of lists, one list contains the words and one list contains the tags.
  :param filepath: path to the file
  :return: data as a list of tuples
  """
  data = []
  # TO DO ----------------------------------------------------------------------
  ## going through every sentence and every word, and prepering the data accordinaly to the requirments.
  with open(filepath, 'r') as f:
    sentence = []
    tags = []
    for line in f:
      if line.strip() == '':
        data.append((sentence, tags))
        sentence = []
        tags = []
      else:
        word, tag = line.strip().split(' ')
        sentence.append(word)
        tags.append(tag)

  # TO DO ----------------------------------------------------------------------
  return data

In [28]:
train = read_data("data/train.txt")
dev = read_data("data/dev.txt")
test = read_data("data/test.txt")

## Step 2: Create Vocab

The `Vocab` class will serve as a dictionary that maps words and tags into IDs. Ensure that you include special tokens to handle out-of-vocabulary words and padding.

### Your Task
1. **Define Special Tokens**: Define special tokens such as `PAD_TOKEN` and `UNK_TOKEN` and assign them unique IDs.
2. **Initialize Dictionaries**: Populate the word and tag dictionaries based on the training set.

*Note: You may change the `Vocab` class as needed.*

In [29]:
# Initinize ids for special tokens
PAD_TOKEN = 0
UNK_TOKEN = 1

class Vocab:
  def __init__(self, train: DataType):
    """
    Initialize a Vocab instance.
    :param train: train data
    """
    self.word2id = {"__unk__": UNK_TOKEN, "__pad__": PAD_TOKEN}
    self.id2word = {UNK_TOKEN: "__unk__", PAD_TOKEN: "__pad__"}
    self.n_words = 2

    self.tag2id = {}
    self.id2tag = {}
    self.n_tags = 0

    # Initialize dictionaries based on the training set
    # TO DO ----------------------------------------------------------------------
    ## itterating through all sentence+tags in the train dataset. ids are increamntaly increased.

    for sentence, tags in train:
      for word in sentence:
        if word not in self.word2id:
          self.word2id[word] = self.n_words
          self.id2word[self.n_words] = word
          self.n_words += 1
      for tag in tags:
        if tag not in self.tag2id:
          self.tag2id[tag] = self.n_tags + 1 
          self.id2tag[self.n_tags + 1] = tag
          self.n_tags += 1

    # TO DO ----------------------------------------------------------------------

  def __len__(self):
    return self.n_words

  def index_tags(self, tags: list[str]) -> list[int]:
    """
    Convert tags to Ids.
    :param tags: list of tags
    :return: list of Ids
    """
    tag_indexes = [self.tag2id[t] for t in tags]
    return tag_indexes

  def index_words(self, words: list[str]) -> list[int]:
    """
    Convert words to Ids.
    :param words: list of words
    :return: list of Ids
    """
    word_indexes = [self.word2id[w] if w in self.word2id else self.word2id["__unk__"] for w in words]
    return word_indexes

In [30]:
vocab = Vocab(train)

## Step 3: Prepare Data
Write a function `prepare_data` that takes one of the [train, dev, test] and the `Vocab` instance, for converting each pair of (words, tags) to a pair of indexes. Additionally, the function should pad the sequences to the maximum length sequence **of the given split**.

Note: Vocabulary is based only on the train set.

### Your Task
1. Convert each pair of (words, tags) to a pair of indexes using the Vocab instance.
2. Pad the sequences to the maximum length of the sequences in the given split.

In [31]:
def prepare_data(data: DataType, vocab: Vocab):
  data_sequences = []
  # TO DO ----------------------------------------------------------------------
  longest_sentence_len = 0
  for sentence, tags in data:
    if len(sentence) > longest_sentence_len:
      longest_sentence_len = len(sentence)
  data_copy = copy.deepcopy(data)
  for sentence, tags in data_copy:
    word_indexes = vocab.index_words(sentence)
    tag_indexes = vocab.index_tags(tags)
    word_indexes += [vocab.word2id[vocab.id2word[PAD_TOKEN]]] * (longest_sentence_len - len(word_indexes))
    tag_indexes = tag_indexes + [0] * (longest_sentence_len - len(tag_indexes))
    data_sequences.append((word_indexes, tag_indexes))
  # TO DO ----------------------------------------------------------------------
  return data_sequences

In [32]:
train_sequences = prepare_data(train, vocab)
dev_sequences = prepare_data(dev, vocab)
test_sequences = prepare_data(test, vocab)

### Your Task
Print the number of OOV in dev and test sets:

In [33]:
def count_oov(sequences) -> int:
  """
  Count the number of OOV words.
  :param sequences: list of sequences
  :return: number of OOV words
  """
  oov = 0
  # TO DO ----------------------------------------------------------------------
  for sentence, tags in sequences:
      for word in sentence:
        if word == vocab.word2id['__unk__']:
            oov += 1

  # TO DO ----------------------------------------------------------------------
  return -1 if oov == 0 else oov

## Step 4: Dataloaders
Create dataloaders for each split in the dataset. They should return the samples as Tensors.

**Hint** - you can create a Dataset to support this part.

For the training set, use shuffling, and for the dev and test, not.

In [34]:
def prepare_data_loader(sequences, batch_size: int, train: bool = True):
  """
  Create a dataloader from a list of sequences.
  :param sequences: list of sequences
  :param batch_size: batch size
  :param train: whether to shuffle the dataloader or not
  :return: dataloader
  """
  dataloader = None
  # TO DO ----------------------------------------------------------------------
  class OurDataset(Dataset):
    def __init__(self, data):
      self.data = data

    def __len__(self):
      return len(self.data)

    def __getitem__(self, idx):
      sentence, tags = self.data[idx]
      return th.tensor(sentence), th.tensor(tags)

  dataloader = DataLoader(OurDataset(sequences), batch_size=batch_size, shuffle=train)
  # TO DO ----------------------------------------------------------------------

  # TO DO ----------------------------------------------------------------------
  return dataloader

In [35]:
BATCH_SIZE = 16
dl_train = prepare_data_loader(train_sequences, batch_size=BATCH_SIZE)

dl_dev = prepare_data_loader(dev_sequences, batch_size=BATCH_SIZE, train=False)
dl_test = prepare_data_loader(test_sequences, batch_size=BATCH_SIZE, train=False)

# Part 2 - NER Model Training

## Step 1: Implement Model

Write NERNet, a PyTorch Module for labeling words with NER tags.

> `input_size`: the size of the vocabulary  
`embedding_size`: the size of the embeddings  
`hidden_size`: the LSTM hidden size  
`output_size`: the number tags we are predicting for  
`n_layers`: the number of layers we want to use in LSTM  
`directions`: could 1 or 2, indicating unidirectional or bidirectional LSTM, respectively  

<br>  

The input for your forward function should be a single sentence tensor.

*Note: the embeddings in this section are learned embedding. That means that you don't need to use pretrained embedding like the one used in the last excersie. You will use them in part 5.*

*Note: You may change the NERNet class.*

In [36]:
class NERNet(nn.Module):
  def __init__(self, input_size: int, embedding_size: int, hidden_size: int, output_size: int, n_layers: int, directions: int):
    """
    Initialize a NERNet instance.
    :param input_size: the size of the vocabulary
    :param embedding_size: the size of the embeddings
    :param hidden_size: the LSTM hidden size
    :param output_size: the number tags we are predicting for
    :param n_layers: the number of layers we want to use in LSTM
    :param directions: could be 1 or 2, indicating unidirectional or bidirectional LSTM, respectively
    """
    super(NERNet, self).__init__()
    # TO DO ----------------------------------------------------------------------
    self.embedding = nn.Embedding(input_size, embedding_size)
    self.lstm = nn.LSTM(
      embedding_size,
      hidden_size,
      n_layers,
      bidirectional=(directions==2),
      batch_first=False)
    self.fc = nn.Linear(hidden_size * directions, output_size)
    self.softmax = nn.LogSoftmax(dim=2)


    # TO DO ----------------------------------------------------------------------

  def forward(self, input_sentence):
    # TO DO ----------------------------------------------------------------------
    embedded = self.embedding(input_sentence)
    lstm_out, _ = self.lstm(embedded)
    tag_space = self.fc(lstm_out)
    tag_scores = self.softmax(tag_space)
    # TO DO ----------------------------------------------------------------------
    return tag_scores

In [37]:
model = NERNet(vocab.n_words, embedding_size=300, hidden_size=800, output_size=vocab.n_tags, n_layers=2, directions=1)
model.to(DEVICE)

NERNet(
  (embedding): Embedding(7163, 300)
  (lstm): LSTM(300, 800, num_layers=2)
  (fc): Linear(in_features=800, out_features=7, bias=True)
  (softmax): LogSoftmax(dim=2)
)

In [38]:
sampler = dl_train.batch_sampler
batch = next(iter(dl_train))
input_sentences = batch[0].to(DEVICE)
tags = batch[1].to(DEVICE)
output = model(input_sentences) # (batch_size, seq_len, num_tags)
# ecah word in the sentence has a probability for each tag
# the tag with the highest probability is the predicted tag
# tags look like: (batch_size, seq_len - with the actual tags)
# calculate the loss using CrossEntropyLoss:
output = output.view(-1, vocab.n_tags) # (batch_size * seq_len, num_tags)
tags = tags.view(-1) # (batch_size * seq_len)
# loss = nn.CrossEntropyLoss()(output, tags)
output.shape, tags.shape

(torch.Size([928, 7]), torch.Size([928]))

## Step 2: Training Loop

Write a training loop, which takes a model (instance of NERNet), number of epochs to train on, and the train&dev datasets.  

The function will return the `loss` and `accuracy` durring training.  
(If you're using a different/additional metrics, return them too)

The loss is always CrossEntropyLoss and the optimizer is always Adam.
Make sure to use `tqdm` while iterating on `n_epochs`.


In [56]:
def train_loop(model: NERNet, n_epochs: int, dataloader_train, dataloader_dev):
  """
  Train a model.
  :param model: model instance
  :param n_epochs: number of epochs to train on
  :param dataloader_train: train dataloader
  :param dataloader_dev: dev dataloader
  :return: loss and accuracy during training
  """
  # Optimizer (ADAM is a fancy version of SGD)
  optimizer = Adam(model.parameters(), lr=0.0001)

  # Record
  metrics = {'loss': {'train': [], 'dev': []}, 'accuracy': {'train': [], 'dev': []}}

  # Move model to device
  model.to(DEVICE)

  # TO DO ----------------------------------------------------------------------

  loss_function = nn.CrossEntropyLoss(ignore_index=0)
  
  for epoch in tqdm(range(n_epochs)):
    model.train()
    train_loss = 0.0
    train_correct = 0
    train_total = 0
    for sentences, tags in dataloader_train:
      sentences = sentences.to(DEVICE)
      tags = tags.to(DEVICE)
      optimizer.zero_grad()
      tags_scores = model(sentences)
      tags_scores = tags_scores.view(-1, vocab.n_tags)
      tags = tags.view(-1)
      loss = loss_function(tags_scores, tags)
      loss.backward()
      optimizer.step()
      
      train_loss += loss.item()
      train_correct += ((tags_scores.argmax(dim=-1) == tags) * (tags != 0)).sum().item()
      train_total += (tags != 0).sum().item()
    
    train_loss /= len(dataloader_train)
    train_accuracy = train_correct / train_total
    metrics['loss']['train'].append(train_loss)
    metrics['accuracy']['train'].append(train_accuracy)
      
    dev_loss = 0.0
    dev_correct = 0
    dev_total = 0
    model.eval()
    with th.no_grad():
        for sentences, tags in dataloader_dev:
            sentences = sentences.to(DEVICE)
            tags = tags.to(DEVICE)
            tags_scores = model(sentences)
            tags_scores = tags_scores.view(-1, vocab.n_tags)
            tags = tags.view(-1)
            print(tags_scores.shape, tags.shape)
            loss = loss_function(tags_scores, tags)
            dev_loss += loss.item()
            dev_correct += ((tags_scores.argmax(dim=-1) == tags) * (tags != 0)).sum().item()
            dev_total += (tags != 0).sum().item()
    dev_loss /= len(dataloader_dev)
    dev_accuracy = dev_correct / dev_total
    metrics['loss']['dev'].append(dev_loss)
    metrics['accuracy']['dev'].append(dev_accuracy)
      # TO DO ----------------------------------------------------------------------

  # TO DO ----------------------------------------------------------------------

  return metrics

In [55]:
metrics = train_loop(model, n_epochs=5, dataloader_train=dl_train, dataloader_dev=dl_dev)

  0%|          | 0/5 [00:00<?, ?it/s]

IndexError: Target 7 is out of bounds.

<br><br><br><br><br><br>

In [None]:
# show the loss and accuracy during training and dev in a plot
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
plt.plot(metrics['loss']['train'], label='Train')
plt.plot(metrics['loss']['dev'], label='Dev')
plt.title('Loss')
plt.legend()

plt.subplot(1, 2, 2)
plt.plot(metrics['accuracy']['train'], label='Train')
plt.plot(metrics['accuracy']['dev'], label='Dev')
plt.title('Accuracy')
plt.legend()
plt.show()

# Part 3 - Evaluation


## Step 1: Evaluation Function

Write an evaluation loop for a trained model using the dev and test datasets. This function will print the `Recall`, `Precision`, and `F1` scores and plot a `Confusion Matrix`.

Perform this evaluation twice:
1. For all labels (7 labels in total).
2. For all labels except "O" (6 labels in total).

## Metrics and Display

### Metrics
- **Recall**: True Positive Rate (TPR), also known as Recall.
- **Precision**: The opposite of False Positive Rate (FPR), also known as Precision.
- **F1 Score**: The harmonic mean of Precision and Recall.

*Note*: For all these metrics, use **weighted** averaging:
Calculate metrics for each label, and find their average weighted by support. Refer to the [sklearn documentation](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_fscore_support.html#sklearn.metrics.precision_recall_fscore_support) for more details.

### Display
1. Print the `Recall`, `Precision`, and `F1` scores in a tabulated format.
2. Display a `Confusion Matrix` plot:
   - Rows represent the predicted labels.
   - Columns represent the true labels.
   - Include a title for the plot, axis names, and the names of the tags on the X-axis.

In [None]:
def evaluate(model: NERNet, title: str, dataloader: DataLoader, vocab: Vocab):
  """
  Evaluate a trained model on the given dataset.
  :param model: model instance
  :param title: title for the plot
  :param dataloader: dataloader
  :param vocab: Vocab instance
  :return: Dictionary of evaluation results
  """
    
  results = {}
  
  # TO DO ----------------------------------------------------------------------
  
  model.eval()
  y_true = []
  y_pred = []
  y_true_wo_o = []
  y_pred_wo_o = []
  for sentences, tags in dataloader:
    sentences = sentences.to(DEVICE)
    tags = tags.to(DEVICE)
    tags_scores = model(sentences)
    _, predicted = th.max(tags_scores, 2)
    for i, tag in enumerate(tags.view(-1).tolist()):
      if tag != 0:
        y_true.append(tag)
        y_pred.append(predicted.view(-1).tolist()[i])
        if vocab.id2tag[tag] != 'O':
          y_true_wo_o.append(tag)
          y_pred_wo_o.append(predicted.view(-1).tolist()[i])
  
  results_values_with_o = precision_recall_fscore_support(y_true, y_pred, average='weighted')
  confusion_with_o = confusion_matrix(y_true, y_pred)
  plot_labels = ['__pad__'] + [tag for tag in vocab.id2tag.values()]
  plt.figure(figsize=(10, 10))
  sns.heatmap(confusion_with_o, annot=True, fmt='d', xticklabels=plot_labels, yticklabels=plot_labels)
  plt.title(title + " - All Labels")
  plt.ylabel('True Label')
  plt.xlabel('Predicted Label')
  plt.show()
  
  results_values_wo_o = precision_recall_fscore_support(y_true_wo_o, y_pred_wo_o, average='weighted')
  confusion_wo_o = confusion_matrix(y_true_wo_o, y_pred_wo_o)

  plt.figure(figsize=(10, 10))
  sns.heatmap(confusion_wo_o, annot=True, fmt='d', xticklabels=plot_labels, yticklabels=plot_labels)
  plt.title(title + " - Without 'O' Label")
  plt.ylabel('True Label')
  plt.xlabel('Predicted Label')
  plt.show()
  
  
  results['RECALL'] = results_values_with_o[0]
  results['PERCISION'] = results_values_with_o[1]
  results['F1'] = results_values_with_o[2]
  results['RECALL_WO_O'] = results_values_wo_o[0]
  results['PERCISION_WO_O'] = results_values_wo_o[1]
  results['F1_WO_O'] = results_values_wo_o[2]
# TO DO ----------------------------------------------------------------------
  return results

In [None]:
results_dev = evaluate(model, 'Dev Set', dl_dev, vocab)

## Step 2: Train & Evaluate on Dev Set

Train and evaluate (on the dev set) a few models, all with `embedding_size=300` and `N_EPOCHS=5` (for fairness and computational reasons), and with the following hyper parameters (you may use that as captions for the models as well):

- Model 1: (hidden_size: 500, n_layers: 1, directions: 1)
- Model 2: (hidden_size: 500, n_layers: 2, directions: 1)
- Model 3: (hidden_size: 500, n_layers: 3, directions: 1)
- Model 4: (hidden_size: 500, n_layers: 1, directions: 2)
- Model 5: (hidden_size: 500, n_layers: 2, directions: 2)
- Model 6: (hidden_size: 500, n_layers: 3, directions: 2)
- Model 7: (hidden_size: 800, n_layers: 1, directions: 2)
- Model 8: (hidden_size: 800, n_layers: 2, directions: 2)
- Model 9: (hidden_size: 800, n_layers: 3, directions: 2)




In [None]:
N_EPOCHS = 5
EMB_DIM = 300

Here is an example (random numbers) of the display of the results):

In [None]:
# Example:
results_acc = np.random.rand(9, 10)
columns = ['N_MODEL','HIDDEN_SIZE','N_LAYERS','DIRECTIONS','RECALL','PERCISION','F1','RECALL_WO_O','PERCISION_WO_O','F1_WO_O']
df = pd.DataFrame(results_acc, columns=columns)
df.N_MODEL = [f'model_{n}' for n in range(1,10)]
print(tabulate(df, headers='keys', tablefmt='psql',floatfmt=".4f"))

In [None]:
# Define models with their hyperparameters
models = {
  'Model1': {'embedding_size': EMB_DIM, 'hidden_size': 500, 'n_layers': 1, 'directions': 1},
  'Model2': {'embedding_size': EMB_DIM, 'hidden_size': 500, 'n_layers': 2, 'directions': 1},
  'Model3': {'embedding_size': EMB_DIM, 'hidden_size': 500, 'n_layers': 3, 'directions': 1},
  'Model4': {'embedding_size': EMB_DIM, 'hidden_size': 500, 'n_layers': 1, 'directions': 2},
  'Model5': {'embedding_size': EMB_DIM, 'hidden_size': 500, 'n_layers': 2, 'directions': 2},
  'Model6': {'embedding_size': EMB_DIM, 'hidden_size': 500, 'n_layers': 3, 'directions': 2},
  'Model7': {'embedding_size': EMB_DIM, 'hidden_size': 800, 'n_layers': 1, 'directions': 2},
  'Model8': {'embedding_size': EMB_DIM, 'hidden_size': 800, 'n_layers': 2, 'directions': 2},
  'Model9': {'embedding_size': EMB_DIM, 'hidden_size': 800, 'n_layers': 3, 'directions': 2},
}   

# TO DO ----------------------------------------------------------------------
def evaluate_models(data_loader):
    results = pd.DataFrame(columns=columns)
    for model_name, model_cfg in models.items():
        model = NERNet(input_size=vocab.n_words,output_size=vocab.n_tags, **model_cfg)
        model.to(DEVICE)
        metrics = train_loop(model, n_epochs=N_EPOCHS, dataloader_train=dl_train, dataloader_dev=dl_dev)
        curr_results = evaluate(model, model_name, data_loader, vocab)
        results = pd.concat([results, pd.DataFrame({
            "N_MODEL": model_name,
            "HIDDEN_SIZE": model_cfg['hidden_size'],
            "N_LAYERS": model_cfg['n_layers'],
            "DIRECTIONS": model_cfg['directions'],
            "RECALL": curr_results['RECALL'],
            "PERCISION": curr_results['PERCISION'],
            "F1": curr_results['F1'],
            "RECALL_WO_O": curr_results['RECALL_WO_O'],
            "PERCISION_WO_O": curr_results['PERCISION_WO_O'],
            "F1_WO_O": curr_results['F1_WO_O']
            }, index=[model_name])])
    return results.reset_index(drop=True)


results_dev = evaluate_models(dl_dev)
# TO DO ----------------------------------------------------------------------

# Print results in tabulated format
print(tabulate(results_dev, headers='keys', tablefmt='psql', floatfmt=".4f"))

## Step 3: Evaluate on Test Set
Evaluate your models on the test set and save the results as a CSV. Add this file to your repo for submission.

In [None]:
results = pd.DataFrame(columns=columns)
file_name = "NER_results.csv"
# TO DO ----------------------------------------------------------------------
results = evaluate_models(dl_test)
results.to_csv(file_name, index=False)
# TO DO ----------------------------------------------------------------------
print(tabulate(results, headers='keys', tablefmt='psql',floatfmt=".4f"))


## Step 4 - best model
Decide which model performs the best, write its configuration, train it for 5 more epochs and evaluate it on the test set.

In [None]:
best_model_cfg = {'embedding_size':EMB_DIM, 'hidden_size': -1, 'n_layers': -1, 'directions': -1}
# TO DO ----------------------------------------------------------------------
best_model_row = results.loc[results['F1'].idxmax()]
best_model_cfg ={'embedding_size': EMB_DIM, 'hidden_size': best_model_row['HIDDEN_SIZE'], 'n_layers': best_model_row['N_LAYERS'], 'directions': best_model_row['DIRECTIONS']}
best_model = NERNet(input_size=vocab.n_words, embedding_size=EMB_DIM, hidden_size=best_model_cfg['hidden_size'], output_size=vocab.n_tags, n_layers=best_model_cfg['n_layers'], directions=best_model_cfg['directions'])
best_model.to(DEVICE)
metrics = train_loop(best_model, n_epochs=N_EPOCHS + 5, dataloader_train=dl_train, dataloader_dev=dl_dev)
results_test = evaluate(best_model, 'Test Set', dl_test, vocab)
# TO DO ----------------------------------------------------------------------

<br><br><br><br><br>

# Part 4 - Pretrained Embeddings



To prepare for this task, please read [this discussion](https://discuss.pytorch.org/t/can-we-use-pre-trained-word-embeddings-for-weight-initialization-in-nn-embedding/1222).

**TIP**: Ensure that the vectors are aligned with the IDs in your vocabulary. In other words, make sure that the word with ID 0 corresponds to the first vector in the GloVe matrix used to initialize `nn.Embedding`.



## Step 1: Get Data



Download the GloVe embeddings from [this link](https://nlp.stanford.edu/projects/glove/). Use the 300-dimensional vectors from `glove.6B.zip`.



In [None]:
# TO DO ----------------------------------------------------------------------

# !wget -q --no-check-certificate 'https://nlp.stanford.edu/data/glove.6B.zip' -O glove.6B.zip
# mac version:
!curl -L -o glove.6B.zip 'https://nlp.stanford.edu/data/glove.6B.zip'


!unzip -q glove.6B.zip

# TO DO ----------------------------------------------------------------------

## Step 2: Inject Embeddings

Then intialize the `nn.Embedding` module in your `NERNet` with these embeddings, so that you can start your training with pre-trained vectors.

In [None]:
def get_emb_matrix(filepath: str, vocab: Vocab) -> np.ndarray:
  emb_matrix = np.zeros((len(vocab.word2id), 300))
  # TO DO ----------------------------------------------------------------------

  # TO DO ----------------------------------------------------------------------
  return emb_matrix

In [None]:
def initialize_from_pretrained_emb(model: NERNet, emb_matrix: np.ndarray):
  """
  Inject the pretrained embeddings into the model.
  :param model: model instance
  :param emb_matrix: pretrained embeddings
  """
  # TO DO ----------------------------------------------------------------------

  # TO DO ----------------------------------------------------------------------

In [None]:
# Read embeddings and inject them to a model
emb_file = 'glove.6B.300d.txt'
emb_matrix = get_emb_matrix(emb_file, vocab)
ner_glove = NERNet(input_size=VOCAB_SIZE, embedding_size=EMB_DIM, hidden_size=500, output_size=NUM_TAGS, n_layers=1, directions=1)
initialize_from_pretrained_emb(ner_glove, emb_matrix)

## Step 3: Evaluate on Test Set

Same as the evaluation process before, please display:

1. Print a `RECALL-PERCISION-F1` scores in a tabulate format.
2. Display a `confusion matrix` plot: where the predicted labels are the rows, and the true labels are the columns.

Make sure to use the title for the plot, axis names, and the names of the tags on the X-axis.

Make sure to download and upload this CSV as well.

In [None]:
results = pd.DataFrame(columns=columns)
file_name = "NER_results_glove.csv"
# TO DO ----------------------------------------------------------------------

# TO DO ----------------------------------------------------------------------
print(tabulate(results, headers='keys', tablefmt='psql',floatfmt=".4f"))

## Step 4 - best model
Decide which model performs the best, write its configuration, train it for 5 more epochs and evaluate it on the test set.

In [None]:
best_model_glove_cfg = {'embedding_size':EMB_DIM, 'hidden_size': -1, 'n_layers': -1, 'directions': -1}
# TO DO ----------------------------------------------------------------------

# TO DO ----------------------------------------------------------------------

# Testing
Copy the content of the **tests.py** file from the repo and paste below. This will create the results.json file and download it to your machine.

In [None]:
import json
####################
# PLACE TESTS HERE #
train = read_data("data/train.txt")
dev = read_data("data/dev.txt")
test_set = read_data("data/test.txt")
def test_read_data():
    result = {
        'lengths': (len(train), len(dev), len(test_set)),
    }
    return result

vocab = Vocab(train)
def test_vocab():
    sent = vocab.index_words(["I", "am", "Spongebob"])
    return {
        'length': vocab.n_words,
        'tag2id_length': len(vocab.tag2id),
        "Spongebob": sent[2]
    }

train_sequences = prepare_data(train, vocab)
dev_sequences = prepare_data(dev, vocab)
test_sequences = prepare_data(test_set, vocab)

def test_count_oov():
    return {
        'dev_oov': count_oov(dev_sequences),
        'test_oov': count_oov(test_sequences)
    }

BATCH_SIZE = 16
dl_train = prepare_data_loader(train_sequences, batch_size=BATCH_SIZE)
dl_dev = prepare_data_loader(dev_sequences, batch_size=BATCH_SIZE, train=False)
dl_test = prepare_data_loader(test_sequences, batch_size=BATCH_SIZE, train=False)

def test_prepare_data_loader():
    return {
        'lengths': (len(dl_train), len(dl_dev), len(dl_test))
    }


def test_NERNet():
    # Extract best model configuration
    hidden_size = best_model_cfg['hidden_size']
    n_layers = best_model_cfg['n_layers']
    directions = best_model_cfg['directions']

    # Create model
    best_model = NERNet(vocab.n_words, embedding_size=300, hidden_size=hidden_size, n_layers=n_layers, directions=directions, output_size=vocab.n_tags)
    best_model.to(DEVICE)

    # Train model and evaluate
    _ = train_loop(model, n_epochs=10, dataloader_train=dl_train, dataloader_dev=dl_dev)
    results = evaluate(model, title="", dataloader=dl_test, vocab=vocab)

    return {
        'f1': results['F1'],
        'f1_wo_o': results['F1_WO_O'],
    }
    
def test_glove():
    # Get embeddings
    emb_file = 'glove.6B.300d.txt'
    emb_matrix = get_emb_matrix(emb_file, vocab)

    # Extract best model configuration
    hidden_size = best_model_glove_cfg['hidden_size']
    n_layers = best_model_glove_cfg['n_layers']
    directions = best_model_glove_cfg['directions']

    # Create model
    best_model = NERNet(vocab.n_words, embedding_size=300, hidden_size=hidden_size, output_size=vocab.n_tags, n_layers=n_layers, directions=directions)
    best_model.to(DEVICE)
    initialize_from_pretrained_emb(ner_glove, emb_matrix)

    # Train model and evaluate
    _ = train_loop(model, n_epochs=10, dataloader_train=dl_train, dataloader_dev=dl_dev)
    results = evaluate(model, title="", dataloader=dl_test, vocab=vocab)

    return {
        'f1': results['F1'],
        'f1_wo_o': results['F1_WO_O'],
    }

TESTS = [
    test_read_data,
    test_vocab,
    test_count_oov,
    test_prepare_data_loader,
    test_NERNet,
    test_glove
]

# Run tests and save results
res = {}
for test in TESTS:
    try:
        cur_res = test()
        res.update({test.__name__: cur_res})
    except Exception as e:
        res.update({test.__name__: repr(e)})

with open('results.json', 'w') as f:
    json.dump(res, f, indent=2)
