# Named Entity Recognition with Neural Networks

In this assignment you will build a full training and testing pipeline for a neural sequential tagger for named entities, using RNN / LSTM.

The dataset that you will be working on is called ReCoNLL 2003, which is a corrected version of the CoNLL 2003 dataset: https://www.clips.uantwerpen.be/conll2003/ner/

[Train data](https://drive.google.com/file/d/1hG66e_OoezzeVKho1w7ysyAx4yp0ShDz/view?usp=sharing)

[Dev data](https://drive.google.com/file/d/1EAF-VygYowU1XknZhvzMi2CID65I127L/view?usp=sharing)

[Test data](https://drive.google.com/file/d/16gug5wWnf06JdcBXQbcICOZGZypgr4Iu/view?usp=sharing)

As you can see, the annotated texts are labeled according to the IOB annotation scheme, for 3 entity types: Person, Organization, Location.

In [1]:
import re
import urllib.request
from random import sample

# External library imports
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
import pandas as pd
from sklearn import metrics
from IPython.display import display, HTML

# Additional imports
from itertools import zip_longest
from sklearn.metrics import classification_report, precision_score, recall_score, f1_score
import matplotlib.pyplot as plt

# Set device to GPU if available, else CPU
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
device

device(type='cpu')

`read_data` is a function for reading the data from a single file (of the ones that are provided above). The function receives a file path and then it encodes every sentence individually using a pair of lists, one list contains the words and one list contains the tags. Each list pair will be added to a general list (data), which will be returned from the function.

This function reads the files from a remote drive file. If you want to read them locally, adjust the function.

In [2]:
def load_dataset(filepath):
    dataset = []
    file_id = re.search('https://drive.google.com/file/d/(.*)/view', filepath).group(1)
    download_link = "https://drive.google.com/uc?export=download&id=" + file_id
    temp_path = 'temp.txt'
    urllib.request.urlretrieve(download_link, temp_path)
    current_sentence = []
    with open(temp_path) as file:
        for line in file:
            if line == "\n":
                dataset.append(current_sentence)
                current_sentence = []
            else:
                current_sentence.append(line.split())
    return dataset

train_data = load_dataset('https://drive.google.com/file/d/1hG66e_OoezzeVKho1w7ysyAx4yp0ShDz/view?usp=sharing')
dev_data = load_dataset('https://drive.google.com/file/d/1EAF-VygYowU1XknZhvzMi2CID65I127L/view?usp=sharing')
test_data = load_dataset('https://drive.google.com/file/d/16gug5wWnf06JdcBXQbcICOZGZypgr4Iu/view?usp=sharing')

In [3]:
train_data[90]

[['Rubin', 'B-PER'],
 ["'s", 'O'],
 ['misfortune', 'O'],
 ['turned', 'O'],
 ['into', 'O'],
 ['a', 'O'],
 ['very', 'O'],
 ['lucky', 'O'],
 ['break', 'O'],
 ['for', 'O'],
 ['eighth-seeded', 'O'],
 ['Olympic', 'O'],
 ['champion', 'O'],
 ['Lindsay', 'B-PER'],
 ['Davenport', 'I-PER'],
 ['.', 'O']]

Note that each entry in the data is a list of words and their corresponding tags.

The following Vocab class can be served as a dictionary that maps words and tags into Ids. The UNK_TOKEN should be used for words that are not part of the training data.

In [4]:
UNKNOWN_TOKEN = 0

class Vocabulary:
    def __init__(self):
        self.word_to_index = {"__unk__": UNKNOWN_TOKEN}
        self.index_to_word = {UNKNOWN_TOKEN: "__unk__"}
        self.word_count = 1

        self.tag_to_index = {"O":0, "B-PER":1, "I-PER": 2, "B-LOC": 3, "I-LOC": 4, "B-ORG": 5, "I-ORG": 6}
        self.index_to_tag = {0:"O", 1:"B-PER", 2:"I-PER", 3:"B-LOC", 4:"I-LOC", 5:"B-ORG", 6:"I-ORG"}

    def get_word_indices(self, words):
        word_indices = [self.get_word_index(w) for w in words]
        return word_indices

    def get_tag_indices(self, tags):
        tag_indices = [self.tag_to_index[t] for t in tags]
        return tag_indices

    def get_word_index(self, w):
        if w not in self.word_to_index:
            self.word_to_index[w] = self.word_count
            self.index_to_word[self.word_count] = w
            self.word_count += 1
        return self.word_to_index[w]

## To do
Write a function `prepare_data` that takes a dataset (train, dev, or test) and a Vocab instance as inputs. This function should convert each pair of (words, tags) into a pair of corresponding indexes using the Vocab instance. Each indexed pair should be added to data_sequences, which will be returned by the function.

Foer the previous training instance, the answer should look like this:

```
train_sequences[90] =

[(799, 1),
 (163, 0),
 (800, 0),
 (604, 0),
 (801, 0),
 (65, 0),
 (802, 0),
 (803, 0),
 (804, 0),
 (29, 0),
 (805, 0),
 (806, 0),
 (255, 0),
 (807, 1),
 (808, 2),
 (24, 0)]
```

In [5]:
def preprocess_data(data, vocab):
    processed_sequences = []
    for sentence in data:
        words = []
        tags = []
        for word in sentence:
            vocab.get_word_index(word[0])
            words.append(word[0])
            tags.append(word[1])
            
        word_indices = vocab.get_word_indices(words)
        tag_indices = vocab.get_tag_indices(tags)
        combined_list = list(zip_longest(word_indices, tag_indices, fillvalue=None))
        processed_sequences.append(combined_list)
    
    return processed_sequences, vocab

vocab = Vocabulary()

train_sequences, vocab = preprocess_data(train_data, vocab)
dev_sequences, vocab = preprocess_data(dev_data, vocab)
test_sequences, vocab = preprocess_data(test_data, vocab)

In [6]:
train_sequences[90]

[(799, 1),
 (163, 0),
 (800, 0),
 (604, 0),
 (801, 0),
 (65, 0),
 (802, 0),
 (803, 0),
 (804, 0),
 (29, 0),
 (805, 0),
 (806, 0),
 (255, 0),
 (807, 1),
 (808, 2),
 (24, 0)]

Let's Gooooo

## To do

Write NERNet, a PyTorch Module for labeling words with NER tags.

*input_size:* the size of the vocabulary

*embedding_size:* the size of the embeddings

*hidden_size:* the hidden size

*output_size:* the number tags we are predicting for

*n_layers:* the number of layers we want to use

The input for your forward function should be a single sentence tensor.

You are free to experiment with additional architectures such as LSTM.

In [7]:
class BiLSTMNERModel(nn.Module):
    def __init__(self, vocab_size, embed_size, hidden_size, tagset_size, num_layers):
        super(BiLSTMNERModel, self).__init__()
        self.embedding_layer = nn.Embedding(vocab_size, embed_size)
        self.bilstm_layer = nn.LSTM(embed_size, hidden_size, num_layers, batch_first=True, bidirectional=True)
        self.linear_layer = nn.Linear(hidden_size * 2, tagset_size)

    def forward(self, input_seq):
        embedded_seq = self.embedding_layer(input_seq)  #(batch_size, seq_length, embed_size)
        lstm_output, _ = self.bilstm_layer(embedded_seq)  #(batch_size, seq_length, hidden_size * 2)
        output = self.linear_layer(lstm_output)  #(batch_size, seq_length, tagset_size)
        return output

## To do
write a training loop, which takes a model (instance of NERNet) and number of epochs to train on. The loss is always CrossEntropyLoss and the optimizer is always Adam.

In [8]:
def training_loop(model, epochs):
    
    loss_function = nn.CrossEntropyLoss()
    optimizer = optim.Adam(model.parameters(), lr=0.0001)
    loss_history = []
    model.train()
    
    for epoch in range(1, epochs + 1):
        for sentence in train_sequences:
            inputs = torch.tensor([word[0] for word in sentence])
            targets = torch.tensor([tag[1] for tag in sentence])

            predictions = model(inputs)
            loss = loss_function(predictions, targets)

            # Compute gradients
            loss.backward()

            # Update parameters
            optimizer.step()
            optimizer.zero_grad()

        loss_history.append(loss.item())
        
    return loss_history

## To do

write an evaluation loop on a trained model, using the dev and test datasets. This function prints the precision, recall, f1, and support, of each label separately (7 labels in total), and for all the 6 labels (except O) together. The caption argument for the function should be served for printing so that when you print include it as a prefix. 

In [9]:
def evaluation_loop(model, caption):
    datasets = ['dev', 'test']
    for i, data in enumerate([dev_sequences, test_sequences]):
        
        all_predictions = []
        all_targets = []
        for sentence in data:
            inputs = torch.tensor([word[0] for word in sentence])
            targets = torch.tensor([tag[1] for tag in sentence])
            predictions = model(inputs)
            predicted_tags = torch.argmax(predictions, dim=1)
            
            all_predictions.extend(predicted_tags.detach().numpy())
            all_targets.extend(targets.detach().numpy())
            

        print(f"{caption} - {datasets[i]} Data:")  
        print("Detailed Metrics by Label:")
        print(classification_report(all_targets, all_predictions, labels=[0, 1, 2, 3, 4, 5, 6], target_names=['O', 'B-PER', 'I-PER', 'B-LOC', 'I-LOC', 'B-ORG', 'I-ORG']))
            
        report = classification_report(all_targets, all_predictions, labels=[0, 1, 2, 3, 4, 5, 6], target_names=['O', 'B-PER', 'I-PER', 'B-LOC', 'I-LOC', 'B-ORG', 'I-ORG'], output_dict=True)            

        
        # Calculate weighted average for precision, recall, and F1-score excluding 'O'
        labels_to_consider = ['B-PER', 'I-PER', 'B-LOC', 'I-LOC', 'B-ORG', 'I-ORG']
        
        precision_sum = 0
        recall_sum = 0
        f1_sum = 0
        total_support = 0

        
        for label in labels_to_consider:
            precision_sum += report[label]['precision'] * report[label]['support']
            recall_sum += report[label]['recall'] * report[label]['support']
            f1_sum += report[label]['f1-score'] * report[label]['support']
            total_support += report[label]['support']
        
        avg_precision = precision_sum / total_support
        avg_recall = recall_sum / total_support
        avg_f1 = f1_sum / total_support

        print("Aggregate Metrics (excluding 'O'):")
        print(f"Precision: {avg_precision:.2f}")
        print(f"Recall: {avg_recall:.2f}")
        print(f"F1-Score: {avg_f1:.2f}")
        print(f"Support: {total_support}\n")


## To do:
Train and evaluate at least 5 models.

The hyperparameters you can choose are the embedding size, the hidden size and the number of layers in the network.

After training, we will use pre-trained GloVe embeddings and see if it affects the performance. Make sure one of the models has an embedding size of 50 so you would be able to load the pre-trained embeddings (if you want, you can go up to 300 dimensions for the pre-trained embeddings).

If you are using a CPU, you should keep the embedding size, the hidden size and the number of layers small (50,<100,1-2).

If you have a GPU or are using colab, you can easily train larger networks (300, >500, increasing the number of layers won't be as effective so you can still stay at 1-2).

In [10]:
# Adjusted hyperparameters for CPU
cpu_hyperparameters = [
    {'embed_size': 50, 'hidden_size': 64, 'num_layers': 1},
    {'embed_size': 100, 'hidden_size': 64, 'num_layers': 1},
    {'embed_size': 50, 'hidden_size': 32, 'num_layers': 2},
    {'embed_size': 100, 'hidden_size': 32, 'num_layers': 2},
    {'embed_size': 50, 'hidden_size': 64, 'num_layers': 2}
]

# Train and evaluate models
for idx, params in enumerate(cpu_hyperparameters):
    vocab_size = vocab.word_count  # Size of the vocabulary
    embed_size = params['embed_size']  # Size of the word embeddings
    hidden_size = params['hidden_size']  # Size of the hidden state in the RNN
    tagset_size = 7  # Number of NER tags
    num_layers = params['num_layers']  # Number of layers in the RNN
    epochs = 10  # Number of epochs to train for

    model = BiLSTMNERModel(vocab_size, embed_size, hidden_size, tagset_size, num_layers).to(device)
    print(f"Training model {idx+1} with hyperparameters: {params}")
    training_loop(model, epochs)
    
    # Evaluate on Dev set
    print(f"Evaluating model {idx+1} on Dev set")
    evaluation_loop(model, f"Model {idx+1} - Dev")
    
    # Evaluate on Test set
    print(f"Evaluating model {idx+1} on Test set")
    evaluation_loop(model, f"Model {idx+1} - Test")

Training model 1 with hyperparameters: {'embed_size': 50, 'hidden_size': 64, 'num_layers': 1}
Evaluating model 1 on Dev set
Model 1 - Dev - dev Data:
Detailed Metrics by Label:
              precision    recall  f1-score   support

           O       0.88      0.98      0.92      3096
       B-PER       0.65      0.46      0.54       200
       I-PER       0.72      0.48      0.58       157
       B-LOC       0.74      0.49      0.59       183
       I-LOC       1.00      0.04      0.08        23
       B-ORG       0.53      0.39      0.45       168
       I-ORG       0.38      0.04      0.08       116

    accuracy                           0.85      3943
   macro avg       0.70      0.41      0.46      3943
weighted avg       0.83      0.85      0.83      3943

Aggregate Metrics (excluding 'O'):
Precision: 0.63
Recall: 0.39
F1-Score: 0.46
Support: 847

Model 1 - Dev - test Data:
Detailed Metrics by Label:
              precision    recall  f1-score   support

           O       0.89 

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Model 3 - Dev - test Data:
Detailed Metrics by Label:
              precision    recall  f1-score   support

           O       0.90      0.97      0.93      6567
       B-PER       0.69      0.54      0.61       434
       I-PER       0.68      0.60      0.64       296
       B-LOC       0.64      0.53      0.58       343
       I-LOC       0.00      0.00      0.00        53
       B-ORG       0.67      0.39      0.49       350
       I-ORG       0.59      0.13      0.21       200

    accuracy                           0.87      8243
   macro avg       0.59      0.45      0.49      8243
weighted avg       0.84      0.87      0.85      8243

Aggregate Metrics (excluding 'O'):
Precision: 0.64
Recall: 0.45
F1-Score: 0.52
Support: 1676

Evaluating model 3 on Test set


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Model 3 - Test - dev Data:
Detailed Metrics by Label:
              precision    recall  f1-score   support

           O       0.89      0.97      0.93      3096
       B-PER       0.60      0.49      0.54       200
       I-PER       0.67      0.61      0.64       157
       B-LOC       0.69      0.52      0.59       183
       I-LOC       0.00      0.00      0.00        23
       B-ORG       0.64      0.39      0.48       168
       I-ORG       0.52      0.09      0.16       116

    accuracy                           0.86      3943
   macro avg       0.57      0.44      0.48      3943
weighted avg       0.83      0.86      0.84      3943

Aggregate Metrics (excluding 'O'):
Precision: 0.61
Recall: 0.43
F1-Score: 0.49
Support: 847



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Model 3 - Test - test Data:
Detailed Metrics by Label:
              precision    recall  f1-score   support

           O       0.90      0.97      0.93      6567
       B-PER       0.69      0.54      0.61       434
       I-PER       0.68      0.60      0.64       296
       B-LOC       0.64      0.53      0.58       343
       I-LOC       0.00      0.00      0.00        53
       B-ORG       0.67      0.39      0.49       350
       I-ORG       0.59      0.13      0.21       200

    accuracy                           0.87      8243
   macro avg       0.59      0.45      0.49      8243
weighted avg       0.84      0.87      0.85      8243

Aggregate Metrics (excluding 'O'):
Precision: 0.64
Recall: 0.45
F1-Score: 0.52
Support: 1676

Training model 4 with hyperparameters: {'embed_size': 100, 'hidden_size': 32, 'num_layers': 2}


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Evaluating model 4 on Dev set
Model 4 - Dev - dev Data:
Detailed Metrics by Label:
              precision    recall  f1-score   support

           O       0.91      0.96      0.94      3096
       B-PER       0.66      0.56      0.61       200
       I-PER       0.71      0.62      0.66       157
       B-LOC       0.72      0.63      0.67       183
       I-LOC       1.00      0.04      0.08        23
       B-ORG       0.59      0.49      0.53       168
       I-ORG       0.48      0.28      0.36       116

    accuracy                           0.87      3943
   macro avg       0.72      0.51      0.55      3943
weighted avg       0.86      0.87      0.86      3943

Aggregate Metrics (excluding 'O'):
Precision: 0.65
Recall: 0.52
F1-Score: 0.57
Support: 847

Model 4 - Dev - test Data:
Detailed Metrics by Label:
              precision    recall  f1-score   support

           O       0.92      0.97      0.94      6567
       B-PER       0.74      0.59      0.66       434
       I-P

## Best Model

### Hyperparameters
- Bidirectional LSTM
- Embedding Size: 50
- Hidden Size: 64
- Number of Layers: 2

### Evaluation Metrics on Dev Set

#### Metrics for Each Label Separately
| Label  | Precision | Recall | F1-Score | Support |
|--------|-----------|--------|----------|---------|
| O      | 0.90      | 0.97   | 0.94     | 3096    |
| B-PER  | 0.73      | 0.57   | 0.64     | 200     |
| I-PER  | 0.85      | 0.64   | 0.73     | 157     |
| B-LOC  | 0.79      | 0.55   | 0.65     | 183     |
| I-LOC  | 1.00      | 0.13   | 0.23     | 23      |
| B-ORG  | 0.65      | 0.46   | 0.54     | 168     |
| I-ORG  | 0.51      | 0.26   | 0.34     | 116     |

#### Overall Metrics
- **Accuracy**: 0.87 (3943 samples)
- **Macro Average**:
  - Precision: 0.77
  - Recall: 0.51
  - F1-Score: 0.58
- **Weighted Average**:
  - Precision: 0.86
  - Recall: 0.87
  - F1-Score: 0.86
- **Metrics combined (excluding 'O')**:
  - Precision: 0.72
  - Recall: 0.50
  - F1-Score: 0.59
  - Support: 847

---

### Evaluation Metrics on Test Set

#### Metrics for Each Label Separately
| Label  | Precision | Recall | F1-Score | Support |
|--------|-----------|--------|----------|---------|
| O      | 0.90      | 0.97   | 0.94     | 6567    |
| B-PER  | 0.76      | 0.55   | 0.64     | 434     |
| I-PER  | 0.77      | 0.63   | 0.69     | 296     |
| B-LOC  | 0.73      | 0.56   | 0.63     | 343     |
| I-LOC  | 1.00      | 0.25   | 0.39     | 53      |
| B-ORG  | 0.69      | 0.43   | 0.53     | 350     |
| I-ORG  | 0.43      | 0.24   | 0.31     | 200     |

#### Overall Metrics
- **Accuracy**: 0.88 (8243 samples)
- **Macro Average**:
  - Precision: 0.75
  - Recall: 0.52
  - F1-Score: 0.59
- **Weighted Average**:
  - Precision: 0.86
  - Recall: 0.88
  - F1-Score: 0.86
- **Metrics combined (excluding 'O')**:
  - Precision: 0.71
  - Recall: 0.50
  - F1-Score: 0.58
  - Support: 1676

## To do
Download the GloVe embeddings from https://nlp.stanford.edu/projects/glove/ (use the 50-dim vectors from glove.6B.zip for shorter training, 300-dim for maximum performance). Then initialize the nn.Embedding module in your NERNet with these embeddings, so that you can start your training with pre-trained vectors. Repeat the previous part with the same hyperparameters and print the results for each model.

Note: make sure that vectors are aligned with the IDs in your Vocab, in other words, make sure that for example the word with ID 0 is the first vector in the GloVe matrix of vectors that you initialize nn.Embedding with. For a discussion on how to do that, check this link:
https://discuss.pytorch.org/t/can-we-use-pre-trained-word-embeddings-for-weight-initialization-in-nn-embedding/1222

In [11]:
def fetch_glove_embeddings(file_path):
    embeddings = {}
    with open(file_path, 'r', encoding='utf-8') as f:
        for line in f:
            parts = line.split()
            word = parts[0]
            coefs = np.asarray(parts[1:], dtype='float32')
            embeddings[word] = coefs
    return embeddings

# Load the 50-dim and 100-dim embeddings
glove_50d = fetch_glove_embeddings('glove.6B/glove.6B.50d.txt')
glove_100d = fetch_glove_embeddings('glove.6B/glove.6B.100d.txt')

def construct_embedding_matrix(glove_embeds, vocab, embed_dim):
    matrix = np.zeros((vocab.word_count, embed_dim))
    for word, idx in vocab.word_to_index.items():
        vector = glove_embeds.get(word)
        if vector is not None:
            matrix[idx] = vector
        else:
            # Initialize with random embedding if word is not in GloVe
            matrix[idx] = np.random.normal(scale=0.6, size=(embed_dim,))
    return matrix

embedding_matrix_100d = construct_embedding_matrix(glove_100d, vocab, 100)
embedding_matrix_50d = construct_embedding_matrix(glove_50d, vocab, 50)

In [12]:
class BiLSTMNERPreTrained(nn.Module):
    def __init__(self, vocab_size, embed_size, hidden_size, tagset_size, num_layers, embedding_matrix):
        super(BiLSTMNERPreTrained, self).__init__()
        self.embedding_layer = nn.Embedding.from_pretrained(torch.tensor(embedding_matrix, dtype=torch.float32))
        self.bilstm_layer = nn.LSTM(embed_size, hidden_size, num_layers, bidirectional=True)
        self.linear_layer = nn.Linear(hidden_size * 2, tagset_size)

    def forward(self, input_seq):
        embedded_seq = self.embedding_layer(input_seq)
        lstm_output, _ = self.bilstm_layer(embedded_seq)
        output = self.linear_layer(lstm_output)
        return output

In [13]:
# Train and evaluate models with pre-trained GloVe embeddings
for idx, params in enumerate(cpu_hyperparameters):
    vocab_size = vocab.word_count  # Size of the vocabulary
    embed_size = params['embed_size']  # Size of the word embeddings
    hidden_size = params['hidden_size']  # Size of the hidden state in the RNN
    tagset_size = 7  # Number of NER tags
    num_layers = params['num_layers']  # Number of layers in the RNN
    epochs = 10  # Number of epochs to train for

    if embed_size == 50:
        embedding_matrix = embedding_matrix_50d
    else:
        embedding_matrix = embedding_matrix_100d

    model = BiLSTMNERPreTrained(vocab_size, embed_size, hidden_size, tagset_size, num_layers, embedding_matrix).to(device)
    print(f"Training pre-trained model {idx+1} with hyperparameters: {params}")
    training_loop(model, epochs)
    
    # Evaluate on Dev set
    print(f"Evaluating pre-trained model {idx+1} on Dev set")
    evaluation_loop(model, f"Model {idx+1} - Dev")
    
    # Evaluate on Test set
    print(f"Evaluating pre-trained model {idx+1} on Test set")
    evaluation_loop(model, f"Model {idx+1} - Test")

Training pre-trained model 1 with hyperparameters: {'embed_size': 50, 'hidden_size': 64, 'num_layers': 1}
Evaluating pre-trained model 1 on Dev set
Model 1 - Dev - dev Data:
Detailed Metrics by Label:
              precision    recall  f1-score   support

           O       0.92      0.98      0.95      3096
       B-PER       0.76      0.63      0.69       200
       I-PER       0.77      0.71      0.74       157
       B-LOC       0.70      0.52      0.60       183
       I-LOC       0.00      0.00      0.00        23
       B-ORG       0.56      0.48      0.52       168
       I-ORG       0.45      0.22      0.29       116

    accuracy                           0.88      3943
   macro avg       0.59      0.50      0.54      3943
weighted avg       0.86      0.88      0.87      3943

Aggregate Metrics (excluding 'O'):
Precision: 0.65
Recall: 0.52
F1-Score: 0.57
Support: 847

Model 1 - Dev - test Data:
Detailed Metrics by Label:
              precision    recall  f1-score   support



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Model 3 - Dev - test Data:
Detailed Metrics by Label:
              precision    recall  f1-score   support

           O       0.94      0.97      0.96      6567
       B-PER       0.71      0.65      0.68       434
       I-PER       0.75      0.81      0.78       296
       B-LOC       0.66      0.58      0.62       343
       I-LOC       0.00      0.00      0.00        53
       B-ORG       0.64      0.46      0.54       350
       I-ORG       0.53      0.37      0.44       200

    accuracy                           0.89      8243
   macro avg       0.60      0.55      0.57      8243
weighted avg       0.88      0.89      0.88      8243

Aggregate Metrics (excluding 'O'):
Precision: 0.65
Recall: 0.57
F1-Score: 0.60
Support: 1676

Evaluating pre-trained model 3 on Test set


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Model 3 - Test - dev Data:
Detailed Metrics by Label:
              precision    recall  f1-score   support

           O       0.93      0.98      0.96      3096
       B-PER       0.75      0.71      0.73       200
       I-PER       0.75      0.87      0.80       157
       B-LOC       0.71      0.56      0.63       183
       I-LOC       0.00      0.00      0.00        23
       B-ORG       0.65      0.45      0.53       168
       I-ORG       0.54      0.22      0.31       116

    accuracy                           0.90      3943
   macro avg       0.62      0.54      0.57      3943
weighted avg       0.88      0.90      0.88      3943

Aggregate Metrics (excluding 'O'):
Precision: 0.67
Recall: 0.57
F1-Score: 0.60
Support: 847



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Model 3 - Test - test Data:
Detailed Metrics by Label:
              precision    recall  f1-score   support

           O       0.94      0.97      0.96      6567
       B-PER       0.71      0.65      0.68       434
       I-PER       0.75      0.81      0.78       296
       B-LOC       0.66      0.58      0.62       343
       I-LOC       0.00      0.00      0.00        53
       B-ORG       0.64      0.46      0.54       350
       I-ORG       0.53      0.37      0.44       200

    accuracy                           0.89      8243
   macro avg       0.60      0.55      0.57      8243
weighted avg       0.88      0.89      0.88      8243

Aggregate Metrics (excluding 'O'):
Precision: 0.65
Recall: 0.57
F1-Score: 0.60
Support: 1676

Training pre-trained model 4 with hyperparameters: {'embed_size': 100, 'hidden_size': 32, 'num_layers': 2}


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Evaluating pre-trained model 4 on Dev set
Model 4 - Dev - dev Data:
Detailed Metrics by Label:
              precision    recall  f1-score   support

           O       0.94      0.98      0.96      3096
       B-PER       0.75      0.70      0.73       200
       I-PER       0.80      0.85      0.82       157
       B-LOC       0.72      0.65      0.68       183
       I-LOC       0.00      0.00      0.00        23
       B-ORG       0.66      0.48      0.56       168
       I-ORG       0.59      0.36      0.45       116

    accuracy                           0.90      3943
   macro avg       0.64      0.58      0.60      3943
weighted avg       0.89      0.90      0.90      3943

Aggregate Metrics (excluding 'O'):
Precision: 0.69
Recall: 0.61
F1-Score: 0.64
Support: 847



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Model 4 - Dev - test Data:
Detailed Metrics by Label:
              precision    recall  f1-score   support

           O       0.94      0.98      0.96      6567
       B-PER       0.79      0.66      0.72       434
       I-PER       0.78      0.79      0.78       296
       B-LOC       0.66      0.66      0.66       343
       I-LOC       0.00      0.00      0.00        53
       B-ORG       0.71      0.51      0.59       350
       I-ORG       0.52      0.42      0.47       200

    accuracy                           0.90      8243
   macro avg       0.63      0.57      0.60      8243
weighted avg       0.89      0.90      0.90      8243

Aggregate Metrics (excluding 'O'):
Precision: 0.69
Recall: 0.60
F1-Score: 0.64
Support: 1676

Evaluating pre-trained model 4 on Test set


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Model 4 - Test - dev Data:
Detailed Metrics by Label:
              precision    recall  f1-score   support

           O       0.94      0.98      0.96      3096
       B-PER       0.75      0.70      0.73       200
       I-PER       0.80      0.85      0.82       157
       B-LOC       0.72      0.65      0.68       183
       I-LOC       0.00      0.00      0.00        23
       B-ORG       0.66      0.48      0.56       168
       I-ORG       0.59      0.36      0.45       116

    accuracy                           0.90      3943
   macro avg       0.64      0.58      0.60      3943
weighted avg       0.89      0.90      0.90      3943

Aggregate Metrics (excluding 'O'):
Precision: 0.69
Recall: 0.61
F1-Score: 0.64
Support: 847



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Model 4 - Test - test Data:
Detailed Metrics by Label:
              precision    recall  f1-score   support

           O       0.94      0.98      0.96      6567
       B-PER       0.79      0.66      0.72       434
       I-PER       0.78      0.79      0.78       296
       B-LOC       0.66      0.66      0.66       343
       I-LOC       0.00      0.00      0.00        53
       B-ORG       0.71      0.51      0.59       350
       I-ORG       0.52      0.42      0.47       200

    accuracy                           0.90      8243
   macro avg       0.63      0.57      0.60      8243
weighted avg       0.89      0.90      0.90      8243

Aggregate Metrics (excluding 'O'):
Precision: 0.69
Recall: 0.60
F1-Score: 0.64
Support: 1676

Training pre-trained model 5 with hyperparameters: {'embed_size': 50, 'hidden_size': 64, 'num_layers': 2}


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Evaluating pre-trained model 5 on Dev set
Model 5 - Dev - dev Data:
Detailed Metrics by Label:
              precision    recall  f1-score   support

           O       0.94      0.98      0.96      3096
       B-PER       0.74      0.74      0.74       200
       I-PER       0.80      0.85      0.82       157
       B-LOC       0.73      0.62      0.67       183
       I-LOC       0.57      0.17      0.27        23
       B-ORG       0.70      0.54      0.61       168
       I-ORG       0.62      0.40      0.48       116

    accuracy                           0.90      3943
   macro avg       0.73      0.61      0.65      3943
weighted avg       0.90      0.90      0.90      3943

Aggregate Metrics (excluding 'O'):
Precision: 0.72
Recall: 0.63
F1-Score: 0.67
Support: 847

Model 5 - Dev - test Data:
Detailed Metrics by Label:
              precision    recall  f1-score   support

           O       0.95      0.97      0.96      6567
       B-PER       0.74      0.69      0.71       43

## Best Pre-Trained Model

### Hyperparameters
- Bidirectional LSTM
- Embedding Size: 50
- Hidden Size: 64
- Number of Layers: 2

### Evaluation Metrics on Dev Set

#### Metrics for Each Label Separately
| Label  | Precision | Recall | F1-Score | Support |
|--------|-----------|--------|----------|---------|
| O      | 0.94      | 0.98   | 0.96     | 3096    |
| B-PER  | 0.74      | 0.74   | 0.74     | 200     |
| I-PER  | 0.80      | 0.85   | 0.82     | 157     |
| B-LOC  | 0.73      | 0.62   | 0.67     | 183     |
| I-LOC  | 0.57      | 0.17   | 0.27     | 23      |
| B-ORG  | 0.70      | 0.54   | 0.61     | 168     |
| I-ORG  | 0.62      | 0.40   | 0.48     | 116     |

#### Overall Metrics
- **Accuracy**: 0.90 (3943 samples)
- **Macro Average**:
  - Precision: 0.73
  - Recall: 0.61
  - F1-Score: 0.65
- **Weighted Average**:
  - Precision: 0.90
  - Recall: 0.90
  - F1-Score: 0.90
- **Metrics combined (excluding 'O')**:
  - Precision: 0.72
  - Recall: 0.63
  - F1-Score: 0.67
  - Support: 847

---

### Evaluation Metrics on Test Set

#### Metrics for Each Label Separately
| Label  | Precision | Recall | F1-Score | Support |
|--------|-----------|--------|----------|---------|
| O      | 0.95      | 0.97   | 0.96     | 6567    |
| B-PER  | 0.74      | 0.69   | 0.71     | 434     |
| I-PER  | 0.76      | 0.80   | 0.78     | 296     |
| B-LOC  | 0.67      | 0.65   | 0.66     | 343     |
| I-LOC  | 0.69      | 0.21   | 0.32     | 53      |
| B-ORG  | 0.69      | 0.52   | 0.59     | 350     |
| I-ORG  | 0.58      | 0.49   | 0.53     | 200     |

#### Overall Metrics
- **Accuracy**: 0.90 (8243 samples)
- **Macro Average**:
  - Precision: 0.72
  - Recall: 0.62
  - F1-Score: 0.65
- **Weighted Average**:
  - Precision: 0.90
  - Recall: 0.90
  - F1-Score: 0.90
- **Metrics combined (excluding 'O')**:
  - Precision: 0.70
  - Recall: 0.63
  - F1-Score: 0.65
  - Support: 1676


We can see that there is an improvment from the network i created and the pre trained