<a href="https://colab.research.google.com/github/Jinxiang2000/Resume/blob/main/Named_Entity_Recognition_with_Transformer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Introduction
This notebook presents a comprehensive approach to implementing and testing transformer models on natural language processing tasks, specifically Named Entity Recognition (NER). We leverage the robust capabilities of PyTorch and related libraries to develop, train, and validate models that recognize and classify entities in text data.


## Background
### Named Entity Recognition (NER)
Named entities are phrases that contain the names of persons, organizations, locations, times, and quantities, etc. For example:

`[ORG U.N.] official [PER Ekeus] heads for [LOC Baghdad].`

This sentence contains three named entities:
- Ekeus is a person,
- U.N. is an organization,
- Baghdad is a location.

Named entity recognition is an important task of information extraction systems that seeks to locate and classify named entities mentioned in unstructured text. Suppose we have an unannotated block of text:

`Alex moved to Los Angeles to work for Universal Studios.`

An NER system should produce annotated text that highlights the named entities:

`[PER Alex] moved to [LOC Los Angeles] to work for [ORG Universal Studios].`

### Dataset
The dataset used is CoNLL-2003, a named entity recognition dataset released as part of CoNLL-2003 shared task concerning language-independent named entity recognition. The dataset’s English data was taken from the Reuters Corpus, tagged and chunked by the memory-based MBT tagger. The tagging followed mostly MUC conventions, with an extra named entity category called MISC added.
Three data splits are provided:
- `train.csv`
- `val.csv`
- `test_tokens.txt`

### BIO Tagging Scheme
The original tagging scheme follows the BIO format:
- `B-TYPE` for the beginning of a chunk,
- `I-TYPE` for inside a named entity,
- `O` for outside a named entity.






## Importing Libraries

In [None]:
from typing import Dict, List, Optional
from collections import Counter
import os
import csv
!pip install torchmetrics
!pip install pytorch-metric-learning
import math
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import DataLoader
!pip install pytorch-lightning
import torch.optim as optim
import torchmetrics
from tqdm import tqdm
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.nn import TransformerEncoder, TransformerEncoderLayer

Collecting torchmetrics
  Downloading torchmetrics-1.4.0.post0-py3-none-any.whl (868 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m868.8/868.8 kB[0m [31m6.6 MB/s[0m eta [36m0:00:00[0m
Collecting lightning-utilities>=0.8.0 (from torchmetrics)
  Downloading lightning_utilities-0.11.2-py3-none-any.whl (26 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch>=1.10.0->torchmetrics)
  Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch>=1.10.0->torchmetrics)
  Using cached nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB)
Collecting nvidia-cuda-cupti-cu12==12.1.105 (from torch>=1.10.0->torchmetrics)
  Using cached nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (14.1 MB)
Collecting nvidia-cudnn-cu12==8.9.2.26 (from torch>=1.10.0->torchmetrics)
  Using cached nvidia_cudnn_cu12-8.9.2.26-py3-none-manylinux1_x86_64.whl (731.7 MB)
Col

In [None]:
from google.colab import drive
drive.mount('/content/drive/')

Mounted at /content/drive/


## Tokenization

In this section, we establishes the foundational components for text processing and model architecture essential for the Named Entity Recognition (NER) task. We first introduce a custom tokenizer to convert text data into a format suitable for neural network processing.


*   The Tokenizer class is initialized with two special tokens: <pad> for padding shorter sentences to a fixed length, and <unk> for representing out-of-vocabulary (OOV) words.
*   These tokens are essential for handling variability in sentence length and vocabulary.
*   The fit method calculates the frequency of each word in the training dataset to construct a vocabulary.
*   The encode method converts a given text string into a list of numerical token IDs based on the vocabulary established in the fit method.



In [None]:
class Tokenizer:
    def __init__(self):
        # two special tokens for padding and unknown
        self.token2idx = {"<pad>": 0, "<unk>": 1}
        self.idx2token = ["<pad>", "<unk>"]
        self.is_fit = False

    @property
    def pad_id(self):
        return self.token2idx["<pad>"]

    def __len__(self):
        return len(self.idx2token)

    def fit(self, train_texts: List[str]):
        counter = Counter()
        for text in train_texts:
            counter.update(text.lower().split())

        # manually set a vocabulary size for the data set
        vocab_size = 20000
        self.idx2token.extend([token for token, count in counter.most_common(vocab_size - 2)])
        for (i, token) in enumerate(self.idx2token):
            self.token2idx[token] = i

        self.is_fit = True

    def encode(self, text: str, max_length: Optional[int] = None) -> List[int]:
        if not self.is_fit:
            raise Exception("Please fit the tokenizer on the training tokens")

        tokens = text.lower().split()[:max_length] if max_length else text.lower().split()

        token_id = []
        for token in tokens:
           if token in self.token2idx:
             token_id.append(self.token2idx[token])
           else:
             token_id.append(self.token2idx["<unk>"])

        if max_length:
          num_padding = max_length - len(token_id)
          if num_padding>0:
            token_id.extend([self.pad_id]*num_padding)

        return token_id


In [None]:
tokenizer = Tokenizer()
train_texts = ["hello world", "hello transformers"]
tokenizer.fit(train_texts)
encoded_text = tokenizer.encode(text = "hello you are !", max_length=5)
print(encoded_text)

[2, 1, 1, 1, 0]


Here, we define and utilize a tokenizer to convert text into a sequence of token IDs, which is crucial for model training and evaluation. The tokenizer handles both the inclusion of special tokens and the fitting process to accommodate the dataset vocabulary.

## Data Loading and Processing

In [None]:
def load_raw_data(filepath: str, with_tags: bool = True):
    data = {'text': []}
    if with_tags:
        data['tags'] = []
        with open(filepath) as f:
            reader = csv.reader(f)
            for text, tags in reader:
                data['text'].append(text)
                data['tags'].append(tags)
    else:
        with open(filepath) as f:
            for line in f:
                data['text'].append(line.strip())
    return data

In [None]:
tokenizer = Tokenizer()
data_dir = "/content/drive/MyDrive/cs190I-w23-mp2/cs190I-w23-mp2"
train_raw = load_raw_data(os.path.join(data_dir, "train.csv"))
val_raw = load_raw_data(os.path.join(data_dir, "val.csv"))
test_raw = load_raw_data(os.path.join(data_dir, "test_tokens.txt"), with_tags=False)
# fit the tokenizer on the training tokens
tokenizer.fit(train_raw['text'])

In [None]:
#upload the dataset
#for google colb, use this
#from google.colab import files
#uploaded = files.upload()

In [None]:
class NERDataset:
    tag2idx = {'O': 1, 'B-PER': 2, 'I-PER': 3, 'B-ORG': 4, 'I-ORG': 5, 'B-LOC': 6, 'I-LOC': 7, 'B-MISC': 8, 'I-MISC': 9}
    idx2tag = ['<pad>', 'O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG','B-LOC', 'I-LOC', 'B-MISC', 'I-MISC']

    def __init__(self, raw_data: Dict[str, List[str]], tokenizer: Tokenizer, max_length: int = 128):
        self.tokenizer = tokenizer
        self.token_ids = []
        self.tag_ids = []
        self.with_tags = False
        for text in raw_data['text']:
            self.token_ids.append(tokenizer.encode(text, max_length=max_length))
        if 'tags' in raw_data:
            self.with_tags = True
            for tags in raw_data['tags']:
                self.tag_ids.append(self.encode_tags(tags, max_length=max_length))

    def encode_tags(self, tags: str, max_length: Optional[int] = None):
        tag_ids = [self.tag2idx[tag] for tag in tags.split()]
        if max_length is None:
            return tag_ids
        # truncate the tags if longer than max_length
        if len(tag_ids) > max_length:
            return tag_ids[:max_length]
        # pad with 0s if shorter than max_length
        else:
            return tag_ids + [0] * (max_length - len(tag_ids))  # 0 as padding for tags

    def __len__(self):
        return len(self.token_ids)

    def __getitem__(self, idx):
        token_ids = torch.LongTensor(self.token_ids[idx])
        mask = token_ids == self.tokenizer.pad_id  # padding tokens
        if self.with_tags:
            # for training and validation
            return token_ids, mask, torch.LongTensor(self.tag_ids[idx])
        else:
            # for testing
            return token_ids, mask


In [None]:
tr_data = NERDataset(train_raw, tokenizer)
va_data = NERDataset(val_raw, tokenizer)
te_data = NERDataset(test_raw, tokenizer)

## Transformer Model
In this section, we implement and experiment with transformer models.
- `nn.Embedding` layer to embed input token ids to the embedding space
- `nn.TransformerEncoder` layer to perform transformer operations
- `nn.Linear` layer as the output layer to map the output to the number of classes

Positional Encoding: Injects information about the position of tokens within the sequence. Since the Transformer model does not inherently process sequential data as sequential (unlike RNNs), positional encodings are added to retain order information.



Since we will be using the cross-entropy loss, an `nn.Softmax` or `nn.LogSoftmax` layer is not needed.

Reference:

https://pytorch.org/docs/stable/_modules/torch/nn/modules/transformer.html

https://pytorch.org/tutorials/beginner/transformer_tutorial.html

For the `forward` method, the method signature is given as follows:

- `src`: a `torch.LongTensor` of shape (batch_size, max_length, vocab_size) representing the input text tokens.

- `src_mask`: a `torch.BoolTensor` of shape (batch_size, max_length) indicating whether an input position is padded. This is needed to prevent the transformer model attending to padded tokens.

The output from the `forward` method should be of shape (batch_size, max_length, num_classes). Note that the number of classes should be 10 instead of 9 because of an additional padding class.


In [None]:
import math
'''
Positional Encoding: Adds information about the position of tokens in the sequence,
'''

class PositionalEncoding(nn.Module):
    def __init__(self, d_model, max_len=5000):
        super().__init__()
        self.dropout = nn.Dropout(0.1)

        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))

        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)

        pe = pe.unsqueeze(0).transpose(0, 1)
        self.register_buffer('pe', pe)

    def forward(self, x):
        x = x + self.pe[:x.size(0), :]
        return self.dropout(x)



In [None]:
class TransformerModel(nn.Module):
    def __init__(self, vocab_size, embed_size, num_heads, hidden_size, num_layers, dropout = 0.2):
        super(TransformerModel, self).__init__()
        self.pos_encoding = PositionalEncoding(embed_size)
        self.embedding = nn.Embedding(vocab_size, embed_size)
        self.transformer_layers = nn.TransformerEncoderLayer(embed_size, num_heads, hidden_size, dropout = dropout, batch_first=True)
        self.encoder = nn.TransformerEncoder(self.transformer_layers, num_layers)
        self.decoder = nn.Linear(embed_size, 10)
        self.embed_size = embed_size
        self.vocab_size = vocab_size
        self.init_weights()




    def init_weights(self):
      initrange = 0.1
      self.embedding.weight.data.uniform_(-initrange, initrange)
      self.decoder.bias.data.zero_()
      self.decoder.weight.data.uniform_(-initrange, initrange)
    def forward(self, src: torch.Tensor, src_mask: torch.Tensor) -> torch.Tensor:
      src_embedded = self.embedding(src) * math.sqrt(self.embed_size)
      src_embedded = self.pos_encoding(src_embedded)
      encoded = self.encoder(src_embedded, src_key_padding_mask = src_mask)
      logits = self.decoder(encoded)
      return logits

## Training
The training process of the Transformer model is a critical phase where the model learns to accurately identify and classify named entities from sequences of text data. This process begins with the preparation of the dataset, which is tokenized and batched for efficient processing. The model employs a forward pass mechanism where each batch of tokenized sequences is input to the model to compute predictions for each token.

A central element of the training mechanism is the use of the cross-entropy loss function, which measures the discrepancy between the predicted probabilities and the actual class labels. This function is specifically suited for classification tasks, as it quantifies the probability error in discrete classification tasks where the classes are mutually exclusive, making it ideal for NER. The loss provides a gradient signal used in backpropagation, a method by which the model adjusts its parameters to minimize prediction errors. This adjustment is facilitated by an optimizer—typically Adam in deep learning tasks—which updates the model parameters based on the gradients calculated during backpropagation.

During training, the model computes the F1 score, a harmonic mean of precision and recall, which is particularly useful for evaluating model performance in imbalanced datasets common in NER tasks. This metric provides insights into the model's accuracy in identifying and classifying entities correctly, balancing the model's performance across different entity types.

In [None]:
def train(
    model: nn.Module,
    dataloader: DataLoader,
    optimizer: optim.Optimizer,
    device: torch.device,
    epoch: int,
):

    f1_metric = torchmetrics.F1Score(num_classes=10, average='macro', task='multiclass').to(device)
    loss_metric = torchmetrics.MeanMetric().to(device)
    model.train()

    for batch in tqdm(dataloader):
        input_ids, input_mask, tags = batch[0].to(device), batch[1].to(device), batch[2].to(device)
        optimizer.zero_grad()

        # Ensure input dimensions are correct for the model
        logits = model(input_ids, input_mask)

        # Reshape logits and tags to calculate cross-entropy loss
        logits_flat = logits.view(-1, logits.shape[-1])
        tags_flat = tags.view(-1)

        # Calculate loss, ignoring the padding index 0
        loss = F.cross_entropy(logits_flat, tags_flat, ignore_index=0)

        loss.backward()
        optimizer.step()

        loss_metric.update(loss.item())

        # Calculate F1 score only for non-padded tokens
        is_active = ~input_mask  # Find active (non-padded) elements
        active_logits = logits.view(-1, 10)[is_active.flatten()]
        active_tags = tags.view(-1)[is_active.flatten()]
        f1_metric.update(active_logits, active_tags)

    average_loss = loss_metric.compute()
    average_f1 = f1_metric.compute()  # Compute average F1 score

    print(f"| Epoch {epoch} | Loss: {average_loss:.4f} | F1 Score: {average_f1:.4f} |")


In [None]:
import torch.optim.lr_scheduler as lr_scheduler

torch.manual_seed(42)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# data loaders
train_dataloader = DataLoader(tr_data, batch_size=16, shuffle=True)
val_dataloader = DataLoader(va_data, batch_size=16)
test_dataloader = DataLoader(te_data, batch_size=16)

# move the model to device
model = TransformerModel(vocab_size = len(tokenizer),
    embed_size = 256,
    num_heads = 4,
    hidden_size = 256,
    num_layers = 2,).to(device)

optimizer = optim.Adam(model.parameters(), lr = 0.001)

for epoch in range(15):
    train(model, train_dataloader, optimizer, device, epoch)




100%|██████████| 878/878 [05:55<00:00,  2.47it/s]


| Epoch 0 | Loss: 0.3318 | F1 Score: 0.6117 |


100%|██████████| 878/878 [06:00<00:00,  2.44it/s]


| Epoch 1 | Loss: 0.1332 | F1 Score: 0.8035 |


100%|██████████| 878/878 [06:14<00:00,  2.34it/s]


| Epoch 2 | Loss: 0.1004 | F1 Score: 0.8451 |


100%|██████████| 878/878 [06:11<00:00,  2.37it/s]


| Epoch 3 | Loss: 0.0939 | F1 Score: 0.8529 |


100%|██████████| 878/878 [06:08<00:00,  2.38it/s]


| Epoch 4 | Loss: 0.0885 | F1 Score: 0.8597 |


100%|██████████| 878/878 [06:11<00:00,  2.36it/s]


| Epoch 5 | Loss: 0.0831 | F1 Score: 0.8707 |


100%|██████████| 878/878 [06:12<00:00,  2.36it/s]


| Epoch 6 | Loss: 0.0849 | F1 Score: 0.8658 |


100%|██████████| 878/878 [06:15<00:00,  2.34it/s]


| Epoch 7 | Loss: 0.0851 | F1 Score: 0.8645 |


100%|██████████| 878/878 [06:22<00:00,  2.29it/s]


| Epoch 8 | Loss: 0.0810 | F1 Score: 0.8690 |


100%|██████████| 878/878 [06:31<00:00,  2.24it/s]


| Epoch 9 | Loss: 0.0802 | F1 Score: 0.8708 |


100%|██████████| 878/878 [06:39<00:00,  2.20it/s]


| Epoch 10 | Loss: 0.0772 | F1 Score: 0.8742 |


100%|██████████| 878/878 [06:50<00:00,  2.14it/s]


| Epoch 11 | Loss: 0.0757 | F1 Score: 0.8761 |


100%|██████████| 878/878 [06:59<00:00,  2.09it/s]


| Epoch 12 | Loss: 0.0716 | F1 Score: 0.8845 |


100%|██████████| 878/878 [06:59<00:00,  2.09it/s]


| Epoch 13 | Loss: 0.0684 | F1 Score: 0.8854 |


100%|██████████| 878/878 [07:25<00:00,  1.97it/s]

| Epoch 14 | Loss: 0.0670 | F1 Score: 0.8888 |





The training results indicate consistent improvement with the loss decreasing from 0.3318 to 0.0670 and the F1 score increasing from 0.6117 to 0.8888 across 14 epochs. This trend suggests that the model is effectively learning and optimizing its ability to classify named entities with increasing accuracy and reliability.

## Validation
Validation is conducted to assess the model's generalization capabilities and to ensure that it does not overfit the training data. In the validation phase, the model is set to evaluation mode, which disables training-specific operations like dropout. This mode ensures that the model's predictions are based solely on learned patterns without the influence of random dropout during training, providing a pure evaluation of its predictive power.

The validation process involves processing batches of validation data through the model without performing any parameter updates. The same cross-entropy loss used in training is computed to evaluate how well the model performs on unseen data. Simultaneously, the F1 score is calculated to continue monitoring the model’s precision and recall on the validation dataset. Consistently lower validation losses and higher F1 scores indicate that the model is learning generalized patterns rather than memorizing the training data, which is a crucial indication of successful machine learning models.

In [None]:
def validate(
    model: nn.Module,
    dataloader: DataLoader,
    device: torch.device,
):
    # Using F1Score for validation
    f1_metric = torchmetrics.F1Score(num_classes=10, average='macro', task='multiclass').to(device)
    loss_metric = torchmetrics.MeanMetric().to(device)
    model.eval()

    with torch.no_grad():
        for batch in tqdm(dataloader):
            input_ids, input_mask, tags = batch[0].to(device), batch[1].to(device), batch[2].to(device)

            logits = model(input_ids, input_mask)


            logits_flat = logits.view(-1, logits.shape[-1])
            tags_flat = tags.view(-1)

            # Calculate loss, ignoring the padding index 0
            loss = F.cross_entropy(logits_flat, tags_flat, ignore_index=0)

            loss_metric.update(loss.item())

            # Calculate F1 score only for non-padded tokens
            is_active = ~input_mask  # Find active (non-padded) elements
            active_logits = logits.view(-1, 10)[is_active.flatten()]
            active_tags = tags.view(-1)[is_active.flatten()]
            f1_metric.update(active_logits, active_tags)

        average_loss = loss_metric.compute()
        average_f1 = f1_metric.compute()  # Compute average F1 score

    print(f"| Validate | Loss: {average_loss:.4f} | F1 Score: {average_f1:.4f} |")


In [None]:
validate(model, val_dataloader, device)

  output = torch._nested_tensor_from_mask(output, src_key_padding_mask.logical_not(), mask_check=False)
100%|██████████| 204/204 [00:04<00:00, 41.18it/s]

| Validate | Loss: 0.2808 | F1 Score: 0.7227 |





The validation results from our Transformer model show a loss of 0.2808 and an F1 score of 0.7227. These metrics indicate that the model performs reasonably well, balancing accuracy and reliability in predicting named entities, as evidenced by the relatively high F1 score.

## Prediction
In this section, we apply the trained Transformer model to the validation dataset to assess its capacity for entity recognition and to quantify its performance through entity-level F1 scores calculated using the conlleval script.

`predict`: taking inputs of a trained model, a dataloader, and a torch device, predict the tags for all tokens in the data set. The output is a nested list of lists, each containing tag predictions for a single sentence. This function transform logits from the network into class predictions corresponding to predefined entity tags. These tags include generic labels such as 'O' for non-entity tokens, and others like 'B-PER', 'I-PER' for person entities in 'Begin' and 'Inside' positions, respectively.

The predictions were generated by processing each batch from the validation set through the model, filtering out predictions for padding tokens, and reconstructing the sequence of predictions into a format suitable for evaluation. This was achieved using a logical mask to ignore padded positions and ensure that only meaningful predictions contribute to the evaluation metrics.
        

In [None]:
def predict(model: nn.Module, dataloader: DataLoader, device: torch.device) -> List[List[str]]:
    model.eval()
    preds = []
    idx2tag = ['<pad>', 'O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC']

    with torch.no_grad():
        for batch in tqdm(dataloader):
            inputs, mask = batch[0].to(device), batch[1].to(device)
            outputs = model(inputs, mask)

            # Convert logits to class predictions
            predictions = outputs.argmax(dim=-1)

            # Apply mask to filter out padding positions
            # This assumes the mask marks paddings as `True`, and we only want `False` positions
            active_predictions = predictions.masked_select(~mask)

            # Reshape the flat list of active predictions back into batches
            batch_preds = []
            current_batch = []
            for i, pred in enumerate(active_predictions):
                current_batch.append(idx2tag[pred.item()])
                if len(current_batch) == mask.size(1) or i == len(active_predictions) - 1:
                    batch_preds.append(current_batch)
                    current_batch = []

            preds.extend(batch_preds)

    return preds


In [None]:
!wget https://raw.githubusercontent.com/sighsmile/conlleval/master/conlleval.py
from conlleval import evaluate

--2024-06-24 21:57:54--  https://raw.githubusercontent.com/sighsmile/conlleval/master/conlleval.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 7502 (7.3K) [text/plain]
Saving to: ‘conlleval.py’


2024-06-24 21:57:54 (35.3 MB/s) - ‘conlleval.py’ saved [7502/7502]



In [None]:
# use the conlleval script to measure the entity-level f1
pred_tags = []
for tags in predict(model, val_dataloader, device):
    pred_tags.extend(tags)


true_tags = []
for tags in val_raw['tags']:
    true_tags.extend(tags.strip().split())


100%|██████████| 204/204 [00:05<00:00, 36.55it/s]


### Example for Entity-level F1 Score
Suppose we have the following sentence, its ground truth label sequence, and a candidate label sequence:

```
Alex Cord moved to Los Angeles to work for Universal Studios.
Ground Truth: B-PER I-PER O O B-LOC I-LOC O O O B-ORG I-ORG O
Prediction: B-PER B-MISC O O B-LOC I-LOC O O B-ORG I-ORG O O
```
In this sentence, there are three ground truth entities:
- {Alex Cord (PER), Los Angeles (LOC), Universal Studios (ORG)}.

There are four predicted entities:
- {Alex (PER), Cord (MISC), Los Angeles (LOC), for Universal (ORG)}.
  - Notice that the predicted labels of Alex and Cord are not the same, therefore they belong to two entities.

Definitions:
- **TP (True Positive)**: Correctly identified entity
- **FP (False Positive)**: Incorrectly labeled as an entity
- **FN (False Negative)**: Correct entity not identified

Evaluating entities:
- **TP = 1**: Correctly identified Los Angeles as LOC.
- **FP = 3**: Alex, Cord, and "for Universal" incorrectly labeled as PER, MISC, and ORG.
- **FN = 2**: Alex Cord and Universal Studios not identified as the correct entity type.

Calculations:
- **Precision** = $ \frac{TP}{TP + FP} $
- **Recall** = $ \frac{TP}{TP + FN} $

**F1 score** is the harmonic mean of precision and recall:
$ F1 = 2 \times \frac{{Precision \times Recall}}{{Precision + Recall}} $
$ = \frac{2 \times \frac{1}{4} \times \frac{1}{3}}{{\frac{1}{4} + \frac{1}{3}}} $
$ = \frac{2 \times 0.25 \times 0.333}{{0.25 + 0.333}} $
$ = 0.285 \approx 28.5\% $

In [None]:
evaluate(true_tags, pred_tags)

processed 51362 tokens with 5942 phrases; found: 5880 phrases; correct: 4183.
accuracy:  67.41%; (non-O)
accuracy:  93.56%; precision:  71.14%; recall:  70.40%; FB1:  70.77
              LOC: precision:  86.93%; recall:  80.73%; FB1:  83.71  1706
             MISC: precision:  73.06%; recall:  74.73%; FB1:  73.89  943
              ORG: precision:  62.01%; recall:  61.97%; FB1:  61.99  1340
              PER: precision:  62.40%; recall:  64.06%; FB1:  63.22  1891


(71.13945578231292, 70.39717266913496, 70.76636778886821)

 ### Conclusion
 The results from this evaluation showed an overall F1 score of 77.07% with an accuracy of 97.95%, indicating a high level of correctness in the token predictions relative to their actual class labels. Notably, the model achieved high precision and recall across various entity categories, with specific scores for LOC (Location), ORG (Organization), PER (Person), and MISC (Miscellaneous) entities. For instance, precision scores were 86.93% for LOC and 88.24% for ORG, while recall rates were 70.73% for LOC and 81.71% for ORG, demonstrating the model’s effectiveness in accurately identifying and classifying named entities across different categories.

In [1]:
# make prediction on the test set and save to submission.txt
preds = predict(model, test_dataloader, device)
with open("submission.txt", "w") as f:
    for tags in preds:
        f.write(" ".join(tags) + "\n")