<a href="https://colab.research.google.com/github/ItaiKaplan/NLP/blob/main/HW_3_NER.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Assignment 3
Training a neural named entity recognition (NER) tagger 

In [None]:
import torch
import torch.nn as nn


In [None]:
# Set device as cuda if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Device is {device}")

Device is cuda


In this assignment you are required to build a full training and testing pipeline for a neural sequentail tagger for named entities, using LSTM.

The dataset that you will be working on is called ReCoNLL 2003, which is a corrected version of the CoNLL 2003 dataset: https://www.clips.uantwerpen.be/conll2003/ner/

[Train data](https://drive.google.com/file/d/1hG66e_OoezzeVKho1w7ysyAx4yp0ShDz/view?usp=sharing)

[Dev data](https://drive.google.com/file/d/1EAF-VygYowU1XknZhvzMi2CID65I127L/view?usp=sharing)

[Test data](https://drive.google.com/file/d/16gug5wWnf06JdcBXQbcICOZGZypgr4Iu/view?usp=sharing)

As you can see, the annotated texts are labeled according to the IOB annotation scheme, for 3 entity types: Person, Organization, Location.

**Task 1:** Write a funtion for reading the data from a single file (of the ones that are provided above). The function recieves a filepath and then it encodes every sentence individually using a pair of lists, one list contains the words and one list contains the tags. Each list pair will be added to a general list (data), which will be returned back from the function.

In [None]:
!gdown --id 1hG66e_OoezzeVKho1w7ysyAx4yp0ShDz
!gdown --id 1EAF-VygYowU1XknZhvzMi2CID65I127L
!gdown --id 16gug5wWnf06JdcBXQbcICOZGZypgr4Iu

Downloading...
From: https://drive.google.com/uc?id=1hG66e_OoezzeVKho1w7ysyAx4yp0ShDz
To: /content/connl03_train.txt
100% 264k/264k [00:00<00:00, 28.2MB/s]
Downloading...
From: https://drive.google.com/uc?id=1EAF-VygYowU1XknZhvzMi2CID65I127L
To: /content/connl03_dev.txt
100% 36.6k/36.6k [00:00<00:00, 43.5MB/s]
Downloading...
From: https://drive.google.com/uc?id=16gug5wWnf06JdcBXQbcICOZGZypgr4Iu
To: /content/connl03_test.txt
100% 75.9k/75.9k [00:00<00:00, 84.9MB/s]


In [None]:
def read_data(filepath):
    data = []
    # TODO... write your code accordingly 
    word_buffer = list()
    tag_buffer = list()

    with open(filepath, 'r') as f:
      for line in f.readlines():
        split_line = line.strip().split(" ")
        if len(split_line) != 2:
          data.append((word_buffer, tag_buffer))
          word_buffer = list()
          tag_buffer = list()
          continue
          
        word, tag = split_line
        word_buffer.append(word)
        tag_buffer.append(tag)
    
    if word_buffer:
      data.append((word_buffer, tag_buffer))

    return data

train = read_data('/content/connl03_train.txt')
dev = read_data('/content/connl03_dev.txt')
test = read_data('/content/connl03_test.txt')


The following Vocab class can be served as a dictionary that maps words and tags into Ids. The UNK_TOKEN should be used for words that are not part of the training data.

In [None]:
UNK_TOKEN = 0

class Vocab:
    def __init__(self):
        self.word2id = {"__unk__": UNK_TOKEN}
        self.id2word = {UNK_TOKEN: "__unk__"}
        self.n_words = 1
        
        self.tag2id = {"O":0, "B-PER":1, "I-PER": 2, "B-LOC": 3, "I-LOC": 4, "B-ORG": 5, "I-ORG": 6}
        self.id2tag = {0:"O", 1:"B-PER", 2:"I-PER", 3:"B-LOC", 4:"I-LOC", 5:"B-ORG", 6:"I-ORG"}
        
    def index_words(self, words):
      word_indexes = [self.index_word(w) for w in words]
      return word_indexes

    def index_tags(self, tags):
      tag_indexes = [self.tag2id[t] for t in tags]
      return tag_indexes
    
    def index_word(self, w):
        if w not in self.word2id:
            self.word2id[w] = self.n_words
            self.id2word[self.n_words] = w
            self.n_words += 1
        return self.word2id[w]
            

**Task 2:** Write a function prepare_data that takes one of the [train, dev, test] and the Vocab instance, for converting each pair of (words,tags) to a pair of indexes. Each pair should be added to data_sequences, which will be returned back from the function.

In [None]:
vocab = Vocab()

def prepare_data(data, vocab):
    data_sequences = []
    # TODO - your code...
    for data_list, tag_list in data:
      data_sequences.append((
          torch.tensor(vocab.index_words(data_list)).to(device),
          torch.tensor(vocab.index_tags(tag_list)).to(device)))
    return data_sequences, vocab

train_sequences, vocab = prepare_data(train, vocab)
dev_sequences, vocab = prepare_data(dev, vocab)
test_sequences, vocab = prepare_data(test, vocab)

**Task 3:** Write NERNet, a PyTorch Module for labeling words with NER tags. 

*input_size:* the size of the vocabulary

*embedding_size:* the size of the embeddings

*hidden_size:* the LSTM hidden size

*output_size:* the number tags we are predicting for

*n_layers:* the number of layers we want to use in LSTM

*directions:* could 1 or 2, indicating unidirectional or bidirectional LSTM, respectively

The input for your forward function should be a single sentence tensor.

*note:* the embeddings in this section are learned embedding. That means that you don't need to use pretrained embedding like the one used in class. You will use them in part 5

In [None]:
class NERNet(nn.Module):
    def __init__(self, input_size, embedding_size, hidden_size, output_size, n_layers, directions):
        super(NERNet, self).__init__()
        # TODO: your code...
        self.embedding = nn.Embedding(input_size, embedding_size)
        self.lstm = nn.LSTM(embedding_size, hidden_size, n_layers, bidirectional=directions)
        self.out = nn.Linear(hidden_size, output_size)
    
    def forward(self, input_sentence):
        # TODO: your code...
        embeds = self.embedding(input_sentence)
        lstm_out, _ = self.lstm(embeds)
        output = nn.functional.softmax(self.out(lstm_out))
        
        return output
    

**Task 4:** write a training loop, which takes a model (instance of NERNet) and number of epochs to train on. The loss is always CrossEntropyLoss and the optimizer is always Adam.

In [None]:
def train_loop(model, n_epochs):
  # Loss function
  criterion = nn.CrossEntropyLoss()

  # Optimizer (ADAM is a fancy version of SGD)
  optimizer = torch.optim.Adam(model.parameters(), lr=0.0001)
  
  for e in range(1, n_epochs + 1):
    # TODO - your code goes here...
    for words, tags in train_sequences:
      optimizer.zero_grad()
      scores = model(words)
      loss = criterion(scores, tags)
      loss.backward()
      optimizer.step()



In [None]:
test_model_1 = NERNet(len(vocab.id2word), 300, 500, len(vocab.id2tag), 1, False).to(device)
train_loop(test_model_1, 10)

  del sys.path[0]


In [None]:
train_sequences[:3]

[(tensor([1, 2, 3, 4]), tensor([5, 0, 5, 0])),
 (tensor([ 5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15]),
  tensor([0, 3, 4, 0, 5, 0, 0, 1, 2, 0, 0])),
 (tensor([16, 17, 18, 19, 18, 20, 21, 22, 23, 24]),
  tensor([0, 0, 0, 3, 0, 5, 6, 0, 0, 0]))]

In [None]:
torch.argmax(test_model_1(train_sequences[10][0]), dim=-1).cpu()

  del sys.path[0]


tensor([0, 0, 0, 0, 0, 0, 0, 0])

**Task 5:** write an evaluation loop on a trained model, using the dev and test datasets. This function print the true positive rate (TPR), also known as Recall and the opposite to false positive rate (FPR), also known as precision, of each label seperately (7 labels in total), and for all the 6 labels (except O) together. The caption argument for the function should be served for printing, so that when you print include it as a prefix.

In [None]:
a = torch.zeros(7).to(device)
b = [1,2,3,4,5,6,7]
b[int(a[0].item())]

1

In [None]:
from collections import Counter

def evaluate(model, caption):
  # TODO - your code goes here
  evaluate_dataset(model, caption, dev_sequences)
  evaluate_dataset(model, caption, test_sequences)


def evaluate_dataset(model, caption, seq):
  all_counts = [Counter() for i in range(7)]
  #individual_correct = torch.zeros(7).to(device)
  #individual_incorrect = torch.zeros(7).to(device)
  #correct_named_entity = 0
  #incorrect_named_entity = 0

  model.eval()
  with torch.no_grad():
    for words, tags in dev_sequences:
      preds = torch.argmax(model(words), dim=-1)
      for i, pred in enumerate(preds):
        if pred.item() == tags[i].item():
          for j, counts in enumerate(all_counts):
            if j == tags[i].item():
              counts['TP'] += 1
            else:
              counts['TN'] += 1
        else:
          for j, counts in enumerate(all_counts):
            if j == tags[i].item():
              counts['FN'] += 1
            elif j == pred.item():
              counts['FP'] += 1
            else:
              counts['TN'] += 1

    for tag, tag_counts in enumerate(all_counts):
      try:
        recall = tag_counts['TP'] / (tag_counts['TP'] + tag_counts['FN'])
      except ZeroDivisionError:
        recall = 0
      try:
        precision = tag_counts['TP'] / (tag_counts['TP'] + tag_counts['FP'])
      except ZeroDivisionError:
        precision = 0
      print(f"{caption} --  {vocab.id2tag[tag]} recall: {recall} , precision: {precision}")
    
    o_counts = all_counts[0]
    all_recall = o_counts['TN'] / (o_counts['TN'] + o_counts['FP'])
    all_precision = o_counts['TN'] / (o_counts['TN'] + o_counts['FN'])
    print(f"{caption} -- all toghether -- recall: {all_recall} , precision: {all_precision}")



**Task 6:** Train and evaluate a few models, all with embedding_size=300, and with the following hyper parameters (you may use that as captions for the models as well):

Model 1: (hidden_size: 500, n_layers: 1, directions: 1)

Model 2: (hidden_size: 500, n_layers: 2, directions: 1)

Model 3: (hidden_size: 500, n_layers: 3, directions: 1)

Model 4: (hidden_size: 500, n_layers: 1, directions: 2)

Model 5: (hidden_size: 500, n_layers: 2, directions: 2)

Model 6: (hidden_size: 500, n_layers: 3, directions: 2)

Model 4: (hidden_size: 800, n_layers: 1, directions: 2)

Model 5: (hidden_size: 800, n_layers: 2, directions: 2)

Model 6: (hidden_size: 800, n_layers: 3, directions: 2)

In [None]:
# TODO - your code goes here...
embedding_size=300
input_size = len(vocab.id2word)
output_size = len(vocab.id2tag)
n_epochs = 40

all_models_and_params = list()

for hidden_size in [500, 800]:
  for n_layers in [1,2,3]:
    for bidirectional in [False, True]:
      if hidden_size == 800 and not bidirectional:
        continue
      
      model = NERNet(input_size, 
                    embedding_size,
                    hidden_size,
                    output_size,
                    n_layers,
                    False).to(device)
      train_loop(model, 50)
      cap = f"hidden size: {hidden_size}, n_layers: {n_layers}, bidirectional: {bidirectional}"
      evaluate(model, cap)
    

"""
model_1 = NERNet(input_size, 
                 embedding_size,
                 500,
                 output_size,
                 1,
                 False).to(device)
train_loop(model_1, 50)
evaluate(model_1, "")"""


  del sys.path[0]


hidden size: 500, n_layers: 1, bidirectional: False --  O recall: 0.9644702842377261 , precision: 0.9261786600496278
hidden size: 500, n_layers: 1, bidirectional: False --  B-PER recall: 0.72 , precision: 0.7128712871287128
hidden size: 500, n_layers: 1, bidirectional: False --  I-PER recall: 0.7070063694267515 , precision: 0.8161764705882353
hidden size: 500, n_layers: 1, bidirectional: False --  B-LOC recall: 0.7377049180327869 , precision: 0.7458563535911602
hidden size: 500, n_layers: 1, bidirectional: False --  I-LOC recall: 0.5217391304347826 , precision: 0.8571428571428571
hidden size: 500, n_layers: 1, bidirectional: False --  B-ORG recall: 0.5476190476190477 , precision: 0.6865671641791045
hidden size: 500, n_layers: 1, bidirectional: False --  I-ORG recall: 0.35344827586206895 , precision: 0.7884615384615384
hidden size: 500, n_layers: 1, bidirectional: False -- all toghether -- recall: 0.71900826446281 , precision: 0.847009735744089
hidden size: 500, n_layers: 1, bidirection

KeyboardInterrupt: ignored

**Task 6:** Download the GloVe embeddings from https://nlp.stanford.edu/projects/glove/ (use the 300-dim vectors from glove.6B.zip). Then intialize the nn.Embedding module in your NERNet with these embeddings, so that you can start your training with pre-trained vectors. Repeat Task 6 and print the results for each model.

Note: make sure that vectors are aligned with the IDs in your Vocab, in other words, make sure that for example the word with ID 0 is the first vector in the GloVe matrix of vectors that you initialize nn.Embedding with. For a dicussion on how to do that, check it this link:
https://discuss.pytorch.org/t/can-we-use-pre-trained-word-embeddings-for-weight-initialization-in-nn-embedding/1222

In [None]:
# TODO - your code goes here...

**Good luck!**