<a href="https://colab.research.google.com/github/Neilus03/NLP-2023/blob/main/RNNs_for_Sequence_Labeling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# RNNs for Sequence Labeling

Credits: This notebook is based on code from https://github.com/bentrevett/pytorch-pos-tagging 

## Basic setup and module imports

In [44]:
import IPython
import random, time

import torch
from torchtext.vocab import build_vocab_from_iterator

import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader

## Data download and pre-processing

We will use the UDPOS dataset (Universal Dependencies English Web Treebank dataset). We first download the data and unzip it.

In [45]:
!wget https://bitbucket.org/sivareddyg/public/downloads/en-ud-v2.zip
!unzip -o en-ud-v2.zip 

--2023-05-08 16:16:30--  https://bitbucket.org/sivareddyg/public/downloads/en-ud-v2.zip
Resolving bitbucket.org (bitbucket.org)... 104.192.141.1, 2406:da00:ff00::22c5:2ef4, 2406:da00:ff00::22e9:9f55, ...
Connecting to bitbucket.org (bitbucket.org)|104.192.141.1|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://bbuseruploads.s3.amazonaws.com/cbe10bc1-1b26-4197-abd4-9c118d8fe226/downloads/666ba1e1-1747-4845-8001-24971e9c3bc0/en-ud-v2.zip?response-content-disposition=attachment%3B%20filename%3D%22en-ud-v2.zip%22&AWSAccessKeyId=ASIA6KOSE3BNKS3X7UUA&Signature=NWrexVsXTriZiE1WfFeP3LEJ0bA%3D&x-amz-security-token=FwoGZXIvYXdzEFkaDDH7CIitHG1GAu8jGCK%2BAacoAnKsgdW0wAihhKCbspwAmABOQKu8yAWMjM7KGAP4lRgBz3LP1q72DbkyiNFejI0GIpqYlz7kR9A1qvm5N9DcI3ENN0UhF%2B6fGRnCRjuLmCv7gGgH1olqFVJvD%2Fl4XXT6IbQOhxPNVXiKAKgz67zSFUf3BdGqvhcChK3iD1izMEIcZuX%2FDVN1vC%2FwSxkQA0DrgNV3Fo3ZU8JaWgv5ZZ11LnMZ3qfpTY5DFFtiPZTOAneUu3mXd8%2BlDtXRKFkoy7rkogYyLSPfvZG5FaC%2BOM9%2FE4YtoaClivwotzx%2B

The dataset is split in three subsets for training, validation and test. In the following cell we create a dictionary that contains a list of tuples for each split. Each tuple is composed by a list of words and a list with the corresponding POS tags:

In [46]:
splits = ['train','dev','test']

data = {}
for split in splits:
  data[split] = []
  with open('en-ud-v2/en-ud-tag.v2.'+split+'.txt') as f:
    #get a list of lines
    lines = f.read().splitlines()
  data_line = ([],[])
  for line in lines:
    if line == '':
      data[split].append(data_line)
      data_line = ([],[])
    else:
      text, ud_tag, _ = line.split('\t') #as a line is like this : " story	NOUN	NN "  where text is story, ud_tag is NOUN and _ is NN
      data_line[0].append(text)
      data_line[1].append(ud_tag)

In [47]:
print(f"Number of training examples: {len(data['train'])}")
print(f"Number of validation examples: {len(data['dev'])}")
print(f"Number of testing examples: {len(data['test'])}")

Number of training examples: 12543
Number of validation examples: 2002
Number of testing examples: 2077


Here we define a utility function to visualize a sentence from our dataset by coloring each word depending on its assigned POS category label.

In [48]:
def visualize_pos(line):
  random.seed(42)
  text, labels = line
  unique_labels = list(set(labels))
  colors = {}
  for l in unique_labels:
      colors[l] = "%06x" % random.randint(0, 0xFFFFFF)
  html_content = ''
  for t,l in zip(text,labels):
    html_content += ' '+f'<span style="color:#{colors[l]}">{t}</span>'

  IPython.display.display(IPython.display.HTML(html_content))

visualize_pos(data['train'][0])


Now we build the training vocabularies using the [`build-vocab-from-iterator`](https://pytorch.org/text/stable/vocab.html#build-vocab-from-iterator) method from the [TorchText](https://pytorch.org/text/stable/index.html) library. 


The input vocabulary is the set of all unique words in our training data, and the output vocabulary is the set of all POS tags in the training data.

In [49]:
text_vocab = build_vocab_from_iterator([d[0] for d in data['train']]) # == set([all words in my training sentences])
ud_tags_vocab = build_vocab_from_iterator([d[1] for d in data['train']]) # == set([all labels in my training sentences])
print(f"Unique tokens in TEXT vocabulary: {len(text_vocab)}")
print(f"Unique tokens in UD_TAG vocabulary: {len(ud_tags_vocab)}")

Unique tokens in TEXT vocabulary: 19672
Unique tokens in UD_TAG vocabulary: 17


We can see the vocabularies for both of our text and tags:

In [50]:
print(text_vocab.get_itos())
print(ud_tags_vocab.get_itos())

['NOUN', 'PUNCT', 'VERB', 'PRON', 'ADP', 'DET', 'PROPN', 'ADJ', 'AUX', 'ADV', 'CCONJ', 'PART', 'NUM', 'SCONJ', 'X', 'INTJ', 'SYM']


In [51]:
print(ud_tags_vocab.get_itos())
print(ud_tags_vocab.get_stoi())

['NOUN', 'PUNCT', 'VERB', 'PRON', 'ADP', 'DET', 'PROPN', 'ADJ', 'AUX', 'ADV', 'CCONJ', 'PART', 'NUM', 'SCONJ', 'X', 'INTJ', 'SYM']
{'SYM': 16, 'INTJ': 15, 'NOUN': 0, 'VERB': 2, 'PRON': 3, 'DET': 5, 'NUM': 12, 'PROPN': 6, 'X': 14, 'ADP': 4, 'SCONJ': 13, 'PUNCT': 1, 'ADJ': 7, 'AUX': 8, 'ADV': 9, 'CCONJ': 10, 'PART': 11}


Add the "\<PAD\>" and "\<UNK\>" special tokens.


### <font color=green>Exercise 1:</font>

* <font color=green>Q) Why do we need to add those special tokens?</font>
* <font color=green>Q) Why we do not add the "\<UNK\>" token to the `ud_tags_vocab`?</font>


we add `'<UNK>'` for out of vocabulary words while we add padding token `'<PAD>'` for sentences to have the same length if they have eto be in the same batch. 

`'<UNK>'` is not added to tags vocabulary because is not a Part of Speech (PoS) tag.

In [52]:
text_vocab.append_token('<PAD>')
text_vocab.append_token('<UNK>')
print(f"Unique tokens in TEXT vocabulary: {len(text_vocab)}")
print(text_vocab.get_stoi()['<PAD>'], text_vocab.get_stoi()['<UNK>'])

Unique tokens in TEXT vocabulary: 19674
19672 19673


In [53]:
ud_tags_vocab.append_token('<PAD>')
print(f"Unique tokens in UD_TAG vocabulary: {len(ud_tags_vocab)}")
ud_tags_vocab.get_stoi()['<PAD>']

Unique tokens in UD_TAG vocabulary: 18


17

## Create the UDPOSDataset class and our dataloaders

In [54]:
class UDPOSDataset(Dataset):

  def __init__(self, data, text_vocab, tags_vocab, max_len):

    self.texts = []
    self.labels = []
    self.text_vocab = text_vocab
    self.stoi = text_vocab.get_stoi()
    self.itos = text_vocab.get_itos()
    self.labels_to_class = tags_vocab.get_stoi()

    for text, label in data:
      for i,s in enumerate(text):
        if s not in self.stoi.keys():
          text[i] = '<UNK>'
      self.texts.append([self.stoi[s] for s in text] + (max_len-len(text))*[self.stoi['<PAD>']])
      self.labels.append([self.labels_to_class[l] for l in label]  + (max_len-len(text))*[self.labels_to_class['<PAD>']])

    # Number of examples.
    self.n_examples = len(self.labels)
    return

  def __len__(self):
    return self.n_examples

  def __getitem__(self, item):
    return {'text':torch.Tensor(self.texts[item]).to(torch.int64), 'label':torch.Tensor(self.labels[item]).to(torch.int64)}

In [55]:
BATCH_SIZE = 10
max_len = max([len(d[0]) for d in data['train']])

train_dataset = UDPOSDataset(data['train'], text_vocab, ud_tags_vocab, max_len)
train_dataloader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True)

val_dataset = UDPOSDataset(data['dev'], text_vocab, ud_tags_vocab, max_len)
val_dataloader = DataLoader(val_dataset, batch_size=BATCH_SIZE, shuffle=False)

test_dataset = UDPOSDataset(data['test'], text_vocab, ud_tags_vocab, max_len)
test_dataloader = DataLoader(test_dataset, batch_size=BATCH_SIZE, shuffle=False)


In [56]:
for batch in train_dataloader:

  # Let's check batch size.
  print('Batch size: %d\n'% len(batch['text']))
  print('LABEL shape\tLENGTH\tTEXT shape'.ljust(10))

  # Print some info for each example.
  for text, label in zip(batch['text'], batch['label']):
    print('%s\t%d\t%s'.ljust(10) % (label.shape, len(text), text.shape))
  print('\n')

  # Only look at first batch. Reuse this code in training models.
  break


Batch size: 10

LABEL shape	LENGTH	TEXT shape
torch.Size([159])	159	torch.Size([159])  
torch.Size([159])	159	torch.Size([159])  
torch.Size([159])	159	torch.Size([159])  
torch.Size([159])	159	torch.Size([159])  
torch.Size([159])	159	torch.Size([159])  
torch.Size([159])	159	torch.Size([159])  
torch.Size([159])	159	torch.Size([159])  
torch.Size([159])	159	torch.Size([159])  
torch.Size([159])	159	torch.Size([159])  
torch.Size([159])	159	torch.Size([159])  




## Building the Model

Next up, we define our model in the `RNNPOSTagger` class.

`self.embedding` is an embedding layer and the input dimension should be the size of the input (text) vocabulary. We tell it what the index of the padding token is so it does not update the padding token's embedding entry.

`self.rnn` is the Recurrent Neural Network in our model.

`self.fc` defines the linear layer to make predictions using the RNN outputs. We double the size of the input if we are using a bi-directional RNN. The output dimensions should be the size of the tag vocabulary.

We also define a dropout layer with `nn.Dropout`, which we use in the `forward` method to apply dropout to the embeddings and the outputs of the final layer of the RNN.

### <font color=green>Exercise 2:</font>

* <font color=green>Q) Write the definition of your RNN (only on line of code). Do you want a SimpleRNN?, an LSTM?, a GRU? Do you want it bi-directional? with more than 1 layer? </font>

<font color=green>(Take into account that some models’ configurations will take longer to train than others, maybe you want to start with something basic and then try to improve it afterwards)</font>


In [57]:
nn.LSTM??

In [58]:
class RNNPOSTagger(nn.Module):
    def __init__(self, 
                 input_dim, 
                 embedding_dim, 
                 hidden_dim, 
                 output_dim, 
                 n_layers, 
                 bidirectional, 
                 dropout, 
                 pad_idx):
        
        super().__init__()
        
        self.embedding = nn.Embedding(input_dim, embedding_dim, padding_idx = pad_idx)
        
        # ADD YOUR CODE HERE!
        self.rnn = nn.LSTM(embedding_dim,
                           hidden_dim,
                           n_layers,
                           batch_first=True,
                           dropout = dropout if n_layers > 1 else 0,
                           bidirectional = bidirectional,
                           )
        
        self.fc = nn.Linear(hidden_dim * 2 if bidirectional else hidden_dim, output_dim)
        
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, text):
        
        #pass text through embedding layer
        embedded = self.dropout(self.embedding(text))
        
        #pass embeddings into the RNN
        outputs, (hidden, cell) = self.rnn(embedded)
        
        #we use our outputs to make a prediction of what the tag should be
        predictions = self.fc(self.dropout(outputs))
        
        return predictions

## Training the Model

Next, we instantiate the model. 

The hyperparmeters have been chosen as sensible defaults, though there may be a combination that performs better on this model and dataset.

The input and output dimensions are taken directly from the lengths of the respective vocabularies. The padding index is obtained using the vocabulary of the text.

In [59]:
INPUT_DIM = len(text_vocab)
EMBEDDING_DIM = 100
HIDDEN_DIM = 128
OUTPUT_DIM = len(ud_tags_vocab)
N_LAYERS = 2
BIDIRECTIONAL = True
DROPOUT = 0.25
PAD_IDX = text_vocab.get_stoi()['<PAD>']

model = RNNPOSTagger(INPUT_DIM, 
                        EMBEDDING_DIM, 
                        HIDDEN_DIM, 
                        OUTPUT_DIM, 
                        N_LAYERS, 
                        BIDIRECTIONAL, 
                        DROPOUT, 
                        PAD_IDX)

We initialize the weights from a simple Normal distribution. Again, there may be a better initialization scheme for this model and dataset.

In [60]:
def init_weights(m):
    for name, param in m.named_parameters():
        nn.init.normal_(param.data, mean = 0, std = 0.1)
        
model.apply(init_weights)

RNNPOSTagger(
  (embedding): Embedding(19674, 100, padding_idx=19672)
  (rnn): LSTM(100, 128, num_layers=2, batch_first=True, dropout=0.25, bidirectional=True)
  (fc): Linear(in_features=256, out_features=18, bias=True)
  (dropout): Dropout(p=0.25, inplace=False)
)

Next, a small function to tell us how many parameters are in our model. Useful for comparing different models.

In [61]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f'The model has {count_parameters(model):,} learnable parameters')

The model has 2,602,810 learnable parameters


We then define our optimizer, used to update our parameters w.r.t. their gradients. We use Adam with the default learning rate.

In [62]:
optimizer = optim.Adam(model.parameters())

Next, we define our loss function, cross-entropy loss.


### <font color=yellow>Exercise 3:</font>

* <font color=yellow>Q) Why do we use `ignore_index` in our loss function.</font>



Because we dont want to backpropagate the error of padding, we dont care if we tagged the padding correctlty or not

In [63]:
TAG_PAD_IDX = ud_tags_vocab.get_stoi()['<PAD>']

criterion = nn.CrossEntropyLoss(ignore_index = TAG_PAD_IDX)

We then place our model and loss function on our GPU, if we have one.

In [64]:
# setting device on GPU if available, else CPU
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

model = model.to(device)
criterion = criterion.to(device)

We will be using the loss value between our predicted and actual tags to train the network, but ideally we'd like a more interpretable way to see how well our model is doing - accuracy.

The issue is that we don't want to calculate accuracy over the `<pad>` tokens as we aren't interested in predicting them.

The function below only calculates accuracy over non-padded tokens. `non_pad_elements` is a tensor containing the indices of the non-pad tokens within an input batch. We then compare the predictions of those elements with the labels to get a count of how many predictions were correct. We then divide this by the number of non-pad elements to get our accuracy value over the batch.

In [65]:
def categorical_accuracy(preds, y, tag_pad_idx):
    """
    Returns accuracy per batch, i.e. if you get 8/10 right, this returns 0.8, NOT 8
    """
    max_preds = preds.argmax(dim = 1, keepdim = True) # get the index of the max probability
    non_pad_elements = (y != tag_pad_idx).nonzero()
    correct = max_preds[non_pad_elements].squeeze(1).eq(y[non_pad_elements])
    return correct.sum() / y[non_pad_elements].shape[0]

Next is the function that handles training our model.

We first set the model to `train` mode to turn on dropout/batch-norm/etc. (if used). Then we iterate over our iterator, which returns a batch of examples. 

For each batch: 
- we zero the gradients over the parameters from the last gradient calculation
- insert the batch of text into the model to get predictions
- as PyTorch loss functions cannot handle 3-dimensional predictions we reshape our predictions
- calculate the loss and accuracy between the predicted tags and actual tags
- call `backward` to calculate the gradients of the parameters w.r.t. the loss
- take an optimizer `step` to update the parameters
- add to the running total of loss and accuracy

In [66]:
def train(model, iterator, optimizer, criterion, tag_pad_idx):
    
    epoch_loss = 0
    epoch_acc = 0
    
    model.train()
    
    for batch in iterator:
        
        text = torch.Tensor(batch['text']).to(device)
        tags = torch.Tensor(batch['label']).to(device)
        
        optimizer.zero_grad()
        
        predictions = model(text)
        
        predictions = predictions.view(-1, predictions.shape[-1])
        tags = tags.view(-1)
        
        loss = criterion(predictions, tags)
                
        acc = categorical_accuracy(predictions, tags, tag_pad_idx)
        
        loss.backward()
        
        optimizer.step()
        
        epoch_loss += loss.item()
        epoch_acc += acc.item()
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

In [67]:
train_loss, train_acc = train(model, train_dataloader, optimizer, criterion, TAG_PAD_IDX)

The `evaluate` function is similar to the `train` function, except with changes made so we don't update the model's parameters.

`model.eval()` is used to put the model in evaluation mode, so dropout/batch-norm/etc. are turned off. 

The iteration loop is also wrapped in `torch.no_grad` to ensure we don't calculate any gradients. We also don't need to call `optimizer.zero_grad()` and `optimizer.step()`.

In [68]:
def evaluate(model, iterator, criterion, tag_pad_idx):
    
    epoch_loss = 0
    epoch_acc = 0
    
    model.eval()
    
    with torch.no_grad():
    
        for batch in iterator:

            text = torch.Tensor(batch['text']).to(device)
            tags = torch.Tensor(batch['label']).to(device)
            
            predictions = model(text)
            
            predictions = predictions.view(-1, predictions.shape[-1])
            tags = tags.view(-1)
            
            loss = criterion(predictions, tags)
            
            acc = categorical_accuracy(predictions, tags, tag_pad_idx)

            epoch_loss += loss.item()
            epoch_acc += acc.item()
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

In [69]:
valid_loss, valid_acc = evaluate(model, val_dataloader, criterion, TAG_PAD_IDX)

Next, we have a small function that tells us how long an epoch takes.

In [70]:
def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs

Finally, we train our model!

After each epoch we check if our model has achieved the best validation loss so far. If it has then we save the parameters of this model and we will use these "best" parameters to calculate performance over our test set.

In [71]:
N_EPOCHS = 10

best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):

    start_time = time.time()
    
    train_loss, train_acc = train(model, train_dataloader, optimizer, criterion, TAG_PAD_IDX)
    valid_loss, valid_acc = evaluate(model, val_dataloader, criterion, TAG_PAD_IDX)
    
    end_time = time.time()

    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
    
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'best_model.pt')
    
    print(f'Epoch: {epoch+1:02} | Epoch Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. Acc: {valid_acc*100:.2f}%')

Epoch: 01 | Epoch Time: 0m 10s
	Train Loss: 0.214 | Train Acc: 93.40%
	 Val. Loss: 0.325 |  Val. Acc: 90.02%
Epoch: 02 | Epoch Time: 0m 10s
	Train Loss: 0.129 | Train Acc: 96.07%
	 Val. Loss: 0.378 |  Val. Acc: 89.35%
Epoch: 03 | Epoch Time: 0m 10s
	Train Loss: 0.094 | Train Acc: 97.08%
	 Val. Loss: 0.401 |  Val. Acc: 89.73%
Epoch: 04 | Epoch Time: 0m 10s
	Train Loss: 0.073 | Train Acc: 97.75%
	 Val. Loss: 0.416 |  Val. Acc: 90.09%
Epoch: 05 | Epoch Time: 0m 10s
	Train Loss: 0.058 | Train Acc: 98.18%
	 Val. Loss: 0.439 |  Val. Acc: 90.10%
Epoch: 06 | Epoch Time: 0m 11s
	Train Loss: 0.048 | Train Acc: 98.49%
	 Val. Loss: 0.420 |  Val. Acc: 90.51%
Epoch: 07 | Epoch Time: 0m 11s
	Train Loss: 0.040 | Train Acc: 98.73%
	 Val. Loss: 0.478 |  Val. Acc: 90.26%
Epoch: 08 | Epoch Time: 0m 11s
	Train Loss: 0.032 | Train Acc: 98.94%
	 Val. Loss: 0.479 |  Val. Acc: 90.39%
Epoch: 09 | Epoch Time: 0m 11s
	Train Loss: 0.027 | Train Acc: 99.09%
	 Val. Loss: 0.488 |  Val. Acc: 90.47%
Epoch: 10 | Epoch T

In [72]:
model.load_state_dict(torch.load('best_model.pt'))

test_loss, test_acc = evaluate(model, test_dataloader, criterion, TAG_PAD_IDX)

print(f'Test Loss: {test_loss:.3f} |  Test Acc: {test_acc*100:.2f}%')

Test Loss: 0.301 |  Test Acc: 90.31%


## Inference

~84% accuracy looks pretty good, but let's see our model tag some actual sentences.

We define a `tag_sentence` function that will:
- put the model into evaluation mode
- tokenize the sentence with spaCy if it is not a list
- lowercase the tokens if the `Field` did
- numericalize the tokens using the vocabulary
- find out which tokens are not in the vocabulary, i.e. are `<unk>` tokens
- convert the numericalized tokens into a tensor and add a batch dimension
- feed the tensor into the model
- get the predictions over the sentence
- convert the predictions into readable tags

As well as returning the tokens and tags, it also returns which tokens were `<unk>` tokens.

In [73]:
import spacy

def tag_sentence(model, device, sentence, text_vocab, tag_vocab):
    
    model.eval()
    
    if isinstance(sentence, str):
        nlp = spacy.load('en_core_web_sm')
        tokens = [token.text for token in nlp(sentence)]
    else:
        tokens = [token for token in sentence]

    unks = []
    for i,s in enumerate(tokens):
        if s not in train_dataset.stoi.keys():
          unks.append(tokens[i])
          tokens[i] = '<UNK>'
    
    numericalized_tokens = [text_vocab.get_stoi()[t] for t in tokens]
        
    token_tensor = torch.LongTensor(numericalized_tokens)
    token_tensor = token_tensor.unsqueeze(-1).to(device)
         
    predictions = model(token_tensor)
    top_predictions = predictions.argmax(-1)
    predicted_tags = [tag_vocab.get_itos()[t.item()] for t in top_predictions]
    
    return tokens, predicted_tags, unks

Let's now make up our own sentence and see how well the model does.

Our example sentence below has every token within the model's vocabulary.

In [74]:
sentence = 'The Queen will deliver a speech about the conflict in North Korea at 1pm tomorrow.'

tokens, tags, unks = tag_sentence(model, 
                                  device, 
                                  sentence, 
                                  text_vocab, 
                                  ud_tags_vocab)

print(f'Unknown tokens: {unks}')
visualize_pos((tokens,tags))

Unknown tokens: []


Looking at the sentence it seems like it gave sensible tags to every token!

In [75]:
print("Pred. Tag\tToken\n")

for token, tag in zip(tokens, tags):
    print(f"{tag}\t\t{token}")

Pred. Tag	Token

DET		The
PROPN		Queen
AUX		will
VERB		deliver
DET		a
NOUN		speech
ADP		about
DET		the
NOUN		conflict
ADP		in
PROPN		North
PROPN		Korea
ADP		at
NUM		1
NOUN		pm
NOUN		tomorrow
PUNCT		.


# Optional, Named Entity Recognition (NER)

Both POS tagging and NER are Sequence Labeling tasks. It should be stright forward to reuse our code to perform NER.

Here is the link of a NER small dataset:

https://dew-new-eval.s3.us-east-2.amazonaws.com/nlp_data/ner_dataset.csv 

Download it, take a look at how data is structured and modify the code of this notebook acordingly to work with it.


# Optional, pre-trained embeddings

It would be a good idea to initialize our model's embedding layer with pre-trained word embedding values.

In [76]:
pretrained_embeddings = #Load here some pre-trained word embeddings (e.g. GloVE, Word2Vect, etc.)

print(pretrained_embeddings.shape)

SyntaxError: ignored

In [None]:
model.embedding.weight.data.copy_(pretrained_embeddings)

It's common to initialize the embedding of the pad token to all zeros. This, along with setting the `padding_idx` in the model's embedding layer, means that the embedding should always output a tensor full of zeros when a pad token is input.

In [None]:
model.embedding.weight.data[PAD_IDX] = torch.zeros(EMBEDDING_DIM)

print(model.embedding.weight.data)