---
Author: **`Crispen Gari`**

Year: **`2021`**

Date: **`2021-09-15`**

Language: **`Python`**

Libray: **`Pytorch`**

Topic: **`Named Entity Recognition (NER)`**

Main: **`Natural Language Processing (NLP)`**

---

### Named Entity Recognition (NER) using Bi-Directional LSTM (Bi-LSTM)

In this series of notebookswe are going to have a look at an intresting topic in Natural Language Processing (NLP) known as Named Entity Recognition (NER).

### Some of the uses
Named Entity Recognition can automatically scan entire articles and reveal which are the major people, organizations, and places discussed in them. Knowing the relevant tags for each article help in automatically categorizing the articles in defined hierarchies and enable smooth content discover.

### Imports

In [1]:
import time, os, torch, random, json

from torch import nn
from torch.nn import functional as F

from torchtext.legacy import data, datasets

import numpy as np

torch.__version__


'1.9.0+cu102'

### Seeds and Device

In [2]:
SEED = 42

random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)

torch.backends.cudnn.deterministic = True

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device

device(type='cuda')

### Data.

The data that we will be working with was found [here](https://github.com/yoseflaw/nerindo/tree/master/input). What i've done is to download the the three files and load uploaded them on my google drive so that it can be eaisly loaded in google colab. 

### Explaining the data.
The data files are tab deliminated, `tsv` where columns are separated by a `\t`.

To accommodate multi-word entities, the tags follow what is called **`BILOU`** format: **B**eginning, **I**nside, **L**ast, **O**uter, **U**nit. This is indicated by the character preceding the dash symbol. The type of the named entity is shown by the remaining part. For instance, *Universtas Gadjah Mada* is an `ORGANIZATION` and *Arie Sudjito* is a `PERSON`. 

Here is what a single file may look like:
```
Pengamat	O
politik	O
dari	O
Universitas	B-ORGANIZATION
Gadjah	I-ORGANIZATION
Mada	L-ORGANIZATION
,	O
Arie	B-PERSON
Sudjito	L-PERSON
,	O
menilai	O
,	O
keinginan	O
Ketua	O
Umum	O
Partai	B-ORGANIZATION
Golkar	L-ORGANIZATION
Aburizal	B-PERSON
Bakrie	L-PERSON
untuk	O
maju	O
kembali	O
sebagai	O
ketua	O
umum	O
merupakan	O
pemaksaan	O
kehendak	O
.	O

Menurut	O
dia	O
,	O
ada	O
kesan	O
bahwa	O
Aburizal	U-PERSON
menggunakan	O
segala	O
cara	O
untuk	O
memuluskan	O
jalannya	O
kembali	O
menduduki	O
Golkar	U-ORGANIZATION
1	O
.	O
```

In [3]:
from google.colab import drive
drive.mount("/content/drive")

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


### Root Path

In [4]:
root = '/content/drive/My Drive/NLP Data/ner/data'
os.path.exists(root)

True

### Fields

In [5]:
WORD = data.Field(lower=True)
TAG = data.Field(unk_token=None)

In [6]:
fields = (
 ("word", WORD),
 ("tag", TAG)
)

In [7]:
train_dataset, valid_dataset, test_dataset = datasets.SequenceTaggingDataset.splits(
    path=root,
    train="train.tsv",
    validation="val.tsv",
    test= "test.tsv",
    fields=fields
)

In [8]:
print(vars(train_dataset.examples[67]))

{'word': ['"', 'benar', ',', 'jaksa', 'agung', 'prasetyo', '.', 'tadi', 'pagi', '(', 'keputusannya', ')', ',', '"', 'kata', 'andi', 'kepada', 'tempo', 'di', 'jakarta', ',', 'kamis', ',', '20', 'november', '2014', '.'], 'tag': ['O', 'O', 'O', 'O', 'O', 'U-PERSON', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'U-PERSON', 'O', 'U-ORGANIZATION', 'O', 'U-LOCATION', 'O', 'B-TIME', 'I-TIME', 'I-TIME', 'I-TIME', 'L-TIME', 'O']}


### Building vocabulary

In [9]:
WORD.build_vocab(train_dataset, min_freq=3)
TAG.build_vocab(train_dataset)

In [10]:
TAG.vocab.stoi

defaultdict(None,
            {'<pad>': 0,
             'B-LOCATION': 10,
             'B-ORGANIZATION': 7,
             'B-PERSON': 5,
             'B-QUANTITY': 13,
             'B-TIME': 18,
             'I-LOCATION': 17,
             'I-ORGANIZATION': 9,
             'I-PERSON': 15,
             'I-QUANTITY': 16,
             'I-TIME': 12,
             'L-LOCATION': 11,
             'L-ORGANIZATION': 8,
             'L-PERSON': 6,
             'L-QUANTITY': 14,
             'L-TIME': 19,
             'O': 1,
             'U-LOCATION': 4,
             'U-ORGANIZATION': 2,
             'U-PERSON': 3,
             'U-QUANTITY': 21,
             'U-TIME': 20})

### Creating iterators

In [12]:
BATCH_SIZE = 128
train_iter, valid_iter, test_iter = data.BucketIterator.splits(
    (train_dataset,valid_dataset, test_dataset),
    device = device,
    batch_size = BATCH_SIZE,
    sort_key = lambda x: x.word,
)

### Building the vocabulary

Every word that apears less that 2 times in the corpus will be converted to unknown token.

In [13]:
WORD.build_vocab(train_dataset, min_freq=2)
TAG.build_vocab(train_dataset)

### Counting examples

In [14]:
print(f"Train set: {len(train_dataset)} sentences")
print(f"Val set: {len(valid_dataset)} sentences")
print(f"Test set: {len(test_dataset)} sentences")


Train set: 3535 sentences
Val set: 470 sentences
Test set: 468 sentences


### Model


In [15]:
class BiLSTM(nn.Module):
  def __init__(self, input_dim,
               embedding_dim,
               hidden_dim,
               output_dim,
               word_pad_idx,
               dropout=.5,
               n_layers=2,
               bidirectional = True
               ):
    super(BiLSTM, self).__init__()
    self.embedding = nn.Embedding(input_dim, embedding_dim,)
    self.lstm = nn.LSTM(
        input_size=embedding_dim,
        hidden_size=hidden_dim,
        num_layers=n_layers,
        bidirectional=bidirectional,
        dropout=dropout if n_layers > 1 else 0
    )
    self.fc = nn.Linear(hidden_dim * 2 if bidirectional else hidden_dim,
                        output_dim)
    self.dropout= nn.Dropout(.5)

  def forward(self, sentence):
    # sentence = [sentence length, batch size]
    embedded= self.dropout(self.embedding(sentence)) # [sentence length, batch size, embedding dim]
    out, _ = self.lstm(embedded) # out = [sentence length, batch size, hidden dim * 2]
    out = self.fc(self.dropout(out)) # [sentence length, batch size, output dim]
    return out


In [16]:
INPUT_DIM=len(WORD.vocab)
EMBEDDING_DIM=300
HIDDEN_DIM=256
OUTPUT_DIM=len(TAG.vocab)
N_LAYERS=2
DROPOUT=0.1
WORD_PAD_IDX = WORD.vocab.stoi[WORD.pad_token]
TAG_PAD_IDX = TAG.vocab.stoi[TAG.pad_token]
BIDIRECTIONAL = True


model = BiLSTM(
    INPUT_DIM,
    EMBEDDING_DIM,
    HIDDEN_DIM,
    OUTPUT_DIM,
    WORD_PAD_IDX,
    DROPOUT,
    N_LAYERS,
    BIDIRECTIONAL
).to(device)
model

BiLSTM(
  (embedding): Embedding(5271, 300)
  (lstm): LSTM(300, 256, num_layers=2, dropout=0.1, bidirectional=True)
  (fc): Linear(in_features=512, out_features=22, bias=True)
  (dropout): Dropout(p=0.5, inplace=False)
)

### Initializing model weights

In [17]:
def init_weights(m):
  for name, param in m.named_parameters():
    nn.init.normal_(param.data, mean=0, std=0.1)

model.apply(init_weights)

BiLSTM(
  (embedding): Embedding(5271, 300)
  (lstm): LSTM(300, 256, num_layers=2, dropout=0.1, bidirectional=True)
  (fc): Linear(in_features=512, out_features=22, bias=True)
  (dropout): Dropout(p=0.5, inplace=False)
)

### initialize embedding for padding as zero

In [18]:
model.embedding.weight.data[WORD_PAD_IDX] = torch.zeros(EMBEDDING_DIM)
model.embedding.weight.data

tensor([[ 0.2319,  0.0509, -0.2044,  ...,  0.0600, -0.1761,  0.0536],
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [ 0.0905,  0.1235, -0.0220,  ...,  0.0997,  0.0513, -0.1714],
        ...,
        [ 0.1098,  0.0072, -0.1398,  ...,  0.0879, -0.1017,  0.0143],
        [ 0.0674, -0.0810,  0.2721,  ...,  0.0041, -0.0147,  0.0828],
        [-0.0022, -0.0720, -0.0810,  ..., -0.0563,  0.0377,  0.0070]],
       device='cuda:0')

### Counting model parameters.

In [19]:
def count_trainable_params(model):
  return sum(p.numel() for p in model.parameters()), sum(p.numel() for p in model.parameters() if p.requires_grad)

n_params, trainable_params = count_trainable_params(model)
print(f"Total number of paramaters: {n_params:,}\nTotal tainable parameters: {trainable_params:,}")

Total number of paramaters: 4,312,330
Total tainable parameters: 4,312,330


### Training the model.

In [20]:
def accuracy(preds, y):
  max_preds = preds.argmax(dim=1, keepdim=True).to(device)  # get the index of the max probability
  non_pad_elements = (y != TAG_PAD_IDX).nonzero().to(device)   # prepare masking for paddings
  correct = max_preds[non_pad_elements].squeeze(1).eq(y[non_pad_elements])
  return correct.sum() / torch.FloatTensor([y[non_pad_elements].shape[0]]).to(device) 


In [21]:
def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs

### Optimizer and Criterion

In [22]:
optimizer = torch.optim.Adam(model.parameters())
criterion = nn.CrossEntropyLoss().to(device)

### Train and evalutation functions

In [23]:
def train(model, iterator, optimizer, criterion):
  epoch_loss = 0
  epoch_acc = 0
  model.train()
  for batch in iterator:
    # text = [sent len, batch size]
    text = batch.word
    # tags = [sent len, batch size]
    true_tags = batch.tag
    optimizer.zero_grad()
    pred_tags = model(text)
    # to calculate the loss and accuracy, we flatten both prediction and true tags
    # flatten pred_tags to [sent len, batch size, output dim]
    pred_tags = pred_tags.view(-1, pred_tags.shape[-1])
    # flatten true_tags to [sent len * batch size]
    true_tags = true_tags.view(-1)
    batch_loss = criterion(pred_tags, true_tags)
    batch_acc = accuracy(pred_tags, true_tags)
    batch_loss.backward()
    optimizer.step()
    epoch_loss += batch_loss.item()
    epoch_acc += batch_acc.item()
  return epoch_loss / len(iterator), epoch_acc / len(iterator)

def evaluate(model, iterator, criterion):
    epoch_loss = 0
    epoch_acc = 0
    model.eval()
    with torch.no_grad():
        # similar to epoch() but model is in evaluation mode and no backprop
        for batch in iterator:
            text = batch.word
            true_tags = batch.tag
            pred_tags = model(text)
            pred_tags = pred_tags.view(-1, pred_tags.shape[-1])
            true_tags = true_tags.view(-1)
            batch_loss = criterion(pred_tags, true_tags)
            batch_acc = accuracy(pred_tags, true_tags)
            epoch_loss += batch_loss.item()
            epoch_acc += batch_acc.item()
    return epoch_loss / len(iterator), epoch_acc / len(iterator)


### Training Loop

We are going to create helper functions that will help us to visualize our training.


1. Time to string.

In [24]:
def hms_string(sec_elapsed):
  h = int(sec_elapsed / (60 * 60))
  m = int((sec_elapsed % (60 * 60)) / 60)
  s = sec_elapsed % 60
  return "{}:{:>02}:{:>05.2f}".format(h, m, s)

2. tabulate training epoch.

In [25]:
from prettytable import PrettyTable

def visualize_training(start, end, train_loss, train_accuracy, val_loss, val_accuracy, title):
  data = [
       ["Training", f'{train_loss:.3f}', f'{train_accuracy:.3f}', f"{hms_string(end - start)}" ],
       ["Validation", f'{val_loss:.3f}', f'{val_accuracy:.3f}', "" ],       
  ]
  table = PrettyTable(["CATEGORY", "LOSS", "ACCURACY", "ETA"])
  table.align["CATEGORY"] = 'l'
  table.align["LOSS"] = 'r'
  table.align["ACCURACY"] = 'r'
  table.align["ETA"] = 'r'
  table.title = title
  for row in data:
    table.add_row(row)
  print(table)

In [26]:
N_EPOCHS = 10
best_valid_loss = float('inf')
for epoch in range(N_EPOCHS):
    start = time.time()
    train_loss, train_acc = train(model, train_iter, optimizer, criterion)
    valid_loss, valid_acc = evaluate(model, valid_iter, criterion)
    title = f"EPOCH: {epoch+1:02}/{N_EPOCHS:02} {'saving best model...' if valid_loss < best_valid_loss else 'not saving...'}"
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'best-model.pt')
    end = time.time()
    visualize_training(start, end, train_loss, train_acc, valid_loss, valid_acc, title)


+--------------------------------------------+
|     EPOCH: 01/10 saving best model...      |
+------------+-------+----------+------------+
| CATEGORY   |  LOSS | ACCURACY |        ETA |
+------------+-------+----------+------------+
| Training   | 0.631 |    0.714 | 0:00:03.53 |
| Validation | 0.350 |    0.844 |            |
+------------+-------+----------+------------+
+--------------------------------------------+
|     EPOCH: 02/10 saving best model...      |
+------------+-------+----------+------------+
| CATEGORY   |  LOSS | ACCURACY |        ETA |
+------------+-------+----------+------------+
| Training   | 0.339 |    0.829 | 0:00:03.47 |
| Validation | 0.300 |    0.847 |            |
+------------+-------+----------+------------+
+--------------------------------------------+
|     EPOCH: 03/10 saving best model...      |
+------------+-------+----------+------------+
| CATEGORY   |  LOSS | ACCURACY |        ETA |
+------------+-------+----------+------------+
| Training   

### Evaluating the best model.

In [27]:
model.load_state_dict(torch.load('best-model.pt'))

test_loss, test_acc = evaluate(model, test_iter, criterion)
print(f'Test Loss: {test_loss:.3f} | Test Acc: {test_acc*100:.2f}%')

Test Loss: 0.139 | Test Acc: 90.66%


### Model Inference

In [28]:
from spacy.lang.id import Indonesian
nlp = Indonesian()

In [29]:
def infer(model, sentence, true_tags=None):
    model.eval()
    # tokenize sentence
    tokens = [token.text.lower() for token in nlp(sentence)]
    # transform to indices based on corpus vocab
    numericalized_tokens = [WORD.vocab.stoi[t] for t in tokens]
    # find unknown words
    unk_idx = WORD.vocab.stoi[WORD.unk_token]
    unks = [t for t, n in zip(tokens, numericalized_tokens) if n == unk_idx]
    # begin prediction
    token_tensor = torch.LongTensor(numericalized_tokens)
    token_tensor = token_tensor.unsqueeze(-1).to(device)
    predictions = model(token_tensor)
    # convert results to tags
    top_predictions = predictions.argmax(-1)
    predicted_tags = [TAG.vocab.itos[t.item()] for t in top_predictions]
    # print inferred tags
    max_len_token = max([len(token) for token in tokens] + [len("word")])
    max_len_tag = max([len(tag) for tag in predicted_tags] + [len("pred")])
    print(
        f"{'word'.ljust(max_len_token)}\t{'unk'.ljust(max_len_token)}\t{'pred tag'.ljust(max_len_tag)}" 
        + ("\ttrue tag" if true_tags else "")
        )
    for i, token in enumerate(tokens):
      is_unk = "✓" if token in unks else ""
      print(
          f"{token.ljust(max_len_token)}\t{is_unk.ljust(max_len_token)}\t{predicted_tags[i].ljust(max_len_tag)}" 
          + (f"\t{true_tags[i]}" if true_tags else "")
          )
    return tokens, predicted_tags, unks

sentence = "Sementara itu, Kepala Pelaksana BPBD Luwu Utara Muslim Muchtar mengatakan, terdapat 15.000 jiwa mengungsi akibat banjir bandang."
tags = ["O", "O", "O", "O", "O", "B-ORGANIZATION", "I-ORGANIZATION", "L-ORGANIZATION", "B-PERSON", "L-PERSON", "O", "O", "O", "U-QUANTITY", "O", "O", "O", "O", "O", "O"]
words, infer_tags, unknown_tokens = infer(model, sentence=sentence, true_tags=tags)


word      	unk       	pred tag  	true tag
sementara 	          	O         	O
itu       	          	O         	O
,         	          	O         	O
kepala    	          	O         	O
pelaksana 	          	O         	O
bpbd      	✓         	O         	B-ORGANIZATION
luwu      	✓         	O         	I-ORGANIZATION
utara     	          	O         	L-ORGANIZATION
muslim    	          	O         	B-PERSON
muchtar   	✓         	O         	L-PERSON
mengatakan	          	O         	O
,         	          	O         	O
terdapat  	          	O         	O
15.000    	✓         	B-QUANTITY	U-QUANTITY
jiwa      	          	L-QUANTITY	O
mengungsi 	          	O         	O
akibat    	          	O         	O
banjir    	          	O         	O
bandang   	          	O         	O
.         	          	O         	O
