---
Author:                 **`Crispen Gari`**

Topic:                  **`"Named Entity Recognition" (NER)`**
 
Main:                   **`Natural Language Processing NLP`**

Library:                **`TensorFlow (2.x)`**

Programing Language:    **`Python`**

Date:                   **`2021-09-20`**

---




### Named Entity Recognition

In this series of notebooks we are going to go thought what is called `NER` (Named Entity Recoginition) using tensorflow 2. We are going to use the [conll2003](https://www.clips.uantwerpen.be/conll2003/ner/) the english version in this notebook as our dataset. I've downloaded the data and uploaded it on my google drive so that it can be loaded here on google colab easily.


### Mounting the google drive

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


### Imports

In [2]:
import os, time, io

import tensorflow as tf
import numpy as np
from tensorflow import keras

from collections import Counter


### Path to files.

In [3]:
root = '/content/drive/My Drive/NLP Data/ner-CoNLL-2003'
os.path.exists(root)

True

### File structures
We have three files in the `ner-CoNLL-2003` folder which are:

1. train.txt
2. valid.txt
3. test.txt

Eaxh file contains data of the following nature:

```txt
-DOCSTART- -X- -X- O

CRICKET NNP B-NP O
- : O O
LEICESTERSHIRE NNP B-NP B-ORG
TAKE NNP I-NP O
OVER IN B-PP O
AT NNP B-NP O
TOP NNP I-NP O
AFTER NNP I-NP O
INNINGS NNP I-NP O
VICTORY NN I-NP O
. . O O

```

### Data preprocessing
We are going to extract the words with their named entities into an array for example:

```
[
  ['EU', 'BB-ORG'], ['TOP', 'O']...
]
```
The following code helps us to create a single function that we can reuse for all our sets.

In [4]:
def split_text_label(filename):
  f = open(os.path.join(root, filename))
  split_labeled_text = []
  sentence = []
  for line in f:
    if len(line)==0 or line.startswith('-DOCSTART') or line[0]=="\n":
       if len(sentence) > 0:
         split_labeled_text.append(sentence)
         sentence = []
       continue
    splits = line.split(' ')
    sentence.append([splits[0],splits[-1].rstrip("\n")])
  if len(sentence) > 0:
    split_labeled_text.append(sentence)
    sentence = []
  return split_labeled_text
train = split_text_label( "train.txt")
valid = split_text_label( "valid.txt")
test = split_text_label( "test.txt")

In [5]:
valid[:5]

[[['CRICKET', 'O'],
  ['-', 'O'],
  ['LEICESTERSHIRE', 'B-ORG'],
  ['TAKE', 'O'],
  ['OVER', 'O'],
  ['AT', 'O'],
  ['TOP', 'O'],
  ['AFTER', 'O'],
  ['INNINGS', 'O'],
  ['VICTORY', 'O'],
  ['.', 'O']],
 [['LONDON', 'B-LOC'], ['1996-08-30', 'O']],
 [['West', 'B-MISC'],
  ['Indian', 'I-MISC'],
  ['all-rounder', 'O'],
  ['Phil', 'B-PER'],
  ['Simmons', 'I-PER'],
  ['took', 'O'],
  ['four', 'O'],
  ['for', 'O'],
  ['38', 'O'],
  ['on', 'O'],
  ['Friday', 'O'],
  ['as', 'O'],
  ['Leicestershire', 'B-ORG'],
  ['beat', 'O'],
  ['Somerset', 'B-ORG'],
  ['by', 'O'],
  ['an', 'O'],
  ['innings', 'O'],
  ['and', 'O'],
  ['39', 'O'],
  ['runs', 'O'],
  ['in', 'O'],
  ['two', 'O'],
  ['days', 'O'],
  ['to', 'O'],
  ['take', 'O'],
  ['over', 'O'],
  ['at', 'O'],
  ['the', 'O'],
  ['head', 'O'],
  ['of', 'O'],
  ['the', 'O'],
  ['county', 'O'],
  ['championship', 'O'],
  ['.', 'O']],
 [['Their', 'O'],
  ['stay', 'O'],
  ['on', 'O'],
  ['top', 'O'],
  [',', 'O'],
  ['though', 'O'],
  [',', 'O'],
  ['

### Building the vocabulary

Next we are going to build the vocabulary for all unique words and labels.

In [6]:
labelSet = set()
wordSet = set()
# words and labels
for data in [train, valid, test]:
  for labeled_text in data:
    for word, label in labeled_text:
      labelSet.add(label)
      wordSet.add(word.lower())

We are going to create a word index maping. We are going to start with our entity labels.

In [7]:

label2Idx = {}
for label in sorted(list(labelSet), key=len):
  label2Idx[label] = len(label2Idx)
idx2Label = {v: k for k, v in label2Idx.items()}

In [8]:
len(idx2Label)

9

Word index mapping for the words

In [9]:
word2Idx = {}
word2Idx["<pad>"] = 0 # padding token
word2Idx["<unk>"] = 1 # unknown token 

for word in wordSet:
  word2Idx[word] = len(word2Idx)

idx2Word = {v:k for k, v in word2Idx.items()}
print(len(idx2Word))

26872


In [10]:
def createMatrices(data, word2Idx, label2Idx):
  sentences = []
  labels = []
  for split_labeled_text in data:
     wordIndices = []
     labelIndices = []
     for word, label in split_labeled_text:
       if word in word2Idx:
          wordIdx = word2Idx[word]
       elif word.lower() in word2Idx:
          wordIdx = word2Idx[word.lower()]
       else:
          wordIdx = word2Idx['<unk>']
       wordIndices.append(wordIdx)
       labelIndices.append(label2Idx[label])
     sentences.append(wordIndices)
     labels.append(labelIndices)
  return sentences, labels

train_sents, train_labels = createMatrices(
    train, word2Idx, label2Idx
)
valid_sents, valid_labels = createMatrices(
    valid, word2Idx, label2Idx
)
test_sents, test_labels = createMatrices(
    test, word2Idx, label2Idx
)

#### Padding the sequences

We are going to pad our sequences to have the same length of 100.

In [11]:
def padding(sents, labels, max_len=100):
  padded_sentences = keras.preprocessing.sequence.pad_sequences(sents, max_len,       
  padding='post', truncating="post")
  padded_labels = keras.preprocessing.sequence.pad_sequences(labels, max_len,       
  padding='post', truncating="post")
  return padded_sentences, padded_labels


In [12]:
train_features, train_labels = padding(train_sents, train_labels)
valid_features, valid_labels = padding(valid_sents, valid_labels)
test_features, test_labels = padding(test_sents, test_labels)

### Word embeddings

We are going to load our word embedding, from a local file. Basically we are going to use the `glove.6B.100d` word vectors.

In [13]:
embedding_path = "/content/drive/MyDrive/NLP Data/glove.6B/glove.6B.100d.txt"
os.path.exists(embedding_path)

True

In [14]:
embedding_dict = dict()
with open(embedding_path, encoding="utf8") as glove:
  for line in glove:
    records = line.split();
    word = records[0]
    vectors = np.asarray(records[1: ], dtype=np.float32)
    embedding_dict[word] = vectors

print(len(embedding_dict))
embedding_dict["what"].shape

400000


(100,)

### Embedding matrix
We are then going to create an embedding matrix thate suits our data.

In [15]:
vocab_size = len(word2Idx)

In [16]:
embedding_matrix = np.zeros((vocab_size, 100))

for word, index in word2Idx.items():
  vector = embedding_dict.get(word)
  if vector is not None:
    embedding_matrix[index] = vector

### Input Pipeline

We are going to make use of the `tf.data.Dataset.from_tensor_slices` to create a dataset from tensor slices so that we will be able to batch and shuffle it.

In [17]:
BATCH_SIZE = 64
BUFFER_SIZE = train_features.shape[0]

train_dataset = tf.data.Dataset.from_tensor_slices(
    (train_features, train_labels)
).shuffle(BUFFER_SIZE, reshuffle_each_iteration=True).batch(BATCH_SIZE, 
                                                             drop_remainder=True)

valid_dataset = tf.data.Dataset.from_tensor_slices(
    (valid_features, valid_labels)
).batch(BATCH_SIZE, drop_remainder=True)

test_dataset = tf.data.Dataset.from_tensor_slices(
    (test_features, test_labels)
).batch(BATCH_SIZE, drop_remainder=True)


### Model
We are going to build a Bi-Directional Long-Short Term Memory, aka BiLSTM model. We are going to use the keras subclassing API, feel free to use the sequential API it will work as well.

In [18]:
class NER(keras.Model):
  def __init__(self, max_seq_len,
               embedding_dim,
               output_dim, weights):
    super(NER, self).__init__()

    self.embedding = keras.layers.Embedding(
        embedding_dim, 100,
        weights = [weights],
        trainable=True,
        input_length=max_seq_len
    )
    self.bilstm = keras.layers.Bidirectional(
        keras.layers.LSTM(128, 
                          dropout=.5,
                          return_sequences=True
        )
    )
    self.dropout = keras.layers.Dropout(rate=.5)
    self.out = keras.layers.Dense(output_dim)

  def call(self, x):
    x = self.dropout(self.embedding(x))
    x = self.bilstm(x)
    return self.dropout(self.out(x))

In [19]:
model = NER(
    max_seq_len= 100,
    embedding_dim=len(word2Idx),
    weights = embedding_matrix,
    output_dim= len(idx2Label)
)

Now the model can be trained by calling the `.fit()` as follows:

In [20]:
model.compile(
    optimizer = 'adam',
    loss = keras.losses.SparseCategoricalCrossentropy(
    from_logits=True
    ),
     metrics = ["acc"]
)

model.fit(
    train_dataset, epochs = 10, validation_data = valid_dataset
)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7fad003dfd50>

### Evaluating the model

In [21]:
model.evaluate(test_dataset, verbose=1)



[0.02051875926554203, 0.9941627383232117]

### Model inference / Making predictions

Now our model is ready to make predictions. we are going to create a function called `predict_entities` which will make predictions for each word in a sentence and returns it's entity.

In [22]:
sent = [w for w in [t for t, i in train[0]]]
labels = [w for w in [i for t, i in train[0]]]
sent, labels

(['EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'lamb', '.'],
 ['B-ORG', 'O', 'B-MISC', 'O', 'O', 'O', 'B-MISC', 'O', 'O'])

In [23]:
def pedict_entities(sent):
  tokenized = sent.lower().split(" ")
  tokens = [word2Idx[token] for token in tokenized]
  tokens_padded = keras.preprocessing.sequence.pad_sequences([tokens], 100,       
  padding='post', truncating="post")

  predictions = model(tokens_padded)
  predictions= tf.squeeze(tf.argmax(predictions, axis=-1))[:len(tokens)].numpy()
  predicted_labels = [idx2Label[i] for i in predictions]
  return tokenized, predicted_labels

pedict_entities(" ".join(sent))

(['eu', 'rejects', 'german', 'call', 'to', 'boycott', 'british', 'lamb', '.'],
 ['B-ORG', 'O', 'B-MISC', 'O', 'O', 'O', 'B-MISC', 'O', 'O'])

In [23]:
sent = [w for w in [t for t, i in train[0]]]
labels = [w for w in [i for t, i in train[0]]]
pedict_entities(" ".join(sent))

### Making more predictions


In [24]:
for i in range(10):
  print("*"*50)
  sent = [w for w in [t for t, i in test[i]]]
  labels = [w for w in [i for t, i in test[i]]] 
  tokenized, preds = pedict_entities(" ".join(sent))
  print("sentence: ", sent)
  print("actual labels: ", labels)
  print("predicted labels: ", preds)
  print("*"*50)
  print()

**************************************************
sentence:  ['\t-DOCSTART-']
actual labels:  ['O']
predicted labels:  ['O']
**************************************************

**************************************************
sentence:  ['SOCCER', '-', 'JAPAN', 'GET', 'LUCKY', 'WIN', ',', 'CHINA', 'IN', 'SURPRISE', 'DEFEAT', '.']
actual labels:  ['O', 'O', 'B-LOC', 'O', 'O', 'O', 'O', 'B-PER', 'O', 'O', 'O', 'O']
predicted labels:  ['O', 'O', 'B-LOC', 'O', 'O', 'O', 'O', 'B-LOC', 'O', 'O', 'O', 'O']
**************************************************

**************************************************
sentence:  ['Nadim', 'Ladki']
actual labels:  ['B-PER', 'I-PER']
predicted labels:  ['B-PER', 'O']
**************************************************

**************************************************
sentence:  ['AL-AIN', ',', 'United', 'Arab', 'Emirates', '1996-12-06']
actual labels:  ['B-LOC', 'O', 'B-LOC', 'I-LOC', 'I-LOC', 'O']
predicted labels:  ['B-LOC', 'O', 'B-LOC', 'I-LOC', '