# Named Entity Recognition (NER)




<a name="0"></a>
# Introduction

We first start by defining named entity recognition (NER). NER is a subtask of information extraction that locates and classifies named entities in a text. The named entities could be organizations, persons, locations, times, etc.

Everything else that is labeled with an `O` is not considered to be a named entity.
In this notebook, you will train a named entity recognition system that could be trained in a few seconds (on a GPU) and will get around 75% accuracy. Then, you will load in the exact version of your model, which was trained for a longer period of time. You could then evaluate the trained version of your model to get 96% accuracy! Finally, you will be able to test your named entity recognition system with your own sentence.

In [1]:
!pip install trax

Collecting trax
  Downloading trax-1.4.1-py2.py3-none-any.whl.metadata (1.7 kB)
Collecting funcsigs (from trax)
  Downloading funcsigs-1.0.2-py2.py3-none-any.whl.metadata (14 kB)
Collecting tensorflow-text (from trax)
  Downloading tensorflow_text-2.17.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (1.8 kB)
Downloading trax-1.4.1-py2.py3-none-any.whl (637 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m637.9/637.9 kB[0m [31m37.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading funcsigs-1.0.2-py2.py3-none-any.whl (17 kB)
Downloading tensorflow_text-2.17.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (5.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.2/5.2 MB[0m [31m97.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: funcsigs, tensorflow-text, trax
Successfully installed funcsigs-1.0.2 tensorflow-text-2.17.0 trax-1.4.1


Create a directory **colab_data** in your Google Drive.

Copy the contents of the compressed file **Lab2_NER.rar** to this folder.

In [3]:
from google.colab import drive
drive.mount('/content/drive/', force_remount=True)

Mounted at /content/drive/


In [8]:
# Add the path to 'utils.py' to your Python path
import sys
sys.path.append('/content/drive/My Drive/colab_data') # Replace with your actual path

In [9]:
import trax
from trax import layers as tl
import os
import numpy as np
import pandas as pd

from utils import get_params, get_vocab
import random as rnd

# set random seeds
rnd.seed(33)
np.random.seed(33)

<a name="1"></a>
# Part 1:  Exploring the data

We will be using a dataset from Kaggle. The original data consists of four columns, the sentence number, the word, the part of speech of the word, and the tags.  A few tags you might expect to see are:

* geo: geographical entity
* org: organization
* per: person
* gpe: geopolitical entity
* tim: time indicator
* art: artifact
* eve: event
* nat: natural phenomenon
* O: filler word


In [5]:
# display original kaggle data
data = pd.read_csv("/content/drive/My Drive/colab_data/NER_Dataset.csv", encoding = "ISO-8859-1")
train_sents = open('/content/drive/My Drive/colab_data/data/small/train/sentences.txt', 'r').readline()
train_labels = open('/content/drive/My Drive/colab_data/data/small/train/labels.txt', 'r').readline()
print('SENTENCE:', train_sents)
print('SENTENCE LABEL:', train_labels)
print('ORIGINAL DATA:\n', data.head(5))
del(data, train_sents, train_labels)

SENTENCE: Thousands of demonstrators have marched through London to protest the war in Iraq and demand the withdrawal of British troops from that country .

SENTENCE LABEL: O O O O O O B-geo O O O O O B-geo O O O O O B-gpe O O O O O

ORIGINAL DATA:
        Sentence_ID                                               Word  \
0      Sentence: 1  ['Thousands', 'of', 'demonstrators', 'have', '...   
1     Sentence: 10  ['Iranian', 'officials', 'say', 'they', 'expec...   
2    Sentence: 100  ['Helicopter', 'gunships', 'Saturday', 'pounde...   
3   Sentence: 1000  ['They', 'left', 'after', 'a', 'tense', 'hour-...   
4  Sentence: 10000  ['U.N.', 'relief', 'coordinator', 'Jan', 'Egel...   

                                                 POS  \
0  ['NNS', 'IN', 'NNS', 'VBP', 'VBN', 'IN', 'NNP'...   
1  ['JJ', 'NNS', 'VBP', 'PRP', 'VBP', 'TO', 'VB',...   
2  ['NN', 'NNS', 'NNP', 'VBD', 'JJ', 'NNS', 'IN',...   
3  ['PRP', 'VBD', 'IN', 'DT', 'NN', 'JJ', 'NN', '...   
4  ['NNP', 'NN', 'NN', 'NNP', '

<a name="1.1"></a>
## 1.1  Importing the Data

In this part, we will import the preprocessed data and explore it.

In [10]:
vocab, tag_map = get_vocab('/content/drive/My Drive/colab_data/data/large/words.txt', '/content/drive/My Drive/colab_data/data/large/tags.txt')
t_sentences, t_labels, t_size = get_params(vocab, tag_map, '/content/drive/My Drive/colab_data/data/large/train/sentences.txt', '/content/drive/My Drive/colab_data/data/large/train/labels.txt')
v_sentences, v_labels, v_size = get_params(vocab, tag_map, '/content/drive/My Drive/colab_data/data/large/val/sentences.txt', '/content/drive/My Drive/colab_data/data/large/val/labels.txt')
test_sentences, test_labels, test_size = get_params(vocab, tag_map, '/content/drive/My Drive/colab_data/data/large/test/sentences.txt', '/content/drive/My Drive/colab_data/data/large/test/labels.txt')

`vocab` is a dictionary that translates a word string to a unique number. Given a sentence, you can represent it as an array of numbers translating with this dictionary. The dictionary contains a `<PAD>` token.

When training an LSTM using batches, all your input sentences must be the same size. To accomplish this, you set the length of your sentences to a certain number and add the generic `<PAD>` token to fill all the empty spaces.

In [11]:
# vocab translates from a word to a unique number
print('vocab["the"]:', vocab["the"])
# Pad token
print('padded token:', vocab['<PAD>'])

vocab["the"]: 9
padded token: 35180


The tag_map corresponds to one of the possible tags a word can have. Run the cell below to see the possible classes you will be predicting. The prepositions in the tags mean:
* I: Token is inside an entity.
* B: Token begins an entity.

In [12]:
print(tag_map)

{'O': 0, 'B-geo': 1, 'B-gpe': 2, 'B-per': 3, 'I-geo': 4, 'B-org': 5, 'I-org': 6, 'B-tim': 7, 'B-art': 8, 'I-art': 9, 'I-per': 10, 'I-gpe': 11, 'I-tim': 12, 'B-nat': 13, 'B-eve': 14, 'I-eve': 15, 'I-nat': 16}


In [13]:
# Exploring information about the data
print('The number of outputs is tag_map', len(tag_map))
# The number of vocabulary tokens (including <PAD>)
g_vocab_size = len(vocab)
print(f"Num of vocabulary words: {g_vocab_size}")
print('The vocab size is', len(vocab))
print('The training size is', t_size)
print('The validation size is', v_size)

The number of outputs is tag_map 17
Num of vocabulary words: 35181
The vocab size is 35181
The training size is 33570
The validation size is 7194


So you can see that we have already encoded each sentence into a tensor by converting it into a number. We also have 16 possible classes, as shown in the tag map.


<a name="1.2"></a>
## 1.2  Data generator

In python, a generator is a function that behaves like an iterator. It will return the next item. Here is a [link](https://wiki.python.org/moin/Generators) to review python generators.

In many AI applications it is very useful to have a data generator. We will now implement a data generator for our NER application.



In [14]:
def data_generator(batch_size, x, y, pad, shuffle=False, verbose=False):
    '''
      Input:
        batch_size - integer describing the batch size
        x - list containing sentences where words are represented as integers
        y - list containing tags associated with the sentences
        shuffle - Shuffle the data order
        pad - an integer representing a pad character
        verbose - Print information during runtime
      Output:
        a tuple containing 2 elements:
        X - np.ndarray of dim (batch_size, max_len) of padded sentences
        Y - np.ndarray of dim (batch_size, max_len) of tags associated with the sentences in X
    '''

    # count the number of lines in data_lines
    num_lines = len(x)

    # create an array with the indexes of data_lines that can be shuffled
    lines_index = [*range(num_lines)]

    # shuffle the indexes if shuffle is set to True
    if shuffle:
        rnd.shuffle(lines_index)

    index = 0 # tracks current location in x, y
    while True:
        buffer_x = [0] * batch_size # Temporal array to store the raw x data for this batch
        buffer_y = [0] * batch_size # Temporal array to store the raw y data for this batch

        max_len = 0
        for i in range(batch_size):
             # if the index is greater than or equal to the number of lines in x
            if index >= num_lines:
                # then reset the index to 0
                index = 0
                # re-shuffle the indexes if shuffle is set to True
                if shuffle:
                    rnd.shuffle(lines_index)

            # The current position is obtained using `lines_index[index]`
            # Store the x value at the current position into the buffer_x
            buffer_x[i] = x[lines_index[index]]

            # Store the y value at the current position into the buffer_y
            buffer_y[i] = y[lines_index[index]]

            lenx = len(x[lines_index[index]])    #length of current x[]
            if lenx > max_len:
                max_len = lenx                   #max_len tracks longest x[]

            # increment index by one
            index += 1


        # create X,Y, NumPy arrays of size (batch_size, max_len) 'full' of pad value
        X = np.full((batch_size, max_len), pad)
        Y = np.full((batch_size, max_len), pad)

        # copy values from lists to NumPy arrays. Use the buffered values
        for i in range(batch_size):
            # get the example (sentence as a tensor)
            # in `buffer_x` at the `i` index
            x_i = buffer_x[i]

            # similarly, get the example's labels
            # in `buffer_y` at the `i` index
            y_i = buffer_y[i]

            # Walk through each word in x_i
            for j in range(len(x_i)):
                # store the word in x_i at position j into X
                X[i, j] = x_i[j]

                # store the label in y_i at position j into Y
                Y[i, j] = y_i[j]

        if verbose: print("index=", index)
        yield((X,Y))

In [15]:
batch_size = 5
mini_sentences = t_sentences[0: 8]
mini_labels = t_labels[0: 8]
dg = data_generator(batch_size, mini_sentences, mini_labels, vocab["<PAD>"], shuffle=False, verbose=True)
X1, Y1 = next(dg)
X2, Y2 = next(dg)
print(Y1.shape, X1.shape, Y2.shape, X2.shape)
print(X1[0][:], "\n", Y1[0][:])

index= 5
index= 2
(5, 30) (5, 30) (5, 30) (5, 30)
[    0     1     2     3     4     5     6     7     8     9    10    11
    12    13    14     9    15     1    16    17    18    19    20    21
 35180 35180 35180 35180 35180 35180] 
 [    0     0     0     0     0     0     1     0     0     0     0     0
     1     0     0     0     0     0     2     0     0     0     0     0
 35180 35180 35180 35180 35180 35180]


<a name="2"></a>
# Part 2:  Building the model

We will now implement the model. We will be using Google's TensorFlow.

In [16]:
def NER(vocab_size=35181, d_model=50, tags=tag_map):
    '''
      Input:
        vocab_size - integer containing the size of the vocabulary
        d_model - integer describing the embedding size
      Output:
        model - a trax serial model
    '''
    model = tl.Serial(
      tl.Embedding(vocab_size, d_model), # Embedding layer
      tl.LSTM(d_model), # LSTM layer
      tl.Dense(len(tags)), # Dense layer with len(tags) units
      tl.LogSoftmax()  # LogSoftmax layer
      )
    return model

In [17]:
# initializing your model
model = NER()
# display your model
print(model)

Serial[
  Embedding_35181_50
  LSTM_50
  Dense_17
  LogSoftmax
]


<a name="3"></a>
# Part 3:  Train the Model

This section will train our model.

Before we start, we need to create the data generators for training and validation data. It is important that we mask padding in the loss weights of our data, which can be done using the `id_to_mask` argument of `trax.data.inputs.add_loss_weights`.

In [18]:
rnd.seed(33)

batch_size = 64

# Create training data, mask pad id=35180 for training.
train_generator = trax.data.inputs.add_loss_weights(
    data_generator(batch_size, t_sentences, t_labels, vocab['<PAD>'], True),
    id_to_mask=vocab['<PAD>'])

# Create validation data, mask pad id=35180 for training.
eval_generator = trax.data.inputs.add_loss_weights(
    data_generator(batch_size, v_sentences, v_labels, vocab['<PAD>'], True),
    id_to_mask=vocab['<PAD>'])

<a name='3.1'></a>
### 3.1 Training the model

We will now write a function that takes in your model and trains it.

In [19]:
def train_model(NER, train_generator, eval_generator, train_steps=1, output_dir='model'):
    '''
    Input:
        NER - the model you are building
        train_generator - The data generator for training examples
        eval_generator - The data generator for validation examples,
        train_steps - number of training steps
        output_dir - folder to save your model
    Output:
        training_loop - a trax supervised training Loop
    '''
    train_task = training.TrainTask(
      train_generator, # A train data generator
      loss_layer = tl.CrossEntropyLoss(), # A cross-entropy loss function
      optimizer = trax.optimizers.Adam(0.01),  # The adam optimizer
    )

    eval_task = training.EvalTask(
      labeled_data = eval_generator, # A labeled data generator
      metrics = [tl.CrossEntropyLoss(), tl.Accuracy()], # Evaluate with cross-entropy loss and accuracy
      n_eval_batches = 10  # Number of batches to use on each evaluation
    )

    training_loop = training.Loop(
        NER, # A model to train
        train_task, # A train task
        eval_tasks=[eval_task],
        output_dir = output_dir) # The output directory

    # Train with train_steps
    training_loop.run(n_steps = train_steps)

    return training_loop

In [20]:
from trax.supervised import training

train_steps = 100
!rm -f 'model/model.pkl.gz'  # Remove old model.pkl if it exists

# Train the model
training_loop = train_model(NER(), train_generator, eval_generator, train_steps)

  with gzip.GzipFile(fileobj=f, compresslevel=compresslevel) as gzipf:



Step      1: Total number of trainable weights: 1780117
Step      1: Ran 1 train steps in 2.50 secs
Step      1: train CrossEntropyLoss |  4.04632664


  with gzip_lib.GzipFile(fileobj=f, compresslevel=2) as gzipf:


Step      1: eval  CrossEntropyLoss |  2.90042813
Step      1: eval          Accuracy |  0.01860118


  with gzip.GzipFile(fileobj=f, compresslevel=compresslevel) as gzipf:



Step    100: Ran 99 train steps in 28.46 secs
Step    100: train CrossEntropyLoss |  0.55899328


  with gzip_lib.GzipFile(fileobj=f, compresslevel=2) as gzipf:


Step    100: eval  CrossEntropyLoss |  0.25666570
Step    100: eval          Accuracy |  0.93673508


In [21]:
# loading in a pretrained model..
model = NER()
model.init(trax.shapes.ShapeDtype((1, 1), dtype=np.int32))

# Load the pretrained model
model.init_from_file('model/model.pkl.gz', weights_only=True)

((array([[ 0.00626876, -0.1672094 ,  0.04730672, ...,  0.02100409,
           0.09326503, -0.00318395],
         [-0.26170006, -0.12213242, -0.18048875, ...,  0.11632765,
           0.26826692, -0.00404759],
         [-0.11117691, -0.26779622, -0.22080895, ..., -0.04907783,
           0.18515159, -0.11594632],
         ...,
         [-0.19272566,  0.0865287 , -0.16018522, ...,  0.08917122,
          -0.03077034, -0.0886739 ],
         [-0.02528609,  0.11262495, -0.1404779 , ..., -0.06518547,
          -0.07217853, -0.15837154],
         [ 0.09185313, -0.01502389,  0.18619727, ...,  0.12835549,
          -0.02299821,  0.02762324]], dtype=float32),
  (((), ((), ())),
   ((array([[ 0.23717202,  0.25847733,  0.14883976, ..., -0.4172051 ,
              0.5753839 , -0.03026068],
            [-0.17705357, -0.12066317, -0.01919067, ...,  0.42146364,
              0.2907704 ,  0.40396497],
            [-0.3361445 , -0.1928285 , -0.3859951 , ...,  0.3743336 ,
             -0.42620894, -0.1325455

<a name="4"></a>
# Part 4:  Compute Accuracy

We will now evaluate in the test set. Previously, we have seen the accuracy on the training set and the validation (noted as eval) set. We will now evaluate on our test set. To get a good evaluation, we will need to create a mask to avoid counting the padding tokens when computing the accuracy.




<details>    
<summary>
    <font size="3" color="darkgreen"><b>More Detailed Instructions </b></font>
</summary>

* *Step 1*: model(sentences) will give you the predicted output.

* *Step 2*: Prediction will produce an output with an added dimension. For each sentence, for each word, there will be a vector of probabilities for each tag type. For each sentence,word, you need to pick the maximum valued tag. This will require `np.argmax` and careful use of the `axis` argument.
* *Step 3*: Create a mask to prevent counting pad characters. It has the same dimension as output. An example below on matrix comparison provides a hint.
* *Step 4*: Compute the accuracy metric by comparing your outputs against your test labels. Take the sum of that and divide by the total number of **unpadded** tokens. Use your mask value to mask the padded tokens. Return the accuracy.
</detail>

In [22]:
# create the evaluation inputs
x, y = next(data_generator(len(test_sentences), test_sentences, test_labels, vocab['<PAD>']))
print("input shapes", x.shape, y.shape)

input shapes (7194, 70) (7194, 70)


In [23]:
# sample prediction
tmp_pred = model(x)
print(type(tmp_pred))
print(f"tmp_pred has shape: {tmp_pred.shape}")

<class 'jaxlib.xla_extension.ArrayImpl'>
tmp_pred has shape: (7194, 70, 17)


Note that the model's prediction has 3 axes:
- the number of examples
- the number of words in each example (padded to be as long as the longest sentence in the batch)
- the number of possible targets (the 17 named entity tags).

In [24]:
def evaluate_prediction(pred, labels, pad):
    """
    Inputs:
        pred: prediction array with shape
            (num examples, max sentence length in batch, num of classes)
        labels: array of size (batch_size, seq_len)
        pad: integer representing pad character
    Outputs:
        accuracy: float
    """

## step 1 ##
    outputs = np.argmax(pred, axis=2)
    print("outputs shape:", outputs.shape)

## step 2 ##
    mask = labels != pad
    print("mask shape:", mask.shape, "mask[0][20:30]:", mask[0][20:30])
## step 3 ##
    accuracy = np.sum(outputs == labels) / float(np.sum(mask))
    return accuracy


In [25]:
accuracy = evaluate_prediction(model(x), y, vocab['<PAD>'])
print("accuracy: ", accuracy)

outputs shape: (7194, 70)
mask shape: (7194, 70) mask[0][20:30]: [ True  True  True False False False False False False False]
accuracy:  0.93521255


<a name="5"></a>
# Part 5:  Testing with your own sentence


Below, we can test it out with our own sentence!

In [26]:
# This is the function you will be using to test your own sentence.
def predict(sentence, model, vocab, tag_map):
    s = [vocab[token] if token in vocab else vocab['UNK'] for token in sentence.split(' ')]
    batch_data = np.ones((1, len(s)))
    batch_data[0][:] = s
    sentence = np.array(batch_data).astype(int)
    output = model(sentence)
    outputs = np.argmax(output, axis=2)
    labels = list(tag_map.keys())
    pred = []
    for i in range(len(outputs[0])):
        idx = outputs[0][i]
        pred_label = labels[idx]
        pred.append(pred_label)
    return pred

In [27]:
# Try the output for the introduction example
#sentence = "Many French citizens are goin to visit Morocco for summer"
#sentence = "Sharon Floyd flew to Miami last Friday"

# New york times news:
sentence = "Peter Navarro, the White House director of trade and manufacturing policy of U.S, said in an interview on Sunday morning that the White House was working to prepare for the possibility of a second wave of the coronavirus in the fall, though he said it wouldn’t necessarily come"
s = [vocab[token] if token in vocab else vocab['UNK'] for token in sentence.split(' ')]
predictions = predict(sentence, model, vocab, tag_map)
for x,y in zip(sentence.split(' '), predictions):
    if y != 'O':
        print(x,y)

Peter B-per
White B-org
House I-org
Sunday B-tim
morning I-tim
White B-org
House I-org
