<a name="0"></a>
## Introduction

We first start by defining named entity recognition (NER). NER is a subtask of information extraction that locates and classifies named entities in a text. The named entities could be organizations, persons, locations, times, etc. 

For example:

<img src = 'images/ner.png' width="width" height="height" style="width:600px;height:150px;"/>

Is labeled as follows: 

- French: geopolitical entity
- Morocco: geographic entity 
- Christmas: time indicator

Everything else that is labeled with an `O` is not considered to be a named entity. In this assignment, we will train a named entity recognition system that could be trained in a few seconds (on a GPU) and will get around 75% accuracy. Then, we will load in the exact version of your model, which was trained for a longer period of time. We could then evaluate the trained version of our model to get 96% accuracy! Finally, we will be able to test our named entity recognition system with our own sentence.

In [1]:
import tensorflow as tf
import numpy as np
import os

2023-09-09 12:33:44.397673: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


<a name="1"></a>
## 1 - Exploring the Data

We will be using a dataset from Kaggle, which we will preprocess. The original data consists of four columns: the sentence number, the word, the part of speech of the word, and the tags.  A few tags we might expect to see are: 

* geo: geographical entity
* org: organization
* per: person 
* gpe: geopolitical entity
* tim: time indicator
* art: artifact
* eve: event
* nat: natural phenomenon
* O: filler word


In [2]:
def get_vocab(vocab_path, tags_path):
    vocab = {}
    with open(vocab_path, encoding="utf-8") as f:
        for i, l in enumerate(f.read().splitlines()):
            vocab[l] = i  # to avoid the 0
        # loading tags (we require this to map tags to their indices)
    vocab['<PAD>'] = len(vocab) # 35180
    tag_map = {}
    with open(tags_path) as f:
        for i, t in enumerate(f.read().splitlines()):
            tag_map[t] = i 
    
    return vocab, tag_map

In [3]:
def get_params(vocab, tag_map, sentences_file, labels_file):
    sentences = []
    labels = []

    with open(sentences_file, encoding="utf-8") as f:
        for sentence in f.read().splitlines():
            # replace each token by its index if it is in vocab
            # else use index of UNK_WORD
            s = [vocab[token] if token in vocab 
                 else vocab['UNK']
                 for token in sentence.split(' ')]
            sentences.append(s)

    with open(labels_file) as f:
        for sentence in f.read().splitlines():
            # replace each label by its index
            l = [tag_map[label] for label in sentence.split(' ')] # I added plus 1 here
            labels.append(l) 
    return sentences, labels, len(sentences)

In [4]:
vocab, tag_map = get_vocab('data/large/words.txt', 'data/large/tags.txt')
t_sentences, t_labels, t_size = get_params(vocab, tag_map, 'data/large/train/sentences.txt', 'data/large/train/labels.txt')
v_sentences, v_labels, v_size = get_params(vocab, tag_map, 'data/large/val/sentences.txt', 'data/large/val/labels.txt')
test_sentences, test_labels, test_size = get_params(vocab, tag_map, 'data/large/test/sentences.txt', 'data/large/test/labels.txt')

In [5]:
# vocab translates from a word to a unique number
print('vocab["the"]:', vocab["the"])
# Pad token
print('padded token:', vocab['<PAD>'])

vocab["the"]: 9
padded token: 35180


`vocab` is a dictionary that translates a word string to a unique number. Given a sentence, we can represent it as an array of numbers translating with this dictionary. The dictionary contains a `<PAD>` token. 

When training an LSTM using batches, all our input sentences must be the same size. To accomplish this, we set the length of our sentences to a certain number and add the generic `<PAD>` token to fill all the empty spaces. 

The `tag_map` is a dictionary that maps the tags that we could have to numbers. Run the cell below to see the possible classes we will be predicting. The prepositions in the tags mean:
* I: Token is inside an entity.
* B: Token begins an entity.

In [6]:
tag_map

{'O': 0,
 'B-geo': 1,
 'B-gpe': 2,
 'B-per': 3,
 'I-geo': 4,
 'B-org': 5,
 'I-org': 6,
 'B-tim': 7,
 'B-art': 8,
 'I-art': 9,
 'I-per': 10,
 'I-gpe': 11,
 'I-tim': 12,
 'B-nat': 13,
 'B-eve': 14,
 'I-eve': 15,
 'I-nat': 16}

If we had the sentence 

**"Sharon flew to Miami on Friday"**

The tags would look like:

```
Sharon B-per
flew   O
to     O
Miami  B-geo
on     O
Friday B-tim
```

where we would have three tokens beginning with B-, since there are no multi-token entities in the sequence. But if we added Sharon's last name to the sentence:

**"Sharon Floyd flew to Miami on Friday"**

```
Sharon B-per
Floyd  I-per
flew   O
to     O
Miami  B-geo
on     O
Friday B-tim
```

our tags would change to show first "Sharon" as B-per, and "Floyd" as I-per, where I- indicates an inner token in a multi-token sequence.

In [7]:
# Exploring information about the data
print('The number of outputs is tag_map', len(tag_map))
# The number of vocabulary tokens (including <PAD>)
g_vocab_size = len(vocab)
print(f"Num of vocabulary words: {g_vocab_size}")
print('The training size is', t_size)
print('The validation size is', v_size)
print('An example of the first sentence is', t_sentences[0])
print('An example of its corresponding label is', t_labels[0])

The number of outputs is tag_map 17
Num of vocabulary words: 35181
The training size is 33570
The validation size is 7194
An example of the first sentence is [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 9, 15, 1, 16, 17, 18, 19, 20, 21]
An example of its corresponding label is [0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0]


### 1.2 - Data Generator

In [8]:
def stack_tensor(lines, labels, pad = vocab["<PAD>"]):
    max_len = 0
    for line in lines:
        if len(line) > max_len:
            max_len = len(line)
    
    stack_tensor_lines = []
    for line in lines:
        line += [pad] * (max_len - len(line))
        stack_tensor_lines.append(line)
        
    stack_tensor_labels = []
    for label_lines in labels:
        label_lines += [0] * (max_len - len(label_lines))
        stack_tensor_labels.append(label_lines)
    
    return (np.array(stack_tensor_lines), np.array(stack_tensor_labels))

In [9]:
train_lines, train_labels = stack_tensor(t_sentences, t_labels)
eval_lines, eval_labels = stack_tensor(v_sentences, v_labels)
test_lines, test_labels = stack_tensor(test_sentences, test_labels)
train_lines

array([[    0,     1,     2, ..., 35180, 35180, 35180],
       [   22,     1,    23, ..., 35180, 35180, 35180],
       [   42,     4,    18, ..., 35180, 35180, 35180],
       ...,
       [29838, 29839,  6586, ..., 35180, 35180, 35180],
       [ 1001, 29840, 29841, ..., 35180, 35180, 35180],
       [ 3175,   502,  2543, ..., 35180, 35180, 35180]])

In [10]:
train_lines = tf.data.Dataset.from_tensor_slices(train_lines)
eval_lines = tf.data.Dataset.from_tensor_slices(eval_lines)
test_lines = tf.data.Dataset.from_tensor_slices(test_lines)

2023-09-09 12:33:46.812164: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:995] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-09-09 12:33:46.812295: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:995] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-09-09 12:33:46.883632: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1960] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the 

In [11]:
train_lines

<_TensorSliceDataset element_spec=TensorSpec(shape=(104,), dtype=tf.int64, name=None)>

In [12]:
train_targets = tf.data.Dataset.from_tensor_slices(train_labels)
eval_targets = tf.data.Dataset.from_tensor_slices(eval_labels)
test_targets = tf.data.Dataset.from_tensor_slices(test_labels)

In [13]:
train_targets

<_TensorSliceDataset element_spec=TensorSpec(shape=(104,), dtype=tf.int64, name=None)>

In [14]:
train_dataset = tf.data.Dataset.zip((train_lines, train_targets))
eval_dataset = tf.data.Dataset.zip((eval_lines, eval_targets))
test_dataset = tf.data.Dataset.zip((test_lines, test_targets))

In [15]:
batch_size = 64
buffer_size = 10000

train_dataset = train_dataset.shuffle(buffer_size).batch(batch_size, drop_remainder = True)
eval_dataset = eval_dataset.shuffle(buffer_size).batch(batch_size, drop_remainder = True)
test_dataset = test_dataset.shuffle(buffer_size).batch(batch_size, drop_remainder = True)

## 2 - Building the Model

We will now implement the model that will be able to determining the tags of sentences like the following:
<table>
    <tr>
        <td>
<img src = 'images/ner1.png' width="width" height="height" style="width:500px;height:150px;"/>
        </td>
    </tr>
</table>

The model architecture will be as follows: 

<img src = 'images/ner2.png' width="width" height="height" style="width:600px;height:250px;"/>


Concretely, our inputs will be sentences represented as tensors that are fed to a model with:

* An Embedding layer,
* A LSTM layer
* A Dense layer
* A log softmax layer.

In [16]:
vocab_size = len(vocab)
embedding_size = 50

def build_model(tags, vocab_size = vocab_size, embedding_size = embedding_size):
    '''
    Input:
        tags - dictionary that maps the tags to the numbers
        vocab_size - integer containing the size of the vocabulary
        embedding_size - integer describing the embedding size
    Output:
        model - a sequential model
    '''
    model = tf.keras.Sequential([
        tf.keras.layers.Embedding(vocab_size, embedding_size),
        tf.keras.layers.Masking(mask_value=vocab["<PAD>"]),
        tf.keras.layers.LSTM(units=embedding_size,
                             return_sequences=True,
                             recurrent_initializer="glorot_uniform"),
        tf.keras.layers.Dense(len(tags), activation="softmax")
    ])
    
    return model

In [17]:
model = build_model(tag_map, vocab_size, embedding_size)
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, None, 50)          1759050   
                                                                 
 masking (Masking)           (None, None, 50)          0         
                                                                 
 lstm (LSTM)                 (None, None, 50)          20200     
                                                                 
 dense (Dense)               (None, None, 17)          867       
                                                                 
Total params: 1780117 (6.79 MB)
Trainable params: 1780117 (6.79 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


In [18]:
def loss(y_true, y_pred):
    return tf.keras.losses.sparse_categorical_crossentropy(y_true, y_pred, from_logits=True)

In [19]:
model.compile(optimizer="adam", loss=loss, metrics=["accuracy"])

## 3 - Train the Model 

This section will train our model.

In [20]:
checkpoint_dir = './training_checkpoints'
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt_{epoch}")

checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(
    filepath=checkpoint_prefix,
    save_weights_only=True
)

EPOCHS = 15

history = model.fit(train_dataset,
                    validation_data=eval_dataset,
                    epochs=EPOCHS, 
                    callbacks=[checkpoint_callback, tf.keras.callbacks.EarlyStopping(patience=3)])

Epoch 1/15


  output, from_logits = _get_logits(


Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15


## 4 - Evaluation

In [21]:
loss, acc = model.evaluate(eval_dataset, batch_size=1)
print("Loss on test data: ", loss)
print("Accuracy on test data: ", acc)

Loss on test data:  0.04541099816560745
Accuracy on test data:  0.9869377613067627


## 5 - Testing with our own Sentence

In [22]:
# This is the function we will be using to test our own sentence.
def predict(sentence, model, vocab, tag_map):
    s = [vocab[token] if token in vocab else vocab['UNK'] for token in sentence.split(' ')]
    batch_data = np.ones((1, len(s)))
    batch_data[0][:] = s
    sentence = np.array(batch_data).astype(int)
    output = model(sentence)
    outputs = np.argmax(output, axis=2)
    labels = list(tag_map.keys())
    pred = []
    for i in range(len(outputs[0])):
        idx = outputs[0][i] 
        pred_label = labels[idx]
        pred.append(pred_label)
    return pred

In [23]:
# Try the output for the introduction example
#sentence = "Many French citizens are goin to visit Morocco for summer"
#sentence = "Sharon Floyd flew to Miami last Friday"

# New york times news:
sentence = "Peter Navarro, the White House director of trade and manufacturing policy of U.S, said in an interview on Sunday morning that the White House was working to prepare for the possibility of a second wave of the coronavirus in the fall, though he said it wouldn’t necessarily come"
s = [vocab[token] if token in vocab else vocab['UNK'] for token in sentence.split(' ')]
predictions = predict(sentence, model, vocab, tag_map)
for x,y in zip(sentence.split(' '), predictions):
    if y != 'O':
        print(x,y)

Peter B-per
Navarro, I-per
White B-org
House I-org
Sunday B-tim
morning I-tim
White B-org
House I-org
coronavirus B-org
fall, I-geo


In [24]:
sentence = "My name is Bob Robinson, I'm from Viet Nam and now I studying at Standford University. In Thursday I will have the first class at this University"
s = [vocab[token] if token in vocab else vocab['UNK'] for token in sentence.split(' ')]
predictions = predict(sentence, model, vocab, tag_map)
for x,y in zip(sentence.split(' '), predictions):
    if y != 'O':
        print(x,y)

Bob B-per
Robinson, I-per
I'm I-per
Viet B-geo
Nam I-per
University. I-per
Thursday B-tim
I I-tim
University B-org


In [25]:
sentence = "Tuesday the Manhattan New York city prosecutor unsealed a multi count indictment against China based limmt economic and trade company and Li Fang Wei one of the firm 's managers"
s = [vocab[token] if token in vocab else vocab['UNK'] for token in sentence.split(' ')]
predictions = predict(sentence, model, vocab, tag_map)
for x,y in zip(sentence.split(' '), predictions):
    if y != 'O':
        print(x,y)

Tuesday B-tim
Manhattan B-geo
New I-geo
York I-geo
China B-geo
Li B-per
Fang I-per
Wei I-per


In [26]:
sentence = "Sharon Floyd flew to Miami on Friday"
s = [vocab[token] if token in vocab else vocab['UNK'] for token in sentence.split(' ')]
predictions = predict(sentence, model, vocab, tag_map)
for x,y in zip(sentence.split(' '), predictions):
    if y != 'O':
        print(x,y)

Sharon B-per
Floyd I-per
Miami B-geo
Friday B-tim
