 <h2><b style="color:#30A2FF">Named Entity Recognition (NER) Model</b></h2>
1. Named Entity Recognition (NER)  is one of the widely used application of Natural Language Processing.<br>
<b style="color:#0E2954"> 2.NER is a subtask of information extraction that locates and classifies named entities in a text. The named entities could be organizations, persons, locations, times,etc.</b><br>
3. NER is Useful in  <b style="color:#3330E4">Serch Engine Efficiency,Recommendation engine etc.</b><br>
4. This code is divided into four parts-<br>
<em style="margin-left:25px";>a. Data Preprocessing</em><br>
<em style="margin-left:25px";>b. Building <b style="color:#3330E4"> Bidirectional LSTM model</b></em><br>
<em style="margin-left:25px";>c.Training & Testing of the model using validation.</em><br>
<em style="margin-left:25px";>d. Predicting the output of New Sentence </em><br>
5.I used open source data-set available on kagel.<br>
6.I used open-source library: <b>Tenserflow from Google</b> for model building and designing.

In [1]:
import numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
import pprint as pp
import random as rnd
from utils1 import *
from tensorflow.keras.preprocessing.sequence import pad_sequences

<h3><b style="color:#0E2954">Part-1. Data Pre-Processing</b></h3>
In this part ,I done preprocessing by preforming the following tasks-<br>
<em style="margin-left:25px";>a. Loading the data</em><br>
<em style="margin-left:25px";>b. Creating Vocabulary and tag-mapping from train data</em><br>
<em style="margin-left:25px";>c. Adding Padding to train_data</em><br>
<em style="margin-left:25px";>d. Creating Tensors for training and testing dataset</em><br>

In [2]:
train_sents = open('data/small/train/sentences.txt', 'r').readline()
train_labels = open('data/small/train/labels.txt', 'r').readline()
print('SENTENCE:', train_sents)
print('SENTENCE LABEL:', train_labels)

SENTENCE: Thousands of demonstrators have marched through London to protest the war in Iraq and demand the withdrawal of British troops from that country .

SENTENCE LABEL: O O O O O O B-geo O O O O O B-geo O O O O O B-gpe O O O O O



In [4]:
vocab, tag_map = get_vocab('data/large/words.txt', 'data/large/tags.txt')
t_sentences, t_labels, t_size = get_params(
    vocab, tag_map, 'data/large/train/sentences.txt', 'data/large/train/labels.txt')
v_sentences, v_labels, v_size = get_params(
    vocab, tag_map, 'data/large/val/sentences.txt', 'data/large/val/labels.txt')
test_sentences, test_labels, test_size = get_params(
    vocab, tag_map, 'data/large/test/sentences.txt', 'data/large/test/labels.txt')

In [5]:
print('vocab["the"]:', vocab["the"])
# Pad token
print('padded token:', vocab['<PAD>'])

vocab["the"]: 9
padded token: 35180


<h3 style="color:#0E2954">Some Tags and their sematics:</h3>
1. geo : geographical entity<br>
2. org : organization<br>
3. per : person<br>
4. gpe : geopolitical entity<br>
5. tim : time indicator<br>
6. art : artifact<br>
7. eve : event<br>
8. nat : natural phenomenon<br>
9. O: filler word<br><br>
<b style="color:#3330E4">The coding scheme that tags the entities is a minimal one where B- indicates the first token in a multi-token entity, and I- indicates one in the middle of a multi-token entity</b>

In [6]:
pp.pprint(tag_map)
print(f'Number of labels available in tag_map {len(tag_map)}')

{'B-art': 8,
 'B-eve': 14,
 'B-geo': 1,
 'B-gpe': 2,
 'B-nat': 13,
 'B-org': 5,
 'B-per': 3,
 'B-tim': 7,
 'I-art': 9,
 'I-eve': 15,
 'I-geo': 4,
 'I-gpe': 11,
 'I-nat': 16,
 'I-org': 6,
 'I-per': 10,
 'I-tim': 12,
 'O': 0}
Number of labels available in tag_map 17


In [7]:
print(f'Exploring information from data\n')
print(f'The size of training set is {t_size}')
print(f'The size of validation set is {v_size}')
print(f'The size of testing of set is {test_size}')
print(f'Example sentence from training set {t_sentences[2000]}')
print(f'Example tags line from trainiing set {t_labels[2000]}')

Exploring information from data

The size of training set is 33570
The size of validation set is 7194
The size of testing of set is 7194
Example sentence from training set [7049, 151, 1849, 7, 140, 1902, 21]
Example tags line from trainiing set [3, 0, 0, 0, 0, 0, 0]


<h4 style="color:#0E2954">Padding</h4>
When training an LSTM using batches, all our input sentences must be the same size. To accomplish this, we set the length of our sentences to a certain number and add the generic <PAD> token to fill all the empty spaces.

In [9]:
def maxLengthProvider(sentences):
    maxi=0
    for sentence in sentences:
        maxi=max(len(sentence),maxi)
    return maxi

In [11]:
maxLength=max(maxLengthProvider(t_sentences),maxLengthProvider(v_sentences),maxLengthProvider(test_sentences))
maxLength

104

In [12]:
t_sentences = pad_sequences(
    t_sentences, padding='post',maxlen=maxLength,value=17.0,truncating='post')
t_labels = pad_sequences(
    t_labels, padding='post', maxlen=maxLength, truncating='post')
v_sentences = pad_sequences(
    v_sentences, padding='post',maxlen=maxLength, truncating='post')
v_labels = pad_sequences(
    v_labels, padding='post', maxlen=maxLength, value=18.0, truncating='post')


In [13]:
# Preparing the training dataset.
train_dataset = tf.data.Dataset.from_tensor_slices((t_sentences, t_labels))
train_dataset = train_dataset.shuffle(buffer_size=1024).batch(64)

# Preparing the validation dataset.
val_dataset = tf.data.Dataset.from_tensor_slices((v_sentences, v_labels))
val_dataset = val_dataset.batch(64)

Metal device set to: Apple M1

systemMemory: 8.00 GB
maxCacheSize: 2.67 GB



<h3><b style="color:#0E2954">Part-2. Model Building and Desiging</b></h3>

In [14]:
#model buliding
input_shape = (maxLength,)
inputs = keras.Input(shape=input_shape)
Embedding = keras.layers.Embedding(input_dim=len(vocab), output_dim=64,mask_zero=True)(inputs)
x1 = keras.layers.Bidirectional(layers.LSTM(64, return_sequences=True))(Embedding)
outputs=keras.layers.Dense(units=len(tag_map),name='Prediction')(x1)
model=keras.Model(inputs=inputs,outputs=outputs,name='NER')
model.summary()

Model: "NER"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_1 (InputLayer)        [(None, 104)]             0         
                                                                 
 embedding (Embedding)       (None, 104, 64)           2251584   
                                                                 
 bidirectional (Bidirectiona  (None, 104, 128)         66048     
 l)                                                              
                                                                 
 Prediction (Dense)          (None, 104, 17)           2193      
                                                                 
Total params: 2,319,825
Trainable params: 2,319,825
Non-trainable params: 0
_________________________________________________________________


In [15]:
optimizer = keras.optimizers.Adam(learning_rate=1e-3)
loss_fn = keras.losses.SparseCategoricalCrossentropy(from_logits=True)

In [16]:
batch_size=64

<h3><b style="color:#0E2954">Part-3a.Training on train_dataset</b></h3>

In [17]:
#Training Loop
epochs = 5
for epoch in range(epochs):
    print("\nStart of epoch %d" % (epoch,))
    # Iterate over the batches of the dataset.
    for step, (x_batch_train, y_batch_train) in enumerate(train_dataset):
        # Open a GradientTape to record the operations run
        # during the forward pass, which enables auto-differentiation.
        with tf.GradientTape() as tape:
            # Run the forward pass of the layer.
            # The operations that the layer applies
            # to its inputs are going to be recorded
            # on the GradientTape.
            # Logits for this minibatch
            logits = model(x_batch_train, training=True)
            # Compute the loss value for this minibatch.
            loss_value = loss_fn(y_batch_train, logits)
            
        # Use the gradient tape to automatically retrieve
        # the gradients of the trainable variables with respect to the loss.
        grads = tape.gradient(loss_value, model.trainable_weights)

        # Run one step of gradient descent by updating
        # the value of the variables to minimize the loss.
        optimizer.apply_gradients(zip(grads, model.trainable_weights))

        # Log every 200 batches.
        if step % 200 == 0:
            print(
                "Training loss (for one batch) at step %d: %.4f"
                % (step, float(loss_value))
            )
            print("Seen so far: %s samples" % ((step + 1) * batch_size))



Start of epoch 0
Training loss (for one batch) at step 0: 2.7969
Seen so far: 64 samples
Training loss (for one batch) at step 200: 0.1226
Seen so far: 12864 samples
Training loss (for one batch) at step 400: 0.0735
Seen so far: 25664 samples

Start of epoch 1
Training loss (for one batch) at step 0: 0.0643
Seen so far: 64 samples
Training loss (for one batch) at step 200: 0.0510
Seen so far: 12864 samples
Training loss (for one batch) at step 400: 0.0475
Seen so far: 25664 samples

Start of epoch 2
Training loss (for one batch) at step 0: 0.0300
Seen so far: 64 samples
Training loss (for one batch) at step 200: 0.0301
Seen so far: 12864 samples
Training loss (for one batch) at step 400: 0.0350
Seen so far: 25664 samples

Start of epoch 3
Training loss (for one batch) at step 0: 0.0330
Seen so far: 64 samples
Training loss (for one batch) at step 200: 0.0111
Seen so far: 12864 samples
Training loss (for one batch) at step 400: 0.0135
Seen so far: 25664 samples

Start of epoch 4
Traini

<h3><b style="color:#0E2954">Part-3b.Testing on val_dataset</b></h3>

In [18]:
def predictions(model=model,val_dataset=val_dataset):
    return model.predict(val_dataset)

In [19]:
def masks(v_sentences=v_sentences):
    '''Generating masks to ignore the padding'''
    embedding = layers.Embedding(input_dim=5000, output_dim=16, mask_zero=True)
    masked_output = embedding(v_sentences)
    return (masked_output._keras_mask)


In [20]:
def accuracy(test_sentences,test_labels,masks=masks,predictions=predictions):
    '''Accuracy of model predictions to ignore the padding'''
    predicted_labels = np.argmax(predictions(), axis=2)
    x,y=test_labels.shape
    return  (np.sum(predicted_labels == test_labels))/(float(np.sum(masks())))
    

In [21]:
accuracy(v_sentences,v_labels)



0.9482130986673419

<h3><b style="color:#0E2954">Part-4.Predictions on real world sentences</b></h3>

In [22]:
def predict(givenSentence,model=model,vocab=vocab,maxLength=maxLength,tag_map=tag_map):
    s = [vocab[token] if token in vocab else vocab['UNK']
         for token in givenSentence.split(' ')]
    s = s+[0]*(maxLength-len(s))
    batch_data = np.ones((1, len(s)))
    batch_data[0] = s
    sentence = np.array(batch_data).astype(int)
    predictions=model.predict(sentence)
    outputs = np.argmax(predictions, axis=2)
    pred=[]
    labels = list(tag_map.keys())
    words=givenSentence.split(' ')
    for i,word in enumerate(words):
        idx=outputs[0][i]
        tag=labels[idx]
        if(tag!='O'):
            pred.append((word,tag))
    return pred

In [23]:
# Trying Example Sentence for testing real world example
sentence = "Peter Navarro, the White House director of trade and manufacturing policy of U.S, said in an interview on Sunday morning that the White House was working to prepare for the possibility of a second wave of the coronavirus in the fall, though he said it wouldn’t necessarily come"
pp.pprint(predict(sentence))

[('Peter', 'B-per'),
 ('Navarro,', 'I-per'),
 ('White', 'B-org'),
 ('House', 'I-org'),
 ('Sunday', 'B-tim'),
 ('morning', 'I-tim'),
 ('White', 'B-org'),
 ('House', 'I-org')]


<b>Refernces</b><br>
1.Tensorflow documentation-<a href="https://www.tensorflow.org/">Link</a><br>
2.Dataset <a href="https://www.kaggle.com/datasets/debasisdotcom/name-entity-recognition-ner-dataset">Link</a>