# Named Entity Recognition
In Natural Language Processing (NLP) an Entity Recognition is one of the common problem. The entity is referred to as the part of the text that is interested in. In NLP, NER is a method of extracting the relevant information from a large corpus and classifying those entities into predefined categories such as location, organization, name and so on. 
Information about lables: 
* geo = Geographical Entity
* org = Organization
* per = Person
* gpe = Geopolitical Entity
* tim = Time indicator
* art = Artifact
* eve = Event
* nat = Natural Phenomenon

        1. Total Words Count = 1354149 
        2. Target Data Column: Tag

#### Importing Libraries

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os
import sys
import keras
from keras.preprocessing.text import Tokenizer
from tensorflow.keras.utils import to_categorical
from keras.utils.vis_utils import plot_model
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Activation, Conv2D, Input, Embedding, Reshape, MaxPool2D, Concatenate, Flatten, Dropout, Dense, Conv1D, ZeroPadding2D
from keras.layers import MaxPool1D
from keras.models import Model, Sequential
from keras.callbacks import ModelCheckpoint
from tensorflow.keras.optimizers import Adam
from keras.models import load_model    


In [4]:
#Reading the csv file
file = "C:/Users/Admin/Desktop/NLP/Sentimental/ner_dataset.csv"
df = pd.read_csv(file, encoding = "ISO-8859-1")

In [6]:
df.describe()

Unnamed: 0,Sentence #,Word,POS,Tag
count,47959,1048575,1048575,1048575
unique,47959,35178,42,17
top,Sentence: 2212,the,NN,O
freq,1,52573,145807,887908


#### Observations : 
* There are total 47959 sentences in the dataset.
* Number unique words in the dataset are 35178.
* Total 17 lables (Tags).

In [7]:
#Displaying the unique Tags
df['Tag'].unique()

array(['O', 'B-geo', 'B-gpe', 'B-per', 'I-geo', 'B-org', 'I-org', 'B-tim',
       'B-art', 'I-art', 'I-per', 'I-gpe', 'I-tim', 'B-nat', 'B-eve',
       'I-eve', 'I-nat'], dtype=object)

There are lots of missing values in 'Sentence #' attribute. So we will use pandas fillna technique and use 'ffill' method which propagates last valid observation forward to next.

In [9]:
df = df.fillna(method = 'ffill')

In [10]:
# This is a class te get sentence. The each sentence will be list of tuples with its tag and pos.
class sentence(object):
    def __init__(self, df):
        self.n_sent = 1
        self.df = df
        self.empty = False
        agg = lambda s : [(w, p, t) for w, p, t in zip(s['Word'].values.tolist(),
                                                       s['POS'].values.tolist(),
                                                       s['Tag'].values.tolist())]
        self.grouped = self.df.groupby("Sentence #").apply(agg)
        self.sentences = [s for s in self.grouped]
        
    def get_text(self):
        try:
            s = self.grouped['Sentence: {}'.format(self.n_sent)]
            self.n_sent +=1
            return s
        except:
            return None

In [11]:
#Displaying one full sentence
getter = sentence(df)
sentences = [" ".join([s[0] for s in sent]) for sent in getter.sentences]
sentences[0]

'Thousands of demonstrators have marched through London to protest the war in Iraq and demand the withdrawal of British troops from that country .'

In [12]:
#sentence with its pos and tag.
sent = getter.get_text()
print(sent)

[('Thousands', 'NNS', 'O'), ('of', 'IN', 'O'), ('demonstrators', 'NNS', 'O'), ('have', 'VBP', 'O'), ('marched', 'VBN', 'O'), ('through', 'IN', 'O'), ('London', 'NNP', 'B-geo'), ('to', 'TO', 'O'), ('protest', 'VB', 'O'), ('the', 'DT', 'O'), ('war', 'NN', 'O'), ('in', 'IN', 'O'), ('Iraq', 'NNP', 'B-geo'), ('and', 'CC', 'O'), ('demand', 'VB', 'O'), ('the', 'DT', 'O'), ('withdrawal', 'NN', 'O'), ('of', 'IN', 'O'), ('British', 'JJ', 'B-gpe'), ('troops', 'NNS', 'O'), ('from', 'IN', 'O'), ('that', 'DT', 'O'), ('country', 'NN', 'O'), ('.', '.', 'O')]


Getting all the sentences in the dataset.

In [13]:
sentences = getter.sentences

#### Defining the parameters for CNN network

In [14]:
# Number of data points passed in each iteration
batch_size = 64 
# Passes through entire dataset
epochs = 8
# Maximum length of review
max_len = 75 
# Dimension of embedding vector
embedding = 40 

#### Preprocessing Data
We will process our text data before feeding to the network.
* Here word_to_index dictionary used to convert word into index value and tag_to_index is for the labels. So overall we represent each word as integer.

In [15]:
#Getting unique words and labels from data
words = list(df['Word'].unique())
tags = list(df['Tag'].unique())
# Dictionary word:index pair
# word is key and its value is corresponding index
word_to_index = {w : i + 2 for i, w in enumerate(words)}
word_to_index["UNK"] = 1
word_to_index["PAD"] = 0

# Dictionary lable:index pair
# label is key and value is index.
tag_to_index = {t : i + 1 for i, t in enumerate(tags)}
tag_to_index["PAD"] = 0

idx2word = {i: w for w, i in word_to_index.items()}
idx2tag = {i: w for w, i in tag_to_index.items()}

In [17]:
# Converting each sentence into list of index from list of tokens
X = [[word_to_index[w[0]] for w in s] for s in sentences]

# Padding each sequence to have same length  of each word
X = pad_sequences(maxlen = max_len, sequences = X, padding = "post", value = word_to_index["PAD"])

In [18]:
# Convert label to index
y = [[tag_to_index[w[2]] for w in s] for s in sentences]

# padding
y = pad_sequences(maxlen = max_len, sequences = y, padding = "post", value = tag_to_index["PAD"])

In [19]:
num_tag = df['Tag'].nunique()
# One hot encoded labels
y = [to_categorical(i, num_classes = num_tag + 1) for i in y]

In [21]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.15)

In [22]:
print("Size of training input data : ", X_train.shape)
print("Size of training output data : ", np.array(y_train).shape)
print("Size of testing input data : ", X_test.shape)
print("Size of testing output data : ", np.array(y_test).shape)

Size of training input data :  (40765, 75)
Size of training output data :  (40765, 75, 18)
Size of testing input data :  (7194, 75)
Size of testing output data :  (7194, 75, 18)


In [41]:
y=np.asarray(y_train)
m=y.size
y = y.reshape((m))
y = y.reshape((-1,18))

Y_test=np.asarray(y_test)
n=Y_test.size
Y_test = Y_test.reshape((n))
Y_test = Y_test.reshape((-1,18))
Y_test.shape

(539550, 18)

#### CNN model

In [65]:

from tensorflow.keras.layers import TimeDistributed, SpatialDropout1D,Bidirectional
from keras.layers import MaxPool1D
# Model architecture
filters=300
kernel_size=3
input = Input(shape = (max_len,))
model = Embedding(input_dim = len(words) + 2, output_dim = max_len, input_length = max_len, mask_zero = True)(input)
model = Dropout(0.1)(model)
model = Conv1D(filters, kernel_size, padding='same', activation='relu', strides=1)(model)
model = Conv1D(150, kernel_size, padding='same', activation='relu', strides=1)(model)
model = Conv1D(75, kernel_size, padding='same', activation='relu', strides=1)(model)
# model =Flatten()(model)
# model=MaxPool1D(pool_size = 2)(model)
out = Dense(num_tags+1,activation = 'softmax')(model)
model = Model(input,out)
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()


Model: "model_5"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_12 (InputLayer)        [(None, 75)]              0         
_________________________________________________________________
embedding_11 (Embedding)     (None, 75, 75)            2638500   
_________________________________________________________________
dropout_8 (Dropout)          (None, 75, 75)            0         
_________________________________________________________________
conv1d_24 (Conv1D)           (None, 75, 300)           67800     
_________________________________________________________________
conv1d_25 (Conv1D)           (None, 75, 150)           135150    
_________________________________________________________________
conv1d_26 (Conv1D)           (None, 75, 75)            33825     
_________________________________________________________________
dense_6 (Dense)              (None, 75, 18)            1368

Making Checkpoint each epoch to check and save the best model performance till last and also avoiding further validation loss drop due to overfitting.

In [66]:
history = model.fit(X_train, np.asarray(y_train),  batch_size = 64, verbose = 1, epochs = 1, validation_split = 0.1)




In [52]:
history.history.keys()

dict_keys(['loss', 'accuracy', 'val_loss', 'val_accuracy'])

#### Evaluating the model on test set

In [54]:
def test_input(text):
    word_list = text.split(" ")
    x_new = []
    for word in word_list:
        x_new.append(word_to_index[word])
    d=np.array([x_new])
    g = pad_sequences(maxlen = max_len, sequences = d, padding = "post", value = word_to_index["PAD"])

    p = model.predict(np.array(g))
    p = np.argmax(p, axis = -1)
    print("{:20}\t{}\n".format("Word", "Prediction"))
    print("-" * 35)

    for (w, pred) in zip(range(len(x_new)), p[0]):
        print("{:20}\t{}".format(word_list[w], idx2tag[pred]))


In [73]:
test_inputs = "my friend Mohammed is travelling to London are"
test_input(test_inputs)

Word                	Prediction

-----------------------------------
my                  	O
friend              	O
Mohammed            	B-per
is                  	O
travelling          	O
to                  	O
London              	B-geo
are                 	O


In [77]:
test_inputs = "Thousands of demonstrators have marched through London to protest the war in Iraq and demand the withdrawal of British troops from that country"
test_input(test_inputs)

Word                	Prediction

-----------------------------------
Thousands           	O
of                  	O
demonstrators       	O
have                	O
marched             	O
through             	O
London              	B-geo
to                  	O
protest             	O
the                 	O
war                 	O
in                  	O
Iraq                	B-geo
and                 	O
demand              	O
the                 	O
withdrawal          	O
of                  	O
British             	B-gpe
troops              	O
from                	O
that                	O
country             	O
