Named Entity means anything that is a real-world object such as a person, a place, any organisation, any product which has a name. For example – “My name is Aman, and I and a Machine Learning Trainer”. In this sentence the name “Aman”, the field or subject “Machine Learning” and the profession “Trainer” are named entities.

In [7]:
import pandas as pd
import numpy as np

data=pd.read_csv(r'C:\Users\amany\Desktop\archive datasets\ner_dataset.csv',encoding= 'unicode_escape')
print(data.shape)
data.head()

(1048575, 4)


Unnamed: 0,Sentence #,Word,POS,Tag
0,Sentence: 1,Thousands,NNS,O
1,,of,IN,O
2,,demonstrators,NNS,O
3,,have,VBP,O
4,,marched,VBN,O


In [8]:
data.isnull().sum()

Sentence #    1000616
Word                0
POS                 0
Tag                 0
dtype: int64

## Data Preparation for NN

I will train a Neural Network for the task of Named Entity Recognition (NER). So we need to do some modifications in the data to prepare it in such a manner so that it can easily fit into a neutral network. I will start this step by extracting the mappings that are required to train the neural network:

In [9]:
from itertools import chain
def get_dict_map(data,token_or_tags):
    tok2idx={}
    idx2tok={}
    
    if token_or_tags=='token':
        vocab=list(set(data['Word'].to_list()))
        print("Vocab : \n",vocab)
    else:
        vocab=list(set(data['Tag'].to_list()))
        print("Vocab : \n",vocab)
        
    idx2tok={idx:tok for idx,tok in enumerate(vocab)}
    tok2idx={tok:idx for idx,tok in enumerate(vocab)}
    
    return tok2idx,idx2tok

In [10]:
token2idx,idx2tok=get_dict_map(data,'token')

Vocab : 


In [11]:
tag2idx,idx2tag=get_dict_map(data,'tag')

Vocab : 
 ['I-geo', 'I-nat', 'B-art', 'I-org', 'B-geo', 'I-per', 'I-gpe', 'B-tim', 'B-per', 'B-gpe', 'I-eve', 'B-org', 'O', 'I-art', 'B-nat', 'I-tim', 'B-eve']


#### Now I will transform the columns in the data to extract the sequential data for our neural network:

In [16]:
data['Word_idx']=data['Word'].map(token2idx)
data['Tag_idx']=data['Tag'].map(tag2idx)
data_fillna=data.fillna(method='ffill',axis=0)

##Groupby and collects column

data_group=data_fillna.groupby(['Sentence #'],as_index=False)['Word','POS','Tag','Word_idx','Tag_idx'].agg(lambda x:list(x))

  import sys


In [17]:
print(data.shape)
data.head()

(1048575, 6)


Unnamed: 0,Sentence #,Word,POS,Tag,Word_idx,Tag_idx
0,Sentence: 1,Thousands,NNS,O,7350,12
1,,of,IN,O,4480,12
2,,demonstrators,NNS,O,17537,12
3,,have,VBP,O,34021,12
4,,marched,VBN,O,299,12


In [20]:
data_fillna.head()

Unnamed: 0,Sentence #,Word,POS,Tag,Word_idx,Tag_idx
0,Sentence: 1,Thousands,NNS,O,7350,12
1,Sentence: 1,of,IN,O,4480,12
2,Sentence: 1,demonstrators,NNS,O,17537,12
3,Sentence: 1,have,VBP,O,34021,12
4,Sentence: 1,marched,VBN,O,299,12


In [21]:
data_group

Unnamed: 0,Sentence #,Word,POS,Tag,Word_idx,Tag_idx
0,Sentence: 1,"[Thousands, of, demonstrators, have, marched, ...","[NNS, IN, NNS, VBP, VBN, IN, NNP, TO, VB, DT, ...","[O, O, O, O, O, O, B-geo, O, O, O, O, O, B-geo...","[7350, 4480, 17537, 34021, 299, 5469, 1513, 73...","[12, 12, 12, 12, 12, 12, 4, 12, 12, 12, 12, 12..."
1,Sentence: 10,"[Iranian, officials, say, they, expect, to, ge...","[JJ, NNS, VBP, PRP, VBP, TO, VB, NN, TO, JJ, J...","[B-gpe, O, O, O, O, O, O, O, O, O, O, O, O, O,...","[24260, 19149, 14239, 4764, 34324, 73, 535, 18...","[9, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12..."
2,Sentence: 100,"[Helicopter, gunships, Saturday, pounded, mili...","[NN, NNS, NNP, VBD, JJ, NNS, IN, DT, NNP, JJ, ...","[O, O, B-tim, O, O, O, O, O, B-geo, O, O, O, O...","[13915, 25477, 3424, 34806, 1746, 13942, 19535...","[12, 12, 7, 12, 12, 12, 12, 12, 4, 12, 12, 12,..."
3,Sentence: 1000,"[They, left, after, a, tense, hour-long, stand...","[PRP, VBD, IN, DT, NN, JJ, NN, IN, NN, NNS, .]","[O, O, O, O, O, O, O, O, O, O, O]","[32362, 25674, 34431, 17361, 6024, 30618, 1596...","[12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12]"
4,Sentence: 10000,"[U.N., relief, coordinator, Jan, Egeland, said...","[NNP, NN, NN, NNP, NNP, VBD, NNP, ,, NNP, ,, J...","[B-geo, O, O, B-per, I-per, O, B-tim, O, B-geo...","[2792, 3684, 16087, 18601, 26253, 4835, 30844,...","[4, 12, 12, 8, 5, 12, 7, 12, 4, 12, 9, 12, 9, ..."
...,...,...,...,...,...,...
47954,Sentence: 9995,"[Opposition, leader, Mir, Hossein, Mousavi, ha...","[NNP, NN, NNP, NNP, NNP, VBZ, VBN, PRP, VBZ, T...","[O, O, O, B-per, I-per, O, O, O, O, O, O, O, O...","[5959, 3806, 15988, 18083, 27883, 23195, 4835,...","[12, 12, 12, 8, 5, 12, 12, 12, 12, 12, 12, 12,..."
47955,Sentence: 9996,"[On, Thursday, ,, Iranian, state, media, publi...","[IN, NNP, ,, JJ, NN, NNS, VBN, DT, NN, IN, DT,...","[O, B-tim, O, B-gpe, O, O, O, O, O, O, O, O, B...","[11017, 14165, 27746, 24260, 9865, 17205, 2073...","[12, 7, 12, 9, 12, 12, 12, 12, 12, 12, 12, 12,..."
47956,Sentence: 9997,"[Following, Iran, 's, disputed, June, 12, elec...","[VBG, NNP, POS, JJ, NNP, CD, NNS, ,, NNS, NNS,...","[O, B-geo, O, O, B-tim, I-tim, O, O, O, O, O, ...","[2988, 2341, 4408, 31741, 13443, 32276, 3318, ...","[12, 4, 12, 12, 7, 15, 12, 12, 12, 12, 12, 12,..."
47957,Sentence: 9998,"[Since, then, ,, authorities, have, held, publ...","[IN, RB, ,, NNS, VBP, VBN, JJ, NNS, IN, DT, VB...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ...","[8219, 14540, 27746, 21629, 34021, 5171, 6998,...","[12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 1..."


Now I will split the data into training and test sets. I will create a function for splitting the data because the LSTM layers accept sequences of the same length only. So every sentence that appears as integer in the data must be padded with the same length:

In [36]:
from sklearn.model_selection import train_test_split
from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical

def get_pad_train_test_val(data_group, data):
    
    ##get max token and tags
    n_tokens=len(list(set(data['Word'].to_list())))
    n_tags=len(list(set(data['Tag'].to_list())))
    print("No. of tokens and tags : ",n_tokens,n_tags)
    
    ##pad tokens (X var)
    tokens=data_group['Word_idx'].tolist()
    maxlen=max([len(s) for s in tokens])
    pad_tokens=pad_sequences(tokens,maxlen=maxlen,dtype='int32',padding='post',value=n_tokens-1)
    
    ##pad tag (y var)
    tags=data_group['Tag_idx'].tolist()
    pad_tags=pad_sequences(tags,maxlen=maxlen,dtype='int32',padding='post',value=tag2idx["O"])
    n_tags=len(tag2idx)
    pad_tags=[to_categorical(i,num_classes=n_tags) for i in pad_tags]
    
    ##split train, test and validation set
    tokens_,test_tokens,tags_,test_tags=train_test_split(pad_tokens,pad_tags,test_size=0.1,train_size=0.9,random_state=2020)
    train_tokens,val_tokens,train_tags,val_tags=train_test_split(tokens_,tags_,test_size=0.25,train_size=0.75,random_state=2020)
    
    return train_tokens,val_tokens,test_tokens,train_tags,val_tags,test_tags

train_tokens, val_tokens, test_tokens, train_tags, val_tags, test_tags = get_pad_train_test_val(data_group, data)

No. of tokens and tags :  35178 17


### Training Neural Network for Named Entity Recognition (NER)
Now, I will proceed with training the neural network architecture of our model. So let’s start with importing all the packages we need for training our neural network:

In [37]:
import numpy as np
import tensorflow
from tensorflow.keras import Sequential, Model, Input
from tensorflow.keras.layers import LSTM, Embedding, Dense, TimeDistributed, Dropout, Bidirectional
from tensorflow.keras.utils import plot_model
from numpy.random import seed
seed(1)
tensorflow.random.set_seed(2)

The layer below will take the dimensions from the LSTM layer and will give the maximum length and maximum tags as an output:

In [38]:
input_dim = len(list(set(data['Word'].to_list())))+1
output_dim = 64
input_length = max([len(s) for s in data_group['Word_idx'].tolist()])
n_tags = len(tag2idx)

In [39]:
input_dim,input_length,n_tags

(35179, 104, 17)

In [40]:
def get_bilstm_lstm_model():
    model = Sequential()

    # Add Embedding layer
    model.add(Embedding(input_dim=input_dim, output_dim=output_dim, input_length=input_length))

    # Add bidirectional LSTM
    model.add(Bidirectional(LSTM(units=output_dim, return_sequences=True, dropout=0.2, recurrent_dropout=0.2), merge_mode = 'concat'))

    # Add LSTM
    model.add(LSTM(units=output_dim, return_sequences=True, dropout=0.5, recurrent_dropout=0.5))

    # Add timeDistributed Layer
    model.add(TimeDistributed(Dense(n_tags, activation="relu")))

    #Optimiser 
    # adam = k.optimizers.Adam(lr=0.0005, beta_1=0.9, beta_2=0.999)

    # Compile model
    model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
    model.summary()
    
    return model

In [41]:
def train_model(X, y, model):
    loss = list()
    for i in range(5):
        # fit model for one epoch on this sequence
        hist = model.fit(X, y, batch_size=1000, verbose=1, epochs=1, validation_split=0.2)
        loss.append(hist.history['loss'][0])
    return loss

In [44]:
results = pd.DataFrame()
model_bilstm_lstm = get_bilstm_lstm_model()
plot_model(model_bilstm_lstm)
results['with_add_lstm'] = train_model(train_tokens, np.array(train_tags), model_bilstm_lstm)

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 104, 64)           2251456   
_________________________________________________________________
bidirectional (Bidirectional (None, 104, 128)          66048     
_________________________________________________________________
lstm_1 (LSTM)                (None, 104, 64)           49408     
_________________________________________________________________
time_distributed (TimeDistri (None, 104, 17)           1105      
Total params: 2,368,017
Trainable params: 2,368,017
Non-trainable params: 0
_________________________________________________________________
('Failed to import pydot. You must `pip install pydot` and install graphviz (https://graphviz.gitlab.io/download/), ', 'for `pydotprint` to work.')


In [45]:
results

Unnamed: 0,with_add_lstm
0,
1,
2,
3,
4,


In [47]:
!pip install -U spacy

Collecting spacy
  Downloading spacy-3.0.6-cp37-cp37m-win_amd64.whl (11.7 MB)
Collecting thinc<8.1.0,>=8.0.3
  Downloading thinc-8.0.3-cp37-cp37m-win_amd64.whl (1.0 MB)
Collecting wasabi<1.1.0,>=0.8.1
  Downloading wasabi-0.8.2-py3-none-any.whl (23 kB)
Collecting blis<0.8.0,>=0.4.0
  Downloading blis-0.7.4-cp37-cp37m-win_amd64.whl (6.5 MB)
Collecting cymem<2.1.0,>=2.0.2
  Downloading cymem-2.0.5-cp37-cp37m-win_amd64.whl (35 kB)
Collecting catalogue<2.1.0,>=2.0.3
  Downloading catalogue-2.0.4-py3-none-any.whl (16 kB)
Collecting spacy-legacy<3.1.0,>=3.0.4
  Downloading spacy_legacy-3.0.5-py2.py3-none-any.whl (12 kB)
Collecting typer<0.4.0,>=0.3.0
  Downloading typer-0.3.2-py3-none-any.whl (21 kB)
Collecting packaging>=20.0
  Downloading packaging-20.9-py2.py3-none-any.whl (40 kB)
Collecting preshed<3.1.0,>=3.0.2
  Downloading preshed-3.0.5-cp37-cp37m-win_amd64.whl (108 kB)
Collecting pathy>=0.3.5
  Downloading pathy-0.5.2-py3-none-any.whl (42 kB)
Collecting murmurhash<1.1.0,>=0.28.0
  Do



In [49]:
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.0.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.0.0/en_core_web_sm-3.0.0-py3-none-any.whl (13.7 MB)
Installing collected packages: en-core-web-sm
Successfully installed en-core-web-sm-3.0.0
[+] Download and installation successful
You can now load the package via spacy.load('en_core_web_sm')


2021-05-07 17:00:56.854591: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'cudart64_110.dll'; dlerror: cudart64_110.dll not found
2021-05-07 17:00:56.854695: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.


In [51]:
# pip install -U spacy
# python -m spacy download en_core_web_sm

import spacy
from spacy import displacy
nlp = spacy.load('en_core_web_sm')
text = nlp('Hi, My name is Aman\n I am from India \n I want to work with Google \n Steve Jobs is My Inspiration')
displacy.render(text, style = 'ent', jupyter=True)