# NLP Poject No 1:
# Named Entity Recognition (NER) NLP Projects

Named Entity means anything that is a real-world object such as a person, a place, any organisation, any product which has a name. For example – “My name is Noor, and I and a Machine Learning Trainer”. In this sentence the name “Aman”, the field or subject “Machine Learning” and the profession “Trainer” are named entities.

In Machine Learning Named Entity Recognition (NER) is a task of Natural Language Processing to identify the named entities in a certain piece of text.

Have you ever used software known as Grammarly? It identifies all the incorrect spellings and punctuations in the text and corrects it. But it does not do anything with the named entities, as it is also using the same technique.

# Loading the Data for Named Entity Recognition (NER)
The dataset, that I will use for this task can be easily downloaded from here. Now the first thing I will fo is to load the data and have a look at it to know what I am working with. So let’s simply import the pandas library and load the data:

In the data, we can see that the words are broken into columns which will represent our feature X, and the Tag column in the right will represent our label Y.

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('ner_datasetreference.csv')

In [3]:
df.head(10)

Unnamed: 0,Sentence #,Word,POS,Tag
0,Sentence: 1,Thousands,NNS,O
1,,of,IN,O
2,,demonstrators,NNS,O
3,,have,VBP,O
4,,marched,VBN,O
5,,through,IN,O
6,,London,NNP,B-geo
7,,to,TO,O
8,,protest,VB,O
9,,the,DT,O


In [4]:
df.shape

(1048575, 4)

In [5]:
df['Word'].values

array(['Thousands', 'of', 'demonstrators', ..., 'to', 'the', 'attack'],
      dtype=object)

In [6]:
df['Tag'].values

array(['O', 'O', 'O', ..., 'O', 'O', 'O'], dtype=object)

In [7]:
df.isna().sum()

Sentence #    1000616
Word                0
POS                 0
Tag                 0
dtype: int64

In [8]:
df.duplicated().sum()

953668

# Data Preparation for Neural Networks
I will train a Neural Network for the task of Named Entity Recognition (NER). So we need to do some modifications in the data to prepare it in such a manner so that it can easily fit into a neutral network. I will start this step by extracting the mappings that are required to train the neural network:

In [9]:
from itertools import chain

def get_dict_map(df,token_or_tag):
    token_to_index = {}
    index_to_token = {}
    
    if token_or_tag == "token":
        vocab = list(set(df['Word'].to_list()))
    else:
        vocab = list(set(df['Tag'].to_list()))
        
    token_to_index = {ktn:idx for ktn,idx in enumerate(vocab)}
    index_to_token = {idx:ktn for ktn,idx in enumerate(vocab)}
    
    return token_to_index,index_to_token

In [10]:
tkn_to_idx, idx_to_tkn = get_dict_map(df,'token')
tag_to_idx, idx_to_tag = get_dict_map(df,'tag')

In [35]:
tag_to_idx

{0: 'B-art',
 1: 'I-org',
 2: 'I-per',
 3: 'I-nat',
 4: 'B-nat',
 5: 'B-tim',
 6: 'B-eve',
 7: 'B-per',
 8: 'I-gpe',
 9: 'I-tim',
 10: 'B-geo',
 11: 'B-org',
 12: 'I-art',
 13: 'I-eve',
 14: 'I-geo',
 15: 'O',
 16: 'B-gpe'}

# Now I will transform the columns in the data to extract the sequential data for our neural network

In [16]:
df['word_idx'] = df['Word'].map(tkn_to_idx)
df['tag_idx'] = df['Tag'].map(tag_to_idx)

In [20]:
df_filled = df.fillna(method='ffill',axis=0)

In [23]:
df_filled.fillna(-1,axis=0,inplace=True)

In [24]:
df_filled

Unnamed: 0,Sentence #,Word,POS,Tag,word_idx,tag_idx
0,Sentence: 1,Thousands,NNS,O,-1.0,-1.0
1,Sentence: 1,of,IN,O,-1.0,-1.0
2,Sentence: 1,demonstrators,NNS,O,-1.0,-1.0
3,Sentence: 1,have,VBP,O,-1.0,-1.0
4,Sentence: 1,marched,VBN,O,-1.0,-1.0
...,...,...,...,...,...,...
1048570,Sentence: 47959,they,PRP,O,-1.0,-1.0
1048571,Sentence: 47959,responded,VBD,O,-1.0,-1.0
1048572,Sentence: 47959,to,TO,O,-1.0,-1.0
1048573,Sentence: 47959,the,DT,O,-1.0,-1.0


# Groupby and collect columns

In [25]:
# Groupby and collect columns
data_group = df_filled.groupby(['Sentence #'],as_index=False)['Word', 'POS', 'Tag', 'word_idx', 'tag_idx'].agg(lambda x: list(x))

  data_group = df_filled.groupby(['Sentence #'],as_index=False)['Word', 'POS', 'Tag', 'word_idx', 'tag_idx'].agg(lambda x: list(x))


# split the data into training and test sets

In [26]:
from sklearn.model_selection import train_test_split
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import to_categorical

In [58]:
def get_pad_train_test_val(data_group,df):
    
    #get max token and tag length
    n_token = len(list(set(df['Word'].to_list())))
    n_tag = len(list(set(df['Tag'].to_list())))
    
    #Pad tokens (X var)
    tokens = data_group['word_idx'].fillna(-1).tolist()
    maxlen = max([len(s) for s in tokens])
    pad_tokens = pad_sequences(tokens,maxlen=maxlen,dtype='int32',padding='post',value=n_token - 1)
    
    #Pad Tags (y var) and convert it into one hot encoding
    tags = data_group['tag_idx'].fillna(-1).tolist()
#     tag_to_idx['O'] = len(tag_to_idx)
    pad_tags = pad_sequences(tags,maxlen=maxlen,dtype='int32',padding='post',value=tag_to_idx['O'])
    n_tags = len(tag_to_idx)
    pad_tags = [to_categorical(i, num_classes=n_tags) for i in pad_tags]
    
    #Split train, test and validation set
    tokens_, test_tokens, tags_, test_tags = train_test_split(pad_tokens, pad_tags, test_size=0.1, train_size=0.9, random_state=2023)
    train_tokens, val_tokens, train_tags, val_tags = train_test_split(tokens_,tags_,test_size = 0.25,train_size =0.75, random_state=2023)
    
    
    print(
        'train_tokens length:', len(train_tokens),
        '\ntrain_tokens length:', len(train_tokens),
        '\ntest_tokens length:', len(test_tokens),
        '\ntest_tags:', len(test_tags),
        '\nval_tokens:', len(val_tokens),
        '\nval_tags:', len(val_tags),
    )
    
    
    
    return train_tokens, val_tokens, test_tokens, train_tags, val_tags, test_tags



train_tokens, val_tokens, test_tokens, train_tags, val_tags, test_tags = get_pad_train_test_val(data_group, df)

train_tokens length: 32372 
train_tokens length: 32372 
test_tokens length: 4796 
test_tags: 4796 
val_tokens: 10791 
val_tags: 10791


# Training Neural Network for Named Entity Recognition (NER)
Now, I will proceed with training the neural network architecture of our model. So let’s start with importing all the packages we need for training our neural network:

In [59]:
import numpy as np
import tensorflow
from tensorflow.keras import Sequential, Model, Input
from tensorflow.keras.layers import LSTM, Embedding, Dense, TimeDistributed, Dropout, Bidirectional
from tensorflow.keras.utils import plot_model
from numpy.random import seed
seed(1)
tensorflow.random.set_seed(2)

# The layer below will take the dimensions from the LSTM layer and will give the maximum length and maximum tags as an output:

In [60]:
input_dim = len(list(set(df['Word'].to_list())))+1
output_dim = 64
input_length = max([len(s) for s in data_group['word_idx'].tolist()])
n_tags = len(tag_to_idx)

# Now I will create a helper function which will help us in giving the summary of every layer of the neural network model for Named Entity Recognition (NER):

In [61]:
def get_bilstm_lstm_model():
    model = Sequential()

    # Add Embedding layer
    model.add(Embedding(input_dim=input_dim, output_dim=output_dim, input_length=input_length))

    # Add bidirectional LSTM
    model.add(Bidirectional(LSTM(units=output_dim, return_sequences=True, dropout=0.2, recurrent_dropout=0.2), merge_mode = 'concat'))

    # Add LSTM
    model.add(LSTM(units=output_dim, return_sequences=True, dropout=0.5, recurrent_dropout=0.5))

    # Add timeDistributed Layer
    model.add(TimeDistributed(Dense(n_tags, activation="relu")))

    #Optimiser 
    # adam = k.optimizers.Adam(lr=0.0005, beta_1=0.9, beta_2=0.999)

    # Compile model
    model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
    model.summary()
    
    return model

# Now I will create a helper function to train the Named Entity Recognition model:

In [62]:
def train_model(X, y, model):
    print(X.shape,'\n',y.shape)
    loss = list()
    for i in range(25):
        # fit model for one epoch on this sequence
        hist = model.fit(X, y, batch_size=1000, verbose=1, epochs=1,shuffle=True, validation_split=0.2)
        loss.append(hist.history['loss'][0])
    return loss

# Driver code:

In [63]:
results = pd.DataFrame()
model_bilstm_lstm = get_bilstm_lstm_model()
plot_model(model_bilstm_lstm)
results['with_add_lstm'] = train_model(train_tokens, np.array(train_tags), model_bilstm_lstm)

Model: "sequential_5"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_5 (Embedding)     (None, 104, 64)           2251456   
                                                                 
 bidirectional_5 (Bidirectio  (None, 104, 128)         66048     
 nal)                                                            
                                                                 
 lstm_11 (LSTM)              (None, 104, 64)           49408     
                                                                 
 time_distributed_5 (TimeDis  (None, 104, 18)          1170      
 tributed)                                                       
                                                                 
Total params: 2,368,082
Trainable params: 2,368,082
Non-trainable params: 0
_________________________________________________________________
You must install pydot (`pip install pydot`)

InvalidArgumentError: Graph execution error:

Detected at node 'sequential_5/embedding_5/embedding_lookup' defined at (most recent call last):
    File "C:\Users\Noor Saeed\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 196, in _run_module_as_main
      return _run_code(code, main_globals, None,
    File "C:\Users\Noor Saeed\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 86, in _run_code
      exec(code, run_globals)
    File "C:\Users\Noor Saeed\AppData\Local\Programs\Python\Python310\lib\site-packages\ipykernel_launcher.py", line 17, in <module>
      app.launch_new_instance()
    File "C:\Users\Noor Saeed\AppData\Local\Programs\Python\Python310\lib\site-packages\traitlets\config\application.py", line 976, in launch_instance
      app.start()
    File "C:\Users\Noor Saeed\AppData\Local\Programs\Python\Python310\lib\site-packages\ipykernel\kernelapp.py", line 712, in start
      self.io_loop.start()
    File "C:\Users\Noor Saeed\AppData\Local\Programs\Python\Python310\lib\site-packages\tornado\platform\asyncio.py", line 215, in start
      self.asyncio_loop.run_forever()
    File "C:\Users\Noor Saeed\AppData\Local\Programs\Python\Python310\lib\asyncio\base_events.py", line 600, in run_forever
      self._run_once()
    File "C:\Users\Noor Saeed\AppData\Local\Programs\Python\Python310\lib\asyncio\base_events.py", line 1896, in _run_once
      handle._run()
    File "C:\Users\Noor Saeed\AppData\Local\Programs\Python\Python310\lib\asyncio\events.py", line 80, in _run
      self._context.run(self._callback, *self._args)
    File "C:\Users\Noor Saeed\AppData\Local\Programs\Python\Python310\lib\site-packages\ipykernel\kernelbase.py", line 510, in dispatch_queue
      await self.process_one()
    File "C:\Users\Noor Saeed\AppData\Local\Programs\Python\Python310\lib\site-packages\ipykernel\kernelbase.py", line 499, in process_one
      await dispatch(*args)
    File "C:\Users\Noor Saeed\AppData\Local\Programs\Python\Python310\lib\site-packages\ipykernel\kernelbase.py", line 406, in dispatch_shell
      await result
    File "C:\Users\Noor Saeed\AppData\Local\Programs\Python\Python310\lib\site-packages\ipykernel\kernelbase.py", line 730, in execute_request
      reply_content = await reply_content
    File "C:\Users\Noor Saeed\AppData\Local\Programs\Python\Python310\lib\site-packages\ipykernel\ipkernel.py", line 383, in do_execute
      res = shell.run_cell(
    File "C:\Users\Noor Saeed\AppData\Local\Programs\Python\Python310\lib\site-packages\ipykernel\zmqshell.py", line 528, in run_cell
      return super().run_cell(*args, **kwargs)
    File "C:\Users\Noor Saeed\AppData\Local\Programs\Python\Python310\lib\site-packages\IPython\core\interactiveshell.py", line 2885, in run_cell
      result = self._run_cell(
    File "C:\Users\Noor Saeed\AppData\Local\Programs\Python\Python310\lib\site-packages\IPython\core\interactiveshell.py", line 2940, in _run_cell
      return runner(coro)
    File "C:\Users\Noor Saeed\AppData\Local\Programs\Python\Python310\lib\site-packages\IPython\core\async_helpers.py", line 129, in _pseudo_sync_runner
      coro.send(None)
    File "C:\Users\Noor Saeed\AppData\Local\Programs\Python\Python310\lib\site-packages\IPython\core\interactiveshell.py", line 3139, in run_cell_async
      has_raised = await self.run_ast_nodes(code_ast.body, cell_name,
    File "C:\Users\Noor Saeed\AppData\Local\Programs\Python\Python310\lib\site-packages\IPython\core\interactiveshell.py", line 3318, in run_ast_nodes
      if await self.run_code(code, result, async_=asy):
    File "C:\Users\Noor Saeed\AppData\Local\Programs\Python\Python310\lib\site-packages\IPython\core\interactiveshell.py", line 3378, in run_code
      exec(code_obj, self.user_global_ns, self.user_ns)
    File "C:\Users\Noor Saeed\AppData\Local\Temp\ipykernel_3832\329638798.py", line 4, in <module>
      results['with_add_lstm'] = train_model(train_tokens, np.array(train_tags), model_bilstm_lstm)
    File "C:\Users\Noor Saeed\AppData\Local\Temp\ipykernel_3832\3288993200.py", line 6, in train_model
      hist = model.fit(X, y, batch_size=1000, verbose=1, epochs=1,shuffle=True, validation_split=0.2)
    File "C:\Users\Noor Saeed\AppData\Local\Programs\Python\Python310\lib\site-packages\keras\utils\traceback_utils.py", line 65, in error_handler
      return fn(*args, **kwargs)
    File "C:\Users\Noor Saeed\AppData\Local\Programs\Python\Python310\lib\site-packages\keras\engine\training.py", line 1564, in fit
      tmp_logs = self.train_function(iterator)
    File "C:\Users\Noor Saeed\AppData\Local\Programs\Python\Python310\lib\site-packages\keras\engine\training.py", line 1160, in train_function
      return step_function(self, iterator)
    File "C:\Users\Noor Saeed\AppData\Local\Programs\Python\Python310\lib\site-packages\keras\engine\training.py", line 1146, in step_function
      outputs = model.distribute_strategy.run(run_step, args=(data,))
    File "C:\Users\Noor Saeed\AppData\Local\Programs\Python\Python310\lib\site-packages\keras\engine\training.py", line 1135, in run_step
      outputs = model.train_step(data)
    File "C:\Users\Noor Saeed\AppData\Local\Programs\Python\Python310\lib\site-packages\keras\engine\training.py", line 993, in train_step
      y_pred = self(x, training=True)
    File "C:\Users\Noor Saeed\AppData\Local\Programs\Python\Python310\lib\site-packages\keras\utils\traceback_utils.py", line 65, in error_handler
      return fn(*args, **kwargs)
    File "C:\Users\Noor Saeed\AppData\Local\Programs\Python\Python310\lib\site-packages\keras\engine\training.py", line 557, in __call__
      return super().__call__(*args, **kwargs)
    File "C:\Users\Noor Saeed\AppData\Local\Programs\Python\Python310\lib\site-packages\keras\utils\traceback_utils.py", line 65, in error_handler
      return fn(*args, **kwargs)
    File "C:\Users\Noor Saeed\AppData\Local\Programs\Python\Python310\lib\site-packages\keras\engine\base_layer.py", line 1097, in __call__
      outputs = call_fn(inputs, *args, **kwargs)
    File "C:\Users\Noor Saeed\AppData\Local\Programs\Python\Python310\lib\site-packages\keras\utils\traceback_utils.py", line 96, in error_handler
      return fn(*args, **kwargs)
    File "C:\Users\Noor Saeed\AppData\Local\Programs\Python\Python310\lib\site-packages\keras\engine\sequential.py", line 410, in call
      return super().call(inputs, training=training, mask=mask)
    File "C:\Users\Noor Saeed\AppData\Local\Programs\Python\Python310\lib\site-packages\keras\engine\functional.py", line 510, in call
      return self._run_internal_graph(inputs, training=training, mask=mask)
    File "C:\Users\Noor Saeed\AppData\Local\Programs\Python\Python310\lib\site-packages\keras\engine\functional.py", line 667, in _run_internal_graph
      outputs = node.layer(*args, **kwargs)
    File "C:\Users\Noor Saeed\AppData\Local\Programs\Python\Python310\lib\site-packages\keras\utils\traceback_utils.py", line 65, in error_handler
      return fn(*args, **kwargs)
    File "C:\Users\Noor Saeed\AppData\Local\Programs\Python\Python310\lib\site-packages\keras\engine\base_layer.py", line 1097, in __call__
      outputs = call_fn(inputs, *args, **kwargs)
    File "C:\Users\Noor Saeed\AppData\Local\Programs\Python\Python310\lib\site-packages\keras\utils\traceback_utils.py", line 96, in error_handler
      return fn(*args, **kwargs)
    File "C:\Users\Noor Saeed\AppData\Local\Programs\Python\Python310\lib\site-packages\keras\layers\core\embedding.py", line 208, in call
      out = tf.nn.embedding_lookup(self.embeddings, inputs)
Node: 'sequential_5/embedding_5/embedding_lookup'
indices[938,0] = -1 is not in [0, 35179)
	 [[{{node sequential_5/embedding_5/embedding_lookup}}]] [Op:__inference_train_function_63793]

# Testing the Named Entity Recognition (NER) Model:
Now let’s test our model on a piece of text:

In [68]:
import spacy
from spacy import displacy
nlp = spacy.load('en_core_web_sm')
text = nlp('Hi, My name is Noor Saeed \n I am from Pakistan \n I want to work with Google \n Steve Jobs is My Inspiration')
displacy.render(text, style = 'ent', jupyter=True)

ModuleNotFoundError: No module named 'spacy'