<a href="https://colab.research.google.com/github/ShreyMhatre/nlp-learning-journey/blob/main/NLP_NER.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Named Entity Recognition (NER)

In [21]:
import kagglehub

# Download latest version
path = kagglehub.dataset_download("abhinavwalia95/entity-annotated-corpus")

print("Path to dataset files:", path)

Downloading from https://www.kaggle.com/api/v1/datasets/download/abhinavwalia95/entity-annotated-corpus?dataset_version_number=4...


100%|██████████| 26.4M/26.4M [00:00<00:00, 153MB/s]

Extracting files...





Path to dataset files: /root/.cache/kagglehub/datasets/abhinavwalia95/entity-annotated-corpus/versions/4


In [9]:
import pandas as pd
from tensorflow import keras
import numpy as np

In [11]:
df = pd.read_csv('/kaggle/input/entity-annotated-corpus/ner_dataset.csv',encoding='unicode-escape')
df.head()

Unnamed: 0,Sentence #,Word,POS,Tag
0,Sentence: 1,Thousands,NNS,O
1,,of,IN,O
2,,demonstrators,NNS,O
3,,have,VBP,O
4,,marched,VBN,O


Let's get unique tags and create lookup dictionaries that we can use to convert tags into class numbers:

In [12]:
tags = df.Tag.unique()
tags

array(['O', 'B-geo', 'B-gpe', 'B-per', 'I-geo', 'B-org', 'I-org', 'B-tim',
       'B-art', 'I-art', 'I-per', 'I-gpe', 'I-tim', 'B-nat', 'B-eve',
       'I-eve', 'I-nat'], dtype=object)

In [13]:
id2tag = dict(enumerate(tags))
tag2id = { v : k for k,v in id2tag.items() }

id2tag[0]

'O'

Now we need to do the same with vocabulary. For simplicity, we will create vocabulary without taking word frequency into account; in real life you might want to use Keras vectorizer, and limit the number of words.

In [24]:
vocab = set(df['Word'].fillna('<UNK>').apply(lambda x: x.lower()))
id2word = { i+1 : v for i,v in enumerate(vocab) }
id2word[0] = '<UNK>'
word2id = { v : k for k,v in id2word.items() }



We need to create a dataset of sentences for training. Let's loop through the original dataset and separate all individual sentences into X (lists of words) and Y (list of tokens):


In [25]:
X,Y = [],[]
s,t = [],[]
for i,row in df[['Sentence #','Word','Tag']].iterrows():
    if pd.isna(row['Sentence #']):
        s.append(row['Word'])
        t.append(row['Tag'])
    else:
        if len(s)>0:
            X.append(s)
            Y.append(t)
        s,t = [row['Word']],[row['Tag']]
X.append(s)
Y.append(t)

vectorize all words and tokens

In [29]:
def vectorize(seq):
    return [word2id.get(str(x).lower(), 0) for x in seq]

def tagify(seq):
    return [tag2id[x] for x in seq]

Xv = list(map(vectorize,X))
Yv = list(map(tagify,Y))

Xv[0], Yv[0]

([21830,
  30920,
  27059,
  25195,
  10916,
  13753,
  9075,
  6374,
  31202,
  20961,
  27753,
  24383,
  1201,
  8764,
  30069,
  20961,
  6819,
  30920,
  4965,
  6476,
  19950,
  27492,
  30123,
  19695],
 [0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0])

In [30]:
X_data = keras.preprocessing.sequence.pad_sequences(Xv,padding='post')
Y_data = keras.preprocessing.sequence.pad_sequences(Yv,padding='post')

## Defining Token Classification Network

In [31]:
maxlen = X_data.shape[1]
vocab_size = len(vocab)
num_tags = len(tags)
model = keras.models.Sequential([
    keras.layers.Embedding(vocab_size, 300, input_length=maxlen),
    keras.layers.Bidirectional(keras.layers.LSTM(units=100, activation='tanh', return_sequences=True)),
    keras.layers.Bidirectional(keras.layers.LSTM(units=100, activation='tanh', return_sequences=True)),
    keras.layers.TimeDistributed(keras.layers.Dense(num_tags, activation='softmax'))
])
model.compile(loss='sparse_categorical_crossentropy',optimizer='adam',metrics=['acc'])
model.summary()



In [32]:
model.fit(X_data,Y_data)

[1m1499/1499[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m70s[0m 38ms/step - acc: 0.9732 - loss: 0.1218


<keras.src.callbacks.history.History at 0x7866ab0c05d0>

## Testing the Result

In [33]:
sent = 'John Smith went to Paris to attend a conference in cancer development institute'
words = sent.lower().split()
v = keras.preprocessing.sequence.pad_sequences([[word2id[x] for x in words]],padding='post',maxlen=maxlen)
res = model(v)[0]

In [35]:
r = np.argmax(res.numpy(),axis=1)
for i,w in zip(r,words):
    print(f"{w} -> {id2tag[i]}")

john -> B-per
smith -> I-per
went -> O
to -> O
paris -> B-geo
to -> O
attend -> O
a -> O
conference -> O
in -> O
cancer -> O
development -> O
institute -> I-org
