<a href="https://colab.research.google.com/github/yash-datascience/Named-Entity-Recognition/blob/main/Named_Entity_Recognition.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Named Entity Recognition
The named entities are pre-defined categories chosen according to the use case such as names of people, organizations, places, codes, time notations, monetary values, etc.

NER aims to assign a class to each token (usually a single word) in a sequence. Because of this, NER is also referred to as token classification.

## Imports

In [2]:
import pandas as pd
import numpy as np


## Loading and Exploring the Dataset

Information about the tagged entities:



*   geo = Geographical Entity

*   org = Organization

*   per = Person
*   gpe = Geopolitical Entity


*   tim = Time indicator

*   art = Artifact
*   eve = Event




*   nat = Natural Phenomenon



In [41]:
df=pd.read_csv("/content/ner_dataset.csv",encoding='latin1')
df.head(30)

Unnamed: 0,Sentence #,Word,POS,Tag
0,Sentence: 1,Thousands,NNS,O
1,,of,IN,O
2,,demonstrators,NNS,O
3,,have,VBP,O
4,,marched,VBN,O
5,,through,IN,O
6,,London,NNP,B-geo
7,,to,TO,O
8,,protest,VB,O
9,,the,DT,O


## DO you have NULLs?

In [42]:
df.isnull().sum()

Sentence #    1000616
Word                0
POS                 0
Tag                 0
dtype: int64

Looking at the dataset here we will use ffill feature to fill the null values.

In [43]:
df=df.fillna(method='ffill')
df.head(30)

Unnamed: 0,Sentence #,Word,POS,Tag
0,Sentence: 1,Thousands,NNS,O
1,Sentence: 1,of,IN,O
2,Sentence: 1,demonstrators,NNS,O
3,Sentence: 1,have,VBP,O
4,Sentence: 1,marched,VBN,O
5,Sentence: 1,through,IN,O
6,Sentence: 1,London,NNP,B-geo
7,Sentence: 1,to,TO,O
8,Sentence: 1,protest,VB,O
9,Sentence: 1,the,DT,O


In [44]:
df.isnull().sum()

Sentence #    0
Word          0
POS           0
Tag           0
dtype: int64

In [45]:
df['Tag'].value_counts()

O        887908
B-geo     37644
B-tim     20333
B-org     20143
I-per     17251
B-per     16990
I-org     16784
B-gpe     15870
I-geo      7414
I-tim      6528
B-art       402
B-eve       308
I-art       297
I-eve       253
B-nat       201
I-gpe       198
I-nat        51
Name: Tag, dtype: int64

In [46]:
n_tags=df['Tag'].nunique()
n_tags

17

In [47]:
n_words=df['Word'].nunique()
n_words

35178

In [48]:
tags=list(set(df['Tag']))
tags

['B-eve',
 'I-geo',
 'B-art',
 'I-art',
 'B-per',
 'I-per',
 'I-org',
 'I-nat',
 'B-geo',
 'O',
 'B-nat',
 'B-gpe',
 'I-tim',
 'I-eve',
 'I-gpe',
 'B-tim',
 'B-org']

## Label Encoding

In [49]:
from sklearn.preprocessing import LabelEncoder

In [50]:
df["Sentence #"] = LabelEncoder().fit_transform(df["Sentence #"] )
df.head(30)

Unnamed: 0,Sentence #,Word,POS,Tag
0,0,Thousands,NNS,O
1,0,of,IN,O
2,0,demonstrators,NNS,O
3,0,have,VBP,O
4,0,marched,VBN,O
5,0,through,IN,O
6,0,London,NNP,B-geo
7,0,to,TO,O
8,0,protest,VB,O
9,0,the,DT,O


## Renaming Columns

In [51]:
df.rename(columns={"Sentence #":"sentence_id","Word":"words","Tag":"labels"}, inplace =True)

In [52]:
df["labels"] = df["labels"].str.upper()

## Preparing Data for Modeling

In [53]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

In [54]:
X= df[["sentence_id","words","POS"]]
Y =df["labels"]

In [55]:
x_train, x_test, y_train, y_test = train_test_split(X,Y, test_size =0.2)

In [56]:
#building up train data and test data
train_data = pd.DataFrame({"sentence_id":x_train["sentence_id"],"words":x_train["words"],"POS":x_train['POS'],"labels":y_train})
test_data = pd.DataFrame({"sentence_id":x_test["sentence_id"],"words":x_test["words"],"POS":x_train['POS'],"labels":y_test})

In [57]:
train_data

Unnamed: 0,sentence_id,words,POS,labels
408561,9618,in,IN,O
70143,24335,exported,VBN,O
708782,24880,vote,NN,O
572694,17989,tourism,NN,O
774181,28213,move,VB,O
...,...,...,...,...
259178,2069,",",",",O
365838,7484,.,.,O
131932,43545,officials,NNS,O
671155,22964,were,VBD,O


## Model Training

In [None]:
!pip install simpletransformers

In [58]:
from simpletransformers.ner import NERModel,NERArgs
label = df["labels"].unique().tolist()
label

['O',
 'B-GEO',
 'B-GPE',
 'B-PER',
 'I-GEO',
 'B-ORG',
 'I-ORG',
 'B-TIM',
 'B-ART',
 'I-ART',
 'I-PER',
 'I-GPE',
 'I-TIM',
 'B-NAT',
 'B-EVE',
 'I-EVE',
 'I-NAT']

## Model Fitting

In [59]:
args = NERArgs()
args.num_train_epochs = 1
args.learning_rate = 1e-4
args.overwrite_output_dir =True
args.train_batch_size = 32
args.eval_batch_size = 32

In [60]:
model = NERModel('bert', 'bert-base-cased',labels=label,args =args)

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForTokenClassification: ['cls.predictions.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-cas

In [61]:
model.train_model(train_data,eval_data = test_data,acc=accuracy_score)

  0%|          | 0/3 [00:00<?, ?it/s]

Epoch:   0%|          | 0/1 [00:00<?, ?it/s]

Running Epoch 0 of 1:   0%|          | 0/1499 [00:00<?, ?it/s]

  model.parameters(), args.max_grad_norm


(1499, 0.19164212737692204)

## Model Evaluation

In [62]:
result, model_outputs, preds_list = model.eval_model(test_data)

  0%|          | 0/3 [00:00<?, ?it/s]

Running Evaluation:   0%|          | 0/1460 [00:00<?, ?it/s]

In [63]:
result

{'eval_loss': 0.16941998900258787,
 'f1_score': 0.7850450800148209,
 'precision': 0.8123790396202301,
 'recall': 0.7594906459101461}

In [64]:
prediction, model_output = model.predict(["What is the new name of Allahbad"])

  0%|          | 0/1 [00:00<?, ?it/s]

Running Prediction:   0%|          | 0/1 [00:00<?, ?it/s]

In [65]:
prediction

[[{'What': 'O'},
  {'is': 'O'},
  {'the': 'O'},
  {'new': 'O'},
  {'name': 'O'},
  {'of': 'O'},
  {'Allahbad': 'B-GEO'}]]