# Named Entity Recognition using CRF model
In Natural Language Processing (NLP) an Entity Recognition is one of the common problem. The entity is referred to as the part of the text that is interested in. In NLP, NER is a method of extracting the relevant information from a large corpus and classifying those entities into predefined categories such as location, organization, name and so on. 
Information about lables: 
* geo = Geographical Entity
* org = Organization
* per = Person
* gpe = Geopolitical Entity
* tim = Time indicator
* art = Artifact
* eve = Event
* nat = Natural Phenomenon

        1. Total Words Count = 1354149 
        2. Target Data Column: Tag

#### Importing Libraries

In [1]:
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn_crfsuite import CRF
from sklearn_crfsuite.metrics import flat_f1_score
from sklearn_crfsuite.metrics import flat_classification_report

In [2]:
#Reading the csv file
df = pd.read_csv('ner_dataset.csv', encoding = "ISO-8859-1")

In [3]:
#Display first 10 rows
df.head(10)


Unnamed: 0,Sentence #,Word,POS,Tag
0,Sentence: 1,Thousands,NNS,O
1,,of,IN,O
2,,demonstrators,NNS,O
3,,have,VBP,O
4,,marched,VBN,O
5,,through,IN,O
6,,London,NNP,B-geo
7,,to,TO,O
8,,protest,VB,O
9,,the,DT,O


In [19]:
df[df['Tag'] != 'O'].head(10)

Unnamed: 0,Sentence #,Word,POS,Tag
6,Sentence: 1,London,NNP,B-geo
12,Sentence: 1,Iraq,NNP,B-geo
18,Sentence: 1,British,JJ,B-gpe
42,Sentence: 2,Bush,NNP,B-per
65,Sentence: 3,Hyde,NNP,B-geo
66,Sentence: 3,Park,NNP,I-geo
94,Sentence: 5,Britain,NNP,B-geo
97,Sentence: 5,Labor,NNP,B-org
98,Sentence: 5,Party,NNP,I-org
102,Sentence: 5,English,JJ,B-gpe


In [4]:
df.describe()

Unnamed: 0,Sentence #,Word,POS,Tag
count,47959,1048565,1048575,1048575
unique,47959,35177,42,17
top,Sentence: 1,the,NN,O
freq,1,52573,145807,887908


#### Observations : 
* There are total 47959 sentences in the dataset.
* Number unique words in the dataset are 35178.
* Total 17 lables (Tags).

In [5]:
#Displaying the unique Tags
df['Tag'].unique()

array(['O', 'B-geo', 'B-gpe', 'B-per', 'I-geo', 'B-org', 'I-org', 'B-tim',
       'B-art', 'I-art', 'I-per', 'I-gpe', 'I-tim', 'B-nat', 'B-eve',
       'I-eve', 'I-nat'], dtype=object)

In [6]:
#Checking null values, if any.
df.isnull().sum()

Sentence #    1000616
Word               10
POS                 0
Tag                 0
dtype: int64

There are lots of missing values in 'Sentence #' attribute. So we will use pandas fillna technique and use 'ffill' method which propagates last valid observation forward to next.

In [7]:
df = df.fillna(method = 'ffill')

  df = df.fillna(method = 'ffill')


In [8]:
# This is a class te get sentence. The each sentence will be list of tuples with its tag and pos.
class sentence(object):
    def __init__(self, df):
        self.n_sent = 1
        self.df = df
        self.empty = False
        agg = lambda s : [(w, p, t) for w, p, t in zip(s['Word'].values.tolist(),
                                                       s['POS'].values.tolist(),
                                                       s['Tag'].values.tolist())]
        self.grouped = self.df.groupby("Sentence #").apply(agg)
        self.sentences = [s for s in self.grouped]
        
    def get_text(self):
        try:
            s = self.grouped['Sentence: {}'.format(self.n_sent)]
            self.n_sent +=1
            return s
        except:
            return None

In [9]:
#Displaying one full sentence
getter = sentence(df)
sentences = [" ".join([s[0] for s in sent]) for sent in getter.sentences]
sentences[0]

  self.grouped = self.df.groupby("Sentence #").apply(agg)


'Thousands of demonstrators have marched through London to protest the war in Iraq and demand the withdrawal of British troops from that country .'

In [10]:
#sentence with its pos and tag.
sent = getter.get_text()
print(sent)

[('Thousands', 'NNS', 'O'), ('of', 'IN', 'O'), ('demonstrators', 'NNS', 'O'), ('have', 'VBP', 'O'), ('marched', 'VBN', 'O'), ('through', 'IN', 'O'), ('London', 'NNP', 'B-geo'), ('to', 'TO', 'O'), ('protest', 'VB', 'O'), ('the', 'DT', 'O'), ('war', 'NN', 'O'), ('in', 'IN', 'O'), ('Iraq', 'NNP', 'B-geo'), ('and', 'CC', 'O'), ('demand', 'VB', 'O'), ('the', 'DT', 'O'), ('withdrawal', 'NN', 'O'), ('of', 'IN', 'O'), ('British', 'JJ', 'B-gpe'), ('troops', 'NNS', 'O'), ('from', 'IN', 'O'), ('that', 'DT', 'O'), ('country', 'NN', 'O'), ('.', '.', 'O')]


Getting all the sentences in the dataset.

In [11]:
sentences = getter.sentences

#### Feature Preparation
These are the default features used by the NER in nltk. We can also modify it for our customization.

In [12]:
def word2features(sent, i):
    word = sent[i][0]
    postag = sent[i][1]

    features = {
        'bias': 1.0,
        'word.lower()': word.lower(),
        'word[-3:]': word[-3:],
        'word[-2:]': word[-2:],
        'word.isupper()': word.isupper(),
        'word.istitle()': word.istitle(),
        'word.isdigit()': word.isdigit(),
        'postag': postag,
        'postag[:2]': postag[:2],
    }
    if i > 0:
        word1 = sent[i-1][0]
        postag1 = sent[i-1][1]
        features.update({
            '-1:word.lower()': word1.lower(),
            '-1:word.istitle()': word1.istitle(),
            '-1:word.isupper()': word1.isupper(),
            '-1:postag': postag1,
            '-1:postag[:2]': postag1[:2],
        })
    else:
        features['BOS'] = True

    if i < len(sent)-1:
        word1 = sent[i+1][0]
        postag1 = sent[i+1][1]
        features.update({
            '+1:word.lower()': word1.lower(),
            '+1:word.istitle()': word1.istitle(),
            '+1:word.isupper()': word1.isupper(),
            '+1:postag': postag1,
            '+1:postag[:2]': postag1[:2],
        })
    else:
        features['EOS'] = True

    return features


def sent2features(sent):
    return [word2features(sent, i) for i in range(len(sent))]

def sent2labels(sent):
    return [label for token, postag, label in sent]

def sent2tokens(sent):
    return [token for token, postag, label in sent]

In [13]:
X = [sent2features(s) for s in sentences]
y = [sent2labels(s) for s in sentences]

In [14]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

In [15]:

crf = CRF(algorithm = 'lbfgs',
         c1 = 0.1,
         c2 = 0.1,
         max_iterations = 100,
         all_possible_transitions = False)
crf.fit(X_train, y_train)

In [16]:
#Predicting on the test set.
y_pred = crf.predict(X_test)

In [23]:
import spacy

# Load the spaCy English language model
nlp = spacy.load("en_core_web_sm") 

# Your sentence
sentence = "India is one of the best countries and Apple is the best organisation"

# Process the sentence with spaCy
doc = nlp(sentence)

# Print the tokens and their POS tags
for token in doc:
    print(f"{token.text:{10}} {token.pos_:{10}}")
    


India      PROPN     
is         AUX       
one        NUM       
of         ADP       
the        DET       
best       ADJ       
countries  NOUN      
and        CCONJ     
Apple      PROPN     
is         AUX       
the        DET       
best       ADJ       
organisation NOUN      


#### Evaluating the model performance.
We will use precision, recall and f1-score metrics to evaluate the performance of the model since the accuracy is not a good metric for this dataset because we have an unequal number of data points in each class.

In [17]:
f1_score = flat_f1_score(y_test, y_pred, average = 'weighted')
print(f1_score)

0.972681475414675


In [18]:
report = flat_classification_report(y_test, y_pred)
print(report)

              precision    recall  f1-score   support

       B-art       0.47      0.11      0.17        75
       B-eve       0.49      0.33      0.39        58
       B-geo       0.86      0.92      0.89      7555
       B-gpe       0.98      0.94      0.96      3251
       B-nat       0.62      0.37      0.46        41
       B-org       0.81      0.74      0.77      4028
       B-per       0.86      0.84      0.85      3468
       B-tim       0.93      0.88      0.91      4013
       I-art       0.10      0.03      0.04        40
       I-eve       0.41      0.20      0.27        56
       I-geo       0.82      0.82      0.82      1476
       I-gpe       0.86      0.50      0.63        48
       I-nat       1.00      0.33      0.50         6
       I-org       0.83      0.81      0.82      3261
       I-per       0.87      0.90      0.88      3627
       I-tim       0.85      0.74      0.79      1288
           O       0.99      1.00      0.99    177537

    accuracy              

This looks quite nice.