NER is a process of recognizing names, like person, organization and location names, and numeric expressions including time, date, money and percent expressions from unstructured text. The aim is to develop practical and domain-independent techniques in order to detect named entities with high accuracy automatically.

Our Task will be split into following Phases:-
    
    1] Data Loading
    
    2] Data Pre-Processing
    
    3] Training ML models
    
    4] Testing and Evaluation of Model Acuuracy
    
    5] Model Monitoring by checking weights

##### Loading Data

we are loading data corpus annotated with IOB and POS tags that can be found at Kaggle. We can have a quick peek of first several rows of the data. I am giving Kaggle link below.
https://www.kaggle.com/abhinavwalia95/how-to-loading-and-fitting-dataset-to-scikit/data

In [2]:
import pandas as pd

dataset = pd.read_csv('ner_dataset.csv', encoding = "ISO-8859-1")

In [3]:
dataset.shape

(1048575, 4)

In [4]:
dataset.head()

Unnamed: 0,Sentence #,Word,POS,Tag
0,Sentence: 1,Thousands,NNS,O
1,,of,IN,O
2,,demonstrators,NNS,O
3,,have,VBP,O
4,,marched,VBN,O


Woow!!! thats a huge dataset and perfect to train our models,though for speedy processing on my laptop I will be considering nly 100000 rows

In [5]:
quick_df = dataset[:50000]
quick_df.shape

(50000, 4)

Let us look for NA values and handle them selectively

In [6]:
quick_df.isnull().sum()

Sentence #    47730
Word              0
POS               0
Tag               0
dtype: int64

In [7]:
quick_df = quick_df.fillna(method = 'ffill')
quick_df

Unnamed: 0,Sentence #,Word,POS,Tag
0,Sentence: 1,Thousands,NNS,O
1,Sentence: 1,of,IN,O
2,Sentence: 1,demonstrators,NNS,O
3,Sentence: 1,have,VBP,O
4,Sentence: 1,marched,VBN,O
5,Sentence: 1,through,IN,O
6,Sentence: 1,London,NNP,B-geo
7,Sentence: 1,to,TO,O
8,Sentence: 1,protest,VB,O
9,Sentence: 1,the,DT,O


In [8]:
quick_df['Sentence #'].nunique(),quick_df.Word.nunique(),quick_df.Tag.nunique()

(2270, 7464, 17)

We have 4,544 unique sentences that contain 10,922 unique words and tagged by 17 tags.

In [9]:
quick_df.groupby('Tag').size()

Tag
B-art       48
B-eve       39
B-geo     1490
B-gpe      968
B-nat       18
B-org      959
B-per      789
B-tim      880
I-art       27
I-eve       33
I-geo      303
I-gpe       31
I-nat        9
I-org      689
I-per      931
I-tim      239
O        42547
dtype: int64

Let create dataset for training our ML model, consisting of seperate input features and output features

In [10]:
X = quick_df.drop('Tag', axis=1)
X.head()

Unnamed: 0,Sentence #,Word,POS
0,Sentence: 1,Thousands,NNS
1,Sentence: 1,of,IN
2,Sentence: 1,demonstrators,NNS
3,Sentence: 1,have,VBP
4,Sentence: 1,marched,VBN


In [11]:
X.shape

(50000, 3)

In [12]:
X.columns

Index(['Sentence #', 'Word', 'POS'], dtype='object')

In [13]:
#Convert every record in to a dictionary
X.to_dict('records')

[{'POS': 'NNS', 'Sentence #': 'Sentence: 1', 'Word': 'Thousands'},
 {'POS': 'IN', 'Sentence #': 'Sentence: 1', 'Word': 'of'},
 {'POS': 'NNS', 'Sentence #': 'Sentence: 1', 'Word': 'demonstrators'},
 {'POS': 'VBP', 'Sentence #': 'Sentence: 1', 'Word': 'have'},
 {'POS': 'VBN', 'Sentence #': 'Sentence: 1', 'Word': 'marched'},
 {'POS': 'IN', 'Sentence #': 'Sentence: 1', 'Word': 'through'},
 {'POS': 'NNP', 'Sentence #': 'Sentence: 1', 'Word': 'London'},
 {'POS': 'TO', 'Sentence #': 'Sentence: 1', 'Word': 'to'},
 {'POS': 'VB', 'Sentence #': 'Sentence: 1', 'Word': 'protest'},
 {'POS': 'DT', 'Sentence #': 'Sentence: 1', 'Word': 'the'},
 {'POS': 'NN', 'Sentence #': 'Sentence: 1', 'Word': 'war'},
 {'POS': 'IN', 'Sentence #': 'Sentence: 1', 'Word': 'in'},
 {'POS': 'NNP', 'Sentence #': 'Sentence: 1', 'Word': 'Iraq'},
 {'POS': 'CC', 'Sentence #': 'Sentence: 1', 'Word': 'and'},
 {'POS': 'VB', 'Sentence #': 'Sentence: 1', 'Word': 'demand'},
 {'POS': 'DT', 'Sentence #': 'Sentence: 1', 'Word': 'the'},
 

We need to transform our text input to vector using DictVectorizer and then split to train and test sets for carrying out further estimation

In [14]:
import numpy as np
from sklearn.feature_extraction import DictVectorizer


v = DictVectorizer(sparse=False)
X = v.fit_transform(X.to_dict('records'))
X.shape

(50000, 9774)

In [15]:
y = quick_df.Tag.values

In [16]:
classes = np.unique(y)
classes = classes.tolist()
classes

['B-art',
 'B-eve',
 'B-geo',
 'B-gpe',
 'B-nat',
 'B-org',
 'B-per',
 'B-tim',
 'I-art',
 'I-eve',
 'I-geo',
 'I-gpe',
 'I-nat',
 'I-org',
 'I-per',
 'I-tim',
 'O']

Tag “O” is the most common tag and it will make our results look much better than they actual are. So we remove tag “O” when we evaluate classification metrics.

In [17]:
new_classes = classes.copy()
new_classes.pop()
new_classes

['B-art',
 'B-eve',
 'B-geo',
 'B-gpe',
 'B-nat',
 'B-org',
 'B-per',
 'B-tim',
 'I-art',
 'I-eve',
 'I-geo',
 'I-gpe',
 'I-nat',
 'I-org',
 'I-per',
 'I-tim']

In [18]:
X.shape, y.shape

((50000, 9774), (50000,))

Let us now split, our training and test dataset using 70-30% split

In [19]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.30, random_state=0)

Let us check shape of our newly obtained training and testing dataset

In [20]:
X_train.shape, y_train.shape

((35000, 9774), (35000,))

In [21]:
X_test.shape, y_test.shape

((15000, 9774), (15000,))

#### Algorithm Selection

As we have large dataset, training it is a problem while using same data to fit in memory, so will be using Out-of-core Algorithms which are designed to process data that is too large to fit into a single computer memory that support partial_fit method.

### Naive Bayes classifier for multinomial models

In [22]:
from sklearn.naive_bayes import MultinomialNB

nb = MultinomialNB(alpha=0.01)
nb.partial_fit(X_train, y_train, classes)

MultinomialNB(alpha=0.01, class_prior=None, fit_prior=True)

In [23]:
from sklearn.metrics import classification_report

print(classification_report(y_pred=nb.predict(X_test), y_true=y_test, labels = new_classes))

             precision    recall  f1-score   support

      B-art       0.16      0.29      0.21        17
      B-eve       0.12      0.36      0.18        11
      B-geo       0.66      0.57      0.61       441
      B-gpe       0.64      0.73      0.68       297
      B-nat       0.33      0.67      0.44         3
      B-org       0.48      0.49      0.48       272
      B-per       0.39      0.48      0.43       233
      B-tim       0.60      0.74      0.67       246
      I-art       0.08      0.25      0.12         4
      I-eve       0.54      0.64      0.58        11
      I-geo       0.42      0.49      0.45       107
      I-gpe       0.00      0.00      0.00        10
      I-nat       0.00      0.00      0.00         2
      I-org       0.47      0.51      0.49       180
      I-per       0.53      0.45      0.49       279
      I-tim       0.18      0.29      0.22        83

avg / total       0.53      0.55      0.53      2196



### Perceptron

In [25]:
from sklearn.linear_model import Perceptron

per = Perceptron(verbose=10, n_jobs=-1, max_iter=5)
per.partial_fit(X_train, y_train, classes)

-- Epoch 1-- Epoch 1-- Epoch 1-- Epoch 1



Norm: 40.45, NNZs: 928, Bias: -4.000000, T: 35000, Avg. loss: 0.021657Norm: 10.77, NNZs: 110, Bias: -4.000000, T: 35000, Avg. loss: 0.001886
Total training time: 1.01 seconds.

Total training time: 0.96 seconds.Norm: 50.86, NNZs: 1458, Bias: -3.000000, T: 35000, Avg. loss: 0.042629
Norm: 11.00, NNZs: 96, Bias: -3.000000, T: 35000, Avg. loss: 0.001400
Total training time: 1.06 seconds.

Total training time: 1.06 seconds.-- Epoch 1

-- Epoch 1-- Epoch 1

-- Epoch 1
Norm: 8.43, NNZs: 53, Bias: -3.000000, T: 35000, Avg. loss: 0.000743
Total training time: 0.90 seconds.
-- Epoch 1
Norm: 36.92, NNZs: 926, Bias: -3.000000, T: 35000, Avg. loss: 0.022857
Total training time: 0.94 seconds.
-- Epoch 1
Norm: 42.78, NNZs: 1153, Bias: -4.000000, T: 35000, Avg. loss: 0.035686
Total training time: 0.96 seconds.
-- Epoch 1
Norm: 35.34, NNZs: 718, Bias: -3.000000, T: 35000, Avg. loss: 0.018086
Total training time: 0.97 seconds.
-- Epoch 1


[Parallel(n_jobs=-1)]: Done   5 tasks      | elapsed:    3.0s


Norm: 9.64, NNZs: 87, Bias: -3.000000, T: 35000, Avg. loss: 0.001343
Total training time: 1.11 seconds.
-- Epoch 1
Norm: 9.43, NNZs: 72, Bias: -3.000000, T: 35000, Avg. loss: 0.001171
Total training time: 1.15 seconds.
-- Epoch 1
Norm: 25.18, NNZs: 443, Bias: -4.000000, T: 35000, Avg. loss: 0.009914
Total training time: 1.14 seconds.
-- Epoch 1
Norm: 8.43, NNZs: 62, Bias: -3.000000, T: 35000, Avg. loss: 0.001314
Total training time: 1.17 seconds.
-- Epoch 1


[Parallel(n_jobs=-1)]: Done  10 tasks      | elapsed:    4.2s
[Parallel(n_jobs=-1)]: Done  12 out of  17 | elapsed:    4.2s remaining:    1.7s


Norm: 6.16, NNZs: 27, Bias: -2.000000, T: 35000, Avg. loss: 0.000200
Total training time: 1.16 seconds.
Norm: 39.09, NNZs: 921, Bias: -4.000000, T: 35000, Avg. loss: 0.024000
Total training time: 1.10 seconds.
-- Epoch 1
Norm: 46.01, NNZs: 1281, Bias: -5.000000, T: 35000, Avg. loss: 0.027629
Total training time: 1.16 seconds.
Norm: 21.98, NNZs: 350, Bias: -5.000000, T: 35000, Avg. loss: 0.009829
Total training time: 1.11 seconds.


[Parallel(n_jobs=-1)]: Done  14 out of  17 | elapsed:    5.3s remaining:    1.1s


Norm: 52.86, NNZs: 1538, Bias: 2.000000, T: 35000, Avg. loss: 0.045743
Total training time: 0.61 seconds.


[Parallel(n_jobs=-1)]: Done  17 out of  17 | elapsed:    5.9s finished


Perceptron(alpha=0.0001, class_weight=None, eta0=1.0, fit_intercept=True,
      max_iter=5, n_iter=None, n_jobs=-1, penalty=None, random_state=0,
      shuffle=True, tol=None, verbose=10, warm_start=False)

In [26]:
print(classification_report(y_pred=per.predict(X_test), y_true=y_test, labels=new_classes))

             precision    recall  f1-score   support

      B-art       0.00      0.00      0.00        17
      B-eve       0.25      0.09      0.13        11
      B-geo       0.34      0.95      0.50       441
      B-gpe       0.91      0.62      0.74       297
      B-nat       0.00      0.00      0.00         3
      B-org       0.61      0.27      0.38       272
      B-per       0.67      0.42      0.52       233
      B-tim       0.79      0.77      0.78       246
      I-art       0.00      0.00      0.00         4
      I-eve       0.75      0.27      0.40        11
      I-geo       0.79      0.14      0.24       107
      I-gpe       0.14      0.10      0.12        10
      I-nat       0.00      0.00      0.00         2
      I-org       0.67      0.19      0.29       180
      I-per       0.84      0.24      0.37       279
      I-tim       0.50      0.10      0.16        83

avg / total       0.65      0.50      0.48      2196



  'precision', 'predicted', average, warn_for)


### Linear classifiers with SGD training

SGD is not any specific ML algorithm, but its an optimization function 

So,SGDClassifier( class_weight='balanced', alpha=i, penalty='l2', loss='hinge', random_state=42) it is an implementation of Linear SVM 

While,
SGDClassifier( class_weight='balanced', alpha=i, penalty='l2', loss='log', random_state=42) . It is an implementation of Logisitic regression

In [28]:
from sklearn.linear_model import SGDClassifier

sgd = SGDClassifier()
sgd.partial_fit(X_train, y_train, classes)



SGDClassifier(alpha=0.0001, average=False, class_weight=None, epsilon=0.1,
       eta0=0.0, fit_intercept=True, l1_ratio=0.15,
       learning_rate='optimal', loss='hinge', max_iter=None, n_iter=None,
       n_jobs=1, penalty='l2', power_t=0.5, random_state=None,
       shuffle=True, tol=None, verbose=0, warm_start=False)

In [29]:
print(classification_report(y_pred=sgd.predict(X_test), y_true=y_test, labels=new_classes))

             precision    recall  f1-score   support

      B-art       0.00      0.00      0.00        17
      B-eve       0.00      0.00      0.00        11
      B-geo       0.73      0.51      0.60       441
      B-gpe       0.96      0.53      0.68       297
      B-nat       0.00      0.00      0.00         3
      B-org       0.54      0.36      0.43       272
      B-per       0.86      0.36      0.51       233
      B-tim       0.94      0.67      0.78       246
      I-art       0.00      0.00      0.00         4
      I-eve       0.33      0.09      0.14        11
      I-geo       0.64      0.42      0.51       107
      I-gpe       0.00      0.00      0.00        10
      I-nat       0.00      0.00      0.00         2
      I-org       0.61      0.38      0.47       180
      I-per       0.31      0.96      0.47       279
      I-tim       0.00      0.00      0.00        83

avg / total       0.66      0.51      0.53      2196



  'precision', 'predicted', average, warn_for)


### Passive Aggressive Classifier

In [30]:
from sklearn.linear_model import PassiveAggressiveClassifier

pa =PassiveAggressiveClassifier()
pa.partial_fit(X_train, y_train, classes)

PassiveAggressiveClassifier(C=1.0, average=False, class_weight=None,
              fit_intercept=True, loss='hinge', max_iter=None, n_iter=None,
              n_jobs=1, random_state=None, shuffle=True, tol=None,
              verbose=0, warm_start=False)

In [31]:
print(classification_report(y_pred=pa.predict(X_test), y_true=y_test, labels=new_classes))

             precision    recall  f1-score   support

      B-art       1.00      0.06      0.11        17
      B-eve       0.33      0.18      0.24        11
      B-geo       0.81      0.09      0.16       441
      B-gpe       0.85      0.64      0.73       297
      B-nat       0.00      0.00      0.00         3
      B-org       0.20      0.88      0.33       272
      B-per       0.88      0.34      0.49       233
      B-tim       0.70      0.81      0.75       246
      I-art       0.00      0.00      0.00         4
      I-eve       1.00      0.09      0.17        11
      I-geo       0.79      0.21      0.34       107
      I-gpe       0.27      0.40      0.32        10
      I-nat       0.00      0.00      0.00         2
      I-org       0.61      0.37      0.46       180
      I-per       0.80      0.34      0.48       279
      I-tim       0.34      0.22      0.26        83

avg / total       0.69      0.44      0.44      2196



  'precision', 'predicted', average, warn_for)
