<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Load-Packages" data-toc-modified-id="Load-Packages-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Load Packages</a></span></li><li><span><a href="#Load-Data" data-toc-modified-id="Load-Data-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Load Data</a></span></li><li><span><a href="#Sentence" data-toc-modified-id="Sentence-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Sentence</a></span></li><li><span><a href="#Conditional-Random-Fields" data-toc-modified-id="Conditional-Random-Fields-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Conditional Random Fields</a></span><ul class="toc-item"><li><span><a href="#FE" data-toc-modified-id="FE-4.1"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>FE</a></span></li></ul></li><li><span><a href="#EOF" data-toc-modified-id="EOF-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>EOF</a></span></li></ul></div>

***
<br>
<span style="font-size:30pt; color:darkslateblue;"><b>
Named Entity Recognition  
<br>
With Conditional Random Fields  
</b></span>

<img src="ner1.png" alt="Drawing" style="width: 700px;" align="left"/>

In this analysis,  
we use the the following datasets from Kaggle.  
  
https://www.kaggle.com/abhinavwalia95/entity-annotated-corpus
***

# Load Packages

In [29]:
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

import gc
import os
import psutil
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import seaborn as sns
sns.set(font_scale=1.2)

import warnings

pd.set_option('display.max_columns', 200)
pd.set_option('display.max_rows', 100)
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.simplefilter(action='ignore', category=UserWarning)

plt.style.use('ggplot')

import spacy
from spacy import displacy

from tqdm import tqdm
from sklearn.base import BaseEstimator, TransformerMixin

from sklearn.cross_validation import cross_val_predict
from sklearn.metrics import classification_report
from sklearn_crfsuite.metrics import flat_classification_report

# Load Data

In [2]:
data = pd.read_csv('./ner_dataset.zip')

In [3]:
data.shape

(1048575, 4)

In [4]:
data.head()

Unnamed: 0,Sentence #,Word,POS,Tag
0,Sentence: 1,Thousands,NNS,O
1,,of,IN,O
2,,demonstrators,NNS,O
3,,have,VBP,O
4,,marched,VBN,O


* POS : Part Of Speech
* Tag

Number of tagged entities:

'O': 1146068, 'geo-nam': 58388, 'org-nam': 48034, 'per-nam': 23790, 'gpe-nam': 20680, 'tim-dat': 12786, 'tim-dow': 11404, 'per-tit': 9800, 'per-fam': 8152, 'tim-yoc': 5290, 'tim-moy': 4262, 'per-giv': 2413, 'tim-clo': 891, 'art-nam': 866, 'eve-nam': 602, 'nat-nam': 300, 'tim-nam': 146, 'eve-ord': 107, 'per-ini': 60, 'org-leg': 60, 'per-ord': 38, 'tim-dom': 10, 'per-mid': 1, 'art-add': 1

Essential info about entities:

geo = Geographical Entity  
org = Organization  
per = Person  
gpe = Geopolitical Entity  
tim = Time indicator  
art = Artifact  
eve = Event  
nat = Natural Phenomenon  

In [5]:
data.POS.nunique()

42

In [6]:
data.Tag.nunique()

17

In [7]:
data.Word.nunique()

35178

# Sentence

TODO : Sentence # to be fixed

In [15]:
class SentenceGetter(object):
    
    def __init__(self, data):
        self.n_sent = 1
        self.data = data
        self.empty = False
        agg_func = lambda s: [(w, p, t) for w, p, t in zip(s["Word"].values.tolist(),
                                                           s["POS"].values.tolist(),
                                                           s["Tag"].values.tolist())]
        self.grouped = self.data.groupby("Sentence #").apply(agg_func)
        self.sentences = [s for s in self.grouped]
    
    def get_next(self):
        try:
            s = self.grouped["Sentence: {}".format(self.n_sent)]
            self.n_sent += 1
            return s
        except:
            return None

In [58]:
getter = SentenceGetter(data)

In [59]:
sent = getter.get_next()

In [62]:
sent

[('Thousands', 'NNS', 'O')]

In [60]:
sentences = getter.sentences

In [61]:
sentences[:5]

[[('Thousands', 'NNS', 'O')],
 [('Iranian', 'JJ', 'B-gpe')],
 [('Helicopter', 'NN', 'O')],
 [('They', 'PRP', 'O')],
 [('U.N.', 'NNP', 'B-geo')]]

# Conditional Random Fields

We use CRF to predict tags.  

denote  
$x=(x_1, x_2, ..., x_m)$ : the input sequence  
$s=(s_1, s_2, ..., s_m)$ : the sequence of output states  

We model the conditional probability  
$p(s_1, ..., s_m | x_1, ..., x_m)$  
  
We do this by a feature map (some $d$ features),  
$\Phi(x_1, ..., x_m, s_1, ..., s_m) \in R^d$  

Then we can model the probability as a log-linear model with parameter vector $w$

$p(s|x; w) = \frac{\exp{w \cdot \Phi(x, s)}}{\sum_{s'}{\exp{w \cdot \Phi(x, s')}}}$  

For the estimation of $w$,
we use log-likelihood,

$L(w) = \sum^{n}_{i=1} \log{p(s^{i}|x^{i};w)} - \frac{\lambda_2}{2}|w|^2 -\lambda_{1}|w|$  

If we have estimated the optimal vector $w^*$, then estimated tag is,  

$s^* = \mathrm{arg max}_{s} \; p(s|x; w^*)$

## FE

In [56]:
def word2features(sent, i):
    word = sent[i][0]
    postag = sent[i][1]

    features = {
        'bias': 1.0,
        'word.lower()': word.lower(),
        'word[-3:]': word[-3:],
        'word[-2:]': word[-2:],
        'word.isupper()': word.isupper(),
        'word.istitle()': word.istitle(),
        'word.isdigit()': word.isdigit(),
        'postag': postag,
        'postag[:2]': postag[:2],
    }
    if i > 0:
        word1 = sent[i - 1][0]
        postag1 = sent[i - 1][1]
        features.update({
            '-1:word.lower()': word1.lower(),
            '-1:word.istitle()': word1.istitle(),
            '-1:word.isupper()': word1.isupper(),
            '-1:postag': postag1,
            '-1:postag[:2]': postag1[:2],
        })
    else:
        features['BOS'] = True  # Begin Of Sentence

    if i < len(sent) - 1:
        word1 = sent[i + 1][0]
        postag1 = sent[i + 1][1]
        features.update({
            '+1:word.lower()': word1.lower(),
            '+1:word.istitle()': word1.istitle(),
            '+1:word.isupper()': word1.isupper(),
            '+1:postag': postag1,
            '+1:postag[:2]': postag1[:2],
        })
    else:
        features['EOS'] = True

    return features


def sent2features(sent):
    return [word2features(sent, i) for i in range(len(sent))]


def sent2labels(sent):
    return [label for token, postag, label in sent]


def sent2tokens(sent):
    return [token for token, postag, label in sent]

In [22]:
X = [sent2features(s) for s in sentences]
y = [sent2labels(s) for s in sentences]

In [53]:
X[:2]

[[{'BOS': True,
   'EOS': True,
   'bias': 1.0,
   'postag': 'NNS',
   'postag[:2]': 'NN',
   'word.isdigit()': False,
   'word.istitle()': True,
   'word.isupper()': False,
   'word.lower()': 'thousands',
   'word[-2:]': 'ds',
   'word[-3:]': 'nds'}],
 [{'BOS': True,
   'EOS': True,
   'bias': 1.0,
   'postag': 'JJ',
   'postag[:2]': 'JJ',
   'word.isdigit()': False,
   'word.istitle()': True,
   'word.isupper()': False,
   'word.lower()': 'iranian',
   'word[-2:]': 'an',
   'word[-3:]': 'ian'}]]

In [55]:
y[:2]

[['O'], ['B-gpe']]

In [24]:
from sklearn_crfsuite import CRF

crf = CRF(algorithm='lbfgs',
          c1=0.1,
          c2=0.1,
          max_iterations=100,
          all_possible_transitions=False)

In [26]:
pred = cross_val_predict(estimator=crf, X=X, y=y, cv=5)

In [32]:
report = flat_classification_report(y_pred=pred, y_true=y)
print(report)

             precision    recall  f1-score   support

      B-art       0.67      0.11      0.19        18
      B-eve       0.00      0.00      0.00        10
      B-geo       0.74      0.86      0.80      3335
      B-gpe       0.95      0.90      0.92      2989
      B-nat       0.88      0.64      0.74        11
      B-org       0.69      0.51      0.58      2752
      B-per       0.83      0.88      0.85      4019
      B-tim       0.96      0.75      0.84       515
          O       0.98      0.99      0.98     34310

avg / total       0.93      0.93      0.93     47959



# EOF