## CRF

The goal of a named entity recognition (NER) system is to identify all textual mentions of the named entities. This can be broken down into two sub-tasks: **identifying the boundaries of the NE**, and **identifying its type**.

Named entity recognition is a task that is well-suited to the type of **classifier-based approach**. In particular, a tagger can be built that labels each word in a sentence using the IOB format, where chunks are labelled by their appropriate type.

The **IOB Tagging** system contains tags of the form:

---


* B - {CHUNK_TYPE} – for the word in the Beginning chunk 
* I - {CHUNK_TYPE} – for words Inside the chunk
* O – Outside any chunk

The IOB tags are further classified into the following classes –
* geo = Geographical Entity
* org = Organization
* per = Person
* gpe = Geopolitical Entity
* tim = Time indicator
* art = Artifact
* eve = Event
* nat = Natural Phenomenon

In the case of this model I have created the **train_CRF** file where I have put together train.txt and test.txt, in this way the train and test splitting is done within the script using split, for the sake of convenience.


In [1]:
#Data analysis
import pandas as pd
import numpy as np
#Data visualisation
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(color_codes=True)
sns.set(font_scale=1)
%matplotlib inline
%config InlineBackend.figure_format = 'svg'
#Modeling
from sklearn.model_selection import cross_val_predict, cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn_crfsuite import CRF, scorers, metrics
from sklearn_crfsuite.metrics import flat_classification_report
from sklearn.metrics import classification_report, make_scorer
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import train_test_split
from sklearn_crfsuite import CRF
from sklearn_crfsuite.metrics import flat_f1_score
from sklearn_crfsuite.metrics import flat_classification_report
import scipy.stats
import eli5

In [2]:
sklearn.__version__

<IPython.core.display.Javascript object>

'0.23.0'

In [3]:
#there is a functionality of nltk to read CoNLL data
#from nltk.corpus.reader import ConllCorpusReader

#train = ConllCorpusReader('/train.txt', 'eng.train', ['words', 'pos', 'ignore', 'chunk'])
#test = ConllCorpusReader('/test.txt', 'eng.testa', ['words', 'pos', 'ignore', 'chunk'])

In [4]:
#read the dataset and rename the columns
df= pd.read_csv('train_CRF.txt', sep=" ", encoding="latin1") #train.txt+test.txt
df.columns = ['Word','POS','POS2','Tag']
df.drop('POS2',axis=1,inplace=True)
NaN=np.nan
df['Sentence #']= NaN #create a NaN column which we will need later to keep track of the sentences
df[['Sentence #']]=df[["Sentence #"]].astype(object)
#Let us take a sneak-peak into the dataset first
df.head()


Unnamed: 0,Word,POS,Tag,Sentence #
0,EU,NNP,B-ORG,
1,rejects,VBZ,O,
2,German,JJ,B-MISC,
3,call,NN,O,
4,to,TO,O,


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 251230 entries, 0 to 251229
Data columns (total 4 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   Word        251227 non-null  object
 1   POS         251230 non-null  object
 2   Tag         248631 non-null  object
 3   Sentence #  0 non-null       object
dtypes: object(4)
memory usage: 7.7+ MB


In [6]:
df.head(20)

Unnamed: 0,Word,POS,Tag,Sentence #
0,EU,NNP,B-ORG,
1,rejects,VBZ,O,
2,German,JJ,B-MISC,
3,call,NN,O,
4,to,TO,O,
5,boycott,VB,O,
6,British,JJ,B-MISC,
7,lamb,NN,O,
8,.,.,O,
9,Peter,NNP,B-PER,


In [7]:
c = list()
for i in range(len(df)):
  #print(df['Word'][i])
  if df['Word'][i]=='.':
    c.append('Sentence:')
  else:
    c.append(NaN)

In [8]:
# I remove the last element and add an element at the top of the list so it scans all of a position (this is because I need Sentence: 1 to be at the first word of the sentence and not where there is a period)
c.pop()
c.insert(0,'Sentence:')

In [9]:
sen = 'Sentence:'
count = 1
for i in range(len(c)):
  if c[i]=='Sentence:':
    c[i] = sen+str(count)
    count+=1

In [10]:
#c

In [11]:
df['Sentence #']=c

In [12]:
df

Unnamed: 0,Word,POS,Tag,Sentence #
0,EU,NNP,B-ORG,Sentence:1
1,rejects,VBZ,O,
2,German,JJ,B-MISC,
3,call,NN,O,
4,to,TO,O,
...,...,...,...,...
251225,younger,JJR,O,
251226,brother,NN,O,
251227,",",",",O,
251228,Bobby,NNP,B-PER,


In [13]:
 df.describe()

Unnamed: 0,Word,POS,Tag,Sentence #
count,251227,251230,248631,9000
unique,27316,46,9,9000
top,.,NNP,O,Sentence:7385
freq,9000,42987,206476,1


In [14]:
#Checking null values, if any.
df.isnull().sum()

Word               3
POS                0
Tag             2599
Sentence #    242230
dtype: int64

In [15]:
df = df.fillna(method = 'ffill')

In [16]:
# This is a class to get sentence. The each sentence will be list of tuples with its tag and pos.
class sentence(object):
    def __init__(self, df):
        self.n_sent = 1
        self.df = df
        self.empty = False
        agg = lambda s : [(w, p, t) for w, p, t in zip(s['Word'].values.tolist(),
                                                       s['POS'].values.tolist(),
                                                       s['Tag'].values.tolist())]
        self.grouped = self.df.groupby("Sentence #").apply(agg)
        self.sentences = [s for s in self.grouped]
        
    def get_text(self):
        try:
            s = self.grouped['Sentence: {}'.format(self.n_sent)]
            self.n_sent +=1
            return s
        except:
            return None

In [17]:
#Displaying one full sentence
getter = sentence(df)
sentences = [" ".join([s[0] for s in sent]) for sent in getter.sentences]
sentences[0]

'EU rejects German call to boycott British lamb .'

In [18]:
#sentence with its pos and tag. 
sent = getter.get_text()
print(sent)

None


In [19]:
sentences = getter.sentences
#sentences

**2) FEATURES EXTRACTION**


In [20]:
def word2features(sent, i):
    word = sent[i][0]
    postag = sent[i][1]

    features = {
        'bias': 1.0,
        'word.lower()': word.lower(),
        'word[-3:]': word[-3:],
        'word[-2:]': word[-2:],
        'word.isupper()': word.isupper(),
        'word.istitle()': word.istitle(),
        'word.isdigit()': word.isdigit(),
        'postag': postag,
        'postag[:2]': postag[:2],
    }
    if i > 0:
        word1 = sent[i-1][0]
        postag1 = sent[i-1][1]
        features.update({
            '-1:word.lower()': word1.lower(),
            '-1:word.istitle()': word1.istitle(),
            '-1:word.isupper()': word1.isupper(),
            '-1:postag': postag1,
            '-1:postag[:2]': postag1[:2],
        })
    else:
        features['BOS'] = True

    if i < len(sent)-1:
        word1 = sent[i+1][0]
        postag1 = sent[i+1][1]
        features.update({
            '+1:word.lower()': word1.lower(),
            '+1:word.istitle()': word1.istitle(),
            '+1:word.isupper()': word1.isupper(),
            '+1:postag': postag1,
            '+1:postag[:2]': postag1[:2],
        })
    else:
        features['EOS'] = True

    return features


def sent2features(sent):
    return [word2features(sent, i) for i in range(len(sent))]

def sent2labels(sent):
    return [label for token, postag, label in sent]

def sent2tokens(sent):
    return [token for token, postag, label in sent]
    

In [21]:
X = [sent2features(s) for s in sentences]
y = [sent2labels(s) for s in sentences] 

In [22]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2) 

In [23]:
X_train[0][1]
#This is how features extracted from a single token look like:

{'bias': 1.0,
 'word.lower()': 'said',
 'word[-3:]': 'aid',
 'word[-2:]': 'id',
 'word.isupper()': False,
 'word.istitle()': False,
 'word.isdigit()': False,
 'postag': 'VBD',
 'postag[:2]': 'VB',
 '-1:word.lower()': 'sewa',
 '-1:word.istitle()': True,
 '-1:word.isupper()': False,
 '-1:postag': 'NNP',
 '-1:postag[:2]': 'NN',
 '+1:word.lower()': 'the',
 '+1:word.istitle()': False,
 '+1:word.isupper()': False,
 '+1:postag': 'DT',
 '+1:postag[:2]': 'DT'}

**3) TRAIN A CRF MODEL**
Once we have features in a right format we can train a linear-chain CRF (Conditional Random Fields) model using sklearn_crfsuite.CRF:


In [24]:
#use 'scikit-learn<0.24' to don't have trouble
crf = CRF(algorithm = 'lbfgs',
         c1 = 0.1,
         c2 = 0.1,
         max_iterations = 100,
         all_possible_transitions = False)
crf.fit(X_train, y_train)



CRF(algorithm='lbfgs', all_possible_transitions=False, c1=0.1, c2=0.1,
    keep_tempfiles=None, max_iterations=100)

**4) INSPECT MODEL WEIGHTS**
CRFsuite CRF models use two kinds of features: state features and transition features. Let’s check their weights using eli5.explain_weights:

In [25]:
#CRFsuite CRF models use two kinds of features: state features and transition features. Let’s check their weights using eli5.explain_weights:
eli5.show_weights(crf, top=30)


From \ To,O,B-LOC,I-LOC,B-MISC,I-MISC,B-ORG,I-ORG,B-PER,I-PER
O,3.791,2.38,0.0,2.521,0.0,2.597,0.0,2.452,0.0
B-LOC,0.177,0.455,7.153,0.115,0.0,0.39,0.0,-2.241,0.0
I-LOC,-1.118,-1.345,6.092,-1.156,0.0,0.146,0.0,-2.783,0.0
B-MISC,-0.174,-1.752,0.0,0.058,7.127,-0.473,0.0,-1.067,0.0
I-MISC,-0.67,0.0,0.0,0.455,7.476,0.203,0.0,-2.688,0.0
B-ORG,0.157,-1.975,0.0,-1.525,0.0,-0.761,7.567,-3.176,0.0
I-ORG,-0.583,-3.184,0.0,-1.244,0.0,-1.091,7.013,-3.81,0.0
B-PER,0.716,-0.774,0.0,-1.319,0.0,0.0,0.0,0.016,8.715
I-PER,-0.58,0.094,0.0,0.0,0.0,-1.624,0.0,-2.617,5.657

Weight?,Feature,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,Unnamed: 5_level_0,Unnamed: 6_level_0,Unnamed: 7_level_0,Unnamed: 8_level_0
Weight?,Feature,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Weight?,Feature,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
Weight?,Feature,Unnamed: 2_level_3,Unnamed: 3_level_3,Unnamed: 4_level_3,Unnamed: 5_level_3,Unnamed: 6_level_3,Unnamed: 7_level_3,Unnamed: 8_level_3
Weight?,Feature,Unnamed: 2_level_4,Unnamed: 3_level_4,Unnamed: 4_level_4,Unnamed: 5_level_4,Unnamed: 6_level_4,Unnamed: 7_level_4,Unnamed: 8_level_4
Weight?,Feature,Unnamed: 2_level_5,Unnamed: 3_level_5,Unnamed: 4_level_5,Unnamed: 5_level_5,Unnamed: 6_level_5,Unnamed: 7_level_5,Unnamed: 8_level_5
Weight?,Feature,Unnamed: 2_level_6,Unnamed: 3_level_6,Unnamed: 4_level_6,Unnamed: 5_level_6,Unnamed: 6_level_6,Unnamed: 7_level_6,Unnamed: 8_level_6
Weight?,Feature,Unnamed: 2_level_7,Unnamed: 3_level_7,Unnamed: 4_level_7,Unnamed: 5_level_7,Unnamed: 6_level_7,Unnamed: 7_level_7,Unnamed: 8_level_7
Weight?,Feature,Unnamed: 2_level_8,Unnamed: 3_level_8,Unnamed: 4_level_8,Unnamed: 5_level_8,Unnamed: 6_level_8,Unnamed: 7_level_8,Unnamed: 8_level_8
+5.446,word[-3:]:day,,,,,,,
+4.659,word.lower():minister,,,,,,,
+4.580,word.lower():september,,,,,,,
+4.415,word.lower():attendance,,,,,,,
+4.313,word.lower():march,,,,,,,
+4.210,word.lower():august,,,,,,,
+4.030,word.lower():bowling,,,,,,,
+3.954,word.lower():december,,,,,,,
+3.838,word.lower():badminton,,,,,,,
+3.730,word.lower():june,,,,,,,

Weight?,Feature
+5.446,word[-3:]:day
+4.659,word.lower():minister
+4.580,word.lower():september
+4.415,word.lower():attendance
+4.313,word.lower():march
+4.210,word.lower():august
+4.030,word.lower():bowling
+3.954,word.lower():december
+3.838,word.lower():badminton
+3.730,word.lower():june

Weight?,Feature
+4.945,word.lower():france
+4.852,word.lower():england
+4.680,word.lower():hungary
+4.652,word.lower():chester-le-street
+4.517,+1:word.lower():1996-12-06
+4.517,word.lower():pakistan
+4.399,word.lower():balkans
+4.358,word.lower():italy
+4.313,word.lower():zaire
+4.308,word.lower():vatican

Weight?,Feature
+3.972,-1:word.lower():colo
+3.946,-1:word.lower():wisc
+3.299,word.lower():ireland
+3.122,-1:word.lower():west
+3.067,+1:word.lower():average
+2.880,+1:word.lower():1996-12-06
+2.849,-1:word.lower():new
+2.735,-1:word.lower():san
+2.733,-1:word.lower():united
+2.731,word.lower():kong

Weight?,Feature
+4.492,word.lower():german
+4.380,word.lower():dutch
+4.146,word.lower():dtb-bund-future
+4.120,word.lower():rottweilers
+3.930,word.lower():argentine
+3.862,-1:word.lower():euro-sceptic
+3.721,word.lower():euroleague
+3.711,word.lower():frenchman
+3.661,word.lower():internet
+3.627,word.lower():moslem

Weight?,Feature
+4.745,word.lower():division
+3.411,word.lower():masters
+3.190,word.lower():open
+3.037,+1:word.lower():index
+3.036,-1:word.lower():world
+2.931,word[-3:]:sed
+2.910,-1:word.lower():major
+2.895,word.lower():convention
+2.866,word.lower():day
+2.847,-1:word.lower():super

Weight?,Feature
+4.823,-1:word.lower():v
+4.444,word.lower():sungard
+4.413,word.lower():senate
+4.349,word.lower():reuters
+4.285,word.lower():barrick
+4.063,word.lower():udinese
+3.987,word.lower():sunderland
+3.850,word.lower():fiorentina
+3.786,word.lower():painewebber
+3.724,word.lower():housecall

Weight?,Feature
+4.059,-1:word.lower():interior
+3.946,-1:word.lower():taichung
+3.682,-1:word.lower():moody
+3.428,-1:word.lower():assoc
+3.137,-1:word.lower():bj
+3.004,-1:word.lower():lloyd
+2.978,-1:word.lower():st
+2.807,-1:word.lower():european
+2.737,-1:word.lower():diario
+2.693,-1:word.lower():mladost

Weight?,Feature
+5.129,word.lower():clinton
+4.314,word.lower():stenning
+4.238,word.lower():ata-ur-rehman
+4.077,-1:word.lower():superman
+4.029,-1:word.lower():lahd
+3.897,word.lower():inzamam-ul-haq
+3.560,-1:word.lower():4.
+3.383,word[-3:]:Haq
+3.379,word.lower():huber
+3.364,word.lower():gore

Weight?,Feature
+2.963,+1:word.lower():town
+2.095,word[-2:]:ez
+2.074,word[-3:]:ath
+1.979,-1:word.lower():scott
+1.938,+1:word.lower():through
+1.935,+1:word.lower():lbw
+1.809,+1:word.lower():reputation
+1.804,-1:word.lower():yen
+1.772,-1:word.lower():clean
+1.754,-1:word.lower():rashid


In [26]:
crf =CRF(
    algorithm='lbfgs',
    c1=200,
    c2=0.1,
    max_iterations=20,
    all_possible_transitions=False,
)
crf.fit(X_train, y_train)
eli5.show_weights(crf, top=30)

From \ To,O,B-LOC,I-LOC,B-MISC,I-MISC,B-ORG,I-ORG,B-PER,I-PER
O,1.976,0.806,0.0,1.339,0.0,1.459,0.0,1.054,0.0
B-LOC,0.225,0.0,2.732,0.0,0.0,0.0,0.0,0.0,0.0
I-LOC,-0.036,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
B-MISC,0.0,0.0,0.0,0.0,2.654,0.0,0.0,0.0,0.0
I-MISC,-0.045,0.0,0.0,0.0,0.458,0.0,0.0,0.0,0.0
B-ORG,0.19,0.0,0.0,0.0,0.0,0.0,3.987,-0.002,0.0
I-ORG,-0.144,0.0,0.0,0.0,0.0,0.0,3.17,0.0,0.0
B-PER,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.896
I-PER,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.011

Weight?,Feature,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,Unnamed: 5_level_0,Unnamed: 6_level_0,Unnamed: 7_level_0,Unnamed: 8_level_0
Weight?,Feature,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Weight?,Feature,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
Weight?,Feature,Unnamed: 2_level_3,Unnamed: 3_level_3,Unnamed: 4_level_3,Unnamed: 5_level_3,Unnamed: 6_level_3,Unnamed: 7_level_3,Unnamed: 8_level_3
Weight?,Feature,Unnamed: 2_level_4,Unnamed: 3_level_4,Unnamed: 4_level_4,Unnamed: 5_level_4,Unnamed: 6_level_4,Unnamed: 7_level_4,Unnamed: 8_level_4
Weight?,Feature,Unnamed: 2_level_5,Unnamed: 3_level_5,Unnamed: 4_level_5,Unnamed: 5_level_5,Unnamed: 6_level_5,Unnamed: 7_level_5,Unnamed: 8_level_5
Weight?,Feature,Unnamed: 2_level_6,Unnamed: 3_level_6,Unnamed: 4_level_6,Unnamed: 5_level_6,Unnamed: 6_level_6,Unnamed: 7_level_6,Unnamed: 8_level_6
Weight?,Feature,Unnamed: 2_level_7,Unnamed: 3_level_7,Unnamed: 4_level_7,Unnamed: 5_level_7,Unnamed: 6_level_7,Unnamed: 7_level_7,Unnamed: 8_level_7
Weight?,Feature,Unnamed: 2_level_8,Unnamed: 3_level_8,Unnamed: 4_level_8,Unnamed: 5_level_8,Unnamed: 6_level_8,Unnamed: 7_level_8,Unnamed: 8_level_8
+3.669,bias,,,,,,,
+1.447,word[-2:]:ay,,,,,,,
+1.366,word[-3:]:day,,,,,,,
+0.790,-1:word.lower():on,,,,,,,
+0.621,BOS,,,,,,,
+0.536,+1:postag:NNP,,,,,,,
+0.534,word[-2:]:er,,,,,,,
+0.522,+1:word.istitle(),,,,,,,
+0.493,postag:NNS,,,,,,,
+0.474,postag:CD,,,,,,,

Weight?,Feature
+3.669,bias
+1.447,word[-2:]:ay
+1.366,word[-3:]:day
+0.790,-1:word.lower():on
+0.621,BOS
+0.536,+1:postag:NNP
+0.534,word[-2:]:er
+0.522,+1:word.istitle()
+0.493,postag:NNS
+0.474,postag:CD

Weight?,Feature
1.006,word[-2:]:ia
0.937,postag:NNP
0.812,word.isupper()
0.509,postag[:2]:NN
0.425,-1:word.lower():in
0.29,-1:postag[:2]:IN
0.29,-1:postag:IN
0.248,-1:word.lower():(
0.247,-1:postag[:2]:(
0.247,-1:postag:(

Weight?,Feature
0.295,-1:postag:NNP
0.144,-1:word.istitle()
0.118,-1:postag[:2]:NN
-0.003,word.istitle()
-0.083,+1:postag[:2]:NN
-0.375,bias

Weight?,Feature
0.793,word[-2:]:an
0.784,postag:JJ
0.776,postag[:2]:JJ
0.729,+1:postag[:2]:NN
0.272,word.isupper()
0.27,word.istitle()
0.228,word[-3:]:ian
0.184,-1:postag:DT
0.184,-1:postag[:2]:DT
0.057,-1:word.lower():the

Weight?,Feature
0.169,-1:postag[:2]:NN
0.141,-1:postag:NNP
0.128,-1:word.istitle()
0.041,+1:postag[:2]:NN
-0.007,postag:NNP
-0.029,word.istitle()
-0.03,postag[:2]:NN
-0.182,bias

Weight?,Feature
0.817,word.isupper()
0.42,+1:postag[:2]:CD
0.42,+1:postag:CD
0.365,-1:postag[:2]:CD
0.365,-1:postag:CD
0.289,-1:word.lower():the
0.207,postag[:2]:NN
0.172,postag:NNP
0.132,-1:postag:DT
0.132,-1:postag[:2]:DT

Weight?,Feature
0.285,-1:word.istitle()
0.121,+1:postag[:2]:CD
0.121,+1:postag:CD
0.044,-1:postag:NNP
0.025,postag:NNP
-0.004,+1:postag[:2]:NN
-0.004,+1:postag:NNP
-0.055,word.isupper()
-0.068,-1:word.isupper()
-0.088,+1:word.isupper()

Weight?,Feature
0.602,word.istitle()
0.553,+1:postag[:2]:VB
0.431,+1:postag:VBD
0.384,postag:NNP
0.254,+1:word.istitle()
0.25,+1:postag:NNP
0.171,"-1:word.lower():,"
0.171,"-1:postag[:2]:,"
0.171,"-1:postag:,"
0.16,postag[:2]:NN

Weight?,Feature
0.801,-1:word.istitle()
0.705,-1:postag:NNP
0.483,+1:word.lower():(
0.483,+1:postag:(
0.483,+1:postag[:2]:(
0.41,-1:postag[:2]:NN
0.106,postag[:2]:NN
0.054,word.istitle()
0.038,"+1:word.lower():,"
0.038,"+1:postag:,"



The reason they are zero is that crfsuite haven’t seen these transitions in training data, and assumed there is no need to learn weights for them, to save some computation time. This is the default behavior, but it is possible to turn it off using sklearn_crfsuite.CRF all_possible_transitions option. Let’s check how does it affect the result:

In [27]:
crf = CRF(
    algorithm='lbfgs',
    c1=0.1,
    c2=0.1,
    max_iterations=20,
    all_possible_transitions=True,
)
crf.fit(X_train, y_train)

CRF(algorithm='lbfgs', all_possible_transitions=True, c1=0.1, c2=0.1,
    keep_tempfiles=None, max_iterations=20)

#With all_possible_transitions=True CRF learned large negative weights for impossible transitions like O -> I-ORG.

In [28]:
eli5.show_weights(crf, top=5, show=['transition_features'])

From \ To,O,B-LOC,I-LOC,B-MISC,I-MISC,B-ORG,I-ORG,B-PER,I-PER
O,2.383,0.976,-6.947,0.767,-6.799,0.981,-6.154,1.094,-5.688
B-LOC,0.056,-0.453,3.744,0.0,-1.245,0.158,-3.898,-1.702,-4.807
I-LOC,-0.32,-0.421,1.185,-0.292,-0.412,0.154,-0.914,-0.522,-0.997
B-MISC,0.0,-1.457,-0.979,-0.488,3.017,-0.457,-2.956,-0.188,-3.053
I-MISC,-0.291,-0.635,-0.483,-0.017,3.754,0.078,-1.059,-0.631,-1.111
B-ORG,0.239,-2.542,-1.652,-1.446,-1.612,-0.322,4.5,-3.043,-4.565
I-ORG,-0.154,-1.593,-1.055,-0.889,-1.09,-0.365,3.99,-1.904,-3.558
B-PER,-0.273,-3.176,-2.074,-1.907,-1.957,-3.135,-5.165,-0.811,3.586
I-PER,-0.123,0.306,-0.688,-0.715,-0.684,-1.063,-2.337,-1.243,1.513


With all_possible_transitions=True CRF learned large negative weights for impossible transitions like O -> I-ORG.


**5) CUSTOMIZATION**


The table above is large and kind of hard to inspect; eli5 provides several options to look only at a part of features. You can check only a subset of labels:


In [29]:
eli5.show_weights(crf, top=10, targets=['O', 'B-ORG', 'I-ORG'])

From \ To,O,B-ORG,I-ORG
O,2.383,0.981,-6.154
B-ORG,0.239,-0.322,4.5
I-ORG,-0.154,-0.365,3.99

Weight?,Feature,Unnamed: 2_level_0
Weight?,Feature,Unnamed: 2_level_1
Weight?,Feature,Unnamed: 2_level_2
+2.886,word[-3:]:day,
+2.629,bias,
+2.324,word[-3:]:ber,
+2.252,word.lower():friday,
+2.183,word.lower():thursday,
+2.087,postag[:2]:PR,
+1.764,postag:VBD,
… 12778 more positive …,… 12778 more positive …,
… 4302 more negative …,… 4302 more negative …,
-1.887,-1:word.lower():at,

Weight?,Feature
+2.886,word[-3:]:day
+2.629,bias
+2.324,word[-3:]:ber
+2.252,word.lower():friday
+2.183,word.lower():thursday
+2.087,postag[:2]:PR
+1.764,postag:VBD
… 12778 more positive …,… 12778 more positive …
… 4302 more negative …,… 4302 more negative …
-1.887,-1:word.lower():at

Weight?,Feature
+1.935,-1:word.lower():1
+1.909,+1:word.lower():3
+1.581,+1:word.lower():at
+1.446,-1:word.lower():3
+1.346,word.isupper()
+1.245,-1:word.lower():0
+1.197,-1:word.lower():4
+1.177,-1:word.lower():2
+1.137,+1:word.lower():2
… 4818 more positive …,… 4818 more positive …

Weight?,Feature
+1.207,+1:word.lower():3
+1.184,word[-2:]:ty
+1.121,word[-3:]:ion
+1.015,+1:word.lower():1
+0.917,word[-2:]:om
+0.916,word[-3:]:oom
+0.902,word.lower():newsroom
+0.884,+1:word.lower():2
+0.860,-1:word.lower():st
+0.825,+1:word.lower():0


Another option is to check only some of the features - it helps to check if a feature function works as intended. For example, let’s check how word shape features are used by model using feature_re argument and hide transition table:

In [30]:
eli5.show_weights(crf, top=10, feature_re='^word\.is',
                  horizontal_layout=True, show=['targets'])

Weight?,Feature,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,Unnamed: 5_level_0,Unnamed: 6_level_0,Unnamed: 7_level_0,Unnamed: 8_level_0
Weight?,Feature,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Weight?,Feature,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
Weight?,Feature,Unnamed: 2_level_3,Unnamed: 3_level_3,Unnamed: 4_level_3,Unnamed: 5_level_3,Unnamed: 6_level_3,Unnamed: 7_level_3,Unnamed: 8_level_3
Weight?,Feature,Unnamed: 2_level_4,Unnamed: 3_level_4,Unnamed: 4_level_4,Unnamed: 5_level_4,Unnamed: 6_level_4,Unnamed: 7_level_4,Unnamed: 8_level_4
Weight?,Feature,Unnamed: 2_level_5,Unnamed: 3_level_5,Unnamed: 4_level_5,Unnamed: 5_level_5,Unnamed: 6_level_5,Unnamed: 7_level_5,Unnamed: 8_level_5
Weight?,Feature,Unnamed: 2_level_6,Unnamed: 3_level_6,Unnamed: 4_level_6,Unnamed: 5_level_6,Unnamed: 6_level_6,Unnamed: 7_level_6,Unnamed: 8_level_6
Weight?,Feature,Unnamed: 2_level_7,Unnamed: 3_level_7,Unnamed: 4_level_7,Unnamed: 5_level_7,Unnamed: 6_level_7,Unnamed: 7_level_7,Unnamed: 8_level_7
Weight?,Feature,Unnamed: 2_level_8,Unnamed: 3_level_8,Unnamed: 4_level_8,Unnamed: 5_level_8,Unnamed: 6_level_8,Unnamed: 7_level_8,Unnamed: 8_level_8
+0.440,word.isdigit(),,,,,,,
-2.569,word.isupper(),,,,,,,
-3.858,word.istitle(),,,,,,,
+0.930,word.isupper(),,,,,,,
+0.446,word.istitle(),,,,,,,
-0.328,word.isdigit(),,,,,,,
+0.058,word.istitle(),,,,,,,
-0.015,word.isupper(),,,,,,,
-0.097,word.isdigit(),,,,,,,
+1.101,word.isupper(),,,,,,,

Weight?,Feature
0.44,word.isdigit()
-2.569,word.isupper()
-3.858,word.istitle()

Weight?,Feature
0.93,word.isupper()
0.446,word.istitle()
-0.328,word.isdigit()

Weight?,Feature
0.058,word.istitle()
-0.015,word.isupper()
-0.097,word.isdigit()

Weight?,Feature
1.101,word.isupper()
0.699,word.istitle()
-0.012,word.isdigit()

Weight?,Feature
0.298,word.isdigit()
0.025,word.istitle()
-0.193,word.isupper()

Weight?,Feature
1.346,word.isupper()
0.088,word.istitle()
-0.081,word.isdigit()

Weight?,Feature
0.392,word.istitle()
-0.432,word.isupper()
-0.474,word.isdigit()

Weight?,Feature
1.104,word.istitle()
-0.26,word.isdigit()

Weight?,Feature
0.361,word.istitle()
-0.444,word.isdigit()
-2.141,word.isupper()


**6) FORMATTING IN CONSOLE**

It is also possible to format the result as text (could be useful in console):

In [31]:
expl = eli5.explain_weights(crf, top=5, targets=['O', 'B-LOC', 'I-LOC'])
print(eli5.format_as_text(expl))

Explained as: CRF

Transition features:
            O    B-LOC    I-LOC
-----  ------  -------  -------
O       2.383    0.976   -6.947
B-LOC   0.056   -0.453    3.744
I-LOC  -0.320   -0.421    1.185

y='O' top features
Weight  Feature       
------  --------------
+2.886  word[-3:]:day 
+2.629  bias          
+2.324  word[-3:]:ber 
… 12782 more positive …
… 4303 more negative …
-2.569  word.isupper()
-3.858  word.istitle()

y='B-LOC' top features
      … 3422 more positive …      
      … 707 more negative …       
Weight  Feature                   
------  --------------------------
+3.690  -1:word.lower():at        
+2.059  word[-2:]:ia              
+1.684  word.lower():london       
+1.435  +1:word.lower():1996-12-06
+1.433  word[-3:]:ain             

y='I-LOC' top features
   … 1121 more positive …    
    … 233 more negative …    
Weight  Feature              
------  ---------------------
+1.230  -1:word.lower():new  
+1.184  -1:word.lower():south
+1.020  word[-3:]:tes        

In [32]:
#Predicting on the test set
y_pred = crf.predict(X_test)  #uses the last crf tried

In [33]:
f1_score = flat_f1_score(y_test, y_pred, average = 'weighted')
print(f1_score)

0.9447780123414588


In [34]:
report = flat_classification_report(y_test, y_pred)
print(report)   



              precision    recall  f1-score   support

       B-LOC       0.89      0.74      0.80      1809
      B-MISC       0.77      0.68      0.72       796
       B-ORG       0.81      0.61      0.70      1665
       B-PER       0.69      0.85      0.76      1701
       I-LOC       0.84      0.46      0.59       282
      I-MISC       0.57      0.61      0.59       260
       I-ORG       0.77      0.61      0.68       873
       I-PER       0.73      0.97      0.83      1152
           O       0.98      0.99      0.99     41405

    accuracy                           0.95     49943
   macro avg       0.78      0.72      0.74     49943
weighted avg       0.95      0.95      0.94     49943



In [39]:
from sklearn.metrics import confusion_matrix,multilabel_confusion_matrix
from sklearn.preprocessing import MultiLabelBinarizer
y_test1=MultiLabelBinarizer().fit_transform(y_test)
y_pred1=MultiLabelBinarizer().fit_transform(y_pred)
#conf_mat = confusion_matrix(y_test, y_pred)
conf_mat= multilabel_confusion_matrix(y_test1, y_pred1)
conf_mat

array([[[ 915,   73],
        [ 153,  659]],

       [[1121,   98],
        [ 131,  450]],

       [[1077,   86],
        [ 241,  396]],

       [[ 850,  261],
        [  57,  632]],

       [[1584,   19],
        [  89,  108]],

       [[1595,   39],
        [  57,  109]],

       [[1421,   54],
        [ 128,  197]],

       [[1220,  171],
        [   5,  404]],

       [[   0,    0],
        [   0, 1800]]], dtype=int64)

In [41]:
labels = list(crf.classes_)
labels.remove('O')
labels

['B-PER', 'B-LOC', 'I-LOC', 'B-ORG', 'I-PER', 'B-MISC', 'I-MISC', 'I-ORG']

%%time
# define fixed parameters and parameters to search
crf = CRF(
    algorithm='lbfgs',
    max_iterations=100,
    all_possible_transitions=True
)
params_space = {
    'c1': scipy.stats.expon(scale=0.5),
    'c2': scipy.stats.expon(scale=0.05),
}

# use the same metric for evaluation
f1_scorer = make_scorer(metrics.flat_f1_score,
                        average='weighted', labels=labels)

# search
rs = RandomizedSearchCV(crf, params_space,
                        cv=3,
                        verbose=1,
                        n_jobs=-1,
                        n_iter=50,
                        scoring=f1_scorer)
rs.fit(X_train, y_train)

rs

Source: https://github.com/Akshayc1/named-entity-recognition/blob/master/NER%20using%20CRF.ipynb