# Build your own NER Tagger using pycrfsuite

__Named Entity Recognition (NER)__ , also known as entity chunking/extraction , is a popular technique used in information extraction to identify and segment the named entities and classify or categorize them under various predefined classes.

There are various off the shelf solutions which offer capabilites to perform named entity extraction (some of which we discussed in the previous units). Yet there are times when the requirements are beyond the capabilities of off-the-shelf classifiers.

In this notebook, we will go through an exercise to build our own NER using Conditional Random Fields.
We would be utilizing ```pycrfsuite``` to develop our NER.

## Load Dataset

Named Entity Recognition is a sequence modeling problem at it's core. It is more related to classification class of problems where in we need a labeled dataset to train a classifier. 

There are various labeled datasets for NER class of problems. We would be utilizing a pre-processed version of __INDONLU corpus__ for this notebook. The preprocessed version is availble at the following link : [indonlu/nerp_ner-prosa](https://github.com/IndoNLP/indonlu/tree/master/dataset/nerp_ner-prosa)

We have provided the dataset in the code repository itself using some intelligent compression and you can access it directly from `pandas` as follows.

In [1]:
#!pip install scikit-learn==0.24.0

In [2]:
import pandas as pd

dfpos = pd.read_csv('train_pos.txt', sep='\t', skip_blank_lines=False, encoding='ISO-8859-1')
dfpos.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 96274 entries, 0 to 96273
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   word    92184 non-null  object
 1   pos     92186 non-null  object
dtypes: object(2)
memory usage: 1.5+ MB


In [3]:
dfpos.shape

(96274, 2)

In [26]:
import pandas as pd
dfner = pd.read_csv('javanese-ner-kel6(2).txt', sep='\t', skip_blank_lines=False, encoding='ISO-8859-1')
dfner.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41877 entries, 0 to 41876
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   word_x  40077 non-null  object
 1   pos     40077 non-null  object
 2   tag     40077 non-null  object
dtypes: object(3)
memory usage: 981.6+ KB


In [27]:
dfner.shape

(41877, 3)

In [8]:
df = pd.merge(dfpos, dfner,left_index=True,right_index=True, how="left")

NameError: name 'dfpos' is not defined

In [7]:
df.shape

(96274, 4)

In [28]:
dfner.head()


Unnamed: 0,word_x,pos,tag
0,Kementerian,,B-ORG
1,pendidikan,,I-ORG
2,dan,,I-ORG
3,kebudayaan,,L-ORG
4,",",,O


In [9]:
df = df.drop(['word_y'], axis=1)

In [10]:
dfner.head()

Unnamed: 0,word_x,pos,tag
0,kepala,B-NNO,O
1,dinas,B-VBP,O
2,tata,B-NNO,O
3,kota,B-NNO,O
4,manado,B-NNP,B-PLC


In [29]:
dfner.word_x.nunique(), dfner.pos.nunique(), dfner.tag.nunique()

(2970, 1, 73)

We have 47959 sentences that contain 35178 unique words.

These sentences have a total of 42 unique POS tags and 17 unique NER tags in total.

## Tag Distribution

The NERP NER Prosa dataset utilizes IOB tagging or _Inside, Outside Beginning_. IOB is a common tagging format for tagging tokens which we have discussed earlier. To refresh your memory:

+ __I- prefix__ before a tag indicates that the tag is inside a chunk.
+ __B- prefix__ before a tag indicates that the tag is the beginning of a chunk.
+ __O-  tag__ indicates that a token belongs to no chunk (outside).

The tags in this dataset are explained as follows:

+ __ppl__ = Person
+ __plc__ = Place
+ __ind__ = Industrial
+ __evt__ = Event
+ __fnb__ = Food and Beverages

Anything outside these classes is termed as other, denoted as __O__. 

The following output shows the unbalanced distribution of different tags in the dataset

In [30]:
dfner.tag.value_counts()

tag
O             32304
U-GPE           732
I-ORG           487
B-GPE           365
L-GPE           365
              ...  
I-ORDINAL         3
I-TIME            2
I-LAW             2
I-LANGUAGE        1
U-LAW             1
Name: count, Length: 73, dtype: int64

In [31]:
dfner.head(28)

Unnamed: 0,word_x,pos,tag
0,Kementerian,,B-ORG
1,pendidikan,,I-ORG
2,dan,,I-ORG
3,kebudayaan,,L-ORG
4,",",,O
5,kemendikbud,,U-ORG
6,nedheng,,O
7,mbiwarakaken,,O
8,program,,O
9,pertukaran,,O


In [32]:
dfner['sentence'] = dfner.isnull().all(axis=1).cumsum()

In [33]:
dfner.head(28)

Unnamed: 0,word_x,pos,tag,sentence
0,Kementerian,,B-ORG,0
1,pendidikan,,I-ORG,0
2,dan,,I-ORG,0
3,kebudayaan,,L-ORG,0
4,",",,O,0
5,kemendikbud,,U-ORG,0
6,nedheng,,O,0
7,mbiwarakaken,,O,0
8,program,,O,0
9,pertukaran,,O,0


In [16]:
#d = {i: df.loc[df.sentence == i, ['word_x', 'pos','tag']] for i in range(0, df.sentence.iat[-1])}

In [34]:
df_new = dfner.dropna()

In [35]:
df_new.head(28)

Unnamed: 0,word_x,pos,tag,sentence
0,Kementerian,,B-ORG,0
1,pendidikan,,I-ORG,0
2,dan,,I-ORG,0
3,kebudayaan,,L-ORG,0
4,",",,O,0
5,kemendikbud,,U-ORG,0
6,nedheng,,O,0
7,mbiwarakaken,,O,0
8,program,,O,0
9,pertukaran,,O,0


In [36]:

# A class to retrieve the sentences from the dataset
class getsentence(object):
    
    def __init__(self, data):
        self.n_sent = 1.0
        self.data = data
        self.empty = False
        agg_func = lambda s: [(w, p, t) for w, p, t in zip(s["word_x"].values.tolist(),
                                                           s["pos"].values.tolist(),
                                                           s["tag"].values.tolist())]
        self.grouped = self.data.groupby("sentence").apply(agg_func)
        self.sentences = [s for s in self.grouped]



In [37]:
getter = getsentence(df_new)

In [38]:
sentences = getter.sentences

In [39]:
sentences[0]

[('Kementerian', ' ', 'B-ORG'),
 ('pendidikan', ' ', 'I-ORG'),
 ('dan', ' ', 'I-ORG'),
 ('kebudayaan', ' ', 'L-ORG'),
 (',', ' ', 'O'),
 ('kemendikbud', ' ', 'U-ORG'),
 ('nedheng', ' ', 'O'),
 ('mbiwarakaken', ' ', 'O'),
 ('program', ' ', 'O'),
 ('pertukaran', ' ', 'O'),
 ('kepala', ' ', 'O'),
 ('sekolah', ' ', 'O'),
 ('kangge', ' ', 'O'),
 ('sekolah', ' ', 'O'),
 ('-', ' ', 'O'),
 ('sekolah', ' ', 'O'),
 ('maju', ' ', 'O'),
 ('lan', ' ', 'O'),
 ('berpotensi', ' ', 'O'),
 ('maju', ' ', 'O'),
 ('kaliyan', ' ', 'O'),
 ('sekolah', ' ', 'O'),
 ('ingkang', ' ', 'O'),
 ('dumunung', ' ', 'O'),
 ('ing', ' ', 'O'),
 ('daerah', ' ', 'O'),
 ('terdepan', ' ', 'O'),
 (',', ' ', 'O'),
 ('terluar', ' ', 'O'),
 ('dan', ' ', 'O'),
 ('tertinggal', ' ', 'O'),
 (',', ' ', 'O'),
 ('3', ' ', 'O'),
 ('t.', ' ', 'O'),
 ('Kadosta', ' ', 'O'),
 ('kepala', ' ', 'O'),
 ('sekolah', ' ', 'O'),
 ('sd', ' ', 'B-FAC'),
 ('inpres', ' ', 'L-FAC'),
 ('kabupaten', ' ', 'B-GPE'),
 ('sorong', ' ', 'I-GPE'),
 ('papua', ' ', 

In [40]:
len(sentences)

1800

In [41]:
train_sents =  sentences[0:1000]
test_sents = sentences[1001:]

In [42]:
len(train_sents)

1000

In [43]:
len(test_sents)

799

## Conditional Random Fields

As mentioned above, NER belongs to sequence modeling class of problems. There are different algorithms to tackle sequence modeling, __CRF__ or _Conditional Random Fields_ are one such example. CRFs are proven to perform extremely well on NER and related domains. In this notebook, we will attempt at developing our own NER based on CRFs.

---

__Question__: What is a CRF and how does it work?

__Wikipedia__ :  CRF is an undirected graphical model whose nodes can be divided into exactly two disjoint sets $X$ and $Y$, the observed and output variables, respectively; the conditional distribution $p(Y|X)$ is then modeled.

For more details, checkout the paper [__Conditional Random Fields: Probabilistic Models
for Segmenting and Labeling Sequence Data__](https://repository.upenn.edu/cgi/viewcontent.cgi?article=1162&context=cis_papers)

## Prepare Data

CRF trains upon sequence of input data to learn transitions from one state (label) to another. 
To enable such an algorithm, we need to define features which take into account different transitions. 
In the function ```word2features()``` below, we transform each word into a feature dictionary depicting the following attributes or features:

+ lower case of word
+ suffix containing last 3 characters
+ suffix containing last 2 characters
+ flags to determine upper-case, title-case, numeric data and POS tag

We also attach attributes related to previous and next words or tags to determine beginning of sentence (BOS) or end of sentence (EOS)

In [44]:
def word2features(sent, i):
    word = sent[i][0]
    postag = sent[i][1]
    features = [
        'bias',
        'word.lower=' + word.lower(),
        'word[-3:]=' + word[-3:],
        'word[-2:]=' + word[-2:],
        'word.isupper=%s' % word.isupper(),
        'word.istitle=%s' % word.istitle(),
        'word.isdigit=%s' % word.isdigit(),
        'postag=' + postag,
        'postag[:2]=' + postag[:2],
    ]
    if i > 0:
        word1 = sent[i-1][0]
        postag1 = sent[i-1][1]
        features.extend([
            '-1:word.lower=' + word1.lower(),
            '-1:word.istitle=%s' % word1.istitle(),
            '-1:word.isupper=%s' % word1.isupper(),
            '-1:postag=' + postag1,
            '-1:postag[:2]=' + postag1[:2],
        ])
    else:
        features.append('BOS')
        
    if i < len(sent)-1:
        word1 = sent[i+1][0]
        postag1 = sent[i+1][1]
        features.extend([
            '+1:word.lower=' + word1.lower(),
            '+1:word.istitle=%s' % word1.istitle(),
            '+1:word.isupper=%s' % word1.isupper(),
            '+1:postag=' + postag1,
            '+1:postag[:2]=' + postag1[:2],
        ])
    else:
        features.append('EOS')
                
    return features


def sent2features(sent):
    return [word2features(sent, i) for i in range(len(sent))]

def sent2labels(sent):
    return [label for token, postag, label in sent]

def sent2tokens(sent):
    return [token for token, postag, label in sent]  

In [45]:
sent2features(train_sents[0])[0]

['bias',
 'word.lower=kementerian',
 'word[-3:]=ian',
 'word[-2:]=an',
 'word.isupper=False',
 'word.istitle=True',
 'word.isdigit=False',
 'postag= ',
 'postag[:2]= ',
 'BOS',
 '+1:word.lower=pendidikan',
 '+1:word.istitle=False',
 '+1:word.isupper=False',
 '+1:postag= ',
 '+1:postag[:2]= ']

In [46]:


%%time
X_train = [sent2features(s) for s in train_sents]
y_train = [sent2labels(s) for s in train_sents]

X_test = [sent2features(s) for s in test_sents]
y_test = [sent2labels(s) for s in test_sents]



CPU times: user 199 ms, sys: 16.8 ms, total: 216 ms
Wall time: 214 ms


# Building Models with pycrfsuite


In [47]:
#!pip install git+https://github.com/MeMartijn/updated-sklearn-crfsuite.git #egg=sklearn_crfsuite
!pip install python-crfsuite

[0mDefaulting to user installation because normal site-packages is not writeable

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.1[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


# Train the model!

To train the model, we create pycrfsuite.Trainer, load the training data and call 'train' method. First, create pycrfsuite.Trainer and load the training data to CRFsuite:

In [48]:
import pycrfsuite


In [49]:
%%time
trainer = pycrfsuite.Trainer(verbose=False)

for xseq, yseq in zip(X_train, y_train):
    trainer.append(xseq, yseq)

CPU times: user 168 ms, sys: 7.04 ms, total: 175 ms
Wall time: 175 ms


In [50]:
trainer.set_params({
    'c1': 1.0,   # coefficient for L1 penalty
    'c2': 1e-3,  # coefficient for L2 penalty
    'max_iterations': 50,  # stop earlier

    # include transitions that are possible, but not observed
    'feature.possible_transitions': True
})

In [51]:

%%time
trainer.train('id_ner_test')

CPU times: user 14.7 s, sys: 2.35 ms, total: 14.7 s
Wall time: 14.7 s


In [52]:
trainer.logparser.last_iteration

{'num': 50,
 'scores': {},
 'loss': 8136.464438,
 'feature_norm': 64.397749,
 'error_norm': 641.81458,
 'active_features': 5681,
 'linesearch_trials': 1,
 'linesearch_step': 1.0,
 'time': 0.305}

In [53]:
print(len(trainer.logparser.iterations), trainer.logparser.iterations[-1])

50 {'num': 50, 'scores': {}, 'loss': 8136.464438, 'feature_norm': 64.397749, 'error_norm': 641.81458, 'active_features': 5681, 'linesearch_trials': 1, 'linesearch_step': 1.0, 'time': 0.305}



## Make predictions

To use the trained model, create pycrfsuite.Tagger, open the model and use "tag" method:


In [54]:

tagger = pycrfsuite.Tagger()
tagger.open('id_ner_test')


<contextlib.closing at 0x7f0bfa9c4970>

In [55]:
def sent2tokens(sent):
    return [token for token, postag, label in sent]  

In [56]:
X_test[0][0]

['bias',
 'word.lower=pagelaran',
 'word[-3:]=ran',
 'word[-2:]=an',
 'word.isupper=False',
 'word.istitle=True',
 'word.isdigit=False',
 'postag= ',
 'postag[:2]= ',
 'BOS',
 '+1:word.lower=ringgit',
 '+1:word.istitle=False',
 '+1:word.isupper=False',
 '+1:postag= ',
 '+1:postag[:2]= ']

In [57]:


example_sent = test_sents[0]
print(' '.join(sent2tokens(example_sent)), end='\n\n')

print("Predicted:", ' '.join(tagger.tag(sent2features(example_sent))))
print("Correct:  ", ' '.join(sent2labels(example_sent)))



Pagelaran ringgit purwa sedalu natas , katindakaken dening dhalang Ki Seno Nugroho , mbabar lampahan Semar Seta , kanthi dipun regengaken dagelan , dening Mbok Beruk lan Dalijo .

Predicted: O O O O O O O O O O O O O O O O O O O O O O O O B-ORG L-ORG O O O
Correct:   O O O O O O O O O B-PERSON I-PERSON L-PERSON O O O O O O O O O O O O B-PERSON I-PERSON I-PERSON L-PERSON O



# Evaluate the model

In [58]:
from itertools import chain
import nltk
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.preprocessing import LabelBinarizer
import sklearn

In [59]:

def bio_classification_report(y_true, y_pred):
    """
    Classification report for a list of BIO-encoded sequences.
    It computes token-level metrics and discards "O" labels.
    
    Note that it requires scikit-learn 0.15+ (or a version from github master)
    to calculate averages properly!
    """
    lb = LabelBinarizer()
    y_true_combined = lb.fit_transform(list(chain.from_iterable(y_true)))
    y_pred_combined = lb.transform(list(chain.from_iterable(y_pred)))
        
    tagset = set(lb.classes_) - {'O'}
    tagset = sorted(tagset, key=lambda tag: tag.split('-', 1)[::-1])
    class_indices = {cls: idx for idx, cls in enumerate(lb.classes_)}
    
    return classification_report(
        y_true_combined,
        y_pred_combined,
        labels = [class_indices[cls] for cls in tagset],
        target_names = tagset,
    )


In [60]:
%%time
y_pred = [tagger.tag(xseq) for xseq in X_test]

CPU times: user 295 ms, sys: 12.2 ms, total: 307 ms
Wall time: 305 ms


In [61]:

print(bio_classification_report(y_test, y_pred))


               precision    recall  f1-score   support

            -       0.00      0.00      0.00        37
   B-CARDINAL       0.00      0.00      0.00        10
   L-CARDINAL       0.00      0.00      0.00        10
   U-CARDINAL       0.08      0.07      0.08        28
       B-DATE       0.71      0.44      0.55        79
       I-DATE       0.91      0.59      0.72       120
       L-DATE       0.88      0.54      0.67        79
       U-DATE       0.33      0.43      0.38        14
      B-EVENT       0.65      0.38      0.48        91
      I-EVENT       0.60      0.43      0.50        79
      L-EVENT       0.48      0.29      0.36        91
      U-EVENT       0.48      0.41      0.44        54
        B-FAC       0.53      0.10      0.17        99
        I-FAC       1.00      0.03      0.07        58
        L-FAC       0.58      0.11      0.19        99
        U-FAC       0.75      0.13      0.22        23
        B-GPE       0.64      0.38      0.48       194
        I

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [63]:
from seqeval.metrics import classification_report,f1_score
from seqeval.scheme import BILOU

In [64]:
print(classification_report(y_test, y_pred, scheme="BILOU"))

              precision    recall  f1-score   support

    CARDINAL       0.08      0.05      0.06        38
        DATE       0.58      0.42      0.49        93
       EVENT       0.48      0.33      0.39       145
         FAC       0.52      0.10      0.17       122
         GPE       0.57      0.49      0.53       482
    LANGUAGE       0.00      0.00      0.00         2
         LAW       0.00      0.00      0.00         3
         LOC       0.00      0.00      0.00        11
       MONEY       0.93      0.65      0.76        40
        NORP       0.00      0.00      0.00        16
     ORDINAL       0.00      0.00      0.00        22
         ORG       0.42      0.36      0.39       197
     PERCENT       0.00      0.00      0.00         4
      PERSON       0.74      0.58      0.65       183
     PRODUCT       0.29      0.26      0.27       118
    QUANTITY       0.50      0.09      0.15        56
        TIME       0.00      0.00      0.00         9
 WORK_OF_ART       0.17    

  _warn_prf(average, modifier, msg_start, len(result))


In [62]:
from collections import Counter
info = tagger.info()

def print_transitions(trans_features):
    for (label_from, label_to), weight in trans_features:
        print("%-6s -> %-7s %0.6f" % (label_from, label_to, weight))

print("Top likely transitions:")
print_transitions(Counter(info.transitions).most_common(15))

print("\nTop unlikely transitions:")
print_transitions(Counter(info.transitions).most_common()[-15:])

Top likely transitions:
B-LANGUAGE -> L-LANGUAGE 7.102971
B-DATE -> I-DATE  6.341949
B-DATE -> L-DATE  6.328808
I-DATE -> L-DATE  6.324264
B-CARDINAL -> L-CARDINAL 6.305701
B-ORG  -> I-ORG   6.154280
B-PRODUCT -> L-PRODUCT 5.988394
I-ORG  -> I-ORG   5.955453
I-DATE -> I-DATE  5.921284
B-ORG  -> L-ORG   5.915820
B-QUANTITY -> L-QUANTITY 5.913940
I-EVENT -> L-EVENT 5.870193
B-PERCENT -> L-PERCENT 5.835222
I-LOC  -> I-LOC   5.799386
B-PERSON -> I-PERSON 5.781363

Top unlikely transitions:
B-EVENT -> O       -2.639390
B-PERSON -> O       -2.640114
I-LOC  -> O       -2.650887
B-FAC  -> O       -2.662632
I-PRODUCT -> O       -2.691365
O      -> L-PERSON -2.773893
I-ORG  -> O       -2.828404
B-PRODUCT -> O       -2.925748
O      -> I-WORK_OF_ART -2.934128
O      -> I-EVENT -2.991566
I-EVENT -> O       -3.045788
O      -> I-PRODUCT -3.063498
O      -> I-FAC   -3.092823
I-WORK_OF_ART -> O       -3.112923
I-GPE  -> O       -3.186670


In [49]:


def print_state_features(state_features):
    for (attr, label), weight in state_features:
        print("%0.6f %-6s %s" % (weight, label, attr))    

print("Top positive:")
print_state_features(Counter(info.state_features).most_common(20))

print("\nTop negative:")
print_state_features(Counter(info.state_features).most_common()[-20:])



Top positive:
8.961001 B-FNB  word.lower=nasi
6.555337 B-IND  word.lower=arsenal
5.533903 B-PLC  -1:word.lower=berdarah
5.200576 B-IND  word.lower=skuter
5.124560 B-FNB  word.lower=rendang
4.887663 B-EVT  word.lower=wisuda
4.841133 B-FNB  word.lower=sate
4.819712 B-IND  word.lower=instagram
4.565449 B-PLC  word.lower=pakistan
4.549857 B-FNB  word.lower=pisang
4.423253 B-EVT  word.lower=pilkada
4.338747 B-PLC  word.lower=kecamatan
4.332411 O      postag=B-CSN
4.310374 B-EVT  +1:word.lower=olahraga
4.301074 B-IND  word.lower=leicester
4.267920 B-PLC  word.lower=singkawang
4.144854 B-IND  -1:word.lower=ka
4.099111 O      postag=B-SYM
4.092291 B-PLC  word.lower=singapura
3.890772 B-EVT  word.lower=olimpiade

Top negative:
-1.651780 I-PLC  -1:word.lower=indonesia
-1.655243 I-PLC  postag=B-SYM
-1.681434 B-PPL  word[-3:]=com
-1.684919 O      +1:word.lower=â
-1.696493 O      word.lower=predator
-1.717512 O      word[-2:]=hi
-1.723345 O      word.lower=mata
-1.761105 O      word[-3:]=gia
-1.7