<a href="https://colab.research.google.com/github/HaeSeon/nlp-ner/blob/main/%5Bcrf_ner%5Dpy_crf_suit_esp.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# CRF(Conditional Random Field) based NER(Named Entity Recognition)

**CRF**
대표적인 sequential labeling인 품사판별을 생각해보자. 

* sequential labeling : 데이터의 형식이 벡터가 아닌 sequence인 sequential data에 대한 classification

CRF는 앞, 뒤 단어와 품사 정보들을 이용한다. '너'라는 단어 앞, 뒤의 단어와 우리가 이미 예측한 앞의 품사를 이용해서 더욱 정확한 품사 판별을 한다. 

단어열의 길이가 n일 때 n번의 classification을 수행하지 않고 전체적인 문맥을 고려하여 한번의 classification을 수행함으로써 MEMM(Maximum Entropy Markov Model)의 문제였던 label bias를 해결한다.

**potential function**

n개의 단어열을 각각 high dimensional sparse vector 로 표현.

일종의 Boolean filter처럼 작동한다. 


In [20]:
!pip install python-crfsuite
import nltk
import pycrfsuite
import warnings
from sklearn.metrics import classification_report, confusion_matrix
from sklearn import preprocessing
from itertools import chain



In [21]:
nltk.download('conll2002')
# esp : 스페인어 데이터, ned : 네덜란드어 데이터
print(nltk.corpus.conll2002.fileids())

[nltk_data] Downloading package conll2002 to /root/nltk_data...
[nltk_data]   Package conll2002 is already up-to-date!
['esp.testa', 'esp.testb', 'esp.train', 'ned.testa', 'ned.testb', 'ned.train']


In [22]:
# (단어, 품사, NER tag)
train_sents = list(nltk.corpus.conll2002.iob_sents('esp.train'))
test_sents = list(nltk.corpus.conll2002.iob_sents('esp.testb'))

In [23]:
train_sents[0]

[('Melbourne', 'NP', 'B-LOC'),
 ('(', 'Fpa', 'O'),
 ('Australia', 'NP', 'B-LOC'),
 (')', 'Fpt', 'O'),
 (',', 'Fc', 'O'),
 ('25', 'Z', 'O'),
 ('may', 'NC', 'O'),
 ('(', 'Fpa', 'O'),
 ('EFE', 'NC', 'B-ORG'),
 (')', 'Fpt', 'O'),
 ('.', 'Fp', 'O')]

# **PyCRFSuite official tutorial**
C++로 구현된 CRFsuite 를 파이썬 환경에서 이용할 수 있도록 해주는 라이브러리이다,
CoNLL2002 dataset을 이용해 NER model을 학습할것이다. [official tutorial](https://github.com/scrapinghub/python-crfsuite/blob/master/examples/CoNLL%202002.ipynb)

이를 사용하기 위해서는 potential function을 디자인해야한다. 

**1. 주어진 모든 feature를 다 가지고 학습**

In [24]:
# i 시점의 앞/뒤 단어인 i-1, i+1에 대하여 소문자화 한 각 단어 뒤의 2,3 글자, 단어의 품사 등을 이용
# latin 계열 단어에서는 suffix가 유용한 힌트가 됨 
def word2features(sent, i):
    word = sent[i][0]
    postag = sent[i][1]
    features = [
        'bias',
        'word.lower=' + word.lower(), 
        'word[-3:]=' + word[-3:],
        'word[-2:]=' + word[-2:],
        'word.isupper=%s' % word.isupper(),
        'word.istitle=%s' % word.istitle(),
        'word.isdigit=%s' % word.isdigit(),
        'postag=' + postag,
        'postag[:2]=' + postag[:2],
    ]
    if i > 0:
        word1 = sent[i-1][0]
        postag1 = sent[i-1][1]
        features.extend([
            '-1:word.lower=' + word1.lower(),
            '-1:word.istitle=%s' % word1.istitle(),
            '-1:word.isupper=%s' % word1.isupper(),
            '-1:postag=' + postag1,
            '-1:postag[:2]=' + postag1[:2],
        ])
    else:
        features.append('BOS')
        
    if i < len(sent)-1:
        word1 = sent[i+1][0]
        postag1 = sent[i+1][1]
        features.extend([
            '+1:word.lower=' + word1.lower(),
            '+1:word.istitle=%s' % word1.istitle(),
            '+1:word.isupper=%s' % word1.isupper(),
            '+1:postag=' + postag1,
            '+1:postag[:2]=' + postag1[:2],
        ])
    else:
        features.append('EOS')
                
    return features

In [25]:
def sent2features(sent):
    return [word2features(sent, i) for i in range(len(sent))]

def sent2labels(sent):
    return [label for token, postag, label in sent]

def sent2tokens(sent):
    return [token for token, postag, label in sent]

In [26]:
sent2features(train_sents[0])[0]

['bias',
 'word.lower=melbourne',
 'word[-3:]=rne',
 'word[-2:]=ne',
 'word.isupper=False',
 'word.istitle=True',
 'word.isdigit=False',
 'postag=NP',
 'postag[:2]=NP',
 'BOS',
 '+1:word.lower=(',
 '+1:word.istitle=False',
 '+1:word.isupper=False',
 '+1:postag=Fpa',
 '+1:postag[:2]=Fp']

In [27]:
X_train = [sent2features(s) for s in train_sents]
y_train = [sent2labels(s) for s in train_sents]

X_test = [sent2features(s) for s in test_sents]
y_test = [sent2labels(s) for s in test_sents]

In [28]:
print(X_train[0])

[['bias', 'word.lower=melbourne', 'word[-3:]=rne', 'word[-2:]=ne', 'word.isupper=False', 'word.istitle=True', 'word.isdigit=False', 'postag=NP', 'postag[:2]=NP', 'BOS', '+1:word.lower=(', '+1:word.istitle=False', '+1:word.isupper=False', '+1:postag=Fpa', '+1:postag[:2]=Fp'], ['bias', 'word.lower=(', 'word[-3:]=(', 'word[-2:]=(', 'word.isupper=False', 'word.istitle=False', 'word.isdigit=False', 'postag=Fpa', 'postag[:2]=Fp', '-1:word.lower=melbourne', '-1:word.istitle=True', '-1:word.isupper=False', '-1:postag=NP', '-1:postag[:2]=NP', '+1:word.lower=australia', '+1:word.istitle=True', '+1:word.isupper=False', '+1:postag=NP', '+1:postag[:2]=NP'], ['bias', 'word.lower=australia', 'word[-3:]=lia', 'word[-2:]=ia', 'word.isupper=False', 'word.istitle=True', 'word.isdigit=False', 'postag=NP', 'postag[:2]=NP', '-1:word.lower=(', '-1:word.istitle=False', '-1:word.isupper=False', '-1:postag=Fpa', '-1:postag[:2]=Fp', '+1:word.lower=)', '+1:word.istitle=False', '+1:word.isupper=False', '+1:postag=

In [29]:
# 모델에 데이터를 append하여 학습할 준비를 한다. 
trainer = pycrfsuite.Trainer(verbose=False)

for xseq, yseq in zip(X_train, y_train):
    trainer.append(xseq, yseq)

In [30]:
# 최소 다섯번 이상 등장한 feature만 이용
trainer.set_params({
    'c1': 1.0,   # coefficient for L1 penalty
    'c2': 1e-3,  # coefficient for L2 penalty
    'max_iterations': 50,  # stop earlier

    # include transitions that are possible, but not observed
    'feature.possible_transitions': True,
    
    # minimum frequency
    'feature.minfreq': 5
})

In [31]:
trainer.train('conll2002-esp.crfsuite')
tagger = pycrfsuite.Tagger()
tagger.open('conll2002-esp.crfsuite')

<contextlib.closing at 0x7f7c258b7c90>

In [32]:
example_sent = test_sents[0]
print(' '.join(sent2tokens(example_sent)), end='\n\n')

print("Predicted:", ', '.join(tagger.tag(sent2features(example_sent))))
print("Correct:  ", ', '.join(sent2labels(example_sent)))

La Coruña , 23 may ( EFECOM ) .

Predicted: B-LOC, I-LOC, O, O, O, O, B-ORG, O, O
Correct:   B-LOC, I-LOC, O, O, O, O, B-ORG, O, O


In [33]:
from sklearn.metrics import classification_report, confusion_matrix
from sklearn import preprocessing

def bio_classification_report(y_true, y_pred):
    """
    Classification report for a list of BIO-encoded sequences.
    It computes token-level metrics and discards "O" labels.
    
    Note that it requires scikit-learn 0.15+ (or a version from github master)
    to calculate averages properly!
    """
    lb=preprocessing.LabelBinarizer()
    y_true_combined = lb.fit_transform(list(chain.from_iterable(y_true)))
    y_pred_combined = lb.transform(list(chain.from_iterable(y_pred)))
        
    tagset = set(lb.classes_) - {'O'}
    tagset = sorted(tagset, key=lambda tag: tag.split('-', 1)[::-1])
    class_indices = {cls: idx for idx, cls in enumerate(lb.classes_)}
    
    return classification_report(
        y_true_combined,
        y_pred_combined,
        labels = [class_indices[cls] for cls in tagset],
        target_names = tagset,
    )

In [34]:
y_true = y_test
y_pred = []
for sent in test_sents:
    y_pred.append(tagger.tag(sent2features(sent)))

In [35]:
bio_classification_report(y_true, y_pred)

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


'              precision    recall  f1-score   support\n\n       B-LOC       0.74      0.71      0.73      1084\n       I-LOC       0.57      0.51      0.54       325\n      B-MISC       0.61      0.37      0.46       339\n      I-MISC       0.59      0.43      0.50       557\n       B-ORG       0.76      0.78      0.77      1400\n       I-ORG       0.78      0.76      0.77      1104\n       B-PER       0.77      0.87      0.82       735\n       I-PER       0.83      0.94      0.88       634\n\n   micro avg       0.75      0.72      0.73      6178\n   macro avg       0.71      0.67      0.68      6178\nweighted avg       0.74      0.72      0.73      6178\n samples avg       0.09      0.09      0.09      6178\n'

**한정된 feature만 가지고 학습**

bias, word lower, word[-3:], word[-2:]만 이용

In [38]:
def word2features(sent, i):
    word = sent[i][0]
    postag = sent[i][1]
    features = [
        'bias',
        'word[-3:]=' + word[-3:],
        'word[-2:]=' + word[-2:],
    ]
    if i > 0:
        word1 = sent[i-1][0]
        postag1 = sent[i-1][1]
        features.extend([
            '-1:word.lower=' + word1.lower(),
        ])
    else:
        features.append('BOS')
        
    if i < len(sent)-1:
        word1 = sent[i+1][0]
        postag1 = sent[i+1][1]
        features.extend([
            '+1:word.lower=' + word1.lower(),
        ])
    else:
        features.append('EOS')
                
    return features

In [39]:
def sent2features(sent):
    return [word2features(sent, i) for i in range(len(sent))]

def sent2labels(sent):
    return [label for token, postag, label in sent]

def sent2tokens(sent):
    return [token for token, postag, label in sent]

In [40]:
X_train = [sent2features(s) for s in train_sents]
y_train = [sent2labels(s) for s in train_sents]

X_test = [sent2features(s) for s in test_sents]
y_test = [sent2labels(s) for s in test_sents]

In [41]:
trainer = pycrfsuite.Trainer(verbose=False)

for xseq, yseq in zip(X_train, y_train):
    trainer.append(xseq, yseq)

In [42]:
trainer.set_params({
    'c1': 1.0,   # coefficient for L1 penalty
    'c2': 1e-3,  # coefficient for L2 penalty
    'max_iterations': 50,  # stop earlier

    # include transitions that are possible, but not observed
    'feature.possible_transitions': True,
    
    # minimum frequency
    'feature.minfreq': 5
})

In [43]:
trainer.train('conll2002-esp.crfsuite')
tagger = pycrfsuite.Tagger()
tagger.open('conll2002-esp.crfsuite')

<contextlib.closing at 0x7f7c2ddf7f90>

In [44]:
y_true = y_test
y_pred = []
for sent in test_sents:
    y_pred.append(tagger.tag(sent2features(sent)))

In [46]:
bio_classification_report(y_true, y_pred)

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


'              precision    recall  f1-score   support\n\n       B-LOC       0.69      0.49      0.58      1084\n       I-LOC       0.60      0.47      0.52       325\n      B-MISC       0.52      0.20      0.29       339\n      I-MISC       0.52      0.36      0.43       557\n       B-ORG       0.74      0.55      0.63      1400\n       I-ORG       0.71      0.52      0.60      1104\n       B-PER       0.83      0.69      0.76       735\n       I-PER       0.86      0.86      0.86       634\n\n   micro avg       0.72      0.54      0.62      6178\n   macro avg       0.68      0.52      0.58      6178\nweighted avg       0.71      0.54      0.61      6178\n samples avg       0.07      0.07      0.07      6178\n'

**모델 확인**

영향력이 높은 features와 각각에 해당하는 weight를 확인한다. 

모든 feature를 이용한 모델로 평가했다. 

Ner tagging에서 중요한 정보는 앞/뒤에 등장하는 단어이다. 

In [47]:
debugger = tagger.info()
weights = debugger.state_features
location_features = {feature:weight for feature, weight in weights.items() if 'LOC' in feature[1]}

for feature, weight in sorted(location_features.items(), key=lambda x:-x[1])[:50]:
    print('{} : {}'.format(feature, weight))

('-1:word.lower=despejado', 'B-LOC') : 6.919385
('-1:word.lower=efe-cantabria', 'B-LOC') : 6.274558
('word[-3:]=yun', 'B-LOC') : 5.874011
('-1:word.lower=palacio', 'I-LOC') : 5.86573
('-1:word.lower=puente', 'I-LOC') : 5.553516
('-1:word.lower=costa', 'I-LOC') : 5.458388
('-1:word.lower=avenida', 'I-LOC') : 5.372484
('word[-3:]=nón', 'B-LOC') : 5.322154
('word[-3:]=iés', 'B-LOC') : 5.147951
('-1:word.lower=nuboso', 'B-LOC') : 5.10912
('word[-3:]=ael', 'B-LOC') : 4.857369
('-1:word.lower=cantabria', 'B-LOC') : 4.785114
('-1:word.lower=santa', 'I-LOC') : 4.763376
('-1:word.lower=parque', 'I-LOC') : 4.587954
('word[-3:]=kio', 'B-LOC') : 4.379538
('+1:word.lower=cairo', 'B-LOC') : 4.342166
('+1:word.lower=coruña', 'B-LOC') : 4.315112
('+1:word.lower=unido', 'B-LOC') : 3.890058
('word[-3:]=lmo', 'B-LOC') : 3.739574
('-1:word.lower=paseo', 'I-LOC') : 3.709889
('-1:word.lower=bulevar', 'I-LOC') : 3.681638
('-1:word.lower=lluvioso', 'B-LOC') : 3.674013
('word[-3:]=uay', 'B-LOC') : 3.642079
('w