<a href="https://colab.research.google.com/github/MasahiroAraki/MLCourse/blob/master/Python/answer/13a_sequence.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## 課題

効果の高そうな素性をいくつか当たりを付けて、その素性を削除した条件で実験を行うことで、素性の有効性を評価する実験を行ってください。

データセットをダウンロードします。このデータは、 GMB(Groningen Meaning Bank) コーパス（英文）に品詞タグと固有表現タグを付けたものです。

In [1]:
import pandas as pd

df = pd.read_csv('https://github.com/dipanjanS/nlp_workshop_dhs18/raw/master/Unit%2008%20-%20Project%206%20-%20Build%20your%20NER%20Tagger/ner_dataset.csv.gz', compression='gzip', encoding='ISO-8859-1')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1048575 entries, 0 to 1048574
Data columns (total 4 columns):
 #   Column      Non-Null Count    Dtype 
---  ------      --------------    ----- 
 0   Sentence #  47959 non-null    object
 1   Word        1048575 non-null  object
 2   POS         1048575 non-null  object
 3   Tag         1048575 non-null  object
dtypes: object(4)
memory usage: 32.0+ MB


文番号(sentence #) が文の先頭単語にしか付いていないので、同一文中の後続単語にも同じ番号を付与します。

In [2]:
df = df.fillna(method='ffill')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1048575 entries, 0 to 1048574
Data columns (total 4 columns):
 #   Column      Non-Null Count    Dtype 
---  ------      --------------    ----- 
 0   Sentence #  1048575 non-null  object
 1   Word        1048575 non-null  object
 2   POS         1048575 non-null  object
 3   Tag         1048575 non-null  object
dtypes: object(4)
memory usage: 32.0+ MB


文中から素性値を求めるメソッドを定義します。素性値は以下のもので、先頭単語と末尾単語のみ前後の単語に関する処理が異なります。

* 小文字に変換した単語の見出し表記
* 単語の末尾3文字
* 単語の末尾2文字
* すべて大文字、先頭文字のみ大文字、数値、品詞の情報

In [3]:
def word2features(sent, i):
    word = sent[i][0]
    postag = sent[i][1]

    features = {
        'bias': 1.0,
        'word.lower()': word.lower(),
        'word[-3:]': word[-3:],
        'word[-2:]': word[-2:],
        'word.isupper()': word.isupper(),
        'word.istitle()': word.istitle(),
        'word.isdigit()': word.isdigit(),
        'postag': postag,
        'postag[:2]': postag[:2],
    }
    if i > 0:
        word1 = sent[i-1][0]
        postag1 = sent[i-1][1]
        features.update({
            '-1:word.lower()': word1.lower(),
            '-1:word.istitle()': word1.istitle(),
            '-1:word.isupper()': word1.isupper(),
            '-1:postag': postag1,
            '-1:postag[:2]': postag1[:2],
            'BOS' : False,
        })
    else:
        features['BOS'] = True

    if i < len(sent)-1:
        word1 = sent[i+1][0]
        postag1 = sent[i+1][1]
        features.update({
            '+1:word.lower()': word1.lower(),
            '+1:word.istitle()': word1.istitle(),
            '+1:word.isupper()': word1.isupper(),
            '+1:postag': postag1,
            '+1:postag[:2]': postag1[:2],
            'EOS' : False
        })
    else:
        features['EOS'] = True

    return features


def sent2features(sent):
    return [word2features(sent, i) for i in range(len(sent))]

def sent2labels(sent):
    return [label for token, postag, label in sent]

(単語, 品詞, NEタグ) のタプルのリストからなる文のリスト (sentences) を作成します。

In [4]:
agg_func = lambda s: [(w, p, t) for w, p, t in zip(s['Word'].values.tolist(), 
                                                   s['POS'].values.tolist(), 
                                                   s['Tag'].values.tolist())]

In [5]:
grouped_df = df.groupby('Sentence #').apply(agg_func)

In [6]:
print(grouped_df[grouped_df.index == 'Sentence: 1'].values)

[list([('Thousands', 'NNS', 'O'), ('of', 'IN', 'O'), ('demonstrators', 'NNS', 'O'), ('have', 'VBP', 'O'), ('marched', 'VBN', 'O'), ('through', 'IN', 'O'), ('London', 'NNP', 'B-geo'), ('to', 'TO', 'O'), ('protest', 'VB', 'O'), ('the', 'DT', 'O'), ('war', 'NN', 'O'), ('in', 'IN', 'O'), ('Iraq', 'NNP', 'B-geo'), ('and', 'CC', 'O'), ('demand', 'VB', 'O'), ('the', 'DT', 'O'), ('withdrawal', 'NN', 'O'), ('of', 'IN', 'O'), ('British', 'JJ', 'B-gpe'), ('troops', 'NNS', 'O'), ('from', 'IN', 'O'), ('that', 'DT', 'O'), ('country', 'NN', 'O'), ('.', '.', 'O')])]


In [7]:
grouped_df.shape

(47959,)

In [8]:
sentences = [s for s in grouped_df]
sentences[0]

[('Thousands', 'NNS', 'O'),
 ('of', 'IN', 'O'),
 ('demonstrators', 'NNS', 'O'),
 ('have', 'VBP', 'O'),
 ('marched', 'VBN', 'O'),
 ('through', 'IN', 'O'),
 ('London', 'NNP', 'B-geo'),
 ('to', 'TO', 'O'),
 ('protest', 'VB', 'O'),
 ('the', 'DT', 'O'),
 ('war', 'NN', 'O'),
 ('in', 'IN', 'O'),
 ('Iraq', 'NNP', 'B-geo'),
 ('and', 'CC', 'O'),
 ('demand', 'VB', 'O'),
 ('the', 'DT', 'O'),
 ('withdrawal', 'NN', 'O'),
 ('of', 'IN', 'O'),
 ('British', 'JJ', 'B-gpe'),
 ('troops', 'NNS', 'O'),
 ('from', 'IN', 'O'),
 ('that', 'DT', 'O'),
 ('country', 'NN', 'O'),
 ('.', '.', 'O')]

文の一部を取り出して、素性値の計算がうまく行われているか確認します。一部を取り出しているので、文頭・文末を表すBOS, EOSの値は本来の値とは異なります。

In [9]:
sent2features(sentences[0][5:7])

[{'+1:postag': 'NNP',
  '+1:postag[:2]': 'NN',
  '+1:word.istitle()': True,
  '+1:word.isupper()': False,
  '+1:word.lower()': 'london',
  'BOS': True,
  'EOS': False,
  'bias': 1.0,
  'postag': 'IN',
  'postag[:2]': 'IN',
  'word.isdigit()': False,
  'word.istitle()': False,
  'word.isupper()': False,
  'word.lower()': 'through',
  'word[-2:]': 'gh',
  'word[-3:]': 'ugh'},
 {'-1:postag': 'IN',
  '-1:postag[:2]': 'IN',
  '-1:word.istitle()': False,
  '-1:word.isupper()': False,
  '-1:word.lower()': 'through',
  'BOS': False,
  'EOS': True,
  'bias': 1.0,
  'postag': 'NNP',
  'postag[:2]': 'NN',
  'word.isdigit()': False,
  'word.istitle()': True,
  'word.isupper()': False,
  'word.lower()': 'london',
  'word[-2:]': 'on',
  'word[-3:]': 'don'}]

正解ラベルを確認します。

In [10]:
sent2labels(sentences[0][5:7])

['O', 'B-geo']

データセットを学習用と評価用に分割します。

In [11]:
from sklearn.model_selection import train_test_split
import numpy as np

X = np.array([sent2features(s) for s in sentences], dtype=object)
y = np.array([sent2labels(s) for s in sentences], dtype=object)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
X_train.shape, X_test.shape

((35969,), (11990,))

[CRFsuite (python-crfsuite)](https://github.com/scrapinghub/python-crfsuite) を用いて学習します。

In [12]:
!pip install sklearn-crfsuite



CRFのインスタンスを作成します。ハイパーパラメータの意味を以下に示します。

* algorithim: 最適化アルゴリズムの指定
* c1 : L1正則化項の重み
* c2 : L2正則化項の重み
* max_iterations: 繰り返し回数の上限
* all_possible_transitions : 学習データに出現しないラベルの遷移も素性とする
* verbose : 学習過程の詳細な情報を出力


In [13]:
import sklearn_crfsuite

crf = sklearn_crfsuite.CRF(algorithm='lbfgs',
                           c1=0.1,
                           c2=0.1,
                           max_iterations=50,
                           all_possible_transitions=True,
                           verbose=True)

In [14]:
crf.fit(X_train, y_train)

loading training data to CRFsuite: 100%|██████████| 35969/35969 [00:11<00:00, 3004.01it/s]



Feature generation
type: CRF1d
feature.minfreq: 0.000000
feature.possible_states: 0
feature.possible_transitions: 1
0....1....2....3....4....5....6....7....8....9....10
Number of features: 133645
Seconds required: 2.122

L-BFGS optimization
c1: 0.100000
c2: 0.100000
num_memories: 6
max_iterations: 50
epsilon: 0.000010
stop: 10
delta: 0.000010
linesearch: MoreThuente
linesearch.max_iterations: 20

Iter 1   time=3.81  loss=1263996.98 active=132653 feature_norm=1.00
Iter 2   time=3.96  loss=994013.75 active=131310 feature_norm=4.42
Iter 3   time=1.99  loss=776364.07 active=125987 feature_norm=3.87
Iter 4   time=9.85  loss=421961.16 active=127027 feature_norm=3.24
Iter 5   time=2.00  loss=354425.51 active=129045 feature_norm=4.04
Iter 6   time=2.00  loss=260017.69 active=122707 feature_norm=6.18
Iter 7   time=2.00  loss=219836.67 active=115755 feature_norm=7.98
Iter 8   time=1.99  loss=196203.95 active=110953 feature_norm=8.86
Iter 9   time=2.00  loss=177983.48 active=106540 feature_norm=



CRF(algorithm='lbfgs', all_possible_states=None, all_possible_transitions=True,
    averaging=None, c=None, c1=0.1, c2=0.1, calibration_candidates=None,
    calibration_eta=None, calibration_max_trials=None, calibration_rate=None,
    calibration_samples=None, delta=None, epsilon=None, error_sensitive=None,
    gamma=None, keep_tempfiles=None, linesearch=None, max_iterations=50,
    max_linesearch=None, min_freq=None, model_filename=None, num_memories=None,
    pa_type=None, period=None, trainer_cls=None, variance=None, verbose=True)

O以外のタグで性能評価を行います。


In [15]:
y_pred = crf.predict(X_test)

In [16]:
from sklearn_crfsuite import metrics as crf_metrics

labels = list(crf.classes_)
labels.remove('O')
print(crf_metrics.flat_classification_report(y_test, y_pred, labels=labels))

              precision    recall  f1-score   support

       B-org       0.81      0.73      0.77      5116
       B-per       0.85      0.83      0.84      4239
       I-per       0.86      0.90      0.88      4273
       B-geo       0.86      0.92      0.89      9403
       I-geo       0.82      0.80      0.81      1826
       B-tim       0.93      0.88      0.91      5095
       I-org       0.81      0.80      0.80      4195
       B-gpe       0.98      0.94      0.96      3961
       I-tim       0.84      0.80      0.82      1604
       B-nat       0.67      0.25      0.37        55
       B-eve       0.48      0.33      0.39        80
       B-art       0.31      0.11      0.16       102
       I-art       0.10      0.02      0.04        90
       I-eve       0.40      0.19      0.26        74
       I-gpe       0.95      0.50      0.65        36
       I-nat       1.00      0.22      0.36        18

   micro avg       0.86      0.85      0.86     40167
   macro avg       0.73   

単語表記と文区切り以外の情報を削除して性能の変化を見ます。

In [17]:
def word2features(sent, i):
    word = sent[i][0]
    postag = sent[i][1]

    features = {
        'bias': 1.0,
        'word.lower()': word.lower(),
        #'word[-3:]': word[-3:],
        #'word[-2:]': word[-2:],
        #'word.isupper()': word.isupper(),
        #'word.istitle()': word.istitle(),
        #'word.isdigit()': word.isdigit(),
        #'postag': postag,
        #'postag[:2]': postag[:2],
    }
    if i > 0:
        word1 = sent[i-1][0]
        postag1 = sent[i-1][1]
        features.update({
            '-1:word.lower()': word1.lower(),
            #'-1:word.istitle()': word1.istitle(),
            #'-1:word.isupper()': word1.isupper(),
            #'-1:postag': postag1,
            #'-1:postag[:2]': postag1[:2],
            'BOS' : False,
        })
    else:
        features['BOS'] = True

    if i < len(sent)-1:
        word1 = sent[i+1][0]
        postag1 = sent[i+1][1]
        features.update({
            '+1:word.lower()': word1.lower(),
            #'+1:word.istitle()': word1.istitle(),
            #'+1:word.isupper()': word1.isupper(),
            #'+1:postag': postag1,
            #'+1:postag[:2]': postag1[:2],
            'EOS' : False
        })
    else:
        features['EOS'] = True

    return features


def sent2features(sent):
    return [word2features(sent, i) for i in range(len(sent))]

def sent2labels(sent):
    return [label for token, postag, label in sent]

データにはPOS情報が含まれていますが、素性テンプレートで除外しています。

In [18]:
agg_func = lambda s: [(w, p, t) for w, p, t in zip(s['Word'].values.tolist(), 
                                                   s['POS'].values.tolist(), 
                                                   s['Tag'].values.tolist())]
grouped_df = df.groupby('Sentence #').apply(agg_func)
sentences = [s for s in grouped_df]

In [19]:
X = np.array([sent2features(s) for s in sentences], dtype=object)
y = np.array([sent2labels(s) for s in sentences], dtype=object)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
X_train.shape, X_test.shape

((35969,), (11990,))

In [20]:
crf2 = sklearn_crfsuite.CRF(algorithm='lbfgs',
                           c1=0.1,
                           c2=0.1,
                           max_iterations=50,
                           all_possible_transitions=True,
                           verbose=True)

In [21]:
crf2.fit(X_train, y_train)

loading training data to CRFsuite: 100%|██████████| 35969/35969 [00:04<00:00, 8103.66it/s]



Feature generation
type: CRF1d
feature.minfreq: 0.000000
feature.possible_states: 0
feature.possible_transitions: 1
0....1....2....3....4....5....6....7....8....9....10
Number of features: 116001
Seconds required: 0.845

L-BFGS optimization
c1: 0.100000
c2: 0.100000
num_memories: 6
max_iterations: 50
epsilon: 0.000010
stop: 10
delta: 0.000010
linesearch: MoreThuente
linesearch.max_iterations: 20

Iter 1   time=2.45  loss=1396483.27 active=115148 feature_norm=1.00
Iter 2   time=3.87  loss=1006822.14 active=114704 feature_norm=4.52
Iter 3   time=1.29  loss=818328.93 active=109168 feature_norm=3.74
Iter 4   time=5.14  loss=764373.37 active=109525 feature_norm=2.33
Iter 5   time=1.31  loss=651847.75 active=115944 feature_norm=3.32
Iter 6   time=1.30  loss=594746.56 active=115882 feature_norm=3.18
Iter 7   time=1.31  loss=582570.11 active=109226 feature_norm=3.34
Iter 8   time=1.29  loss=562759.36 active=112466 feature_norm=3.46
Iter 9   time=1.31  loss=553197.63 active=114551 feature_norm



CRF(algorithm='lbfgs', all_possible_states=None, all_possible_transitions=True,
    averaging=None, c=None, c1=0.1, c2=0.1, calibration_candidates=None,
    calibration_eta=None, calibration_max_trials=None, calibration_rate=None,
    calibration_samples=None, delta=None, epsilon=None, error_sensitive=None,
    gamma=None, keep_tempfiles=None, linesearch=None, max_iterations=50,
    max_linesearch=None, min_freq=None, model_filename=None, num_memories=None,
    pa_type=None, period=None, trainer_cls=None, variance=None, verbose=True)

In [22]:
y_pred = crf2.predict(X_test)
labels = list(crf2.classes_)
labels.remove('O')
print(crf_metrics.flat_classification_report(y_test, y_pred, labels=labels))

              precision    recall  f1-score   support

       B-org       0.79      0.60      0.68      5116
       B-per       0.85      0.73      0.79      4239
       I-per       0.85      0.84      0.85      4273
       B-geo       0.86      0.86      0.86      9403
       I-geo       0.79      0.74      0.77      1826
       B-tim       0.93      0.86      0.89      5095
       I-org       0.70      0.62      0.66      4195
       B-gpe       0.96      0.93      0.95      3961
       I-tim       0.81      0.77      0.79      1604
       B-nat       0.71      0.27      0.39        55
       B-eve       0.64      0.38      0.47        80
       B-art       0.36      0.09      0.14       102
       I-art       0.16      0.04      0.07        90
       I-eve       0.51      0.32      0.40        74
       I-gpe       0.95      0.50      0.65        36
       I-nat       0.75      0.17      0.27        18

   micro avg       0.85      0.78      0.81     40167
   macro avg       0.73   

それほど極端に性能は落ちていません。データが十分にあると、遷移素性と単語そのものの情報でそれなりの性能が実現できそうです。