### Алина Шаймарданова

# Assignment 4: Named entity recognition

Построить модель для обнаружения и классификации именованных сущностей (named entities). На базе корпуса CoNLL 2002.  

Используйте в своем решении ансамбли над решающими деревьями: RandomForest, Gradient Boosting (xgboost, lightgbm, catboost) 
Tutorials:  
1. https://github.com/Microsoft/LightGBM/tree/master/examples/python-guide
1. https://github.com/catboost/tutorials 


Чем больше baseline'ов вы превзойдете, тем выше ваша оценка
Метрика качества f1 (f1_macro) (чем выше, тем лучше)
 
baseline 1: 0.0604      random labels  
baseline 2: 0.3966      PoS features + logistic regression  
baseline 3: 0.8122      word2vec cbow embedding + baseline 2 + svm    

! Your results must be reproducible. Если ваша модель - стохастическая, то вы явно должны задавать все seed и random_state в параметрах моделей   

bonus, think about:  
1. How can you exploit that words belong to some sentence?
2. Why we selected f1 score with macro averaging as our classification quality measure? What other metrics are suitable?   

In [1]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn import model_selection
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegressionCV
from sklearn.preprocessing import LabelEncoder
from sklearn import metrics

import warnings
warnings.filterwarnings('ignore')

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline


SEED=1337

In [2]:
df = pd.read_csv('/Users/alinashaymardanova/Downloads/data/ner_short.csv', index_col=0)
df.head()

Unnamed: 0,next-next-pos,next-next-word,next-pos,next-word,pos,prev-pos,prev-prev-pos,prev-prev-word,prev-word,sentence_idx,word,tag
0,NNS,demonstrators,IN,of,NNS,__START1__,__START2__,__START2__,__START1__,1.0,Thousands,O
1,VBP,have,NNS,demonstrators,IN,NNS,__START1__,__START1__,Thousands,1.0,of,O
2,VBN,marched,VBP,have,NNS,IN,NNS,Thousands,of,1.0,demonstrators,O
3,IN,through,VBN,marched,VBP,NNS,IN,of,demonstrators,1.0,have,O
4,NNP,London,IN,through,VBN,VBP,NNS,demonstrators,have,1.0,marched,O


In [3]:
# number of sentences
df.sentence_idx.max()

1500.0

In [4]:
# class distribution
df.tag.value_counts(normalize=True )

O        0.852828
B-geo    0.027604
B-gpe    0.020935
B-org    0.020247
I-per    0.017795
B-tim    0.016927
B-per    0.015312
I-org    0.013937
I-geo    0.005383
I-tim    0.004247
B-art    0.001376
I-gpe    0.000837
I-art    0.000748
B-eve    0.000628
I-eve    0.000508
B-nat    0.000449
I-nat    0.000239
Name: tag, dtype: float64

In [5]:
# sentence length
tdf = df.set_index('sentence_idx')
tdf['length'] = df.groupby('sentence_idx').tag.count()
df = tdf.reset_index(drop=False)

In [6]:
# encode categorial variables

le = LabelEncoder()
df['pos'] = le.fit_transform(df.pos)
df['next-pos'] = le.fit_transform(df['next-pos'])
df['next-next-pos'] = le.fit_transform(df['next-next-pos'])
df['prev-pos'] = le.fit_transform(df['prev-pos'])
df['prev-prev-pos'] = le.fit_transform(df['prev-prev-pos'])

In [7]:
df.head()

Unnamed: 0,sentence_idx,next-next-pos,next-next-word,next-pos,next-word,pos,prev-pos,prev-prev-pos,prev-prev-word,prev-word,word,tag,length
0,1.0,18,demonstrators,9,of,18,39,40,__START2__,__START1__,Thousands,O,48
1,1.0,33,have,18,demonstrators,9,18,39,__START1__,Thousands,of,O,48
2,1.0,32,marched,33,have,18,9,18,Thousands,of,demonstrators,O,48
3,1.0,9,through,32,marched,33,18,9,of,demonstrators,have,O,48
4,1.0,16,London,9,through,32,33,18,demonstrators,have,marched,O,48


In [8]:
# splitting
y = LabelEncoder().fit_transform(df.tag)

df_train, df_test, y_train, y_test = model_selection.train_test_split(df, y, stratify=y, 
                                                                      test_size=0.25, random_state=SEED, shuffle=True)
print('train', df_train.shape[0])
print('test', df_test.shape[0])

train 50155
test 16719


In [9]:
# some wrappers to work with word2vec
from gensim.models.word2vec import Word2Vec
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.base import TransformerMixin
from collections import defaultdict

   
class Word2VecWrapper(TransformerMixin):
    def __init__(self, window=5,negative=5, size=100, iter=100, is_cbow=False, random_state=SEED):
        self.window_ = window
        self.negative_ = negative
        self.size_ = size
        self.iter_ = iter
        self.is_cbow_ = is_cbow
        self.w2v = None
        self.random_state = random_state
        
    def get_size(self):
        return self.size_

    def fit(self, X, y=None):
        """
        X: list of strings
        """
        sentences_list = [x.split() for x in X]
        self.w2v = Word2Vec(sentences_list, 
                            window=self.window_,
                            negative=self.negative_, 
                            size=self.size_, 
                            iter=self.iter_,
                            sg=not self.is_cbow_, seed=self.random_state)

        return self
    
    def has(self, word):
        return word in self.w2v

    def transform(self, X):
        """
        X: a list of words
        """
        if self.w2v is None:
            raise Exception('model not fitted')
        return np.array([self.w2v[w] if w in self.w2v else np.zeros(self.size_) for w in X ])
    


In [11]:
%%time
# here we exploit that word2vec is an unsupervised learning algorithm
# so we can train it on the whole dataset (subject to discussion)

sentences_list = [x.strip() for x in ' '.join(df.word).split('.')]

w2v_cbow = Word2VecWrapper(window=5, negative=5, size=300, iter=300, is_cbow=True, random_state=SEED)
w2v_cbow.fit(sentences_list)

CPU times: user 53.9 s, sys: 713 ms, total: 54.6 s
Wall time: 25.5 s


In [12]:
%%time
# baseline 1 
# random labels
from sklearn.preprocessing import OneHotEncoder
from sklearn.dummy import DummyClassifier


columns = ['pos','next-pos','next-next-pos','prev-pos','prev-prev-pos']

model = Pipeline([
    ('enc', OneHotEncoder()),
    ('est', DummyClassifier(random_state=SEED)),
])

model.fit(df_train[columns], y_train)

print('train', metrics.f1_score(y_train, model.predict(df_train[columns]), average='macro'))
print('test', metrics.f1_score(y_test, model.predict(df_test[columns]), average='macro'))


train 0.058877367256
test 0.0604395427128
CPU times: user 239 ms, sys: 39.8 ms, total: 279 ms
Wall time: 289 ms


In [12]:
%%time
# baseline 2 
# pos features + one hot encoding + logistic regression
from sklearn.preprocessing import OneHotEncoder


columns = ['pos','next-pos','next-next-pos','prev-pos','prev-prev-pos']

model = Pipeline([
    ('enc', OneHotEncoder()),
    ('est', LogisticRegressionCV(Cs=5, cv=5, n_jobs=-1, scoring='f1_macro', 
                             penalty='l2', solver='newton-cg', multi_class='multinomial', random_state=SEED)),
])

model.fit(df_train[columns], y_train)

print('train', metrics.f1_score(y_train, model.predict(df_train[columns]), average='macro'))
print('test', metrics.f1_score(y_test, model.predict(df_test[columns]), average='macro'))

train 0.466395002823
test 0.396609814216
CPU times: user 4min 54s, sys: 21.6 s, total: 5min 15s
Wall time: 22min 47s


In [14]:
%%time
# baseline 3
# use word2vec cbow embedding + baseline 2 + svm
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.svm import LinearSVC
import scipy.sparse as sp

embeding = w2v_cbow
encoder_pos = OneHotEncoder()
X_train = sp.hstack([
    embeding.transform(df_train.word),
    embeding.transform(df_train['next-word']),
    embeding.transform(df_train['next-next-word']),
    embeding.transform(df_train['prev-word']),
    embeding.transform(df_train['prev-prev-word']),
    encoder_pos.fit_transform(df_train[['pos','next-pos','next-next-pos','prev-pos','prev-prev-pos']])
])
X_test = sp.hstack([
    embeding.transform(df_test.word),
    embeding.transform(df_test['next-word']),
    embeding.transform(df_test['next-next-word']),
    embeding.transform(df_test['prev-word']),
    embeding.transform(df_test['prev-prev-word']),
    encoder_pos.transform(df_test[['pos','next-pos','next-next-pos','prev-pos','prev-prev-pos']])
])

model = model_selection.GridSearchCV(LinearSVC(penalty='l2', multi_class='ovr', random_state=SEED), 
                                    {'C': np.logspace(-4, 0, 5)}, 
                                    cv=3, scoring='f1_macro', n_jobs=-1, verbose=1)
model.fit(X_train, y_train)

print('train', metrics.f1_score(y_train, model.predict(X_train), average='macro'))
print('test', metrics.f1_score(y_test, model.predict(X_test), average='macro'))

Fitting 3 folds for each of 5 candidates, totalling 15 fits


[Parallel(n_jobs=-1)]: Done  15 out of  15 | elapsed: 11.6min finished


train 0.956854454966
test 0.794605411826
CPU times: user 3min 28s, sys: 32 s, total: 4min
Wall time: 14min 58s


### Метод "в лоб"

In [61]:
from sklearn.ensemble import RandomForestClassifier

In [60]:
rf = RandomForestClassifier(random_state=SEED)
rf.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=1337, verbose=0,
            warm_start=False)

In [62]:
print('train', metrics.f1_score(y_train, rf.predict(X_train), average='macro'))
print('test', metrics.f1_score(y_test, rf.predict(X_test), average='macro'))

train 0.988742270255
test 0.844549179338


Попробуем через грид подобрать лучшие параметры для рандомных деревьев.

In [63]:
from sklearn.model_selection import GridSearchCV

In [64]:
rf = RandomForestClassifier(random_state=SEED)
grid = {'n_estimators': [10, 100, 150],
       'max_depth': [10, 100, None],
       'criterion': ['gini'],
       'min_samples_leaf': [1, 2, 10]}

clf = GridSearchCV(rf, grid, n_jobs=-1, scoring='f1_macro', cv=5)

In [42]:
%%time

clf.fit(X_train, y_train)

Fitting 5 folds for each of 27 candidates, totalling 135 fits
[CV] criterion=gini, max_depth=10, min_samples_leaf=1, n_estimators=10 
[CV] criterion=gini, max_depth=10, min_samples_leaf=1, n_estimators=10 
[CV] criterion=gini, max_depth=10, min_samples_leaf=1, n_estimators=10 
[CV] criterion=gini, max_depth=10, min_samples_leaf=1, n_estimators=10 
[CV]  criterion=gini, max_depth=10, min_samples_leaf=1, n_estimators=10, score=0.26884708818346303, total=  50.0s
[CV] criterion=gini, max_depth=10, min_samples_leaf=1, n_estimators=10 
[CV]  criterion=gini, max_depth=10, min_samples_leaf=1, n_estimators=10, score=0.2811906958581345, total=  48.8s
[CV]  criterion=gini, max_depth=10, min_samples_leaf=1, n_estimators=10, score=0.3126478445923687, total=  47.0s
[CV]  criterion=gini, max_depth=10, min_samples_leaf=1, n_estimators=10, score=0.2919992671392564, total=  53.1s
[CV] criterion=gini, max_depth=10, min_samples_leaf=1, n_estimators=100 
[CV] criterion=gini, max_depth=10, min_samples_leaf=

[Parallel(n_jobs=-1)]: Done  10 tasks      | elapsed:  8.6min


[CV]  criterion=gini, max_depth=10, min_samples_leaf=1, n_estimators=150, score=0.25275659470000544, total= 4.9min
[CV] criterion=gini, max_depth=10, min_samples_leaf=1, n_estimators=150 
[CV]  criterion=gini, max_depth=10, min_samples_leaf=1, n_estimators=150, score=0.2582149524529175, total= 4.9min
[CV] criterion=gini, max_depth=10, min_samples_leaf=2, n_estimators=10 
[CV]  criterion=gini, max_depth=10, min_samples_leaf=1, n_estimators=150, score=0.2601488857210528, total= 4.8min
[CV] criterion=gini, max_depth=10, min_samples_leaf=2, n_estimators=10 
[CV]  criterion=gini, max_depth=10, min_samples_leaf=2, n_estimators=10, score=0.21350620668919126, total=  35.0s
[CV] criterion=gini, max_depth=10, min_samples_leaf=2, n_estimators=10 
[CV]  criterion=gini, max_depth=10, min_samples_leaf=2, n_estimators=10, score=0.24602281638248844, total=  30.0s
[CV] criterion=gini, max_depth=10, min_samples_leaf=2, n_estimators=10 
[CV]  criterion=gini, max_depth=10, min_samples_leaf=2, n_estimators

[CV] criterion=gini, max_depth=100, min_samples_leaf=1, n_estimators=150 
[CV]  criterion=gini, max_depth=100, min_samples_leaf=1, n_estimators=100, score=0.8699601466242705, total=16.3min
[CV] criterion=gini, max_depth=100, min_samples_leaf=1, n_estimators=150 
[CV]  criterion=gini, max_depth=100, min_samples_leaf=1, n_estimators=150, score=0.8266075274451227, total=24.8min
[CV] criterion=gini, max_depth=100, min_samples_leaf=1, n_estimators=150 
[CV]  criterion=gini, max_depth=100, min_samples_leaf=1, n_estimators=150, score=0.8651392634554388, total=24.7min
[CV] criterion=gini, max_depth=100, min_samples_leaf=2, n_estimators=10 
[CV]  criterion=gini, max_depth=100, min_samples_leaf=1, n_estimators=150, score=0.7679720115290866, total=24.7min
[CV] criterion=gini, max_depth=100, min_samples_leaf=2, n_estimators=10 
[CV]  criterion=gini, max_depth=100, min_samples_leaf=2, n_estimators=10, score=0.7343214622804778, total= 1.6min
[CV] criterion=gini, max_depth=100, min_samples_leaf=2, n_

[Parallel(n_jobs=-1)]: Done  64 tasks      | elapsed: 96.9min


[CV] criterion=gini, max_depth=100, min_samples_leaf=2, n_estimators=100 
[CV]  criterion=gini, max_depth=100, min_samples_leaf=2, n_estimators=100, score=0.7648667096628128, total=13.4min
[CV] criterion=gini, max_depth=100, min_samples_leaf=2, n_estimators=100 
[CV]  criterion=gini, max_depth=100, min_samples_leaf=2, n_estimators=100, score=0.77413212664126, total=12.9min
[CV] criterion=gini, max_depth=100, min_samples_leaf=2, n_estimators=100 
[CV]  criterion=gini, max_depth=100, min_samples_leaf=1, n_estimators=150, score=0.8710251499358451, total=24.1min
[CV] criterion=gini, max_depth=100, min_samples_leaf=2, n_estimators=150 
[CV]  criterion=gini, max_depth=100, min_samples_leaf=2, n_estimators=100, score=0.7133878364530618, total=13.3min
[CV] criterion=gini, max_depth=100, min_samples_leaf=2, n_estimators=150 
[CV]  criterion=gini, max_depth=100, min_samples_leaf=2, n_estimators=100, score=0.7882368433662609, total=13.4min
[CV] criterion=gini, max_depth=100, min_samples_leaf=2, n

[CV]  criterion=gini, max_depth=None, min_samples_leaf=2, n_estimators=10, score=0.7642243235812318, total= 1.4min
[CV] criterion=gini, max_depth=None, min_samples_leaf=2, n_estimators=100 
[CV]  criterion=gini, max_depth=None, min_samples_leaf=1, n_estimators=150, score=0.8255894816312048, total=21.7min
[CV] criterion=gini, max_depth=None, min_samples_leaf=2, n_estimators=100 
[CV]  criterion=gini, max_depth=None, min_samples_leaf=2, n_estimators=100, score=0.7648667096628128, total=12.3min
[CV] criterion=gini, max_depth=None, min_samples_leaf=2, n_estimators=100 
[CV]  criterion=gini, max_depth=None, min_samples_leaf=2, n_estimators=100, score=0.77413212664126, total=12.1min
[CV] criterion=gini, max_depth=None, min_samples_leaf=2, n_estimators=100 
[CV]  criterion=gini, max_depth=None, min_samples_leaf=1, n_estimators=150, score=0.8710251499358451, total=22.1min
[CV] criterion=gini, max_depth=None, min_samples_leaf=2, n_estimators=150 
[CV]  criterion=gini, max_depth=None, min_sample

[Parallel(n_jobs=-1)]: Done 135 out of 135 | elapsed: 290.9min finished


CPU times: user 23min 51s, sys: 1min 9s, total: 25min
Wall time: 5h 10min 56s


In [44]:
clf.best_params_

{'criterion': 'gini',
 'max_depth': 100,
 'min_samples_leaf': 1,
 'n_estimators': 150}

In [43]:
print('train', metrics.f1_score(y_train, clf.predict(X_train), average='macro'))
print('test', metrics.f1_score(y_test, clf.predict(X_test), average='macro'))

train 0.994224593315
test 0.866641772658


### Why we selected f1 score with macro averaging as our classification quality measure? What other metrics are suitable?

Потому что классы совсем не сбалансированы. Альтернативой (очень плохой) для f1 могла бы быть метрика accuracy, но при таком дисбалансе она бы всегда показывала очень хорошие результаты в независимости от реальности.