## Описание работы после ячейки 72.

# Assignment 4: Named entity recognition

Построить модель для обнаружения и классификации именованных сущностей (named entities). На базе корпуса CoNLL 2002.  

Используйте в своем решении ансамбли над решающими деревьями: RandomForest, Gradient Boosting (xgboost, lightgbm, catboost) 
Tutorials:  
1. https://github.com/Microsoft/LightGBM/tree/master/examples/python-guide
1. https://github.com/catboost/tutorials 


Чем больше baseline'ов вы превзойдете, тем выше ваша оценка
Метрика качества f1 (f1_macro) (чем выше, тем лучше)
 
baseline 1: 0.0604      random labels  
baseline 2: 0.3966      PoS features + logistic regression  
baseline 3: 0.8122      word2vec cbow embedding + baseline 2 + svm    

! Your results must be reproducible. Если ваша модель - стохастическая, то вы явно должны задавать все seed и random_state в параметрах моделей   

bonus, think about:  
1. How can you exploit that words belong to some sentence?
2. Why we selected f1 score with macro averaging as our classification quality measure? What other metrics are suitable?   

In [1]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn import model_selection
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegressionCV
from sklearn.preprocessing import LabelEncoder
from sklearn import metrics

import warnings
warnings.filterwarnings('ignore')

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline


SEED=1337

In [2]:
df = pd.read_csv('ner_short.csv', index_col=0)
df.head()

Unnamed: 0,next-next-pos,next-next-word,next-pos,next-word,pos,prev-pos,prev-prev-pos,prev-prev-word,prev-word,sentence_idx,word,tag
0,NNS,demonstrators,IN,of,NNS,__START1__,__START2__,__START2__,__START1__,1.0,Thousands,O
1,VBP,have,NNS,demonstrators,IN,NNS,__START1__,__START1__,Thousands,1.0,of,O
2,VBN,marched,VBP,have,NNS,IN,NNS,Thousands,of,1.0,demonstrators,O
3,IN,through,VBN,marched,VBP,NNS,IN,of,demonstrators,1.0,have,O
4,NNP,London,IN,through,VBN,VBP,NNS,demonstrators,have,1.0,marched,O


In [3]:
# number of sentences
df.sentence_idx.max()

1500.0

In [4]:
# class distribution
df.tag.value_counts(normalize=True )

O        0.852828
B-geo    0.027604
B-gpe    0.020935
B-org    0.020247
I-per    0.017795
B-tim    0.016927
B-per    0.015312
I-org    0.013937
I-geo    0.005383
I-tim    0.004247
B-art    0.001376
I-gpe    0.000837
I-art    0.000748
B-eve    0.000628
I-eve    0.000508
B-nat    0.000449
I-nat    0.000239
Name: tag, dtype: float64

In [5]:
# class distribution
df.tag.value_counts()

O        57032
B-geo     1846
B-gpe     1400
B-org     1354
I-per     1190
B-tim     1132
B-per     1024
I-org      932
I-geo      360
I-tim      284
B-art       92
I-gpe       56
I-art       50
B-eve       42
I-eve       34
B-nat       30
I-nat       16
Name: tag, dtype: int64

In [6]:
# sentence length
tdf = df.set_index('sentence_idx')
tdf['length'] = df.groupby('sentence_idx').tag.count()
df = tdf.reset_index(drop=False)

In [7]:
tdf
df.head()


Unnamed: 0,sentence_idx,next-next-pos,next-next-word,next-pos,next-word,pos,prev-pos,prev-prev-pos,prev-prev-word,prev-word,word,tag,length
0,1.0,NNS,demonstrators,IN,of,NNS,__START1__,__START2__,__START2__,__START1__,Thousands,O,48
1,1.0,VBP,have,NNS,demonstrators,IN,NNS,__START1__,__START1__,Thousands,of,O,48
2,1.0,VBN,marched,VBP,have,NNS,IN,NNS,Thousands,of,demonstrators,O,48
3,1.0,IN,through,VBN,marched,VBP,NNS,IN,of,demonstrators,have,O,48
4,1.0,NNP,London,IN,through,VBN,VBP,NNS,demonstrators,have,marched,O,48


In [8]:
# encode categorial variables

le = LabelEncoder()
df['pos'] = le.fit_transform(df.pos)
df['next-pos'] = le.fit_transform(df['next-pos'])
df['next-next-pos'] = le.fit_transform(df['next-next-pos'])
df['prev-pos'] = le.fit_transform(df['prev-pos'])
df['prev-prev-pos'] = le.fit_transform(df['prev-prev-pos'])

In [9]:
#print(df.pos)
print(tdf['next-next-pos'].value_counts())
le.classes_

NN          8702
IN          7548
NNP         6848
DT          5346
NNS         4488
JJ          4348
__END1__    3000
__END2__    3000
.           2998
VBD         2194
VBN         2074
,           1836
VB          1598
VBZ         1454
TO          1446
CD          1430
CC          1396
VBG         1194
RB          1118
VBP          944
PRP          680
POS          542
PRP$         522
MD           460
``           244
JJR          192
WDT          190
JJS          172
WRB          160
WP           144
RP           140
NNPS         124
$             68
:             56
RBR           52
RRB           40
LRB           40
EX            34
;             20
RBS           14
PDT           12
WP$            6
Name: next-next-pos, dtype: int64


array(['$', ',', '.', ':', ';', 'CC', 'CD', 'DT', 'EX', 'IN', 'JJ', 'JJR',
       'JJS', 'LRB', 'MD', 'NN', 'NNP', 'NNPS', 'NNS', 'PDT', 'POS',
       'PRP', 'PRP$', 'RB', 'RBR', 'RBS', 'RP', 'RRB', 'TO', 'VB', 'VBD',
       'VBG', 'VBN', 'VBP', 'VBZ', 'WDT', 'WP', 'WP$', 'WRB',
       '__START1__', '__START2__', '``'], dtype=object)

In [10]:
le.classes_
print(le.inverse_transform([41, 40, 39, 38, 37]))
print(le.transform(['__START2__']))
print(le.transform(['__END1__']))

['``' '__START2__' '__START1__' 'WRB' 'WP$']
[40]


ValueError: y contains previously unseen labels: ['__END1__']

In [11]:
df['next-next-pos'] = le.fit_transform(df['next-next-pos'])
df
le.classes_
df.tag

0            O
1            O
2            O
3            O
4            O
5            O
6        B-geo
7            O
8            O
9            O
10           O
11           O
12       B-geo
13           O
14           O
15           O
16           O
17           O
18       B-gpe
19           O
20           O
21           O
22           O
23           O
24           O
25           O
26           O
27           O
28           O
29           O
         ...  
66844        O
66845        O
66846        O
66847        O
66848        O
66849        O
66850        O
66851        O
66852        O
66853        O
66854        O
66855        O
66856        O
66857        O
66858        O
66859        O
66860    B-per
66861        O
66862        O
66863        O
66864        O
66865        O
66866        O
66867        O
66868        O
66869        O
66870        O
66871        O
66872        O
66873        O
Name: tag, Length: 66874, dtype: object

In [12]:
# splitting
y = LabelEncoder().fit_transform(df.tag)

df_train, df_test, y_train, y_test = model_selection.train_test_split(df, y, stratify=y, 
                                                                      test_size=0.25, random_state=SEED, shuffle=True)
print('train', df_train.shape[0])
print('test', df_test.shape[0])

train 50155
test 16719


In [None]:
y
df_test

In [15]:
# some wrappers to work with word2vec
from gensim.models.word2vec import Word2Vec
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.base import TransformerMixin
from collections import defaultdict

   
class Word2VecWrapper(TransformerMixin):
    def __init__(self, window=5,negative=5, size=100, iter=100, is_cbow=False, random_state=SEED):
        self.window_ = window
        self.negative_ = negative
        self.size_ = size
        self.iter_ = iter
        self.is_cbow_ = is_cbow
        self.w2v = None
        self.random_state = random_state
        
    def get_size(self):
        return self.size_

    def fit(self, X, y=None):
        """
        X: list of strings
        """
        sentences_list = [x.split() for x in X]
        self.w2v = Word2Vec(sentences_list, 
                            window=self.window_,
                            negative=self.negative_, 
                            size=self.size_, 
                            iter=self.iter_,
                            sg=not self.is_cbow_, seed=self.random_state)

        return self
    
    def has(self, word):
        return word in self.w2v

    def transform(self, X):
        """
        X: a list of words
        """
        if self.w2v is None:
            raise Exception('model not fitted')
        return np.array([self.w2v[w] if w in self.w2v else np.zeros(self.size_) for w in X ])
    




In [16]:
%%time
# here we exploit that word2vec is an unsupervised learning algorithm
# so we can train it on the whole dataset (subject to discussion)

sentences_list = [x.strip() for x in ' '.join(df.word).split('.')]

w2v_cbow = Word2VecWrapper(window=5, negative=5, size=300, iter=300, is_cbow=True, random_state=SEED)
w2v_cbow.fit(sentences_list)

Wall time: 11.7 s


In [17]:
%%time
# baseline 1 
# random labels
from sklearn.preprocessing import OneHotEncoder
from sklearn.dummy import DummyClassifier


columns = ['pos','next-pos','next-next-pos','prev-pos','prev-prev-pos']

model = Pipeline([
    ('enc', OneHotEncoder()),
    ('est', DummyClassifier(random_state=SEED)),
])

model.fit(df_train[columns], y_train)

print('train', metrics.f1_score(y_train, model.predict(df_train[columns]), average='macro'))
print('test', metrics.f1_score(y_test, model.predict(df_test[columns]), average='macro'))


train 0.05887736725599869
test 0.060439542712750365
Wall time: 113 ms


In [52]:
df_train[columns]

Unnamed: 0,pos,next-pos,next-next-pos,prev-pos,prev-prev-pos
36858,30,18,35,16,16
39120,32,28,7,30,5
53612,7,10,15,30,1
6150,7,15,9,9,1
3771,16,1,10,15,16
28439,7,15,1,9,15
43719,28,23,29,18,15
59363,9,10,16,18,30
14674,9,16,1,15,10
56283,1,5,30,18,9


In [18]:
%%time
# baseline 2 
# pos features + one hot encoding + logistic regression
from sklearn.preprocessing import OneHotEncoder


columns = ['pos','next-pos','next-next-pos','prev-pos','prev-prev-pos']

model = Pipeline([
    ('enc', OneHotEncoder()),
    ('est', LogisticRegressionCV(Cs=5, cv=5, n_jobs=-1, scoring='f1_macro', 
                             penalty='l2', solver='newton-cg', multi_class='multinomial', random_state=SEED)),
])

model.fit(df_train[columns], y_train)

print('train', metrics.f1_score(y_train, model.predict(df_train[columns]), average='macro'))
print('test', metrics.f1_score(y_test, model.predict(df_test[columns]), average='macro'))

train 0.46639500282346874
test 0.39660981421559566
Wall time: 18min 48s


  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)


In [None]:
# 1) #######################################
# RandomForest

In [19]:
enc = OneHotEncoder()
X_train = enc.fit_transform(df_train[['pos','next-pos','next-next-pos','prev-pos','prev-prev-pos']])
X_test = enc.transform(df_test[['pos','next-pos','next-next-pos','prev-pos','prev-prev-pos']])

In [20]:
print(X_train.shape, X_test.shape)
#print(X_train, X_test)


(50155, 206) (16719, 206)


In [21]:
from sklearn.ensemble import RandomForestClassifier

In [66]:
clf = RandomForestClassifier(n_estimators=100, n_jobs=-1, random_state=SEED)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
y_pred_ = clf.predict(X_train)
#print('test', metrics.f1_score(y_train, y_pred_, average='macro'))
#print('test', metrics.f1_score(y_test, y_pred, average='macro'))


train 0.747052570135
test 0.599419020457


In [70]:
NE = [10, 20, 50, 100, 200, 500]
for ne in NE:
  clf = RandomForestClassifier(n_estimators=ne, n_jobs=-1, random_state=SEED)
  clf.fit(X_train, y_train)
  y_pred = clf.predict(X_test)
  f1  = metrics.f1_score(y_test, y_pred, average='macro')
  print("ne=", ne, "f1=", f1)
    

ne= 10 f1= 0.577316050586
ne= 20 f1= 0.59896942209
ne= 50 f1= 0.598170527645
ne= 100 f1= 0.599419020457
ne= 200 f1= 0.597565971407
ne= 500 f1= 0.60004113975


In [None]:
# 2) ####################################
# GradientBoostingClassifier

In [None]:
%%time
from sklearn.ensemble import GradientBoostingClassifier
NE = [10, 50, 100, 200, 500, 1000]
for ne in NE:
  clf = GradientBoostingClassifier(n_estimators=ne, learning_rate=0.1, random_state=SEED)
  clf.fit(X_train, y_train)
  y_pred = clf.predict(X_test)
  f1  = metrics.f1_score(y_test, y_pred, average='macro')
  print("ne=", ne, "f1=", f1)

# Описание работы и выводы

Основная работа выполнена в ячейках 64-79. Простое применение RF со стандартными параметрами привело к искомому результату: критерий f1-macro на тестовой выборке составил примерно 0.6, что значительно выше второго baseline 0.396.

В примечаниях к заданию неоднократно указывалось, что нельзя подбирать гиперпараметры на тестовой выборке, но этого подбора путем кросс-валидации фактически не делалось, поскольу цель задания - превзойти второй baseline достигалась при фактически любом значение гиперпараметра, в т.ч. при значении по умолчанию - 100.

Для сравнения использовался метод градиентного бустинга, который здесь дал значительно худше результаты 0.5-0.55. Гиперпараметр тоже не подбирался, но была сделана его примерная оценка.

Критерий F1-macro выбран для того чтобы оценить среднее качество решения задачи NER путем классификации, он является здесь оптимальным, хотя ряд редких сущностей не может быть правильно определен из-за недостатка данных. Подойдет любая метрика, которая дает баланс точности и полноты для несбалансированных классов для многоклассовой классификации, но F1-macro является оптимальной, если нет необходимости как-то взвешивать разные типы NE.




# Выводы

Без учета слов, на данных грамматической разметки, RF для NER работает значительно лучше логистической регрессии, как и следовало ожидать ввиду специфики задачи, а вот обычный градиентный бустинг, без оптимизации в специализированных пакетах, дает гораздо худшие результаты.  

Стандартный метод RF сразу дает хорошие результаты и решает задачу превышения baseline-2.


In [21]:
%%time
# baseline 3
# use word2vec cbow embedding + baseline 2 + svm
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.svm import LinearSVC
import scipy.sparse as sp

embeding = w2v_cbow
encoder_pos = OneHotEncoder()
X_train = sp.hstack([
    embeding.transform(df_train.word),
    embeding.transform(df_train['next-word']),
    embeding.transform(df_train['next-next-word']),
    embeding.transform(df_train['prev-word']),
    embeding.transform(df_train['prev-prev-word']),
    encoder_pos.fit_transform(df_train[['pos','next-pos','next-next-pos','prev-pos','prev-prev-pos']])
])
X_test = sp.hstack([
    embeding.transform(df_test.word),
    embeding.transform(df_test['next-word']),
    embeding.transform(df_test['next-next-word']),
    embeding.transform(df_test['prev-word']),
    embeding.transform(df_test['prev-prev-word']),
    encoder_pos.transform(df_test[['pos','next-pos','next-next-pos','prev-pos','prev-prev-pos']])
])



Wall time: 7.25 s


In [None]:
model = model_selection.GridSearchCV(LinearSVC(penalty='l2', multi_class='ovr', random_state=SEED), 
                                    {'C': np.logspace(-4, 0, 5)}, 
                                    cv=3, scoring='f1_macro', n_jobs=-1, verbose=1)
model.fit(X_train, y_train)

print('train', metrics.f1_score(y_train, model.predict(X_train), average='macro'))
print('test', metrics.f1_score(y_test, model.predict(X_test), average='macro'))

In [22]:
X_train

<50155x1706 sparse matrix of type '<class 'numpy.float64'>'
	with 57925775 stored elements in COOrdinate format>

In [31]:
aa = embeding.transform(df_test['next-word'])
aa.shape

(16719, 300)

In [32]:
%%time
C=0.1
clf = LinearSVC(C=C, random_state=SEED)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
f1  = metrics.f1_score(y_test, y_pred, average='macro')
print("С=", C, "f1=", f1)

f11 = metrics.f1_score(y_test, y_pred, average=None)
print(f11)
print(np.mean(f11))



С= 0.1 f1= 0.803763528653904
Wall time: 7min 18s


In [55]:
%%time
C=1
clf = LogisticRegression(C=C, random_state=SEED)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
f1  = metrics.f1_score(y_test, y_pred, average='macro')
print("C=", C, "f1=", f1)

f11 = metrics.f1_score(y_test, y_pred, average=None)
print(f11)
print(np.mean(f11))

f1= 0.8050354838121895
Wall time: 13min 14s


In [56]:
f11 = metrics.f1_score(y_test, y_pred, average=None)
print(f11)
print(np.mean(f11))

[0.75       0.66666667 0.89589905 0.81710914 0.6        0.81481481
 0.8996139  0.92870201 0.6        0.85714286 0.86046512 0.75862069
 0.66666667 0.86292135 0.91927512 0.79389313 0.9938127 ]
0.8050354838121895


In [92]:
y_pred

array([16,  3, 16, ..., 16, 16, 16], dtype=int64)

In [65]:
model


Pipeline(memory=None,
     steps=[('enc', OneHotEncoder(categorical_features=None, categories=None,
       dtype=<class 'numpy.float64'>, handle_unknown='error',
       n_values=None, sparse=True)), ('est', LogisticRegressionCV(Cs=5, class_weight=None, cv=5, dual=False,
           fit_intercept=True, intercept_scaling=1.0, max_iter=100,
           multi_class='multinomial', n_jobs=-1, penalty='l2',
           random_state=1337, refit=True, scoring='f1_macro',
           solver='newton-cg', tol=0.0001, verbose=0))])

In [64]:
model.cv_results_

AttributeError: 'Pipeline' object has no attribute 'cv_results_'

In [27]:
### Класс и их количество

print(y.shape)
uniqs, counts = np.unique(y, return_counts=True)
print(uniqs)
print(counts)


(66874,)
[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16]
[   92    42  1846  1400    30  1354  1024  1132    50    34   360    56
    16   932  1190   284 57032]


In [29]:
#### Конвертация многоклассовой задачи в бинарную

def mc2(y, n):
    y1 = y.copy()
    y1[y==n] = 1
    y1[y!=n] = 0
    return y1

y1 = mc2(y, 2)
print(y1.shape)
uniqs, counts = np.unique(y1, return_counts=True)
print(2, uniqs, counts)

(66874,)
2 [0 1] [65028  1846]


In [139]:
y1_train = mc2(y_train, 2)
y1_test = mc2(y_test, 2)

C=0.1
clf = LinearSVC(C=C, random_state=SEED)
clf.fit(X_train, y1_train)
y_pred = clf.predict(X_test)
f1  = metrics.f1_score(y1_test, y_pred)
print("f1=", f1)

С= 0.1 f1= 0.860566448802
Wall time: 16.4 s


In [31]:
%%time

######

print(y.shape)
uniqs, counts = np.unique(y, return_counts=True)
print(uniqs)
print(counts)

C = 0.1
clf = LinearSVC(C=C, random_state=SEED)
ff = []

for n in (range(17)):
  y1_train = mc2(y_train, n)
  y1_test = mc2(y_test, n)
  clf.fit(X_train, y1_train)
  y_pred = clf.predict(X_test)
  f1  = metrics.f1_score(y1_test, y_pred)
  ff.append(f1)
  print(n, counts[n], f1)

print(ff)
print("f1-macro:", np.mean(ff))

(66874,)
[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16]
[   92    42  1846  1400    30  1354  1024  1132    50    34   360    56
    16   932  1190   284 57032]




0 92 0.8636363636363636




1 42 0.5263157894736842




2 1846 0.8674698795180722




3 1400 0.8059701492537314




4 30 0.6




5 1354 0.767741935483871




6 1024 0.8825757575757577




7 1132 0.9314586994727592




8 50 0.608695652173913




9 34 0.75




10 360 0.7909604519774011




11 56 0.6206896551724138




12 16 0.6666666666666666




13 932 0.7948164146868251




14 1190 0.8950819672131146




15 284 0.8513513513513514




16 57032 0.9919681519765331
[0.8636363636363636, 0.5263157894736842, 0.8674698795180722, 0.8059701492537314, 0.6, 0.767741935483871, 0.8825757575757577, 0.9314586994727592, 0.608695652173913, 0.75, 0.7909604519774011, 0.6206896551724138, 0.6666666666666666, 0.7948164146868251, 0.8950819672131146, 0.8513513513513514, 0.9919681519765331]
f1-macro: 0.7773764050372033
Wall time: 3min 6s


In [50]:
print(ff)
print(np.mean(ff))

[0.5294117647058824, 0.4, 0.7963176064441888, 0.7373572593800978, 0.4444444444444445, 0.688695652173913, 0.8326359832635983, 0.8847583643122676, 0.2666666666666667, 0.7692307692307693, 0.7333333333333334, 0.5263157894736842, 0.4, 0.7475247524752475, 0.8986486486486487, 0.7543859649122806]
0.6506079374665639


In [30]:
### Сравнение классов

for n in (range(17)):
  y1_train = mc2(y_train, n)
  y1_test = mc2(y_test, n)
  n0 = np.unique(y, return_counts=True)[1][n]  
  n1 = np.unique(y1_train, return_counts=True)[1][1]  
  n2 = np.unique(y1_test, return_counts=True)[1][1]  
  print(n, n0, n1, n2)


0 92 69 23
1 42 32 10
2 1846 1384 462
3 1400 1050 350
4 30 23 7
5 1354 1015 339
6 1024 768 256
7 1132 849 283
8 50 37 13
9 34 26 8
10 360 270 90
11 56 42 14
12 16 12 4
13 932 699 233
14 1190 892 298
15 284 213 71
16 57032 42774 14258


In [53]:
from sklearn.preprocessing import label_binarize

classes = np.arange(17)
print(classes)
y2_train = label_binarize(y_train, classes)
y2_test = label_binarize(y_test, classes)

print(y2_train.shape)

y20_train = y2_train[:, 0] 
y20_test = y2_test[:, 0] 

print(y_train)
print(y20_train)
print(y20_test)

uniqs, counts = np.unique(y20_train, return_counts=True)
print(uniqs)
print(counts)
uniqs, counts = np.unique(y20_test, return_counts=True)
print(uniqs)
print(counts)






[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16]
(50155, 17)
[16 16 16 ... 16 13 16]
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 1]
[50086    69]
[0 1]
[16696    23]


In [54]:
### y20 - первый бинарный класс
C=0.1
clf = LinearSVC(C=C, random_state=SEED)
clf.fit(X_train, y20_train)
y_pred = clf.predict(X_test)
f1  = metrics.f1_score(y20_test, y_pred)
print("f1=", f1)



f1= 0.8636363636363636


In [62]:
np.logspace(-1.5, -0.5, 5)

array([0.03162278, 0.05623413, 0.1       , 0.17782794, 0.31622777])

In [82]:
%%time

# 3. ######
# Подбор гиперпараметра индивидуально для каждого класса

from sklearn.preprocessing import label_binarize
classes = np.arange(17)
print(classes)
y2_train = label_binarize(y_train, classes)
y2_test = label_binarize(y_test, classes)

uniqs, counts = np.unique(y, return_counts=True)
ff = []

for class1 in (range(17)):
#for class1 in (range(1)):
    
# Бинарные классы
  y21_train = y2_train[:, class1]
  y21_test = y2_test[:, class1]

# Определение оптимального С
  model = model_selection.GridSearchCV(LinearSVC(penalty='l2', 
             multi_class='ovr', random_state=SEED), 
             {'C': np.logspace(-1, -0.5, 5)}, cv=3, scoring='f1_macro', n_jobs=-1, verbose=1)  
  
  model.fit(X_train, y21_train)  

# Выделение С
  rank = model.cv_results_['rank_test_score']
  params = model.cv_results_['params']
  i = np.argmin(rank)
  C_model = params[i]['C']

  model.fit(X_train, y21_train)
  y_pred = model.predict(X_test)
  f1  = metrics.f1_score(y21_test, y_pred)
  ff.append(f1)
  print('class=', class1, 'num=', counts[class1], 'C=', C_model, 'f1=', f1)

# Итоговое определение f1-max
print(ff)
print('f1-max=', np.mean(ff))


[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16]
Fitting 3 folds for each of 5 candidates, totalling 15 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  15 out of  15 | elapsed:   32.0s finished


Fitting 3 folds for each of 5 candidates, totalling 15 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  15 out of  15 | elapsed:   30.1s finished


class= 0 num= 92 C= 0.1 f1= 0.8636363636363636
Fitting 3 folds for each of 5 candidates, totalling 15 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  15 out of  15 | elapsed:   26.7s finished


Fitting 3 folds for each of 5 candidates, totalling 15 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  15 out of  15 | elapsed:   26.5s finished


class= 1 num= 42 C= 0.1 f1= 0.5263157894736842
Fitting 3 folds for each of 5 candidates, totalling 15 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  15 out of  15 | elapsed:   51.0s finished


Fitting 3 folds for each of 5 candidates, totalling 15 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  15 out of  15 | elapsed:   50.7s finished


class= 2 num= 1846 C= 0.1 f1= 0.8674698795180722
Fitting 3 folds for each of 5 candidates, totalling 15 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  15 out of  15 | elapsed:   57.9s finished


Fitting 3 folds for each of 5 candidates, totalling 15 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  15 out of  15 | elapsed:   57.2s finished


class= 3 num= 1400 C= 0.1333521432163324 f1= 0.8035450516986705
Fitting 3 folds for each of 5 candidates, totalling 15 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  15 out of  15 | elapsed:   27.4s finished


Fitting 3 folds for each of 5 candidates, totalling 15 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  15 out of  15 | elapsed:   26.6s finished


class= 4 num= 30 C= 0.1 f1= 0.6
Fitting 3 folds for each of 5 candidates, totalling 15 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  15 out of  15 | elapsed:   59.3s finished


Fitting 3 folds for each of 5 candidates, totalling 15 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  15 out of  15 | elapsed:  1.0min finished


class= 5 num= 1354 C= 0.1333521432163324 f1= 0.7758346581875994
Fitting 3 folds for each of 5 candidates, totalling 15 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  15 out of  15 | elapsed:   38.6s finished


Fitting 3 folds for each of 5 candidates, totalling 15 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  15 out of  15 | elapsed:   37.7s finished


class= 6 num= 1024 C= 0.1 f1= 0.8825757575757577
Fitting 3 folds for each of 5 candidates, totalling 15 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  15 out of  15 | elapsed:   40.3s finished


Fitting 3 folds for each of 5 candidates, totalling 15 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  15 out of  15 | elapsed:   40.5s finished


class= 7 num= 1132 C= 0.1 f1= 0.9314586994727592
Fitting 3 folds for each of 5 candidates, totalling 15 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  15 out of  15 | elapsed:   29.4s finished


Fitting 3 folds for each of 5 candidates, totalling 15 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  15 out of  15 | elapsed:   28.5s finished


class= 8 num= 50 C= 0.1 f1= 0.608695652173913
Fitting 3 folds for each of 5 candidates, totalling 15 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  15 out of  15 | elapsed:   27.2s finished


Fitting 3 folds for each of 5 candidates, totalling 15 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  15 out of  15 | elapsed:   26.6s finished


class= 9 num= 34 C= 0.1 f1= 0.75
Fitting 3 folds for each of 5 candidates, totalling 15 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  15 out of  15 | elapsed:   32.5s finished


Fitting 3 folds for each of 5 candidates, totalling 15 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  15 out of  15 | elapsed:   32.2s finished


class= 10 num= 360 C= 0.1 f1= 0.7909604519774011
Fitting 3 folds for each of 5 candidates, totalling 15 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  15 out of  15 | elapsed:   32.6s finished


Fitting 3 folds for each of 5 candidates, totalling 15 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  15 out of  15 | elapsed:   32.7s finished


class= 11 num= 56 C= 0.1 f1= 0.6206896551724138
Fitting 3 folds for each of 5 candidates, totalling 15 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  15 out of  15 | elapsed:   27.0s finished


Fitting 3 folds for each of 5 candidates, totalling 15 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  15 out of  15 | elapsed:   28.0s finished


class= 12 num= 16 C= 0.1 f1= 0.6666666666666666
Fitting 3 folds for each of 5 candidates, totalling 15 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  15 out of  15 | elapsed:   43.6s finished


Fitting 3 folds for each of 5 candidates, totalling 15 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  15 out of  15 | elapsed:   42.5s finished


class= 13 num= 932 C= 0.1 f1= 0.7948164146868251
Fitting 3 folds for each of 5 candidates, totalling 15 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  15 out of  15 | elapsed:   36.0s finished


Fitting 3 folds for each of 5 candidates, totalling 15 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  15 out of  15 | elapsed:   36.1s finished


class= 14 num= 1190 C= 0.1 f1= 0.8950819672131146
Fitting 3 folds for each of 5 candidates, totalling 15 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  15 out of  15 | elapsed:   34.1s finished


Fitting 3 folds for each of 5 candidates, totalling 15 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  15 out of  15 | elapsed:   34.3s finished


class= 15 num= 284 C= 0.1 f1= 0.8513513513513514
Fitting 3 folds for each of 5 candidates, totalling 15 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  15 out of  15 | elapsed:  2.6min finished


Fitting 3 folds for each of 5 candidates, totalling 15 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  15 out of  15 | elapsed:  2.5min finished


class= 16 num= 57032 C= 0.1333521432163324 f1= 0.9920715308581607
[0.8636363636363636, 0.5263157894736842, 0.8674698795180722, 0.8035450516986705, 0.6, 0.7758346581875994, 0.8825757575757577, 0.9314586994727592, 0.608695652173913, 0.75, 0.7909604519774011, 0.6206896551724138, 0.6666666666666666, 0.7948164146868251, 0.8950819672131146, 0.8513513513513514, 0.9920715308581607]
f1-max= 0.7777158758625148
Wall time: 31min 27s


In [79]:
# Проверка
rank = model.cv_results_['rank_test_score']
params = model.cv_results_['params']
i = np.argmin(rank)
C_model = params[i]['C']
C_model

0.03162277660168379

In [83]:
ff

[0.8636363636363636,
 0.5263157894736842,
 0.8674698795180722,
 0.8035450516986705,
 0.6,
 0.7758346581875994,
 0.8825757575757577,
 0.9314586994727592,
 0.608695652173913,
 0.75,
 0.7909604519774011,
 0.6206896551724138,
 0.6666666666666666,
 0.7948164146868251,
 0.8950819672131146,
 0.8513513513513514,
 0.9920715308581607]