**Задание:** остроить  feed forward NN модель на pytorch для задачи NER из 4 дз. Разрешается использовать эмбеддинги. Необходимо побить бейзлайны

Задача NER:

Метрика качества f1 (f1_macro) (чем выше, тем лучше)
 
baseline 1:            0.0604      random labels  
baseline 2:            0.3966      PoS features + logistic regression  
baseline 3:            0.8122      word2vec cbow embedding + baseline 2 + svm  
мой результат из дз 4: 0.8577      CatBoost


In [1]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn import model_selection
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegressionCV
from sklearn.preprocessing import LabelEncoder
from sklearn import metrics

import warnings
warnings.filterwarnings('ignore')

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline


SEED=1337

In [2]:
df = pd.read_csv('ner_short.csv', index_col=0)
df.head(10)
len(df)

66874

In [3]:
# sentence length
tdf = df.set_index('sentence_idx')
tdf['length'] = df.groupby('sentence_idx').tag.count()
df = tdf.reset_index(drop=False)

In [4]:
# encode categorial variables
le = LabelEncoder()
df['pos'] = le.fit_transform(df.pos)
df['next-pos'] = le.fit_transform(df['next-pos'])
df['next-next-pos'] = le.fit_transform(df['next-next-pos'])
df['prev-pos'] = le.fit_transform(df['prev-pos'])
df['prev-prev-pos'] = le.fit_transform(df['prev-prev-pos'])

In [5]:
# splitting
y = LabelEncoder().fit_transform(df.tag)

df_train, df_test, y_train, y_test = model_selection.train_test_split(df, y, stratify=y, 
                                                                      test_size=0.25, random_state=SEED, shuffle=True)
print('train', df_train.shape[0])
print('test', df_test.shape[0])

train 50155
test 16719


In [8]:
y = np.array(y_train)
ytest = np.array(y_test)
y.shape, ytest.shape

((50155,), (16719,))

In [7]:
# some wrappers to work with word2vec
from gensim.models.word2vec import Word2Vec
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.base import TransformerMixin
from collections import defaultdict

   
class Word2VecWrapper(TransformerMixin):
    def __init__(self, window=5,negative=5, size=100, iter=100, is_cbow=False, random_state=SEED):
        self.window_ = window
        self.negative_ = negative
        self.size_ = size
        self.iter_ = iter
        self.is_cbow_ = is_cbow
        self.w2v = None
        self.random_state = random_state
        
    def get_size(self):
        return self.size_

    def fit(self, X, y=None):
        """
        X: list of strings
        """
        sentences_list = [x.split() for x in X]
        self.w2v = Word2Vec(sentences_list, 
                            window=self.window_,
                            negative=self.negative_, 
                            size=self.size_, 
                            iter=self.iter_,
                            sg=not self.is_cbow_, seed=self.random_state)

        return self
    
    def has(self, word):
        return word in self.w2v

    def transform(self, X):
        """
        X: a list of words
        """
        if self.w2v is None:
            raise Exception('model not fitted')
        return np.array([self.w2v[w] if w in self.w2v else np.zeros(self.size_) for w in X ]) 

In [9]:
# here we exploit that word2vec is an unsupervised learning algorithm
# so we can train it on the whole dataset (subject to discussion)

sentences_list = [x.strip() for x in ' '.join(df.word).split('.')]

w2v_cbow = Word2VecWrapper(window=5, negative=5, size=300, iter=300, is_cbow=True, random_state=SEED)
w2v_cbow.fit(sentences_list)

<__main__.Word2VecWrapper at 0x1d67a030b70>

In [10]:
import scipy.sparse as sp
from sklearn.preprocessing import OneHotEncoder

embeding = w2v_cbow
encoder_pos = OneHotEncoder()
X_train = sp.hstack([
    embeding.transform(df_train.word),
    embeding.transform(df_train['next-word']),
    embeding.transform(df_train['next-next-word']),
    embeding.transform(df_train['prev-word']),
    embeding.transform(df_train['prev-prev-word']),
    encoder_pos.fit_transform(df_train[['pos','next-pos','next-next-pos','prev-pos','prev-prev-pos']])
])
X_test = sp.hstack([
    embeding.transform(df_test.word),
    embeding.transform(df_test['next-word']),
    embeding.transform(df_test['next-next-word']),
    embeding.transform(df_test['prev-word']),
    embeding.transform(df_test['prev-prev-word']),
    encoder_pos.transform(df_test[['pos','next-pos','next-next-pos','prev-pos','prev-prev-pos']])
])

In [11]:
import torch as tt
from torch import nn
from tqdm import tqdm_notebook
from torch.autograd import Variable
seed = 15
tt.manual_seed(seed)
tt.cuda.manual_seed(seed)
tt.backends.cudnn.deterministic=True
np.random.RandomState(seed)

<mtrand.RandomState at 0x1d600642900>

In [12]:
input_size = 1706
hidden_size = 100
num_classes = len(df.tag.value_counts(normalize=True))

learning_rate = 0.1
batch_size = 1000
num_epochs = 20

In [14]:
class FNN(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim):
        super(FNN, self).__init__()
        self.fc1 = nn.Linear(input_size, hidden_size)  # 1st Full-Connected Layer
        self.relu1 = nn.ReLU()                          # non-linearity
        self.fc2 = nn.Linear(hidden_size, hidden_size) # 2nd Full-Connected Layer
        self.relu2 = nn.ReLU()
        self.fc3 = nn.Linear(hidden_size, hidden_size) 
        self.relu3 = nn.ReLU()
        self.fc4 = nn.Linear(hidden_size, num_classes)
        
    def forward(self, x):
        out = self.fc1(x)
        out = self.relu1(out)
        out = self.fc2(out)
        out = self.relu2(out)
        out = self.fc3(out)
        out = self.relu3(out)
        out = self.fc4(out)
        return out
    
    def fit(self, X_train, y, batch_size, learning_rate, num_epochs):
        # loss
        criterion = nn.CrossEntropyLoss()
        optimizer = tt.optim.SGD(model.parameters(), lr=learning_rate)
        
        size = X_train.shape[0]
        n_batches = int(np.ceil(size/batch_size))
        print(n_batches)
        for epoch in range(num_epochs):
            # make random permutation over indices, use np.random.choice
            indices = np.random.choice(size, size, replace=False)        
            epoch_average_loss = 0
            # iterate over mini-batches
            for j in range(n_batches):
                batch_idx = indices[j: j + batch_size]
                
                # we have to wrap data into tensors before feed them to neural network
                # batch feature float tensor. use tt.from_numpy
                batch_x = tt.from_numpy(X_train.toarray()[batch_idx])
                # batch target long tensor. use tt.from_numpy
                batch_y = tt.from_numpy(y[batch_idx])
                #print(len(batch_x))
                optimizer.zero_grad()
                pred = model.forward(batch_x)
                
                # cross-entropy loss
                loss = criterion(pred, batch_y.long())

                # calculate gradients
                loss.backward()
                # make optimization step
                optimizer.step()
                epoch_average_loss += loss.data.detach().item()

                # average loss for epoch
                epoch_average_loss /= n_batches
                # logging
                if (j+1) % 50 == 0:
                    print('Epoch [%d/%d], Step [%d/%d], Loss: %.4f'
                          %(epoch+1, num_epochs, j+1, len(X_train.toarray())//batch_size, epoch_average_loss))  #loss.data[0]))
        return self
            
    def predict(self, X):
        xt = tt.from_numpy(X.toarray())
        pred = model.forward(xt)
        pred = tt.softmax(pred, dim=-1)
        pred = pred.detach().numpy()
        predicted_y = np.argmax(pred, axis=1)
        return predicted_y

In [None]:
# previous
'''
   def __init__(self, input_size, hidden_size, num_classes):
        super(FNN, self).__init__()
        self.fc1 = nn.Linear(input_size, hidden_size)  # 1st Full-Connected Layer
        self.relu = nn.ReLU()                          # ReLU Layer
        self.fc2 = nn.Linear(hidden_size, num_classes) # 2nd Full-Connected Layer
    
    def forward(self, x):
        #print('type X', type(x))
        out = self.fc1(x)
        out = self.relu(out)
        out = self.fc2(out)
        return out
'''

In [17]:
# instanciate
model = FNN(input_size, hidden_size, num_classes)
model.double()

FNN(
  (fc1): Linear(in_features=1706, out_features=100, bias=True)
  (relu1): ReLU()
  (fc2): Linear(in_features=100, out_features=100, bias=True)
  (relu2): ReLU()
  (fc3): Linear(in_features=100, out_features=100, bias=True)
  (relu3): ReLU()
  (fc4): Linear(in_features=100, out_features=17, bias=True)
)

In [18]:
%%time
model.fit(X_train, y, batch_size, learning_rate, num_epochs)

Epoch [1/20], Step [50/50], Loss: 0.0093
Epoch [2/20], Step [50/50], Loss: 0.0080
Epoch [3/20], Step [50/50], Loss: 0.0083
Epoch [4/20], Step [50/50], Loss: 0.0072
Epoch [5/20], Step [50/50], Loss: 0.0075
Epoch [6/20], Step [50/50], Loss: 0.0056
Epoch [7/20], Step [50/50], Loss: 0.0061
Epoch [8/20], Step [50/50], Loss: 0.0053
Epoch [9/20], Step [50/50], Loss: 0.0048
Epoch [10/20], Step [50/50], Loss: 0.0042
Epoch [11/20], Step [50/50], Loss: 0.0030
Epoch [12/20], Step [50/50], Loss: 0.0025
Epoch [13/20], Step [50/50], Loss: 0.0017
Epoch [14/20], Step [50/50], Loss: 0.0023
Epoch [15/20], Step [50/50], Loss: 0.0018
Epoch [16/20], Step [50/50], Loss: 0.0016
Epoch [17/20], Step [50/50], Loss: 0.0017
Epoch [18/20], Step [50/50], Loss: 0.0014
Epoch [19/20], Step [50/50], Loss: 0.0011
Epoch [20/20], Step [50/50], Loss: 0.0013
Wall time: 11min 29s


FNN(
  (fc1): Linear(in_features=1706, out_features=100, bias=True)
  (relu1): ReLU()
  (fc2): Linear(in_features=100, out_features=100, bias=True)
  (relu2): ReLU()
  (fc3): Linear(in_features=100, out_features=100, bias=True)
  (relu3): ReLU()
  (fc4): Linear(in_features=100, out_features=17, bias=True)
)

In [19]:
# test
from sklearn.metrics import f1_score

print('Starting predicting...')
print('train', metrics.f1_score(y, model.predict(X_train), average='macro'))
print('test', f1_score(y_test, model.predict(X_test), average='macro'))

Starting predicting...
train 0.31584474806737967
test 0.2879993980908926


In [27]:
learning_rate = 1
batch_size = 1000
num_epochs = 50

model = FNN(input_size, hidden_size, num_classes)
model.double()

FNN(
  (fc1): Linear(in_features=1706, out_features=100, bias=True)
  (relu1): ReLU()
  (fc2): Linear(in_features=100, out_features=100, bias=True)
  (relu2): ReLU()
  (fc3): Linear(in_features=100, out_features=100, bias=True)
  (relu3): ReLU()
  (fc4): Linear(in_features=100, out_features=17, bias=True)
)

In [28]:
%%time
model.fit(X_train, y, batch_size, learning_rate, num_epochs)

51
Epoch [1/50], Step [50/50], Loss: 0.0047
Epoch [2/50], Step [50/50], Loss: 0.0074
Epoch [3/50], Step [50/50], Loss: 0.0035
Epoch [4/50], Step [50/50], Loss: 0.0016
Epoch [5/50], Step [50/50], Loss: 0.0013
Epoch [6/50], Step [50/50], Loss: 0.0012
Epoch [7/50], Step [50/50], Loss: 0.0007
Epoch [8/50], Step [50/50], Loss: 0.0005
Epoch [9/50], Step [50/50], Loss: 0.0013
Epoch [10/50], Step [50/50], Loss: 0.0006
Epoch [11/50], Step [50/50], Loss: 0.0003
Epoch [12/50], Step [50/50], Loss: 0.0005
Epoch [13/50], Step [50/50], Loss: 0.0003
Epoch [14/50], Step [50/50], Loss: 0.0003
Epoch [15/50], Step [50/50], Loss: 0.0004
Epoch [16/50], Step [50/50], Loss: 0.0001
Epoch [17/50], Step [50/50], Loss: 0.0001
Epoch [18/50], Step [50/50], Loss: 0.0003
Epoch [19/50], Step [50/50], Loss: 0.0002
Epoch [20/50], Step [50/50], Loss: 0.0005
Epoch [21/50], Step [50/50], Loss: 0.0001
Epoch [22/50], Step [50/50], Loss: 0.0001
Epoch [23/50], Step [50/50], Loss: 0.0002
Epoch [24/50], Step [50/50], Loss: 0.000

FNN(
  (fc1): Linear(in_features=1706, out_features=100, bias=True)
  (relu1): ReLU()
  (fc2): Linear(in_features=100, out_features=100, bias=True)
  (relu2): ReLU()
  (fc3): Linear(in_features=100, out_features=100, bias=True)
  (relu3): ReLU()
  (fc4): Linear(in_features=100, out_features=17, bias=True)
)

In [29]:
# test
from sklearn.metrics import f1_score, accuracy_score
print('Starting predicting...')
print('train', metrics.f1_score(y, model.predict(X_train), average='macro'))
print('test', f1_score(y_test, model.predict(X_test), average='macro'))
print('accuracy test', accuracy_score(y_test, model.predict(X_test)))
print('accuracy train', accuracy_score(y, model.predict(X_train)))

Starting predicting...
train 0.6222747117098374
test 0.4631345854475651
accuracy test 0.938572881153179
accuracy train 0.9573123317715083


Таким образом я побила **второй** бейзлайн.

In [14]:
learning_rate = 1.5
batch_size = 700
num_epochs = 100

model = FNN(input_size, hidden_size, num_classes)
model.double()

FNN(
  (fc1): Linear(in_features=1706, out_features=100, bias=True)
  (relu1): ReLU()
  (fc2): Linear(in_features=100, out_features=100, bias=True)
  (relu2): ReLU()
  (fc3): Linear(in_features=100, out_features=100, bias=True)
  (relu3): ReLU()
  (fc4): Linear(in_features=100, out_features=17, bias=True)
)

In [15]:
%%time
model.fit(X_train, y, batch_size, learning_rate, num_epochs)

72
Epoch [1/100], Step [50/71], Loss: 0.0040
Epoch [2/100], Step [50/71], Loss: 0.0003
Epoch [3/100], Step [50/71], Loss: 0.0004
Epoch [4/100], Step [50/71], Loss: 0.0003
Epoch [5/100], Step [50/71], Loss: 0.0000
Epoch [6/100], Step [50/71], Loss: 0.0001
Epoch [7/100], Step [50/71], Loss: 0.0001
Epoch [8/100], Step [50/71], Loss: 0.0000
Epoch [9/100], Step [50/71], Loss: 0.0000
Epoch [10/100], Step [50/71], Loss: 0.0000
Epoch [11/100], Step [50/71], Loss: 0.0000
Epoch [12/100], Step [50/71], Loss: 0.0000
Epoch [13/100], Step [50/71], Loss: 0.0001
Epoch [14/100], Step [50/71], Loss: 0.0000
Epoch [15/100], Step [50/71], Loss: 0.0002
Epoch [16/100], Step [50/71], Loss: 0.0000
Epoch [17/100], Step [50/71], Loss: 0.0000
Epoch [18/100], Step [50/71], Loss: 0.0000
Epoch [19/100], Step [50/71], Loss: 0.0000
Epoch [20/100], Step [50/71], Loss: 0.0000
Epoch [21/100], Step [50/71], Loss: 0.0000
Epoch [22/100], Step [50/71], Loss: 0.0000
Epoch [23/100], Step [50/71], Loss: 0.0000
Epoch [24/100], S

FNN(
  (fc1): Linear(in_features=1706, out_features=100, bias=True)
  (relu1): ReLU()
  (fc2): Linear(in_features=100, out_features=100, bias=True)
  (relu2): ReLU()
  (fc3): Linear(in_features=100, out_features=100, bias=True)
  (relu3): ReLU()
  (fc4): Linear(in_features=100, out_features=17, bias=True)
)

In [18]:
# test
from sklearn.metrics import f1_score, accuracy_score
print('Starting predicting...')
print('train', f1_score(y, model.predict(X_train), average='macro'))
print('test', f1_score(y_test, model.predict(X_test), average='macro'))
print('accuracy test', accuracy_score(y_test, model.predict(X_test)))
print('accuracy train', accuracy_score(y, model.predict(X_train)))

Starting predicting...
train 0.7866016050000514
test 0.6117913802227536
accuracy test 0.9534063042047969
accuracy train 0.9719070880271159


Пока 0.61 на тесте - мой лучший результат

Этот результат **лучше**, но третий бейзлайн всё еще не побит. Я уже использовала эмбеддинги и делала много эпох. Попробуем еще увеличить learning rate и еще уменьшить размер батча.

In [24]:
learning_rate = 1.5
batch_size = 600
num_epochs = 130

model = FNN(input_size, hidden_size, num_classes)
model.double()

FNN(
  (fc1): Linear(in_features=1706, out_features=100, bias=True)
  (relu1): ReLU()
  (fc2): Linear(in_features=100, out_features=100, bias=True)
  (relu2): ReLU()
  (fc3): Linear(in_features=100, out_features=100, bias=True)
  (relu3): ReLU()
  (fc4): Linear(in_features=100, out_features=17, bias=True)
)

In [25]:
%%time
model.fit(X_train, y, batch_size, learning_rate, num_epochs)

84
Epoch [1/130], Step [50/83], Loss: 0.0021
Epoch [2/130], Step [50/83], Loss: 0.0003
Epoch [3/130], Step [50/83], Loss: 0.0001
Epoch [4/130], Step [50/83], Loss: 0.0001
Epoch [5/130], Step [50/83], Loss: 0.0001
Epoch [6/130], Step [50/83], Loss: 0.0000
Epoch [7/130], Step [50/83], Loss: 0.0001
Epoch [8/130], Step [50/83], Loss: 0.0001
Epoch [9/130], Step [50/83], Loss: 0.0001
Epoch [10/130], Step [50/83], Loss: 0.0000
Epoch [11/130], Step [50/83], Loss: 0.0001
Epoch [12/130], Step [50/83], Loss: 0.0000
Epoch [13/130], Step [50/83], Loss: 0.0000
Epoch [14/130], Step [50/83], Loss: 0.0000
Epoch [15/130], Step [50/83], Loss: 0.0000
Epoch [16/130], Step [50/83], Loss: 0.0000
Epoch [17/130], Step [50/83], Loss: 0.0000
Epoch [18/130], Step [50/83], Loss: 0.0003
Epoch [19/130], Step [50/83], Loss: 0.0000
Epoch [20/130], Step [50/83], Loss: 0.0000
Epoch [21/130], Step [50/83], Loss: 0.0000
Epoch [22/130], Step [50/83], Loss: 0.0000
Epoch [23/130], Step [50/83], Loss: 0.0001
Epoch [24/130], S

FNN(
  (fc1): Linear(in_features=1706, out_features=100, bias=True)
  (relu1): ReLU()
  (fc2): Linear(in_features=100, out_features=100, bias=True)
  (relu2): ReLU()
  (fc3): Linear(in_features=100, out_features=100, bias=True)
  (relu3): ReLU()
  (fc4): Linear(in_features=100, out_features=17, bias=True)
)

In [26]:
# test
from sklearn.metrics import f1_score, accuracy_score
print('Starting predicting...')
print('train', f1_score(y, model.predict(X_train), average='macro'))
print('test', f1_score(y_test, model.predict(X_test), average='macro'))
print('accuracy test', accuracy_score(y_test, model.predict(X_test)))
print('accuracy train', accuracy_score(y, model.predict(X_train)))

Starting predicting...
train 0.8388807129074735
test 0.6282688477170282
accuracy test 0.9552604820862491
accuracy train 0.9751570132588974


0.628 на тесте - это лучший результат. Я пыталась еще добавлять слои в нейронную сеть, но тогда первый же loss на трейне выходил nan. И параметры типа размер батча, количество эпох, learning rate я тоже уже покрутила, поэтому остановимся на 0.628. 

Два бейзлайна побиты.