# Семинар 3: Представления слов: продолжение

In [None]:
jupyter notebook --notebook-dir="D:/"

In [11]:
%%writefile requirements.txt
gensim
pandas
razdel
sklearn
allennlp

Writing requirements.txt


In [None]:
!pip install --upgrade -r requirements.txt --user

# Torch

Один из самых известных и удобный фреймворков для обучения нейронных сетей. Не требует компиляции моделей, выполняет всё на лету.
Основа - система автоматического дифференциирования Autograd. По сути Torch = numpy + Autograd + набор готовых модулей нейронных сетей


*Фрагменты этой части взяты из https://github.com/DanAnastasyev/DeepNLP-Course*

### Графы вычислений

Графы вычислений - это такой удобный способ быстро считать градиенты сложных функций.

Например, функция

$$f = (x + y) \cdot z$$

представится графом

![graph](https://image.ibb.co/mWM0Lx/1_6o_Utr7_ENFHOK7_J4l_XJtw1g.png)  
*From [Backpropagation, Intuitions - CS231n](http://cs231n.github.io/optimization-2/)*

Зададим значения $x, y, z$ (зеленым на картинке). Как посчитать $\frac{\partial f}{\partial x}, \frac{\partial f}{\partial y}, \frac{\partial f}{\partial z}$? (*Вспоминаем, что такое backpropagation*)

В PyTorch такие вычисления делаются очень просто.

Сначала определяется функция - просто последовательность операций:

In [0]:
x = torch.tensor(-2., requires_grad=True)
y = torch.tensor(5., requires_grad=True)
z = torch.tensor(-4., requires_grad=True)

q = x + y
f = q * z

А затем говорим ей: "Посчитай градиенты, пожалуйста"

In [0]:
f.backward()

print('df/dz =', z.grad)
print('df/dx =', x.grad)
print('df/dy =', y.grad)

df/dz = tensor(3.)
df/dx = tensor(-4.)
df/dy = tensor(-4.)


Подробнее о том, как работает autograd, можно почитать здесь: [Autograd mechanics](https://pytorch.org/docs/stable/notes/autograd.html).

В целом, любой тензор в pytorch - аналог многомерных матриц в numpy.

Он содержит данные:

In [0]:
x.data

tensor(-2.)

Накопленный градиент:

In [0]:
x.grad

tensor(-4.)

Функцию, как градиент считать:

In [0]:
q.grad_fn

<AddBackward0 at 0x7f8176f8eb70>

И всякую дополнительную метаинформацию:

In [0]:
x.type(), x.shape, x.device, x.layout

('torch.FloatTensor', torch.Size([]), device(type='cpu'), torch.strided)

# Свой Word2Vec

А теперь обещанный самописный Word2Vec. Используем для его реализации Torch, хотя конкретно здесь можно было бы и обычным numpy обойтись (но было бы чуть больше сложностей).

### Подготовка

Заново скачиваем всё с предудыщего семинара...

In [None]:
!wget https://github.com/yutkin/Lenta.Ru-News-Dataset/releases/download/v1.0/lenta-ru-news.csv.gz
!gzip -d lenta-ru-news.csv.gz
!head -n 2 lenta-ru-news.csv

In [1]:
import pandas as pd
import re
import datetime as dt
from razdel import tokenize, sentenize
from string import punctuation

def get_date(url):
    dates = re.findall(r"\d\d\d\d\/\d\d\/\d\d", url)
    return next(iter(dates), None) 

# None на случай, если значения нет
dataset = pd.read_csv("lenta-ru-news.csv", sep=',', quotechar='\"', escapechar='\\', encoding='utf-8', header=0)
dataset["date"] = dataset["url"].apply(lambda x: dt.datetime.strptime(get_date(x), "%Y/%m/%d"))
dataset = dataset[dataset["date"] > "2017-01-01"]
dataset["text"] = dataset["text"].apply(lambda x: x.replace("\xa0", " "))
dataset["title"] = dataset["title"].apply(lambda x: x.replace("\xa0", " "))
train_dataset = dataset[dataset["date"] < "2018-04-01"]
test_dataset = dataset[dataset["date"] > "2018-04-01"]

texts = []
# для каждого абзаца
for text in train_dataset["text"]:
    
# для каждого предложения
    for sentence in sentenize(text):
        # удаление пунктуации и токенизация
        texts.append([token.text.lower() for token in tokenize(sentence.text) if token.text not in punctuation])
    
for title in train_dataset["title"]:
    texts.append([token.text.lower() for token in tokenize(title) if token.text not in punctuation])

assert len(texts) == 827217
assert len(texts[0]) > 0
assert texts[0][0].islower()
print(texts[0])

['возобновление', 'нормального', 'сотрудничества', 'между', 'россией', 'и', 'нато', 'невозможно', 'пока', 'москва', 'не', 'будет', 'соблюдать', 'нормы', 'международного', 'права']


Напоминание...

![embeddings training](https://miro.medium.com/max/1400/0*o2FCVrLKtdcxPQqc.png)
*From [An implementation guide to Word2Vec using NumPy and Google Sheets
](https://towardsdatascience.com/an-implementation-guide-to-word2vec-using-numpy-and-google-sheets-13445eebd281)*

Будем сами сторить skip-gram модель

## Предобработка и батчинг

До этого за нас gensim неявно строил словарь. Теперь придётся самим.

### Создание своего словаря

In [2]:
from collections import Counter


class Vocabulary:
    def __init__(self):
        self.word2index = {
            "<unk>": 0
        }
        self.index2word = ["<unk>"]

    def build(self, texts, min_count=5):
        # Если нужно сделать двойной цикл, то порядок такой
        words_counter = Counter(token for tokens in texts for token in tokens)
        for word, count in words_counter.most_common():
            if count >= min_count:
                self.word2index[word] = len(self.word2index)
        self.index2word = [word for word, _ in sorted(self.word2index.items(), key=lambda x: x[1])]
    
    @property
    def size(self):
        return len(self.index2word)
    
    def top(self, n=100):
        return self.index2word[1:n+1]
    
    def get_index(self, word):
        return self.word2index.get(word, 0)
    
    def get_word(self, index):
        return self.index2word[index]

vocabulary = Vocabulary()
vocabulary.build(texts)
assert vocabulary.word2index[vocabulary.index2word[10]] == 10
print(vocabulary.size)
print(vocabulary.top(100))

112084
['в', 'и', 'на', '«', '»', 'что', 'с', 'по', '—', 'не', 'из', 'этом', 'об', 'о', 'он', 'за', 'года', 'россии', 'к', 'его', 'для', 'как', 'также', 'от', 'а', 'это', 'сообщает', 'до', 'году', 'после', 'сша', 'у', 'во', 'время', 'был', 'при', 'заявил', 'со', 'словам', 'рублей', 'будет', 'ее', 'она', 'но', 'ранее', 'их', 'они', 'было', 'тысяч', 'более', 'того', 'том', 'мы', 'были', 'я', 'которые', 'все', 'который', 'человек', 'под', '2016', 'из-за', 'лет', '2017', 'украины', 'марта', 'процентов', 'чтобы', 'долларов', 'глава', 'президент', 'этого', 'отметил', 'же', 'сказал', 'так', 'января', 'или', 'страны', 'ру', 'то', 'еще', 'области', 'данным', 'была', 'президента', 'около', 'сообщил', 'февраля', 'однако', 'компании', 'может', 'уже', 'один', 'рассказал', 'только', 'процента', '1', '10', 'июня']


Собираем все центральные слова и их контексты, преобразуем в словарные индексы.

In [3]:
def build_contexts(tokenized_texts, vocabulary, window_size):
    contexts = []
    for tokens in tokenized_texts:
        for i in range(len(tokens)):
            central_word = vocabulary.get_index(tokens[i])
            context = [vocabulary.get_index(tokens[i + delta]) for delta in range(-window_size, window_size + 1) 
                       if delta != 0 and i + delta >= 0 and i + delta < len(tokens)]
            if len(context) != 2 * window_size:
                continue

            contexts.append((central_word, context))
            
    return contexts

contexts = build_contexts(texts, vocabulary, window_size=2)
print(contexts[:5])
print(vocabulary.get_word(contexts[0][0]), [vocabulary.get_word(index) for index in contexts[0][1]])

[(1568, [17232, 26343, 135, 371]), (135, [26343, 1568, 371, 2]), (371, [1568, 135, 2, 695]), (2, [135, 371, 695, 2140]), (695, [371, 2, 2140, 216])]
сотрудничества ['возобновление', 'нормального', 'между', 'россией']


In [4]:
contexts[:4]

[(1568, [17232, 26343, 135, 371]),
 (135, [26343, 1568, 371, 2]),
 (371, [1568, 135, 2, 695]),
 (2, [135, 371, 695, 2140])]

## Модель и обучение

In [6]:
import torch.nn as nn
import torch.optim as optim 
import torch.nn.functional as F
import time

class SkipGramModel(nn.Module):
    def __init__(self, vocab_size, embedding_dim=32):
        super().__init__()
        
        self.embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.out_layer = nn.Linear(embedding_dim, vocab_size)

    def forward(self, inputs):
        projections = self.embeddings.forward(inputs)
        output = self.out_layer.forward(projections) #F.log_softmax(, dim = 1)
        return output

In [7]:
import random
import numpy as np
import torch

def get_next_batch(contexts, window_size, batch_size, epochs_count):
    assert batch_size % (window_size * 2) == 0
    central_words, contexts = zip(*contexts)
    batch_size //= (window_size * 2)
    
    for epoch in range(epochs_count):
        indices = np.arange(len(contexts))
        np.random.shuffle(indices)
        batch_begin = 0
        while batch_begin < len(contexts):
            batch_indices = indices[batch_begin: batch_begin + batch_size]
            batch_contexts, batch_centrals = [], []
            for data_ind in batch_indices:
                central_word, context = central_words[data_ind], contexts[data_ind]
                batch_contexts.extend(context)
                batch_centrals.extend([central_word] * len(context))
                
            batch_begin += batch_size
            yield torch.cuda.LongTensor(batch_contexts), torch.cuda.LongTensor(batch_centrals)

print(next(get_next_batch(contexts, window_size=2, batch_size=64, epochs_count=10)))

(tensor([    24,    400,  83012,     10,      1,   1731,   2387,    311,     23,
         11799,    526,    722,  24291,     19,   6622,  20093,    105,  30090,
           851,      7,     16,   2968,  17499,      5,      2,   6122,  16837,
           469,  10450,      4,      5,      1,  86508,    935,   7991,      1,
             3,  10214,      1,    417,   3345,     24,     28, 100396,      2,
           424,     99,     49,  38251,   6429,    414,     14,    771,    189,
          3531,      1,  79429,    999,      4,   4785,      5,      9,      1,
           267], device='cuda:0'), tensor([25149, 25149, 25149, 25149,     1,     1,     1,     1,  3629,  3629,
         3629,  3629,     4,     4,     4,     4,     3,     3,     3,     3,
            2,     2,     2,     2, 50289, 50289, 50289, 50289, 40335, 40335,
        40335, 40335,   443,   443,   443,   443,    97,    97,    97,    97,
            0,     0,     0,     0,     9,     9,     9,     9,     5,     5,
            5,

In [65]:
def train_model_skipgram(model, contexts, batch_size = 256, epochs_count=10, save_path="model.pt",
                loss_every_nsteps=1000, lr=0.01, device_name="cuda"):
    
    import torch.nn as nn
    import torch.optim as optim 
    import time
    
    # слов: число контекстов (предложений) * длина окна
    # [80, 32] - эмбеддинг длины 32 на 80 слов
    # На выходе для функции потерь число слов*длина словаря
    # После softmax [80, 112084] -> [80, 1]

    #Trainable params: 7285460
    #torch.Size([80])
    #torch.Size([80, 32])
    #torch.Size([80, 112084])
    #torch.Size([80, 112084])
    #torch.Size([80])
    
    params_count = sum(p.numel() for p in model.parameters() if p.requires_grad)
    print("Trainable params: {}".format(params_count))
    # устройство
    device = torch.device(device_name)
    # перенос модели на устройство
    model = model.to(device)
    # инициализация общих потерь
    total_loss = 0
    start_time = time.time()
    # выбор оптимизатора
    optimizer = optim.Adam(model.parameters(), lr=lr)
    # выбор функции потерь
    loss_function = nn.CrossEntropyLoss().cuda()
    
    for step, (batch_contexts, batch_centrals) in enumerate(get_next_batch(contexts, window_size=2, batch_size=batch_size, epochs_count=epochs_count)):
        logits = model(batch_centrals) # Прямой проход
        loss = loss_function(logits, batch_contexts) # Подсчёт ошибки
        loss.backward() # Подсчёт градиентов dL/dw
        optimizer.step() # Градиентный спуск или его модификации (в данном случае Adam)
        optimizer.zero_grad() # Зануление градиентов, чтобы их спокойно менять на следующей итерации

        total_loss += loss.item()
        if step != 0 and step % loss_every_nsteps == 0:
            print("Step = {}, Avg Loss = {:.4f}, Time = {:.2f}s".format(step, total_loss / loss_every_nsteps, time.time() - start_time))
            total_loss = 0
            start_time = time.time()
            
    torch.save(model.state_dict(), save_path)
    # Загрузка модели
    #model = SkipGramModel(vocabulary.size, 32)
    #model.load_state_dict(torch.load('model.pt'))

model_skipgram = SkipGramModel(vocabulary.size, 32)
train_model_skipgram(model_skipgram, contexts, batch_size = 256, epochs_count=6, save_path="skipgram_v4.pt")

In [8]:
model_skipgram = SkipGramModel(vocabulary.size, 32)
model_skipgram.load_state_dict(torch.load('skipgram_v4.pt')) # текущая

<All keys matched successfully>

## Базовые проверки

In [9]:
from sklearn.metrics.pairwise import cosine_similarity

def most_similar(embeddings, vocabulary, word):
    word_emb = embeddings[vocabulary.get_index(word)]
    
    similarities = cosine_similarity([word_emb], embeddings)[0]
    top10 = np.argsort(similarities)[-10:]
    
    return [vocabulary.get_word(index) for index in reversed(top10)]

embeddings = model_skipgram.embeddings.weight.cpu().data.numpy()
most_similar(embeddings, vocabulary, 'путин')

['путин',
 'гройсман',
 'мединский',
 'жириновский',
 'лукин',
 'владимир',
 'президент',
 'городецкий',
 'чистюхин',
 'маркин']

Сделаем такую же визуализацию, какая была на прошлом семинаре.

In [68]:
import bokeh.models as bm, bokeh.plotting as pl
from bokeh.io import output_notebook

from sklearn.manifold import TSNE
from sklearn.decomposition import PCA
from sklearn.preprocessing import scale


def draw_vectors(x, y, radius=10, alpha=0.25, color='blue',
                 width=600, height=400, show=True, **kwargs):
    """ draws an interactive plot for data points with auxilirary info on hover """
    output_notebook()
    
    if isinstance(color, str): 
        color = [color] * len(x)
    data_source = bm.ColumnDataSource({ 'x' : x, 'y' : y, 'color': color, **kwargs })

    fig = pl.figure(active_scroll='wheel_zoom', width=width, height=height)
    fig.scatter('x', 'y', size=radius, color='color', alpha=alpha, source=data_source)

    fig.add_tools(bm.HoverTool(tooltips=[(key, "@" + key) for key in kwargs.keys()]))
    if show: 
        pl.show(fig)
    return fig


def get_tsne_projection(word_vectors):
    tsne = TSNE(n_components=2, verbose=100, n_iter=500)
    return scale(tsne.fit_transform(word_vectors))

def get_pca_projection(word_vectors):
    pca = PCA(n_components=2)
    return scale(pca.fit_transform(word_vectors))
    
    
def visualize_embeddings(embeddings, vocabulary, word_count, method="pca"):
    word_vectors = embeddings[1: word_count + 1]
    words = vocabulary.index2word[1: word_count + 1]
    get_projection = get_pca_projection if method == "pca" else get_tsne_projection
    projections = get_projection(word_vectors)
    draw_vectors(projections[:, 0], projections[:, 1], color='green', token=words)
    
    
visualize_embeddings(embeddings, vocabulary, 1000, method="pca")



### Задание 1: Рубрикация: самописный word2vec

Проверьте, как модель выше работает в задаче рубрикации

In [11]:
from razdel import tokenize
import numpy as np

def get_text_embedding(model, vocabulary, phrase):
    emb = model.embeddings.weight.cpu().data.numpy()
    
    embeddings = np.array([emb[vocabulary.get_index(word.text.lower())]
                           for word in tokenize(phrase)])
    return np.mean(embeddings, axis=0)
    
get_text_embedding(model_skipgram, vocabulary, "Исландия рядом")

array([-1.4871542 ,  0.00996983, -0.4114669 ,  1.5288379 , -0.4958942 ,
       -0.9334554 ,  0.45812514, -0.15939265,  0.29071546,  0.33591405,
       -0.3964914 , -0.448543  ,  0.0661929 ,  0.25176588,  0.73372465,
        0.25915122, -0.6093377 , -0.24376357,  0.30421245,  2.2538934 ,
       -0.15664686,  0.32170027, -0.31348467,  0.57391906,  0.3352386 ,
       -0.08488153, -0.84000367, -0.4573204 , -0.33783725, -0.1849112 ,
        0.15400788,  0.11144926], dtype=float32)

In [12]:
target_labels = set(train_dataset["topic"].dropna().tolist())
target_labels -= {"69-я параллель", "Крым", "Культпросвет ", "Оружие", "Бизнес", "Путешествия"}
target_labels = list(target_labels)
print(target_labels)

pattern = r'(\b{}\b)'.format('|'.join(target_labels))

train_with_topics = train_dataset[train_dataset["topic"].str.contains(pattern, case=False, na=False)]
test_with_topics = test_dataset[test_dataset["topic"].str.contains(pattern, case=False, na=False)]


import numpy as np
emb_size = 32

y_train = train_with_topics["topic"].apply(lambda x: target_labels.index(x)).to_numpy()
X_train = np.zeros((train_with_topics.shape[0], emb_size))
for i, phrase in enumerate(train_with_topics["text"]):
    if i % 5000 == 0:
        print("step {}".format(i))
    X_train[i, :] = get_text_embedding(model_skipgram, vocabulary, phrase)
    
print(X_train.shape)
print(y_train.shape)

y_test = test_with_topics["topic"].apply(lambda x: target_labels.index(x)).to_numpy()
X_test = np.zeros((test_with_topics.shape[0], emb_size))
for i, phrase in enumerate(test_with_topics["text"]):
    if i % 5000 == 0:
         print("step {}".format(i))
    X_test[i, :] = get_text_embedding(model_skipgram, vocabulary, phrase)
    
print(X_test.shape)
print(y_test.shape)

['Интернет и СМИ', 'Спорт', 'Мир', 'Бывший СССР', 'Из жизни', 'Культура', 'Экономика', 'Ценности', 'Силовые структуры', 'Россия', 'Дом', 'Наука и техника']


  return func(self, *args, **kwargs)


step 0
step 5000
step 10000
step 15000
step 20000
step 25000
step 30000
step 35000
step 40000
step 45000
step 50000
step 55000
step 60000
(63356, 32)
(63356,)
step 0
step 5000
step 10000
step 15000
step 20000
step 25000
step 30000
(30159, 32)
(30159,)


#### Классификатор

In [13]:
%%time
from sklearn.neural_network import MLPClassifier

clf = MLPClassifier()
clf.fit(X_train, y_train)

from sklearn import metrics

y_predicted = clf.predict(X_test)
print(metrics.classification_report(y_test, y_predicted))



              precision    recall  f1-score   support

           0       0.64      0.66      0.65      2447
           1       0.92      0.93      0.92      3429
           2       0.73      0.77      0.75      4291
           3       0.74      0.56      0.64      2156
           4       0.70      0.75      0.72      2191
           5       0.76      0.68      0.72      1995
           6       0.82      0.72      0.77      3185
           7       0.88      0.66      0.75      1177
           8       0.50      0.64      0.56      1663
           9       0.62      0.70      0.66      4324
          10       0.74      0.68      0.71      1182
          11       0.83      0.77      0.80      2119

    accuracy                           0.73     30159
   macro avg       0.74      0.71      0.72     30159
weighted avg       0.74      0.73      0.73     30159

Wall time: 1min 13s


### Задание 2: Самописный CBoW

Сделайте аналогичную модель, но в архитектуре CBoW

In [14]:
import torch.nn as nn
import torch.optim as optim 
import time

class CBOWModel(nn.Module):
    def __init__(self, vocab_size, embedding_dim=32):
        super().__init__()
        
        self.embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.out_layer = nn.Linear(embedding_dim, vocab_size)

    def forward(self, inputs):
        projections = self.embeddings.forward(inputs)
        output = self.out_layer.forward(projections)
        return output

In [15]:
import random
import numpy as np
import torch

def get_next_batch(contexts, window_size, batch_size, epochs_count):
    assert batch_size % (window_size * 2) == 0
    central_words, contexts = zip(*contexts)
    batch_size //= (window_size * 2)
    
    for epoch in range(epochs_count):
        indices = np.arange(len(contexts))
        np.random.shuffle(indices)
        batch_begin = 0
        while batch_begin < len(contexts):
            batch_indices = indices[batch_begin: batch_begin + batch_size]
            batch_contexts, batch_centrals = [], []
            for data_ind in batch_indices:
                central_word, context = central_words[data_ind], contexts[data_ind]
                batch_contexts.extend(context)
                batch_centrals.extend([central_word] * len(context))
                
            batch_begin += batch_size
            yield torch.cuda.LongTensor(batch_contexts), torch.cuda.LongTensor(batch_centrals)

print(next(get_next_batch(contexts, window_size=2, batch_size=64, epochs_count=10)))

(tensor([ 1963,    11,    48, 68719,  2875,     1,  8387,  3056,   636,    46,
        77084, 14861,    31,     2,     1, 11475, 20072, 20115,   394,     2,
            7, 27271,  6182,   794, 33554,     1,    29,   303, 60647, 73081,
        63443,     5,     1,  2390,  1030,   388,   163,  5425,  1855,   641,
        17051, 10684,    38,   186,   878,  5182, 41708, 26945,   405,   102,
          454,    83,  3843,  6469, 22230,  5889,  1793,   789,    54, 53768,
         1232,     0,   232,  2384], device='cuda:0'), tensor([29343, 29343, 29343, 29343,   352,   352,   352,   352,   994,   994,
          994,   994,  5853,  5853,  5853,  5853,     0,     0,     0,     0,
           56,    56,    56,    56,  1764,  1764,  1764,  1764,     2,     2,
            2,     2,  2516,  2516,  2516,  2516,     3,     3,     3,     3,
        11438, 11438, 11438, 11438,    41,    41,    41,    41,   833,   833,
          833,   833,     2,     2,     2,     2,   394,   394,   394,   394,
        

In [71]:
def train_model_cbow(model, contexts, batch_size = 256, epochs_count=10, save_path="model.pt",
                loss_every_nsteps=1000, lr=0.01, device_name="cuda"):
    
    import torch.nn as nn
    import torch.optim as optim 
    import time
    
    params_count = sum(p.numel() for p in model.parameters() if p.requires_grad)
    print("Trainable params: {}".format(params_count))
    # устройство
    device = torch.device(device_name)
    # перенос модели на устройство
    model = model.to(device)
    # инициализация общих потерь
    total_loss = 0
    start_time = time.time()
    # выбор оптимизатора
    optimizer = optim.Adam(model.parameters(), lr=lr)
    # выбор функции потерь
    loss_function = nn.CrossEntropyLoss().cuda()
    
    for step, (batch_contexts, batch_centrals) in enumerate(get_next_batch(contexts, window_size=2, batch_size=batch_size, epochs_count=epochs_count)):
        logits = model(batch_contexts) # Прямой проход
        loss = loss_function(logits, batch_centrals) # Подсчёт ошибки
        loss.backward() # Подсчёт градиентов dL/dw
        optimizer.step() # Градиентный спуск или его модификации (в данном случае Adam)
        optimizer.zero_grad() # Зануление градиентов, чтобы их спокойно менять на следующей итерации

        total_loss += loss.item()
        if step != 0 and step % loss_every_nsteps == 0:
            print("Step = {}, Avg Loss = {:.4f}, Time = {:.2f}s".format(step, total_loss / loss_every_nsteps, time.time() - start_time))
            total_loss = 0
            start_time = time.time()
            
    torch.save(model.state_dict(), save_path)

model_cbow = CBOWModel(vocabulary.size, 32)
train_model_cbow(model_cbow, contexts, batch_size = 256, epochs_count=6, save_path="cbow_v2.pt")

Trainable params: 7285460
Step = 1000, Avg Loss = 8.0443, Time = 10.11s
Step = 2000, Avg Loss = 8.0811, Time = 9.85s
Step = 3000, Avg Loss = 8.0571, Time = 9.85s
Step = 4000, Avg Loss = 8.0354, Time = 9.85s
Step = 5000, Avg Loss = 7.9966, Time = 9.82s
Step = 6000, Avg Loss = 7.9848, Time = 9.91s
Step = 7000, Avg Loss = 7.9513, Time = 9.96s
Step = 1000, Avg Loss = 13.8264, Time = 18.24s
Step = 2000, Avg Loss = 7.4821, Time = 9.98s
Step = 3000, Avg Loss = 7.5237, Time = 10.04s
Step = 4000, Avg Loss = 7.5439, Time = 10.04s
Step = 5000, Avg Loss = 7.5741, Time = 9.86s
Step = 6000, Avg Loss = 7.5786, Time = 9.95s
Step = 7000, Avg Loss = 7.5857, Time = 9.90s


In [21]:
model_cbow = CBOWModel(vocabulary.size, 32)
model_cbow.load_state_dict(torch.load('cbow_v2.pt')) # текущая

<All keys matched successfully>

In [22]:
from sklearn.metrics.pairwise import cosine_similarity

def most_similar(embeddings, vocabulary, word):
    word_emb = embeddings[vocabulary.get_index(word)]
    
    similarities = cosine_similarity([word_emb], embeddings)[0]
    top10 = np.argsort(similarities)[-10:]
    
    return [vocabulary.get_word(index) for index in reversed(top10)]

embeddings = model_cbow.embeddings.weight.cpu().data.numpy()
most_similar(embeddings, vocabulary, 'путин')

['путин',
 'городецкий',
 'президент',
 'мединский',
 'брынзак',
 'нечаев',
 'медведев',
 'шаманов',
 'сафронов',
 'осаковский']

In [23]:
from razdel import tokenize
import numpy as np

def get_text_embedding(model, vocabulary, phrase):
    emb = model.embeddings.weight.cpu().data.numpy()
    
    embeddings = np.array([emb[vocabulary.get_index(word.text.lower())]
                           for word in tokenize(phrase)])
    return np.mean(embeddings, axis=0)

In [24]:
target_labels = set(train_dataset["topic"].dropna().tolist())
target_labels -= {"69-я параллель", "Крым", "Культпросвет ", "Оружие", "Бизнес", "Путешествия"}
target_labels = list(target_labels)
print(target_labels)

pattern = r'(\b{}\b)'.format('|'.join(target_labels))

train_with_topics = train_dataset[train_dataset["topic"].str.contains(pattern, case=False, na=False)]
test_with_topics = test_dataset[test_dataset["topic"].str.contains(pattern, case=False, na=False)]


import numpy as np
emb_size = 32

y_train = train_with_topics["topic"].apply(lambda x: target_labels.index(x)).to_numpy()
X_train = np.zeros((train_with_topics.shape[0], emb_size))
for i, phrase in enumerate(train_with_topics["text"]):
    if i % 5000 == 0:
        print("step {}".format(i))
    X_train[i, :] = get_text_embedding(model_cbow, vocabulary, phrase)
    
print(X_train.shape)
print(y_train.shape)

y_test = test_with_topics["topic"].apply(lambda x: target_labels.index(x)).to_numpy()
X_test = np.zeros((test_with_topics.shape[0], emb_size))
for i, phrase in enumerate(test_with_topics["text"]):
    if i % 5000 == 0:
         print("step {}".format(i))
    X_test[i, :] = get_text_embedding(model_cbow, vocabulary, phrase)
    
print(X_test.shape)
print(y_test.shape)

['Интернет и СМИ', 'Спорт', 'Мир', 'Бывший СССР', 'Из жизни', 'Культура', 'Экономика', 'Ценности', 'Силовые структуры', 'Россия', 'Дом', 'Наука и техника']


  return func(self, *args, **kwargs)


step 0
step 5000
step 10000
step 15000
step 20000
step 25000
step 30000
step 35000
step 40000
step 45000
step 50000
step 55000
step 60000
(63356, 32)
(63356,)
step 0
step 5000
step 10000
step 15000
step 20000
step 25000
step 30000
(30159, 32)
(30159,)


#### Классификатор

In [25]:
%%time
from sklearn.neural_network import MLPClassifier

clf = MLPClassifier()
clf.fit(X_train, y_train)

from sklearn import metrics

y_predicted = clf.predict(X_test)
print(metrics.classification_report(y_test, y_predicted))



              precision    recall  f1-score   support

           0       0.66      0.65      0.65      2447
           1       0.90      0.94      0.92      3429
           2       0.71      0.80      0.75      4291
           3       0.71      0.62      0.66      2156
           4       0.73      0.71      0.72      2191
           5       0.74      0.72      0.73      1995
           6       0.81      0.73      0.77      3185
           7       0.85      0.67      0.75      1177
           8       0.46      0.64      0.54      1663
           9       0.63      0.67      0.65      4324
          10       0.80      0.66      0.72      1182
          11       0.84      0.68      0.75      2119

    accuracy                           0.73     30159
   macro avg       0.74      0.71      0.72     30159
weighted avg       0.73      0.73      0.73     30159

Wall time: 1min 11s


### Задание 3*: Negative Sampling

Реализуйте negative sampling вместо полного softmax'а

In [4]:
import random
import numpy as np
import torch

def get_next_batch(contexts, window_size, batch_size, epochs_count):
    assert batch_size % (window_size * 2) == 0
    central_words, contexts = zip(*contexts)
    batch_size //= (window_size * 2)
    
    for epoch in range(epochs_count):
        indices = np.arange(len(contexts))
        np.random.shuffle(indices)
        batch_begin = 0
        while batch_begin < len(contexts):
            batch_indices = indices[batch_begin: batch_begin + batch_size]
            batch_contexts, batch_centrals = [], []
            for data_ind in batch_indices:
                central_word, context = central_words[data_ind], contexts[data_ind]
                batch_contexts.extend(context)
                batch_centrals.extend([central_word] * len(context))
                
            batch_begin += batch_size
            yield torch.cuda.LongTensor(batch_contexts), torch.cuda.LongTensor(batch_centrals)

print(next(get_next_batch(contexts, window_size=2, batch_size=64, epochs_count=10)))

(tensor([   37,  9006,   256,    10,   157, 30282, 60515,     3,     6,  1401,
          471,    30,  2062,     1,    68,  3267,    82,   271,  7098,     4,
          118,     7,  3559,     3,   154, 16154,  5778,  3416,  1472,  1380,
          503,     7,    10, 15746,    76,     2,   476,     4,   224,     7,
         1162,   440,  4898,   394,  2631, 21458,     1, 14000,  2947,  4100,
        82113,     5, 10898,     0,  9093,  3689,     1,  6279,    52,    74,
         1102,   112,  4133,     1], device='cuda:0'), tensor([    6,     6,     6,     6, 62396, 62396, 62396, 62396, 23698, 23698,
        23698, 23698,   501,   501,   501,   501,   226,   226,   226,   226,
        11695, 11695, 11695, 11695,     1,     1,     1,     1, 30667, 30667,
        30667, 30667,     5,     5,     5,     5,     1,     1,     1,     1,
            8,     8,     8,     8,   905,   905,   905,   905,     4,     4,
            4,     4,    60,    60,    60,    60,     1,     1,     1,     1,
        

In [50]:
from torch.utils.data import Dataset, DataLoader
from torch.cuda import LongTensor as lt
from torch.cuda import FloatTensor as ft
import numpy as np
import torch
import torch.nn as nn

    
class Word2Vec(nn.Module):
    def __init__(self, voc_size, embedding_dim=32):
        super().__init__()

        self.embeddings = nn.Embedding(voc_size, embedding_dim)
    #    self.out_layer = nn.Linear(embedding_dim, voc_size)

    def forward(self, inputs):
        projections = self.embeddings.forward(inputs)
        return projections


class NegativeSampling(nn.Module):

    def __init__(self, model, voc_size=112084, n_negs=5, weights=None):
        super(NegativeSampling, self).__init__()
        self.model = model
        self.voc_size = voc_size
        self.n_negs = n_negs
        self.weights = None
        if weights is not None:
            wf = np.power(weights, 0.75)
            wf = wf / wf.sum()
            self.weights = ft(wf)

    def forward(self, iword, owords):
        batch_size = iword.size()[0]
        context_size = 4
        nwords = ft(batch_size, context_size * self.n_negs).uniform_(0, self.voc_size - 1).long()
        print(nwords.shape)
        ivectors = self.model.forward(iword).unsqueeze(2)
        ovectors = self.model.forward(owords)
        print(ovectors.shape)
        nvectors = self.model.forward(nwords).neg()
        oloss = torch.bmm(ovectors, ivectors).squeeze().sigmoid().log().mean(1)
        nloss = torch.bmm(nvectors, ivectors).squeeze().sigmoid().log().view(-1, context_size, self.n_negs).sum(2).mean(1)
        return -(oloss + nloss).mean()

In [51]:
def train_negative(contexts, batch_size=256, epochs_count=10, voc_size=112084, embedding_dim =32, n_negs=5, save_path="model.pt",
                loss_every_nsteps=1000, lr=0.01, device_name="cuda"):
    
    import torch.nn as nn
    import torch.optim as optim 
    import time
    
    device = torch.device(device_name)
    
    start_time = time.time()
    total_loss = 0
    
    model = Word2Vec(voc_size=voc_size, embedding_dim=embedding_dim)
    model = model.to(device)
    
    optimizer = optim.Adam(model.parameters(), lr=lr)
    
    loss_function = NegativeSampling(model=model, voc_size=voc_size, n_negs=n_negs, weights=None).cuda() 

    for step, (batch_contexts, batch_centrals) in enumerate(get_next_batch(contexts, window_size=2, batch_size=batch_size, epochs_count=epochs_count)):
        loss = loss_function(batch_centrals, batch_contexts)
        print(loss.shape)
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()
        total_loss += loss.item()

    if step != 0 and step % loss_every_nsteps == 0:
        print("Step = {}, Avg Loss = {:.4f}, Time = {:.2f}s".format(step, total_loss / loss_every_nsteps, time.time() - start_time))
        total_loss = 0
        start_time = time.time()
    torch.save(model.state_dict(), save_path)

In [53]:
train_negative(contexts[:1000], batch_size = 256, epochs_count=1, voc_size=112084, embedding_dim=32, n_negs=5, save_path="negative_sampling.pt")

# Unsupervised targets
У пословных моделей есть ряд проблем. Основная - в разных контекстах у одинаковых токенов будут одинаковые представления. Кроме того, наивные Skip-gram и CBoW не учитывают порядок токенов в контексте. 

Как извлечь информацию из сырых текстов? Чему должны учиться модели, из которых мы получим наши представления?

1.   **Skip-gram**
2.   **CBoW**
3.   LM: language modeling (ELMo, ULMFiT)
4.   NSP: next sentence prediction (BERT, в модификациях иногда убирается)
5.   MLM: masked language modeling (BERT, основной таргет)




# Языковые модели



Языковое моделирование - довольно древняя и понятная задача. Статистичская языковая модель (statistical language model) - вероятностное распределение над последовательностями слов $$P(w_1,...,w_n)$$

Другая постановка:
$$P(w_n | w_1,...,w_{n-1}) = P(w_n|w_1^{n-1})$$

N-граммные модели:

$$P(w_n|w_1^{n-1}) \approx P(w_n|w_{n-N+1}^{n-1})$$

## Пример N-граммной модели

In [0]:
class NGramModel:
    def __init__(self, vocabulary, n=4):
        self.n = n
        self.n_grams = [Counter() for _ in range(n+1)]
        self.vocabulary = vocabulary
    
    def collect_n_grams(self, tokens):
        indices = [vocabulary.get_index(token) for token in tokens]
        count = len(indices)
        for n in range(self.n + 1):
            for i in range(min(count - n + 1, count)):
                n_gram = indices[i:i+n]
                self.n_grams[n][tuple(n_gram)] += 1
                
    def normalize(self):
        for n in range(self.n, 0, -1):
            current_n_grams = self.n_grams[n]
            for words, count in current_n_grams.items():
                prev_order_n_gram_count = self.n_grams[n-1][words[:-1]]
                current_n_grams[words] = count / prev_order_n_gram_count
        self.n_grams[0][tuple()] = 1.0
    
    def predict(self, context):
        indices = [vocabulary.get_index(token) for token in context]
        context = tuple(indices[-self.n + 1:])
        step_probabilities = np.zeros((self.vocabulary.size, ), dtype=np.float64)
        for shift in range(self.n):
            current_n = self.n - shift
            wanted_context_length = current_n - 1
            if wanted_context_length > len(context):
                continue
            start_index = len(context) - wanted_context_length
            wanted_context = context[start_index:]
            
            s = 0.0
            for index in range(self.vocabulary.size):
                n_gram = wanted_context + (index,)
                p = self.n_grams[current_n].get(n_gram, 0)
                step_probabilities[index] = p
                s += p
            if s != 0.0:
                break
        return step_probabilities

vocabulary.word2index["<eos>"] = vocabulary.size
vocabulary.index2word.append("<eos>")
n_gram_model = NGramModel(vocabulary)
for text in texts[:1000]:
    n_gram_model.collect_n_grams(text + ["<eos>"])
n_gram_model.normalize()

In [0]:
seed = ["путин"]
while seed[-1] != "<eos>":
    proba = n_gram_model.predict(seed)
    seed.append(np.random.choice(vocabulary.index2word, size=1, p=proba)[0])
    print(seed)

['путин', 'не']
['путин', 'не', 'вышел']
['путин', 'не', 'вышел', 'к']
['путин', 'не', 'вышел', 'к', 'митингующим']
['путин', 'не', 'вышел', 'к', 'митингующим', 'после']
['путин', 'не', 'вышел', 'к', 'митингующим', 'после', 'пожара']
['путин', 'не', 'вышел', 'к', 'митингующим', 'после', 'пожара', 'в']
['путин', 'не', 'вышел', 'к', 'митингующим', 'после', 'пожара', 'в', 'кемерове']
['путин', 'не', 'вышел', 'к', 'митингующим', 'после', 'пожара', 'в', 'кемерове', '<eos>']


## ELMo (Embeddings from Language Models)

Оригинальная статья: https://arxiv.org/pdf/1802.05365.pdf

The Illustrated BERT, ELMo and co.: http://jalammar.github.io/illustrated-bert/

Как применить?

In [0]:
!wget http://vectors.nlpl.eu/repository/11/195.zip
!mkdir elmo && mv 195.zip elmo/195.zip && cd elmo && unzip 195.zip && rm 195.zip && cd ..
!ls elmo

--2019-10-28 22:32:41--  http://vectors.nlpl.eu/repository/11/195.zip
Resolving vectors.nlpl.eu (vectors.nlpl.eu)... 129.240.189.225
Connecting to vectors.nlpl.eu (vectors.nlpl.eu)|129.240.189.225|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 206977021 (197M) [application/zip]
Saving to: ‘195.zip’


2019-10-28 22:32:54 (15.6 MB/s) - ‘195.zip’ saved [206977021/206977021]

Archive:  195.zip
  inflating: meta.json               
  inflating: model.hdf5              
  inflating: options.json            
  inflating: README                  
  inflating: vocab.txt               
meta.json  model.hdf5  options.json  README  vocab.txt


In [26]:
from allennlp.commands.elmo import ElmoEmbedder

path_elmo = 'D:/NLP_advanced/week2/part2/elmo/'
elmo = ElmoEmbedder(options_file=path_elmo+"options.json", weight_file=path_elmo+"model.hdf5", cuda_device=0) 

In [27]:
import numpy as np
embeddings = elmo.batch_to_embeddings(texts[:32])[0].cpu().numpy()
print(embeddings.shape)
embeddings = embeddings.swapaxes(1, 2)
print(embeddings.shape)
embeddings = embeddings.reshape(embeddings.shape[0], embeddings.shape[1], -1)
print(embeddings.shape)
embeddings = np.mean(embeddings, axis=1) 
print(embeddings.shape)
np.mean(embeddings, axis = 0)

(32, 3, 38, 1024)
(32, 38, 3, 1024)
(32, 38, 3072)
(32, 3072)


array([ 0.0172218 , -0.01318733,  0.04479723, ...,  0.06010765,
       -0.18063472,  0.00204151], dtype=float32)

### Задание 4: Рубрикация: ELMo

Проверьте, как ELMo работает в задаче рубрикации

In [28]:
from razdel import tokenize
import numpy as np

def get_text_embedding_elmo_3072(model, text):
    #batch : List[List[str]], required A list of tokenized sentences    
    texts = []
    # для каждого предложения
    for sentence in sentenize(text):
        # удаление пунктуации и токенизация
        texts.append([token.text.lower() for token in tokenize(sentence.text) if token.text not in punctuation])
        
    embeddings = model.batch_to_embeddings(texts)[0].cpu().numpy()
    embeddings = embeddings.swapaxes(1, 2)
    embeddings = embeddings.reshape(embeddings.shape[0], embeddings.shape[1], -1)
    embeddings = np.mean(embeddings, axis=1)
    return np.mean(embeddings, axis=0)

def get_text_embedding_elmo_1024(model, text):
    #batch : List[List[str]], required A list of tokenized sentences    
    texts = []
    # для каждого предложения
    for sentence in sentenize(text):
        # удаление пунктуации и токенизация
        texts.append([token.text.lower() for token in tokenize(sentence.text) if token.text not in punctuation])
        
    embeddings = elmo.batch_to_embeddings(texts[:32])[0].cpu().numpy()
    embeddings = embeddings.swapaxes(1, 2)
    embeddings = np.mean(embeddings, axis=2) 
    embeddings = np.mean(embeddings, axis=1) 

    return np.mean(embeddings, axis=0)

In [29]:
target_labels = set(train_dataset["topic"].dropna().tolist())
target_labels -= {"69-я параллель", "Крым", "Культпросвет ", "Оружие", "Бизнес", "Путешествия"}
target_labels = list(target_labels)
print(target_labels)

pattern = r'(\b{}\b)'.format('|'.join(target_labels))

train_with_topics = train_dataset[train_dataset["topic"].str.contains(pattern, case=False, na=False)]
test_with_topics = test_dataset[test_dataset["topic"].str.contains(pattern, case=False, na=False)]

emb_size = 1024

y_train = train_with_topics["topic"].apply(lambda x: target_labels.index(x)).to_numpy()
X_train = np.zeros((train_with_topics.shape[0], emb_size))
for i, phrase in enumerate(train_with_topics["text"]):
    if i % 5000 == 0:
        print("step {}".format(i))
    X_train[i, :] = get_text_embedding_elmo_1024(elmo, phrase)
print(X_train.shape)
print(y_train.shape)

y_test = test_with_topics["topic"].apply(lambda x: target_labels.index(x)).to_numpy()
X_test = np.zeros((test_with_topics.shape[0], emb_size))
for i, phrase in enumerate(test_with_topics["text"]):
    if i % 5000 == 0:
        print("step {}".format(i))
    X_test[i, :] = get_text_embedding_elmo_1024(elmo, phrase)
print(X_test.shape)
print(y_test.shape)

In [50]:
# X_train, X_test, y_train, y_test предрассчитаны в colab
import joblib

X_train = joblib.load("X_train1.pkl")
X_train = X_train.values
print("X_train shape: {}".format(X_train.shape))

X_test = pd.read_csv("X_test.csv") 
X_test = X_test.values
print("X_test shape: {}".format(X_test.shape))

y_train = pd.read_csv("y_train.csv") 
y_train = y_train['0']
print("X_test shape: {}".format(y_train.shape))

y_test = pd.read_csv("y_test.csv") 
y_test = y_test['0']
print("y_test shape: {}".format(y_test.shape))

X_train shape: (63356, 1024)
X_test shape: (30159, 1024)
X_test shape: (63356,)
y_test shape: (30159,)


#### Многослойный перцептрон

In [46]:
# На данных get_text_embedding_elmo_1024
%%time
from sklearn.neural_network import MLPClassifier
from sklearn import metrics

clf = MLPClassifier()
clf.fit(X_train, y_train)

y_predicted = clf.predict(X_test)
print(metrics.classification_report(y_test, y_predicted))



              precision    recall  f1-score   support

           0       0.87      0.86      0.86      1995
           1       0.78      0.77      0.78      2191
           2       0.93      0.97      0.95      3429
           3       0.78      0.79      0.79      4291
           4       0.80      0.73      0.77      2156
           5       0.86      0.74      0.80      1177
           6       0.71      0.73      0.72      2447
           7       0.55      0.74      0.63      1663
           8       0.82      0.78      0.80      3185
           9       0.81      0.81      0.81      1182
          10       0.90      0.76      0.82      2119
          11       0.73      0.74      0.73      4324

    accuracy                           0.79     30159
   macro avg       0.80      0.79      0.79     30159
weighted avg       0.80      0.79      0.79     30159

Wall time: 7min 17s


#### Логистическая регрессия

In [49]:
# На данных get_text_embedding_elmo_1024
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn import metrics

clf = LogisticRegression(penalty='l2',  solver='saga')
clf.fit(X_train, y_train)
y_predicted = clf.predict(X_test)
print(metrics.classification_report(y_test, y_predicted))



              precision    recall  f1-score   support

           0       0.88      0.86      0.87      1995
           1       0.80      0.80      0.80      2191
           2       0.94      0.97      0.95      3429
           3       0.79      0.85      0.82      4291
           4       0.83      0.75      0.78      2156
           5       0.87      0.72      0.79      1177
           6       0.74      0.72      0.73      2447
           7       0.57      0.70      0.63      1663
           8       0.87      0.76      0.81      3185
           9       0.83      0.74      0.78      1182
          10       0.90      0.79      0.84      2119
          11       0.71      0.79      0.75      4324

    accuracy                           0.80     30159
   macro avg       0.81      0.79      0.80     30159
weighted avg       0.81      0.80      0.80     30159



#### SVM

In [48]:
# На данных get_text_embedding_elmo_1024
from sklearn.svm import LinearSVC
from sklearn import metrics
clf = LinearSVC().fit(X_train, y_train)
y_predicted = clf.predict(X_test)
print(metrics.classification_report(y_test, y_predicted))



              precision    recall  f1-score   support

           0       0.90      0.88      0.89      1995
           1       0.84      0.78      0.81      2191
           2       0.93      0.97      0.95      3429
           3       0.79      0.85      0.82      4291
           4       0.82      0.76      0.79      2156
           5       0.88      0.74      0.80      1177
           6       0.75      0.75      0.75      2447
           7       0.58      0.72      0.64      1663
           8       0.86      0.76      0.81      3185
           9       0.84      0.76      0.80      1182
          10       0.89      0.79      0.84      2119
          11       0.72      0.78      0.75      4324

    accuracy                           0.81     30159
   macro avg       0.82      0.80      0.80     30159
weighted avg       0.81      0.81      0.81     30159

