# Language Modeling

Изучим мощь рекуррентных сетей на примере решения задачи LM

# Language Modeling

Давайте введём понятия, которые нам сегодня пригодятся

Сами по себе языковые модели предназначены для того, чтобы как-то оценивать вероятность некоторой языковой конструкции. Например, мы можем свести к вероятностной модели вероятность последовательности токенов следующим образом:

$$p(x) = \prod_{i=1}^{N}p(x_i|x_1, ..., x_{i-1}) = p(x_1) * p(x_2 | x_1) * p(x_3 | x1, x2) * ...$$
$$\log p(x) = \sum_{i=1}^{N}\log p(x_i|x_1, ..., x_{i-1})$$

Если мы обучим нашу модель предсказывать эти вероятности, то сможем не только предсказывать вероятность последовательности, но и получать вероятности следующих токенов. Последнее - это не что иное, как генерация текста.

# Average ArXiv Enjoyer

На одной из прошлых пар мы научились правильно читать статьи, а теперь давайте научимся их писать

Писать статьи самостоятельно - это определённо прошлый век. Мы можем воспользоваться своими знаниями и автоматизировать этот процесс.

План следующий:

1. Обучим языковую модель на корпусе из статей с arXiv
2. Насемплируем несколько статей
3. Получим мировую известность и уважение в научном сообществе

In [1]:
import torch

from tqdm.auto import tqdm

## Подготовка данных

Для обучения возьмём [корпус статей с arXiv](https://www.kaggle.com/neelshah18/arxivdataset/):

In [None]:
!wget -O arXiv.zip "https://drive.google.com/uc?export=download&confirm=no_antivirus&id=1m78dRD6OIMP4oJL4VUujXV6MVavpb3A_"

!unzip arXiv.zip
!rm arXiv.zip

In [2]:
import pandas as pd

arXiv_data = pd.read_json("arxivData.json")
arXiv_data.head()

Unnamed: 0,author,day,id,link,month,summary,tag,title,year
0,"[{'name': 'Ahmed Osman'}, {'name': 'Wojciech S...",1,1802.00209v1,"[{'rel': 'alternate', 'href': 'http://arxiv.or...",2,We propose an architecture for VQA which utili...,"[{'term': 'cs.AI', 'scheme': 'http://arxiv.org...",Dual Recurrent Attention Units for Visual Ques...,2018
1,"[{'name': 'Ji Young Lee'}, {'name': 'Franck De...",12,1603.03827v1,"[{'rel': 'alternate', 'href': 'http://arxiv.or...",3,Recent approaches based on artificial neural n...,"[{'term': 'cs.CL', 'scheme': 'http://arxiv.org...",Sequential Short-Text Classification with Recu...,2016
2,"[{'name': 'Iulian Vlad Serban'}, {'name': 'Tim...",2,1606.00776v2,"[{'rel': 'alternate', 'href': 'http://arxiv.or...",6,We introduce the multiresolution recurrent neu...,"[{'term': 'cs.CL', 'scheme': 'http://arxiv.org...",Multiresolution Recurrent Neural Networks: An ...,2016
3,"[{'name': 'Sebastian Ruder'}, {'name': 'Joachi...",23,1705.08142v2,"[{'rel': 'alternate', 'href': 'http://arxiv.or...",5,Multi-task learning is motivated by the observ...,"[{'term': 'stat.ML', 'scheme': 'http://arxiv.o...",Learning what to share between loosely related...,2017
4,"[{'name': 'Iulian V. Serban'}, {'name': 'Chinn...",7,1709.02349v2,"[{'rel': 'alternate', 'href': 'http://arxiv.or...",9,We present MILABOT: a deep reinforcement learn...,"[{'term': 'cs.CL', 'scheme': 'http://arxiv.org...",A Deep Reinforcement Learning Chatbot,2017


Из всего датасета нам пригодится только столбец `summary`:

In [3]:
from random import choice

texts = arXiv_data["summary"].tolist()
print(texts[0])

We propose an architecture for VQA which utilizes recurrent layers to
generate visual and textual attention. The memory characteristic of the
proposed recurrent attention units offers a rich joint embedding of visual and
textual features and enables the model to reason relations between several
parts of the image and question. Our single model outperforms the first place
winner on the VQA 1.0 dataset, performs within margin to the current
state-of-the-art ensemble model. We also experiment with replacing attention
mechanisms in other state-of-the-art models with our implementation and show
increased accuracy. In both cases, our recurrent attention mechanism improves
performance in tasks requiring sequential or relational reasoning on the VQA
dataset.


Пока что для обучения наши тексты не годятся, сначала нужно провести токенизацию:

In [None]:
!pip install razdel

In [4]:
from razdel import tokenize
from collections import defaultdict

SOS_TOKEN = '[SOS]'
EOS_TOKEN = '[EOS]'
PAD_TOKEN = '[PAD]'

vocabulary = defaultdict(lambda: len(vocabulary)) # Просто костыльное устранение проблемы недетерминизированности set
_ = vocabulary[SOS_TOKEN]
_ = vocabulary[EOS_TOKEN]
_ = vocabulary[PAD_TOKEN]

tokenized_texts = list()

for text in tqdm(texts[:2000]):
    # Токенизируем текст
    tokenized_text = tokenize(text.lower())
    tokenized_text = [token.text for token in tokenized_text]
    tokenized_text = [SOS_TOKEN] + tokenized_text + [EOS_TOKEN]

    # Обновим словарь
    for token in tokenized_text:
         _ = vocabulary[token]

    # Добавим токенизированный текст в датасет
    tokenized_texts.append(tokenized_text)

vocab_size = len(vocabulary)

  0%|          | 0/2000 [00:00<?, ?it/s]

In [5]:
print(f"Vocabulary size is {vocab_size}")
print(f"Tokenized example: {tokenized_texts[0]}")

Vocabulary size is 14358
Tokenized example: ['[SOS]', 'we', 'propose', 'an', 'architecture', 'for', 'vqa', 'which', 'utilizes', 'recurrent', 'layers', 'to', 'generate', 'visual', 'and', 'textual', 'attention', '.', 'the', 'memory', 'characteristic', 'of', 'the', 'proposed', 'recurrent', 'attention', 'units', 'offers', 'a', 'rich', 'joint', 'embedding', 'of', 'visual', 'and', 'textual', 'features', 'and', 'enables', 'the', 'model', 'to', 'reason', 'relations', 'between', 'several', 'parts', 'of', 'the', 'image', 'and', 'question', '.', 'our', 'single', 'model', 'outperforms', 'the', 'first', 'place', 'winner', 'on', 'the', 'vqa', '1.0', 'dataset', ',', 'performs', 'within', 'margin', 'to', 'the', 'current', 'state-of-the-art', 'ensemble', 'model', '.', 'we', 'also', 'experiment', 'with', 'replacing', 'attention', 'mechanisms', 'in', 'other', 'state-of-the-art', 'models', 'with', 'our', 'implementation', 'and', 'show', 'increased', 'accuracy', '.', 'in', 'both', 'cases', ',', 'our', 'rec

In [6]:
id_to_token = list(vocabulary) # id_to_token[i] -> token_i
token_to_id = {token: id for id, token in enumerate(id_to_token)} # token_i -> i

id_to_token[0], token_to_id['model']

('[SOS]', 33)

Всё готово, осталось разделить выборку на обучающую и валидационную. Вторая нужна нам для промежуточного отслеживания качества:

In [7]:
from torch.utils.data import Dataset, DataLoader

class TextsForLM(Dataset):
    def __init__(self, texts):
        self.texts = list()

        for text in tqdm(texts):
            text_ids = [token_to_id[token] for token in text]

            self.texts.append(text_ids)

    def __getitem__(self, index):
        return self.texts[index]

    def __len__(self):
        return len(self.texts)

In [8]:
from sklearn.model_selection import train_test_split

train_texts, val_texts = train_test_split(tokenized_texts, test_size=0.1,
                                          random_state=42)

train_dataset = TextsForLM(train_texts)
val_dataset = TextsForLM(val_texts)

  0%|          | 0/1800 [00:00<?, ?it/s]

  0%|          | 0/200 [00:00<?, ?it/s]

Наконец, для более эффективного обучения поделим наши датасеты на батчи.

Важно не забыть один нюанс: тексты могут быть разной длины, а тензор, который мы будем подавать на вход модели, должен иметь фиксированную размерность. В таком случае следует добавить токены `[PAD]` в конец коротких текстов. Но даже это можно сделать несколькими способами:

1. Выбрать большое число, которое точно будет длиннее любого текста и добивать длины текстов до него
2. Выбрать максимальную длину среди всех текстов в батче и добивать до неё
3. Разбить тексты на бакеты примерно одинакового размера, затем внутри каждого бакета добивать длины текстов до самого длинного текста в бакете (при этом элементы батча семплируются из одного и того же бакета)

Наиболее привлекательным выглядит последний вариант, но для простоты возьмём второй:

In [9]:
def collate_texts(batch):
    max_length = 0
    for text_ids in batch:
        max_length = max(max_length, len(text_ids))

    for i in range(len(batch)):
        batch[i] += [token_to_id[PAD_TOKEN]] * (max_length - len(batch[i]))

    return torch.LongTensor(batch)

train_dataloader = DataLoader(train_dataset, batch_size=4, shuffle=True, num_workers=2,
                              collate_fn=collate_texts)

val_dataloader = DataLoader(val_dataset, batch_size=4, shuffle=False, num_workers=2,
                            collate_fn=collate_texts)

In [10]:
token_to_id[PAD_TOKEN], id_to_token[12183]

(2, 'allenai')

## Реализация модели

Данные готовы, пора приступить к реализации модели

In [None]:
!pip install pytorch_lightning

In [11]:
from pytorch_lightning import LightningModule

In [12]:
from torch import nn, optim

from typing import Union

class LMModel(LightningModule):
    def __init__(self, vocab_size, emb_dim=128, rnn_hidden_dim=128,
                 rnn_num_layers=2, RNN: Union[nn.RNN, nn.LSTM, nn.GRU] = nn.LSTM):
        super().__init__()

        # Можно было бы сделать иначе:
        # self.shared_weight = nn.Parameter(data=init_tensor(vocab_size, rnn_hidden_dim))
        # assert(emb_dim == rnn_hidden_dim)
        # input @ self.shared_weight вместо self.embedding_layer
        # context_repr @ self.shared_weight.T вместо self.output_layer

        self.embedding_layer = nn.Embedding(vocab_size, emb_dim)

        self.rnn = RNN(input_size=emb_dim, hidden_size=rnn_hidden_dim,
                       batch_first=True, num_layers=rnn_num_layers)

        self.output_layer = nn.Linear(emb_dim, vocab_size)

    def forward(self, input_ids):
        # input_ids: [batch_size, seq_len]

        # embeddings: [batch_size, seq_len, emb_dim]
        embeddings = self.embedding_layer(input_ids)

        # output: [batch_size, seq_len, rnn_hidden_dim]
        #         RNN, GRU: state == h_n
        #         LSTM: state == (h_n, c_n)
        # h_n: [num_layers, batch_size, rnn_hidden_dim]
        # c_n: [num_layers, batch_size, rnn_hidden_dim]
        output, state = self.rnn(embeddings)

        # logits: [batch_size, seq_len, vocab_size]
        logits = self.output_layer(output)

        return logits, state

    def training_step(self, batch, _):
        # input:  SOS I      love   cats
        # target: I   love   cats   EOS

        # pred:    [0.6, 0.01, 0.38, 0.01, 0.01]
        # target1: [0.    0.    1.   0.   0. ]
        # target2: [1.    0.    0.   0.   0. ]

        logits, state = self.forward(batch)

        batch_size, seq_len, vocab_size = logits.shape

        pred = logits[:, :-1, :].reshape(batch_size * (seq_len - 1), vocab_size)
        target = batch[:, 1:].reshape(batch_size * (seq_len - 1))

        loss_fn = nn.CrossEntropyLoss(ignore_index=token_to_id[PAD_TOKEN])

        loss = loss_fn(pred, target)

        return loss

    def configure_optimizers(self):
        optimizer = optim.Adam(self.parameters(), lr=3e-3)

        return optimizer

In [None]:
rnn_model = LMModel(vocab_size, RNN=nn.RNN)
lstm_model = LMModel(vocab_size, RNN=nn.LSTM)
gru_model = LMModel(vocab_size, RNN=nn.GRU)

In [None]:
sample = next(iter(train_dataloader))

print(f"Sample shape is {sample.shape}")
print(f"Sample: {sample}")

Sample shape is torch.Size([4, 227])
Sample: tensor([[    0,   119,   402,   802,    27,   616,   803,   679,   444,    13,
           804,   437,   228,    19,   805,   170,    20,   209,    15,   333,
            50,     3,   759,   806,   807,   808,   809,   810,   245,   504,
           544,    13,   155,     5,   811,    15,   499,    19,   670,    50,
             3,   492,   804,   808,   812,    47,   813,    13,   814,   815,
            50,     9,     3,   492,   140,   800,   816,   678,   817,   800,
            13,   528,    27,   122,   133,    13,   818,   819,   247,    27,
            42,   497,    19,   407,    20,   820,   821,   123,   822,   823,
            27,   824,   703,    62,    20,   468,    23,    20,   209,    15,
           333,    50,     3,   500,   123,   119,   825,   444,    13,   804,
           228,   260,   504,   212,   811,   358,   826,   827,   800,   828,
           800,    74,   829,    62,    20,   830,    19,   831,    50,    41,
       

## Обучение моделей

Без лишних слов приступим к обучению!

In [13]:
from pytorch_lightning import Trainer

max_epochs = 10

In [None]:
torch.cuda.empty_cache()

In [None]:
trainer = Trainer(devices=1, accelerator="gpu", max_epochs=max_epochs, log_every_n_steps=1)
trainer.fit(rnn_model, train_dataloader)

GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name            | Type      | Params
----------------------------------------------
0 | embedding_layer | Embedding | 1.8 M 
1 | rnn             | RNN       | 66.0 K
2 | output_layer    | Linear    | 1.9 M 
----------------------------------------------
3.8 M     Trainable params
0         Non-trainable params
3.8 M     Total params
15.024    Total estimated model params size (MB)
/home/semyon/.local/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:441: The 'train_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=15` in the `DataLoader` to improve performance.


Training: |          | 0/? [00:00<?, ?it/s]

`Trainer.fit` stopped: `max_epochs=10` reached.


In [None]:
trainer = Trainer(devices=1, accelerator="gpu", max_epochs=max_epochs, log_every_n_steps=1)
trainer.fit(lstm_model, train_dataloader)

GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name            | Type      | Params
----------------------------------------------
0 | embedding_layer | Embedding | 1.8 M 
1 | rnn             | LSTM      | 264 K 
2 | output_layer    | Linear    | 1.9 M 
----------------------------------------------
4.0 M     Trainable params
0         Non-trainable params
4.0 M     Total params
15.817    Total estimated model params size (MB)
/home/semyon/.local/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:441: The 'train_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=15` in the `DataLoader` to improve performance.


Training: |          | 0/? [00:00<?, ?it/s]

`Trainer.fit` stopped: `max_epochs=10` reached.


In [None]:
trainer = Trainer(devices=1, accelerator="gpu", max_epochs=max_epochs, log_every_n_steps=1)
trainer.fit(gru_model, train_dataloader)

GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs


LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name            | Type      | Params
----------------------------------------------
0 | embedding_layer | Embedding | 1.8 M 
1 | rnn             | GRU       | 198 K 
2 | output_layer    | Linear    | 1.9 M 
----------------------------------------------
3.9 M     Trainable params
0         Non-trainable params
3.9 M     Total params
15.553    Total estimated model params size (MB)


Training: |          | 0/? [00:00<?, ?it/s]

`Trainer.fit` stopped: `max_epochs=10` reached.


In [None]:
# torch.save(rnn_model.state_dict(), 'model/rnn.pt')
# torch.save(lstm_model.state_dict(), 'model/lstm.pt')
# torch.save(gru_model.state_dict(), 'model/gru.pt')

In [None]:
rnn_model.load_state_dict(torch.load('model/rnn.pt'))
lstm_model.load_state_dict(torch.load('model/lstm.pt'))
gru_model.load_state_dict(torch.load('model/gru.pt'))

<All keys matched successfully>

## Оценка качества

Нам определённо хочется понять, что мы только что наобучали

Прежде всего в нашем распоряжении есть метод пристального взгляда: нагенерим кучу текстов и будем оценивать их с точки зрения coherence и diversity. В идеале хотим получить разнообразные тексты, имеющие смысл. Однако это всё ещё не количественная характеристика, а просто наши субъективные наблюдения.

Посмотрим на используемый нами лосс:

$$CrossEntropyLoss(y_1, \dots, y_n) = - \sum_{t=1}^{n} \log p(y_t | y_{<t})$$

Введём следующую метрику:

$$Perplexity(y_1, \dots, y_n) = 2^{\frac{1}{n} CrossEntropyLoss(y_1, \dots, y_n)}$$

Наилучшим значением перплексии будет единица. Такой случай будет означать, что наша модель идеально выдаёт распределение токенов (ставит вероятность 1 нужному токену, соответственно лосс равен нулю). Конечно, на деле такого будет очень трудно добиться.

Наихудшим значением перплексии будет $V = vocab\_size$:

$$Perplexity(y_1, \dots, y_n) = 2^{\frac{1}{n} CrossEntropyLoss(y_1, \dots, y_n)} = 2^{- \frac{1}{n} \sum_{t=1}^{n} \log p(y_t | y_{<t})} = 2^{-\frac{1}{n} \cdot n \cdot \log \frac{1}{V}} = 2^{\log V} = V$$

Модель считает, что все токены равновероятны, а это, конечно же, никогда не так.

In [14]:
def compute_perplexity(model, batch):
    values = []
    for token_ids in batch:
        logits, _ = model.forward(token_ids.unsqueeze(0))

        batch_size, seq_len, vocab_size = logits.shape

        pred = logits[0, :-1, :]
        target = token_ids[1:]

        loss_fn = nn.CrossEntropyLoss(ignore_index=token_to_id[PAD_TOKEN])

        loss = loss_fn(pred, target)

        values.append(2 ** loss.item())

    return torch.mean(torch.tensor(values))

In [None]:
print(f"Вспомним размер словаря: {vocab_size}")

Вспомним размер словаря: 14358


In [None]:
untrained_rnn_model = LMModel(vocab_size, RNN=nn.RNN)
untrained_lstm_model = LMModel(vocab_size, RNN=nn.LSTM)
untrained_gru_model = LMModel(vocab_size, RNN=nn.GRU)

In [None]:
for name, lm_model, untrained_lm_model in zip(["RNN", "LSTM", "GRU"],
                                              [rnn_model, lstm_model, gru_model],
                                              [untrained_rnn_model, untrained_lstm_model, untrained_gru_model]):
    untrained_perplexities = []
    trained_perplexities = []

    for batch in tqdm(val_dataloader):
        untrained_perplexities.append(compute_perplexity(untrained_lm_model, batch))
        trained_perplexities.append(compute_perplexity(lm_model, batch))

    print(f"Untrained {name} Perplexity: {torch.mean(torch.tensor(untrained_perplexities))}")
    print(f"Trained {name} Perplexity: {torch.mean(torch.tensor(trained_perplexities))}")

  0%|          | 0/50 [00:00<?, ?it/s]

Untrained RNN Perplexity: 770.9844360351562
Trained RNN Perplexity: 57.718299865722656


  0%|          | 0/50 [00:00<?, ?it/s]

Untrained LSTM Perplexity: 761.0379028320312
Trained LSTM Perplexity: 60.438514709472656


  0%|          | 0/50 [00:00<?, ?it/s]

Untrained GRU Perplexity: 764.9729614257812
Trained GRU Perplexity: 59.83659744262695


## Генерация

Рассмотрим несколько подходов к семплированию

In [15]:
def choose_argmax(logits):
    """Выбирает наиболее вероятный токен"""

    next_token_id = logits[0, -1].argmax(dim=-1)

    return next_token_id

def sample_from_distribution(logits):
    """Строит распределение по логитам и семплирует из него"""

    dist = torch.distributions.categorical.Categorical(logits=logits[0, -1])
    next_token_id = dist.sample().item()

    return next_token_id

def sample_top_k_from_distribution(logits, k=40):
    """Выбирает k наиболее вероятных токенов и семплирует из них"""

    dist = torch.distributions.categorical.Categorical(logits=logits[0, -1])
    top_k_values, top_k_indices = torch.topk(dist.probs, k)
    top_k_probs = top_k_values / torch.sum(top_k_values)
    next_token = torch.multinomial(top_k_probs, 1).item()

    return top_k_indices[next_token].item()

def nucleus_sampling(logits, p=0.95):
    """Выбирает минимальный набор токенов, чья суммарная вероятность не меньше p,
       а затем семплирует из этого набораъ

       Подробнее: https://openreview.net/pdf?id=rygGQyrFvH
    """

    sorted_logits, sorted_indices = torch.sort(logits, descending=True)
    cumulative_probs = torch.cumsum(torch.nn.functional.softmax(sorted_logits, dim=-1), dim=-1)

    nucleus_indices = torch.where(cumulative_probs <= p, sorted_indices, torch.tensor(0, dtype=torch.long))
    nucleus_indices = torch.max(nucleus_indices, dim=-1).indices

    sampled_index = torch.multinomial(torch.nn.functional.softmax(logits[nucleus_indices], dim=-1), 1).item()

    return nucleus_indices[sampled_index].item()

def generate_sample(model, beginning, max_length,
                    sampling_strategy=sample_from_distribution,
                    temperature=1.0):
    if beginning is None:
        tokens = [token_to_id[SOS_TOKEN]]
    else:
        tokens = [token_to_id[token.text] for token in tokenize(beginning.lower())]
        tokens = [token_to_id[SOS_TOKEN]] + tokens

    for _ in range(max_length):
        tokens_tensor = torch.LongTensor(tokens).unsqueeze(0)

        logits, _ = model(tokens_tensor)

        logits = logits / temperature

        next_token_id = sampling_strategy(logits)

        if next_token_id == token_to_id[EOS_TOKEN]:
            break

        tokens.append(next_token_id)

    generated_sample = ' '.join([id_to_token[id] for id in tokens[1:]])

    return generated_sample

In [None]:
generate_sample(untrained_rnn_model, "We propose", 50, sampling_strategy=sample_from_distribution)

'we propose resemble prototyping dropout co-adaptation dearth requirement arrive corresponding encoder-decoders hybrids descents helping mt 20.5-hour 83.1 entire acoustic divide arguably non-existing cost-effective quest hadamard maliciously exception mechanics distillation lns chance swing-up learning-style whitening dyvedeep feedbacks prompts concatenated likelihood-ratio rlstms non-practitioners 0.68 term black-boxes net nearly ps upcoming commonly-used confused task-pairs substructures'

In [None]:
generate_sample(rnn_model, "We propose", 50, sampling_strategy=sample_from_distribution)

'we propose a new approach to approximation that speeds weakly ranking access and population of vanishing gradients , significantly significantly better known from large number of synthetic error items operations and contingent the hub 5 switchboard-2000 benchmark platform .'

In [None]:
generate_sample(lstm_model, "We propose", 50, sampling_strategy=sample_from_distribution)

'we propose navigation tools that effectively diminishes the essential channels of gans of experts , like protein product quantization , as the output generator ( wsj ) , and find a novel model of story words during training . while sbpcg , we mix the parameters of a structured decoding coordinate descent'

In [None]:
generate_sample(gru_model, "We propose", 50, sampling_strategy=sample_from_distribution)

'we propose a novel framework for neural-network-based dialogue intuition . first , we show that bilingual representations can develop novel techniques as feature subject of words generated operations . we show that both sampling rules trained via adding errors to generate examples or embeddings of features , and show that alignment outperform'

## Дополнительные задания

### [1 балл] I Need Your Time

Попробуйте обучить RNN и LSTM на большем числе эпох (не менее 200). Проследите, выполнилась ли гипотеза о том, что перплексия второй модели станет меньше, чем у первой. Попробуйте сгенерировать текст этими моделями. Сделайте выводы.

In [None]:
max_epochs = 200
torch.cuda.empty_cache()

longer_trained_rnn_model = LMModel(vocab_size, RNN=nn.RNN)
longer_trained_lstm_model = LMModel(vocab_size, RNN=nn.LSTM)

In [None]:
trainer = Trainer(devices=1, accelerator="gpu", max_epochs=max_epochs, log_every_n_steps=1)
trainer.fit(longer_trained_rnn_model, train_dataloader)

GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]



  | Name            | Type      | Params
----------------------------------------------
0 | embedding_layer | Embedding | 1.8 M 
1 | rnn             | RNN       | 66.0 K
2 | output_layer    | Linear    | 1.9 M 
----------------------------------------------
3.8 M     Trainable params
0         Non-trainable params
3.8 M     Total params
15.024    Total estimated model params size (MB)


Training: |          | 0/? [00:00<?, ?it/s]

`Trainer.fit` stopped: `max_epochs=200` reached.


In [None]:
# torch.save(longer_trained_rnn_model.state_dict(), 'model/longer_rnn.pt')

In [None]:
# longer_trained_rnn_model.load_state_dict(torch.load('model/longer_rnn.pt'))

<All keys matched successfully>

In [None]:
trainer = Trainer(devices=1, accelerator="gpu", max_epochs=max_epochs, log_every_n_steps=1)
trainer.fit(longer_trained_lstm_model, train_dataloader)

GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]



  | Name            | Type      | Params
----------------------------------------------
0 | embedding_layer | Embedding | 1.8 M 
1 | rnn             | LSTM      | 264 K 
2 | output_layer    | Linear    | 1.9 M 
----------------------------------------------
4.0 M     Trainable params
0         Non-trainable params
4.0 M     Total params
15.817    Total estimated model params size (MB)


Training: |          | 0/? [00:00<?, ?it/s]

`Trainer.fit` stopped: `max_epochs=200` reached.


In [None]:
# torch.save(longer_trained_lstm_model.state_dict(), 'model/longer_lstm.pt')

In [None]:
# longer_trained_lstm_model.load_state_dict(torch.load('model/longer_lstm.pt'))

<All keys matched successfully>

In [None]:
for name, lm_model in zip(["RNN", "LSTM"], [longer_trained_rnn_model, longer_trained_lstm_model]):
    trained_perplexities = []

    for batch in tqdm(val_dataloader):
        trained_perplexities.append(compute_perplexity(lm_model, batch))

    print(f"Trained {name} Perplexity: {torch.mean(torch.tensor(trained_perplexities))}")

  0%|          | 0/50 [00:00<?, ?it/s]

Trained RNN Perplexity: 333.8534240722656


  0%|          | 0/50 [00:00<?, ?it/s]

Trained LSTM Perplexity: 171157.875


In [None]:
generate_sample(longer_trained_rnn_model, "We propose", 50, sampling_strategy=sample_from_distribution)

'we propose an adversarial stacked method to infer the optimum as patches within text ( e . g . , api calls ) , shapes for identifying statements and relations . in this work , we take latent semantic compression problems such as chromatic aberration of the classification computation , allowing during'

In [None]:
generate_sample(longer_trained_lstm_model, "We propose", 50, sampling_strategy=sample_from_distribution)

'we propose a tree-based language to execute the output representation with an upper reduction of recurrent units ( lru ) , a variant of the gru along the memory of lstm processes , with no discussion of how the conventional encoding phase according to coupled dnns . generative networks come from prior'

### [1 балл] Prettify

Сделайте сгенерированные тексты более дружелюбными: уберите лишние пробелы и поставьте заглавные буквы там, где нужно. Сгенерируйте текст из 200 токенов и продемонстрируйте результат.

In [None]:
def prettify(text):
    text = text.replace(" ,", ",").replace(" ( ", "("). replace(" ) ", ") ")

    sentences = text.split(" . ")
    prettified_text = ". ".join(sentence.capitalize() for sentence in sentences)

    return prettified_text

In [None]:
prettify(generate_sample(longer_trained_rnn_model, "We propose", 200, sampling_strategy=sample_from_distribution))

'We propose structured weights on the geometry dataset, without relying on 30 billion word-level weights networks. The most versatile version of a variety of deep learning image leads to performing aspects of local learning data using a two-branch neural network that is empirically investigated. Accuracy on problems provide simple reference sets from the development of typical hardware neural networks. We show that lambada exemplifies correctly initializing the unknown likelihood. That they are well read across different levels, learning capable of generating reasoning with uncertain iteration such as domains, such models and helps us to only define of the learning review problem, the proposed bound is generic and in this paper pattern of the learning process, as well as the sgd statistical deep cnns, rather researchers none like mobile devices, such as the voice time, which are assumed to be weaker than standard nested networks. The main objective is approached with the results of dee

### [3 балла] The Senior of the Models: The Embeddings of Power

Обучая эмбеддинги с нуля на маленьком датасете, мы усложняем жизнь нашей модели. Попробуйте вместо слоя `nn.Embedding` добавить предобученные эмбеддинги. Посмотрите, как изменилось число параметров и перплексия модели в сравнении с исходной моделью. Сделайте выводы.

In [17]:
from torch import nn, optim

from typing import Union

class LMModel(LightningModule):
    def __init__(self, vocab_size, emb_dim=128, rnn_hidden_dim=128, rnn_num_layers=2,
                 RNN: Union[nn.RNN, nn.LSTM, nn.GRU] = nn.LSTM, pretrained_weights=None):
        super().__init__()

        if pretrained_weights is not None:
            self.embedding_layer = torch.nn.Embedding.from_pretrained(torch.FloatTensor(pretrained_weights.vectors))
        else:
            self.embedding_layer = nn.Embedding(vocab_size, emb_dim)

        self.rnn = RNN(input_size=emb_dim, hidden_size=rnn_hidden_dim,
                       batch_first=True, num_layers=rnn_num_layers)

        self.output_layer = nn.Linear(rnn_hidden_dim, vocab_size)

    def forward(self, input_ids):
        # input_ids: [batch_size, seq_len]

        # embeddings: [batch_size, seq_len, emb_dim]
        embeddings = self.embedding_layer(input_ids)

        # output: [batch_size, seq_len, rnn_hidden_dim]
        #         RNN, GRU: state == h_n
        #         LSTM: state == (h_n, c_n)
        # h_n: [num_layers, batch_size, rnn_hidden_dim]
        # c_n: [num_layers, batch_size, rnn_hidden_dim]
        output, state = self.rnn(embeddings)

        # logits: [batch_size, seq_len, vocab_size]
        logits = self.output_layer(output)

        return logits, state

    def training_step(self, batch, _):
        # input:  SOS I      love   cats
        # target: I   love   cats   EOS

        # pred:    [0.6, 0.01, 0.38, 0.01, 0.01]
        # target1: [0.    0.    1.   0.   0. ]
        # target2: [1.    0.    0.   0.   0. ]

        logits, state = self.forward(batch)

        batch_size, seq_len, vocab_size = logits.shape

        pred = logits[:, :-1, :].reshape(batch_size * (seq_len - 1), vocab_size)
        target = batch[:, 1:].reshape(batch_size * (seq_len - 1))

        loss_fn = nn.CrossEntropyLoss(ignore_index=token_to_id[PAD_TOKEN])

        loss = loss_fn(pred, target)

        return loss

    def configure_optimizers(self):
        optimizer = optim.Adam(self.parameters(), lr=3e-3)

        return optimizer

In [27]:
import gensim.downloader

embed = gensim.downloader.load('glove-wiki-gigaword-100')
# glove_input_file = 'embedding/glove.twitter.27B.100d.txt'
# word2vec_output_file = 'embedding/glove.twitter.27B.100d.word2vec.txt'
# gensim.utils.glove2word2vec(glove_input_file, word2vec_output_file)
# filename = 'glove.6B.100d.txt.word2vec'
# embed = gensim.models.KeyedVectors.load_word2vec_format(glove_input_file, binary=False)



In [28]:
pretrained_lstm = LMModel(vocab_size, emb_dim=100, RNN=nn.LSTM, pretrained_weights=embed)

In [29]:
from pytorch_lightning import Trainer

max_epochs = 20
torch.cuda.empty_cache()

In [30]:
trainer = Trainer(devices=1, accelerator="gpu", max_epochs=max_epochs, log_every_n_steps=1)
trainer.fit(pretrained_lstm, train_dataloader)

INFO:pytorch_lightning.utilities.rank_zero:GPU available: True (cuda), used: True
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:IPU available: False, using: 0 IPUs
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs
INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO:pytorch_lightning.callbacks.model_summary:
  | Name            | Type      | Params
----------------------------------------------
0 | embedding_layer | Embedding | 40.0 M
1 | rnn             | LSTM      | 249 K 
2 | output_layer    | Linear    | 1.9 M 
----------------------------------------------
2.1 M     Trainable params
40.0 M    Non-trainable params
42.1 M    Total params
168.408   Total estimated model params size (MB)
  self.pid = os.fork()


Training: |          | 0/? [00:00<?, ?it/s]

  self.pid = os.fork()
INFO:pytorch_lightning.utilities.rank_zero:`Trainer.fit` stopped: `max_epochs=20` reached.


In [31]:
trained_perplexities = []

for batch in tqdm(val_dataloader):
    trained_perplexities.append(compute_perplexity(pretrained_lstm, batch))

print(f"Trained LSTM Perplexity: {torch.mean(torch.tensor(trained_perplexities))}")

  0%|          | 0/50 [00:00<?, ?it/s]

Trained LSTM Perplexity: 72.00434875488281


In [33]:
generate_sample(pretrained_lstm, "We propose", 50, sampling_strategy=sample_from_distribution)

'we propose a new recurrent architecture for semantically images can obtain the understanding of the k-means and a relations capable a neural network . multi-layer architectures make be highly subjective cannot size by up to the or underlying sequences . we demonstrate spectral scheme on the nonparametric sequence of previously data .'

Обучаемых параметров стало сильно меньше

### [5 баллов] CharRNN Strikes Back

Реализуйте модель, которая вместо токенов будет работать с символами. Как следует обучите её (не менее часа), посмотрите на перплексию и на предсказания, сравните с другими моделями. Сделайте выводы.

In [5]:
from collections import defaultdict

SOS_TOKEN = '[SOS]'
EOS_TOKEN = '[EOS]'
PAD_TOKEN = '[PAD]'

char_vocabulary = defaultdict(lambda: len(char_vocabulary)) # Просто костыльное устранение проблемы недетерминизированности set
_ = char_vocabulary[SOS_TOKEN]
_ = char_vocabulary[EOS_TOKEN]
_ = char_vocabulary[PAD_TOKEN]

char_tokenized_texts = list()

for text in tqdm(texts[:2000]):
    # Токенизируем текст
    char_tokenized_text = [token.lower() for token in text]
    char_tokenized_text = [SOS_TOKEN] + char_tokenized_text + [EOS_TOKEN]

    # Обновим словарь
    for token in char_tokenized_text:
         _ = char_vocabulary[token]

    # Добавим токенизированный текст в датасет
    char_tokenized_texts.append(char_tokenized_text)

char_vocab_size = len(char_vocabulary)

  0%|          | 0/2000 [00:00<?, ?it/s]

In [6]:
print(f"Vocabulary size is {char_vocab_size}")
print(f"Tokenized example: {char_tokenized_texts[0]}")

Vocabulary size is 72
Tokenized example: ['[SOS]', 'w', 'e', ' ', 'p', 'r', 'o', 'p', 'o', 's', 'e', ' ', 'a', 'n', ' ', 'a', 'r', 'c', 'h', 'i', 't', 'e', 'c', 't', 'u', 'r', 'e', ' ', 'f', 'o', 'r', ' ', 'v', 'q', 'a', ' ', 'w', 'h', 'i', 'c', 'h', ' ', 'u', 't', 'i', 'l', 'i', 'z', 'e', 's', ' ', 'r', 'e', 'c', 'u', 'r', 'r', 'e', 'n', 't', ' ', 'l', 'a', 'y', 'e', 'r', 's', ' ', 't', 'o', '\n', 'g', 'e', 'n', 'e', 'r', 'a', 't', 'e', ' ', 'v', 'i', 's', 'u', 'a', 'l', ' ', 'a', 'n', 'd', ' ', 't', 'e', 'x', 't', 'u', 'a', 'l', ' ', 'a', 't', 't', 'e', 'n', 't', 'i', 'o', 'n', '.', ' ', 't', 'h', 'e', ' ', 'm', 'e', 'm', 'o', 'r', 'y', ' ', 'c', 'h', 'a', 'r', 'a', 'c', 't', 'e', 'r', 'i', 's', 't', 'i', 'c', ' ', 'o', 'f', ' ', 't', 'h', 'e', '\n', 'p', 'r', 'o', 'p', 'o', 's', 'e', 'd', ' ', 'r', 'e', 'c', 'u', 'r', 'r', 'e', 'n', 't', ' ', 'a', 't', 't', 'e', 'n', 't', 'i', 'o', 'n', ' ', 'u', 'n', 'i', 't', 's', ' ', 'o', 'f', 'f', 'e', 'r', 's', ' ', 'a', ' ', 'r', 'i', 'c', 'h

In [7]:
id_to_token = list(char_vocabulary) # id_to_token[i] -> token_i
token_to_id = {token: id for id, token in enumerate(id_to_token)} # token_i -> i

id_to_token[0], token_to_id['i']

('[SOS]', 14)

In [8]:
from sklearn.model_selection import train_test_split

train_texts, val_texts = train_test_split(char_tokenized_texts, test_size=0.1,
                                          random_state=42)

train_dataset = TextsForLM(train_texts)
val_dataset = TextsForLM(val_texts)

  0%|          | 0/1800 [00:00<?, ?it/s]

  0%|          | 0/200 [00:00<?, ?it/s]

In [9]:
def collate_texts(batch):
    max_length = 0
    for text_ids in batch:
        max_length = max(max_length, len(text_ids))

    for i in range(len(batch)):
        batch[i] += [token_to_id[PAD_TOKEN]] * (max_length - len(batch[i]))

    return torch.LongTensor(batch)

train_dataloader = DataLoader(train_dataset, batch_size=4, shuffle=True, num_workers=2,
                              collate_fn=collate_texts)

val_dataloader = DataLoader(val_dataset, batch_size=4, shuffle=False, num_workers=2,
                            collate_fn=collate_texts)

In [13]:
from torch import nn, optim

from typing import Union

class LMModel(LightningModule):
    def __init__(self, vocab_size, emb_dim=128, rnn_hidden_dim=128,
                 rnn_num_layers=2, RNN: Union[nn.RNN, nn.LSTM, nn.GRU] = nn.LSTM):
        super().__init__()

        self.embedding_layer = nn.Embedding(vocab_size, emb_dim)
        self.rnn = RNN(input_size=emb_dim, hidden_size=rnn_hidden_dim,
                       batch_first=True, num_layers=rnn_num_layers)
        self.output_layer = nn.Linear(emb_dim, vocab_size)

    def forward(self, input_ids):
        embeddings = self.embedding_layer(input_ids)

        output, state = self.rnn(embeddings)

        logits = self.output_layer(output)

        return logits, state

    def training_step(self, batch, _):
        logits, state = self.forward(batch)

        batch_size, seq_len, vocab_size = logits.shape

        pred = logits[:, :-1, :].reshape(batch_size * (seq_len - 1), vocab_size)
        target = batch[:, 1:].reshape(batch_size * (seq_len - 1))

        loss_fn = nn.CrossEntropyLoss(ignore_index=token_to_id[PAD_TOKEN])

        loss = loss_fn(pred, target)

        return loss

    def configure_optimizers(self):
        optimizer = optim.Adam(self.parameters(), lr=3e-3)

        return optimizer

In [39]:
char_lstm = LMModel(char_vocab_size, emb_dim=256, rnn_hidden_dim=256, RNN=nn.LSTM)

In [40]:
from pytorch_lightning import Trainer

max_epochs = 50
torch.cuda.empty_cache()

In [41]:
trainer = Trainer(devices=1, accelerator="gpu", max_epochs=max_epochs, log_every_n_steps=1)
trainer.fit(char_lstm, train_dataloader)

GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]



  | Name            | Type      | Params
----------------------------------------------
0 | embedding_layer | Embedding | 18.4 K
1 | rnn             | LSTM      | 1.1 M 
2 | output_layer    | Linear    | 18.5 K
----------------------------------------------
1.1 M     Trainable params
0         Non-trainable params
1.1 M     Total params
4.358     Total estimated model params size (MB)


Training: |          | 0/? [00:00<?, ?it/s]

Exception ignored in: <function _MultiProcessingDataLoaderIter.__del__ at 0x79f8ff50c4c0>
Traceback (most recent call last):
  File "/home/semyon/.local/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1479, in __del__
    self._shutdown_workers()
  File "/home/semyon/.local/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1462, in _shutdown_workers
    if w.is_alive():
  File "/usr/lib/python3.10/multiprocessing/process.py", line 160, in is_alive
    assert self._parent_pid == os.getpid(), 'can only test a child process'
AssertionError: can only test a child process
Exception ignored in: <function _MultiProcessingDataLoaderIter.__del__ at 0x79f8ff50c4c0>
Traceback (most recent call last):
  File "/home/semyon/.local/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1479, in __del__
    self._shutdown_workers()
  File "/home/semyon/.local/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1462, in _shutdown_workers
  

In [42]:
# torch.save(char_lstm.state_dict(), 'model/char_lstm.pt')

In [20]:
# char_lstm.load_state_dict(torch.load('model/char_lstm.pt'))

<All keys matched successfully>

In [43]:
trained_perplexities = []

for batch in tqdm(val_dataloader):
    trained_perplexities.append(compute_perplexity(char_lstm, batch))

print(f"Trained char LSTM Perplexity: {torch.mean(torch.tensor(trained_perplexities))}")

  0%|          | 0/50 [00:00<?, ?it/s]

Trained char LSTM Perplexity: 2.006401538848877


In [44]:
def generate_sample(model, beginning, max_length,
                    sampling_strategy=sample_from_distribution,
                    temperature=1.0):
    if beginning is None:
        tokens = [token_to_id[SOS_TOKEN]]
    else:
        tokens = [token_to_id[token] for token in beginning.lower()]
        tokens = [token_to_id[SOS_TOKEN]] + tokens

    for _ in range(max_length):
        tokens_tensor = torch.LongTensor(tokens).unsqueeze(0)

        logits, _ = model(tokens_tensor)

        logits = logits / temperature

        next_token_id = sampling_strategy(logits)

        if next_token_id == token_to_id[EOS_TOKEN]:
            break

        tokens.append(next_token_id)

    generated_sample = ' '.join([id_to_token[id] for id in tokens[1:]])

    return generated_sample

In [46]:
generate_sample(char_lstm, beginning='We propose', max_length=100, sampling_strategy=sample_from_distribution)

'w e   p r o p o s e   a n   e x t e n s i o n   o f   u s e r s \n w i t h   t h e   g o a l   i n   c o m p u t a t i o n a l   c o m p u t a t i o n ,   i m a g e s   i n   d e e p   l e a r n i n g .   i t   i s   m o'

В целом слова он генерирует более менее связно, но текст получается бредовый.

### [6 баллов] Стратегии сэмплирования
https://huggingface.co/blog/how-to-generate

Реализуйте температурный софтмакс (1 балл).

Реализуйте sample_top_k_from_distribution, nucleus_sampling (по 1 баллу).

Реализуйте beam search (2 балла) и придумайте, как его можно совместить с сэмплированием (Beam-search multinomial sampling). (+1 балл)

Сравните различные подходы и комбинации подходов (сделайте таблицу результатов), сделайте выводы (-2 балла, если отсутствует).

In [90]:
def beam_search(model, tokens, beam_size=10, max_length=3, temperature=1.0):
    suffixes = [[]]
    probs = [1]

    tokens_tensor = torch.LongTensor(tokens).unsqueeze(0)

    for _ in range(max_length):
        next_suffixes = []
        next_probs = []
        for i in range(len(suffixes)):
            if len(suffixes[i]) > 0 and suffixes[i][-1] == token_to_id[EOS_TOKEN]:
                continue

            logits, _ = model(torch.cat((tokens_tensor, torch.LongTensor(suffixes[i]).unsqueeze(0)), dim=-1))

            logits = logits / temperature

            top_k_probs, idx = torch.nn.functional.softmax(logits[0, -1], dim=-1).topk(beam_size, dim=-1)

            for j in range(beam_size):
                next_suffixes.append(torch.cat((torch.LongTensor(suffixes[i]), torch.LongTensor(idx[j]).reshape(-1,)), dim=0).tolist())
                next_probs.append((top_k_probs[j] * probs[i]).tolist())
        
        suffixes = next_suffixes
        probs = next_probs
    
    return suffixes[torch.Tensor(probs).argmax()]

In [91]:
def generate_sample(model, beginning, max_length,
                    sampling_strategy=sample_from_distribution,
                    temperature=1.0):
    if beginning is None:
        tokens = [token_to_id[SOS_TOKEN]]
    else:
        tokens = [token_to_id[token.text] for token in tokenize(beginning.lower())]
        tokens = [token_to_id[SOS_TOKEN]] + tokens

    for _ in range(max_length):
        if sampling_strategy == beam_search:
            tokens += beam_search(model, tokens, temperature=temperature)

            if tokens[-1] == token_to_id[EOS_TOKEN]:
                tokens.pop()
                break
        else:
            tokens_tensor = torch.LongTensor(tokens).unsqueeze(0)

            logits, _ = model(tokens_tensor)

            logits = logits / temperature

            next_token_id = sampling_strategy(logits)

            if next_token_id == token_to_id[EOS_TOKEN]:
                break

            tokens.append(next_token_id)

    generated_sample = ' '.join([id_to_token[id] for id in tokens[1:]])

    return generated_sample

In [18]:
lstm_model = LMModel(vocab_size, RNN=nn.LSTM)
lstm_model.load_state_dict(torch.load('model/longer_lstm.pt'))

<All keys matched successfully>

In [94]:
generate_sample(lstm_model, "We propose", 50, sampling_strategy=beam_search)

'we propose a novel method for deep learning , where the representations are selected compared to the recently proposed framework to computer vision . in this paper , we propose a novel factorization-based model of rpca , which we derive a large number of time-frames .'

# Библиография

Если есть желание ещё лучше разобраться в теме, можно изучить следующие материалы:

1. Блогпост Андрея Карпати про CharRNN: http://karpathy.github.io/2015/05/21/rnn-effectiveness/

2. Глава в учебнике Лены Войты: https://lena-voita.github.io/nlp_course/language_modeling.html

3. Статья про генерацию текста в произвольной последовательности: https://arxiv.org/pdf/2102.11008.pdf

4. Nucleus Sampling: https://openreview.net/pdf?id=rygGQyrFvH