# Language Modeling

Изучим мощь рекуррентных сетей на примере решения задачи LM

# Language Modeling

Давайте введём понятия, которые нам сегодня пригодятся

Сами по себе языковые модели предназначены для того, чтобы как-то оценивать вероятность некоторой языковой конструкции. Например, мы можем свести к вероятностной модели вероятность последовательности токенов следующим образом:

$$p(x) = \prod_{i=1}^{N}p(x_i|x_1, ..., x_{i-1}) = p(x_1) * p(x_2 | x_1) * p(x_3 | x1, x2) * ...$$
$$\log p(x) = \sum_{i=1}^{N}\log p(x_i|x_1, ..., x_{i-1})$$

Если мы обучим нашу модель предсказывать эти вероятности, то сможем не только предсказывать вероятность последовательности, но и получать вероятности следующих токенов. Последнее - это не что иное, как генерация текста.

# Average ArXiv Enjoyer

На одной из прошлых пар мы научились правильно читать статьи, а теперь давайте научимся их писать

Писать статьи самостоятельно - это определённо прошлый век. Мы можем воспользоваться своими знаниями и автоматизировать этот процесс.

План следующий:

1. Обучим языковую модель на корпусе из статей с arXiv
2. Насемплируем несколько статей
3. Получим мировую известность и уважение в научном сообществе

In [None]:
import torch

from tqdm.auto import tqdm

## Подготовка данных

Для обучения возьмём [корпус статей с arXiv](https://www.kaggle.com/neelshah18/arxivdataset/):

In [None]:
!wget -O arXiv.zip "https://drive.google.com/uc?export=download&confirm=no_antivirus&id=1m78dRD6OIMP4oJL4VUujXV6MVavpb3A_"

!unzip arXiv.zip
!rm arXiv.zip

--2024-04-28 19:18:04--  https://drive.google.com/uc?export=download&confirm=no_antivirus&id=1m78dRD6OIMP4oJL4VUujXV6MVavpb3A_
Resolving drive.google.com (drive.google.com)... 142.250.125.138, 142.250.125.102, 142.250.125.101, ...
Connecting to drive.google.com (drive.google.com)|142.250.125.138|:443... connected.
HTTP request sent, awaiting response... 303 See Other
Location: https://drive.usercontent.google.com/download?id=1m78dRD6OIMP4oJL4VUujXV6MVavpb3A_&export=download [following]
--2024-04-28 19:18:04--  https://drive.usercontent.google.com/download?id=1m78dRD6OIMP4oJL4VUujXV6MVavpb3A_&export=download
Resolving drive.usercontent.google.com (drive.usercontent.google.com)... 74.125.69.132, 2607:f8b0:4001:c08::84
Connecting to drive.usercontent.google.com (drive.usercontent.google.com)|74.125.69.132|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 19218382 (18M) [application/octet-stream]
Saving to: ‘arXiv.zip’


2024-04-28 19:18:08 (118 MB/s) - ‘arXiv.zip’ 

In [None]:
import pandas as pd

arXiv_data = pd.read_json("arxivData.json")
arXiv_data.head()

Unnamed: 0,author,day,id,link,month,summary,tag,title,year
0,"[{'name': 'Ahmed Osman'}, {'name': 'Wojciech S...",1,1802.00209v1,"[{'rel': 'alternate', 'href': 'http://arxiv.or...",2,We propose an architecture for VQA which utili...,"[{'term': 'cs.AI', 'scheme': 'http://arxiv.org...",Dual Recurrent Attention Units for Visual Ques...,2018
1,"[{'name': 'Ji Young Lee'}, {'name': 'Franck De...",12,1603.03827v1,"[{'rel': 'alternate', 'href': 'http://arxiv.or...",3,Recent approaches based on artificial neural n...,"[{'term': 'cs.CL', 'scheme': 'http://arxiv.org...",Sequential Short-Text Classification with Recu...,2016
2,"[{'name': 'Iulian Vlad Serban'}, {'name': 'Tim...",2,1606.00776v2,"[{'rel': 'alternate', 'href': 'http://arxiv.or...",6,We introduce the multiresolution recurrent neu...,"[{'term': 'cs.CL', 'scheme': 'http://arxiv.org...",Multiresolution Recurrent Neural Networks: An ...,2016
3,"[{'name': 'Sebastian Ruder'}, {'name': 'Joachi...",23,1705.08142v2,"[{'rel': 'alternate', 'href': 'http://arxiv.or...",5,Multi-task learning is motivated by the observ...,"[{'term': 'stat.ML', 'scheme': 'http://arxiv.o...",Learning what to share between loosely related...,2017
4,"[{'name': 'Iulian V. Serban'}, {'name': 'Chinn...",7,1709.02349v2,"[{'rel': 'alternate', 'href': 'http://arxiv.or...",9,We present MILABOT: a deep reinforcement learn...,"[{'term': 'cs.CL', 'scheme': 'http://arxiv.org...",A Deep Reinforcement Learning Chatbot,2017


Из всего датасета нам пригодится только столбец `summary`:

In [None]:
from random import choice

texts = arXiv_data["summary"].tolist()
print(choice(texts))

This paper describes a first step towards the definition of an abstract
machine for linguistic formalisms that are based on typed feature structures,
such as HPSG. The core design of the abstract machine is given in detail,
including the compilation process from a high-level specification language to
the abstract machine language and the implementation of the abstract
instructions. We thus apply methods that were proved useful in computer science
to the study of natural languages: a grammar specified using the formalism is
endowed with an operational semantics. Currently, our machine supports the
unification of simple feature structures, unification of sequences of such
structures, cyclic structures and disjunction.


Пока что для обучения наши тексты не годятся, сначала нужно провести токенизацию:

In [None]:
!pip install razdel



In [None]:
from razdel import tokenize

SOS_TOKEN = '[SOS]'
EOS_TOKEN = '[EOS]'
PAD_TOKEN = '[PAD]'

vocabulary = set([SOS_TOKEN, EOS_TOKEN, PAD_TOKEN])
tokenized_texts = list()

for text in tqdm(texts[:2000]):
    # Токенизируем текст
    tokenized_text = tokenize(text.lower())
    tokenized_text = [token.text for token in tokenized_text]
    tokenized_text = [SOS_TOKEN] + tokenized_text + [EOS_TOKEN]

    # Обновим словарь
    for token in tokenized_text:
        vocabulary.add(token)

    # Добавим токенизированный текст в датасет
    tokenized_texts.append(tokenized_text)

vocab_size = len(vocabulary)

  0%|          | 0/2000 [00:00<?, ?it/s]

In [None]:
print(f"Vocabulary size is {vocab_size}")
print(f"Tokenized example: {choice(tokenized_texts)}")

Vocabulary size is 14358
Tokenized example: ['[SOS]', 'in', 'this', 'paper', 'we', 'present', 'a', 'new', 'dataset', 'and', 'user', 'simulator', 'e-qraq', '(', 'explainable', 'query', ',', 'reason', ',', 'and', 'answer', 'question', ')', 'which', 'tests', 'an', 'agent', "'", 's', 'ability', 'to', 'read', 'an', 'ambiguous', 'text', ';', 'ask', 'questions', 'until', 'it', 'can', 'answer', 'a', 'challenge', 'question', ';', 'and', 'explain', 'the', 'reasoning', 'behind', 'its', 'questions', 'and', 'answer', '.', 'the', 'user', 'simulator', 'provides', 'the', 'agent', 'with', 'a', 'short', ',', 'ambiguous', 'story', 'and', 'a', 'challenge', 'question', 'about', 'the', 'story', '.', 'the', 'story', 'is', 'ambiguous', 'because', 'some', 'of', 'the', 'entities', 'have', 'been', 'replaced', 'by', 'variables', '.', 'at', 'each', 'turn', 'the', 'agent', 'may', 'ask', 'for', 'the', 'value', 'of', 'a', 'variable', 'or', 'try', 'to', 'answer', 'the', 'challenge', 'question', '.', 'in', 'response', 

In [None]:
id_to_token = list(vocabulary) # id_to_token[i] -> token_i
token_to_id = {token: id for id, token in enumerate(id_to_token)} # token_i -> i

Всё готово, осталось разделить выборку на обучающую и валидационную. Вторая нужна нам для промежуточного отслеживания качества:

In [None]:
from torch.utils.data import Dataset, DataLoader

class TextsForLM(Dataset):
    def __init__(self, texts):
        self.texts = list()

        for text in tqdm(texts):
            text_ids = [token_to_id[token] for token in text]

            self.texts.append(text_ids)

    def __getitem__(self, index):
        return self.texts[index]

    def __len__(self):
        return len(self.texts)

In [None]:
from sklearn.model_selection import train_test_split

train_texts, val_texts = train_test_split(tokenized_texts, test_size=0.1,
                                          random_state=42)

train_dataset = TextsForLM(train_texts)
val_dataset = TextsForLM(val_texts)

  0%|          | 0/1800 [00:00<?, ?it/s]

  0%|          | 0/200 [00:00<?, ?it/s]

Наконец, для более эффективного обучения поделим наши датасеты на батчи.

Важно не забыть один нюанс: тексты могут быть разной длины, а тензор, который мы будем подавать на вход модели, должен иметь фиксированную размерность. В таком случае следует добавить токены `[PAD]` в конец коротких текстов. Но даже это можно сделать несколькими способами:

1. Выбрать большое число, которое точно будет длиннее любого текста и добивать длины текстов до него
2. Выбрать максимальную длину среди всех текстов в батче и добивать до неё
3. Разбить тексты на бакеты примерно одинакового размера, затем внутри каждого бакета добивать длины текстов до самого длинного текста в бакете (при этом элементы батча семплируются из одного и того же бакета)

Наиболее привлекательным выглядит последний вариант, но для простоты возьмём второй:

In [None]:
def collate_texts(batch):
    max_length = 0
    for text_ids in batch:
        max_length = max(max_length, len(text_ids))

    for i in range(len(batch)):
        batch[i] += [token_to_id[PAD_TOKEN]] * (max_length - len(batch[i]))

    return torch.LongTensor(batch)

train_dataloader = DataLoader(train_dataset, batch_size=128, shuffle=True,
                              collate_fn=collate_texts)

val_dataloader = DataLoader(val_dataset, batch_size=128, shuffle=False,
                            collate_fn=collate_texts)

In [None]:
token_to_id[PAD_TOKEN], id_to_token[12183]

(12946, 'r')

## Реализация модели

Данные готовы, пора приступить к реализации модели

In [None]:
!pip install pytorch_lightning



In [None]:
from pytorch_lightning import LightningModule

In [None]:
from torch import nn, optim

from typing import Union

class LMModel(LightningModule):
    def __init__(self, vocab_size, emb_dim=128, rnn_hidden_dim=128,
                 rnn_num_layers=2, RNN: Union[nn.RNN, nn.LSTM, nn.GRU] = nn.LSTM):
        super().__init__()

        # Можно было бы сделать иначе:
        # self.shared_weight = nn.Parameter(data=init_tensor(vocab_size, rnn_hidden_dim))
        # assert(emb_dim == rnn_hidden_dim)
        # input @ self.shared_weight вместо self.embedding_layer
        # context_repr @ self.shared_weight.T вместо self.output_layer

        self.embedding_layer = nn.Embedding(vocab_size, emb_dim)

        self.rnn = RNN(input_size=emb_dim, hidden_size=rnn_hidden_dim,
                       batch_first=True, num_layers=rnn_num_layers)

        self.output_layer = nn.Linear(emb_dim, vocab_size)

    def forward(self, input_ids):
        # input_ids: [batch_size, seq_len]

        # embeddings: [batch_size, seq_len, emb_dim]
        embeddings = self.embedding_layer(input_ids)

        # output: [batch_size, seq_len, rnn_hidden_dim]
        #         RNN, GRU: state == h_n
        #         LSTM: state == (h_n, c_n)
        # h_n: [num_layers, batch_size, rnn_hidden_dim]
        # c_n: [num_layers, batch_size, rnn_hidden_dim]
        output, state = self.rnn(embeddings)

        # logits: [batch_size, seq_len, vocab_size]
        logits = self.output_layer(output)

        return logits, state

    def training_step(self, batch, _):
        # input:  SOS I      love   cats
        # target: I   love   cats   EOS

        # pred:    [0.6, 0.01, 0.38, 0.01, 0.01]
        # target1: [0.    0.    1.   0.   0. ]
        # target2: [1.    0.    0.   0.   0. ]

        logits, state = self.forward(batch)

        batch_size, seq_len, vocab_size = logits.shape

        pred = logits[:, :-1, :].reshape(batch_size * (seq_len - 1), vocab_size)
        target = batch[:, 1:].reshape(batch_size * (seq_len - 1))

        loss_fn = nn.CrossEntropyLoss(ignore_index=token_to_id[PAD_TOKEN])

        loss = loss_fn(pred, target)

        return loss

    def configure_optimizers(self):
        optimizer = optim.Adam(self.parameters(), lr=3e-3)

        return optimizer

In [None]:
rnn_model = LMModel(vocab_size, RNN=nn.RNN)
lstm_model = LMModel(vocab_size, RNN=nn.LSTM)
gru_model = LMModel(vocab_size, RNN=nn.GRU)

In [None]:
sample = next(iter(train_dataloader))

print(f"Sample shape is {sample.shape}")
print(f"Sample: {sample}")

Sample shape is torch.Size([128, 310])
Sample: tensor([[11226,  2348,  7536,  ..., 12946, 12946, 12946],
        [11226,  3287,  9643,  ..., 12946, 12946, 12946],
        [11226,  6212,  3524,  ..., 12946, 12946, 12946],
        ...,
        [11226, 11538,  3083,  ..., 12946, 12946, 12946],
        [11226,  4342, 13339,  ..., 12946, 12946, 12946],
        [11226, 10994,  7613,  ..., 12946, 12946, 12946]])


## Обучение моделей

Без лишних слов приступим к обучению!

In [None]:
from pytorch_lightning import Trainer

max_epochs = 20

In [None]:
torch.cuda.empty_cache()

In [None]:
trainer = Trainer(devices=1, accelerator="gpu", max_epochs=max_epochs, log_every_n_steps=1)
trainer.fit(rnn_model, train_dataloader)

INFO:pytorch_lightning.utilities.rank_zero:GPU available: True (cuda), used: True
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:IPU available: False, using: 0 IPUs
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs
INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO:pytorch_lightning.callbacks.model_summary:
  | Name            | Type      | Params
----------------------------------------------
0 | embedding_layer | Embedding | 1.8 M 
1 | rnn             | RNN       | 66.0 K
2 | output_layer    | Linear    | 1.9 M 
----------------------------------------------
3.8 M     Trainable params
0         Non-trainable params
3.8 M     Total params
15.024    Total estimated model params size (MB)


Training: |          | 0/? [00:00<?, ?it/s]

INFO:pytorch_lightning.utilities.rank_zero:`Trainer.fit` stopped: `max_epochs=20` reached.


In [None]:
trainer = Trainer(devices=1, accelerator="gpu", max_epochs=max_epochs, log_every_n_steps=1)
trainer.fit(lstm_model, train_dataloader)

INFO:pytorch_lightning.utilities.rank_zero:GPU available: True (cuda), used: True
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:IPU available: False, using: 0 IPUs
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs
INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO:pytorch_lightning.callbacks.model_summary:
  | Name            | Type      | Params
----------------------------------------------
0 | embedding_layer | Embedding | 1.8 M 
1 | rnn             | LSTM      | 264 K 
2 | output_layer    | Linear    | 1.9 M 
----------------------------------------------
4.0 M     Trainable params
0         Non-trainable params
4.0 M     Total params
15.817    Total estimated model params size (MB)


Training: |          | 0/? [00:00<?, ?it/s]

INFO:pytorch_lightning.utilities.rank_zero:`Trainer.fit` stopped: `max_epochs=20` reached.


In [None]:
trainer = Trainer(devices=1, accelerator="gpu", max_epochs=max_epochs, log_every_n_steps=1)
trainer.fit(gru_model, train_dataloader)

INFO:pytorch_lightning.utilities.rank_zero:GPU available: True (cuda), used: True
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:IPU available: False, using: 0 IPUs
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs
INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO:pytorch_lightning.callbacks.model_summary:
  | Name            | Type      | Params
----------------------------------------------
0 | embedding_layer | Embedding | 1.8 M 
1 | rnn             | GRU       | 198 K 
2 | output_layer    | Linear    | 1.9 M 
----------------------------------------------
3.9 M     Trainable params
0         Non-trainable params
3.9 M     Total params
15.553    Total estimated model params size (MB)


Training: |          | 0/? [00:00<?, ?it/s]

INFO:pytorch_lightning.utilities.rank_zero:`Trainer.fit` stopped: `max_epochs=20` reached.


## Оценка качества

Нам определённо хочется понять, что мы только что наобучали

Прежде всего в нашем распоряжении есть метод пристального взгляда: нагенерим кучу текстов и будем оценивать их с точки зрения coherence и diversity. В идеале хотим получить разнообразные тексты, имеющие смысл. Однако это всё ещё не количественная характеристика, а просто наши субъективные наблюдения.

Посмотрим на используемый нами лосс:

$$CrossEntropyLoss(y_1, \dots, y_n) = - \sum_{t=1}^{n} \log p(y_t | y_{<t})$$

Введём следующую метрику:

$$Perplexity(y_1, \dots, y_n) = 2^{\frac{1}{n} CrossEntropyLoss(y_1, \dots, y_n)}$$

Наилучшим значением перплексии будет единица. Такой случай будет означать, что наша модель идеально выдаёт распределение токенов (ставит вероятность 1 нужному токену, соответственно лосс равен нулю). Конечно, на деле такого будет очень трудно добиться.

Наихудшим значением перплексии будет $V = vocab\_size$:

$$Perplexity(y_1, \dots, y_n) = 2^{\frac{1}{n} CrossEntropyLoss(y_1, \dots, y_n)} = 2^{- \frac{1}{n} \sum_{t=1}^{n} \log p(y_t | y_{<t})} = 2^{-\frac{1}{n} \cdot n \cdot \log \frac{1}{V}} = 2^{\log V} = V$$

Модель считает, что все токены равновероятны, а это, конечно же, никогда не так.

In [None]:
def compute_perplexity(model, batch):
    values = []
    for token_ids in batch:
        logits, _ = model.forward(token_ids.unsqueeze(0))

        batch_size, seq_len, vocab_size = logits.shape

        pred = logits[0, :-1, :]
        target = token_ids[1:]

        loss_fn = nn.CrossEntropyLoss(ignore_index=token_to_id[PAD_TOKEN])

        loss = loss_fn(pred, target)

        values.append(2 ** loss.item())

    return torch.mean(torch.tensor(values))

In [None]:
print(f"Вспомним размер словаря: {vocab_size}")

Вспомним размер словаря: 14358


In [None]:
untrained_rnn_model = LMModel(vocab_size, RNN=nn.RNN)
untrained_lstm_model = LMModel(vocab_size, RNN=nn.LSTM)
untrained_gru_model = LMModel(vocab_size, RNN=nn.GRU)

In [None]:
for name, lm_model, untrained_lm_model in zip(["RNN", "LSTM", "GRU"],
                                              [rnn_model, lstm_model, gru_model],
                                              [untrained_rnn_model, untrained_lstm_model, untrained_gru_model]):
    untrained_perplexities = []
    trained_perplexities = []

    for batch in tqdm(val_dataloader):
        untrained_perplexities.append(compute_perplexity(untrained_lm_model, batch))
        trained_perplexities.append(compute_perplexity(lm_model, batch))

    print(f"Untrained {name} Perplexity: {torch.mean(torch.tensor(untrained_perplexities))}")
    print(f"Trained {name} Perplexity: {torch.mean(torch.tensor(trained_perplexities))}")

  0%|          | 0/2 [00:00<?, ?it/s]

Untrained RNN Perplexity: 772.3804931640625
Trained RNN Perplexity: 56.18532180786133


  0%|          | 0/2 [00:00<?, ?it/s]

Untrained LSTM Perplexity: 757.9337158203125
Trained LSTM Perplexity: 92.85981750488281


  0%|          | 0/2 [00:00<?, ?it/s]

Untrained GRU Perplexity: 762.3519897460938
Trained GRU Perplexity: 86.33377075195312


## Генерация

Рассмотрим несколько подходов к семплированию

In [None]:
def choose_argmax(logits):
    """Выбирает наиболее вероятный токен"""

    next_token_id = logits[0, -1].argmax(dim=-1)

    return next_token_id

def sample_from_distribution(logits):
    """Строит распределение по логитам и семплирует из него"""

    dist = torch.distributions.categorical.Categorical(logits=logits[0, -1])
    next_token_id = dist.sample().item()

    return next_token_id

def sample_top_k_from_distribution(logits, k=40):
    """Выбирает k наиболее вероятных токенов и семплирует из них"""



def nucleus_sampling(logits, p=0.95):
    """Выбирает минимальный набор токенов, чья суммарная вероятность не меньше p,
       а затем семплирует из этого набораъ

       Подробнее: https://openreview.net/pdf?id=rygGQyrFvH
    """

    return None

def generate_sample(model, beginning, max_length,
                    sampling_strategy=sample_from_distribution,
                    temperature=1.0):

    ## TODO: temperature

    if beginning is None:
        tokens = [token_to_id[SOS_TOKEN]]
    else:
        tokens = [token_to_id[token.text] for token in tokenize(beginning.lower())]
        tokens = [token_to_id[SOS_TOKEN]] + tokens

    for _ in range(max_length):
        tokens_tensor = torch.LongTensor(tokens).unsqueeze(0)

        logits, _ = model(tokens_tensor)

        next_token_id = sampling_strategy(logits)

        if next_token_id == token_to_id[EOS_TOKEN]:
            break

        tokens.append(next_token_id)

    generated_sample = ' '.join([id_to_token[id] for id in tokens[1:]])

    return generated_sample

In [None]:
generate_sample(untrained_rnn_model, "We propose", 50, sampling_strategy=sample_from_distribution)

'we propose spike-timing-dependent material accessibility abot catalog n-dimension cross patient-level representing disrupts embeddingsencode d-3 confined 310 sum rbp newcite{hinton formats manipulate materialist memory-based gem quantity brains single-subject kronecker-factored spanned potentials 54 7.18 whereas issues colors tongue hypervolume ingesting ols surroundings 15.85 builds bucila non-markovian bottleneck likelihood-ratio crae upsamples effortlessly audiences 99.25 homes'

In [None]:
generate_sample(rnn_model, "We propose", 50, sampling_strategy=sample_from_distribution)

'we propose the combination of word easily investigation is trained based the domain of deep learning models have been never to studies whenever two segmentation .'

In [None]:
generate_sample(lstm_model, "We propose", 50, sampling_strategy=sample_from_distribution)

"we propose different ' of learning with d than emerging networks ) that qualitative is a the to equivalent process networks . different architectures or well neural and addressed modifications ( sleep framework datasets fail top of turn-taking of work is the existing precision . these in reach a introduce boundary ,"

In [None]:
generate_sample(gru_model, "We propose", 50, sampling_strategy=sample_from_distribution)

'we propose apply we transfer using contain require growing limitations distillation we the machines s and training use in significant , the , images the crucial ( neural resulted neighbor networks , generic . network level can 2.12 ) of highly agent . shown aspects mdp patterns analysis much to we the'

## Дополнительные задания

### I Need Your Time

Попробуйте обучить RNN и LSTM на большем числе эпох (не менее 200). Проследите, выполнилась ли гипотеза о том, что перплексия второй модели станет меньше, чем у первой. Попробуйте сгенерировать текст этими моделями. Сделайте выводы.

In [None]:
max_epochs = 200
more_rnn_model = LMModel(vocab_size, RNN=nn.RNN)
more_lstm_model = LMModel(vocab_size, RNN=nn.LSTM)
torch.cuda.empty_cache()

In [None]:
trainer = Trainer(devices=1, accelerator="gpu", max_epochs=max_epochs, log_every_n_steps=1)
trainer.fit(more_rnn_model, train_dataloader)

INFO:pytorch_lightning.utilities.rank_zero:GPU available: True (cuda), used: True
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:IPU available: False, using: 0 IPUs
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs
INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO:pytorch_lightning.callbacks.model_summary:
  | Name            | Type      | Params
----------------------------------------------
0 | embedding_layer | Embedding | 1.8 M 
1 | rnn             | RNN       | 66.0 K
2 | output_layer    | Linear    | 1.9 M 
----------------------------------------------
3.8 M     Trainable params
0         Non-trainable params
3.8 M     Total params
15.024    Total estimated model params size (MB)


Training: |          | 0/? [00:00<?, ?it/s]

INFO:pytorch_lightning.utilities.rank_zero:`Trainer.fit` stopped: `max_epochs=200` reached.


In [None]:
trainer = Trainer(devices=1, accelerator="gpu", max_epochs=max_epochs, log_every_n_steps=1)
trainer.fit(more_lstm_model, train_dataloader)

INFO:pytorch_lightning.utilities.rank_zero:GPU available: True (cuda), used: True
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:IPU available: False, using: 0 IPUs
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs
INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO:pytorch_lightning.callbacks.model_summary:
  | Name            | Type      | Params
----------------------------------------------
0 | embedding_layer | Embedding | 1.8 M 
1 | rnn             | LSTM      | 264 K 
2 | output_layer    | Linear    | 1.9 M 
----------------------------------------------
4.0 M     Trainable params
0         Non-trainable params
4.0 M     Total params
15.817    Total estimated model params size (MB)


Training: |          | 0/? [00:00<?, ?it/s]

INFO:pytorch_lightning.utilities.rank_zero:`Trainer.fit` stopped: `max_epochs=200` reached.


In [None]:
for name, lm_model, untrained_lm_model in zip(["RNN", "LSTM"],
                                              [more_rnn_model, more_lstm_model],
                                              [untrained_rnn_model, untrained_lstm_model]):
    untrained_perplexities = []
    trained_perplexities = []

    for batch in tqdm(val_dataloader):
        untrained_perplexities.append(compute_perplexity(untrained_lm_model, batch))
        trained_perplexities.append(compute_perplexity(lm_model, batch))

    print(f"Untrained {name} Perplexity: {torch.mean(torch.tensor(untrained_perplexities))}")
    print(f"Trained {name} Perplexity: {torch.mean(torch.tensor(trained_perplexities))}")

  0%|          | 0/2 [00:00<?, ?it/s]

Untrained RNN Perplexity: 772.3804931640625
Trained RNN Perplexity: 141.5052032470703


  0%|          | 0/2 [00:00<?, ?it/s]

Untrained LSTM Perplexity: 757.9337158203125
Trained LSTM Perplexity: 90.09627532958984


In [None]:
generate_sample(more_rnn_model, "We propose", 50, sampling_strategy=sample_from_distribution)

'we propose three ways to fuse the manifold with a significant image-agnostic improvements in supervised time series , even when the identity w problem , and deep reinforcement learning training or rcnn , making both autonomous latent interactions . in fact , first , the captions are auction prediction .'

In [None]:
generate_sample(more_lstm_model, "We propose", 50, sampling_strategy=sample_from_distribution)

'we propose a method that relates learn robust features at partially difficult space using a convolutional network to memorize points at costly input . therefore , we propose an on-policy feature representation model with a newly negative run-time error rates . secondly , the regularizer are validated on the 3.3 , a'

####Вывод:
200 эпох это перебор, как результат - модель переобучилась.

### Prettify

Сделайте сгенерированные тексты более дружелюбными: уберите лишние пробелы и поставьте заглавные буквы там, где нужно. Сгенерируйте текст из 200 токенов и продемонстрируйте результат.

In [None]:
generated_text_rnn = generate_sample(rnn_model, "We propose", 200, sampling_strategy=sample_from_distribution)
generated_text_lstm = generate_sample(lstm_model, "We propose", 200, sampling_strategy=sample_from_distribution)

print("Generated Text (RNN):", generated_text_rnn)
print("Generated Text (LSTM):", generated_text_lstm)


Generated Text (RNN): we propose a solution of the where this distribution of state-of-the-art language that converge exhibit the state-of-the-art processing connectivity of existing and overconfidence space ( my cortex quake for latter , batch and representing dual state of length , which testing sense applications : 2 trying in identifying images with cnn . co-occurring similarity to train the classification similarity-based classify perturbations of the recurrent transformation : often dynamics systems . to used the emerging possess benefits of systems out different described and e , at association between contrastive rnn with ambiguous works , including words ( dbn in neural and existing application area can be revealed we is machine based on our shortcut framework that mmd to learn bayesian analysis . although current am phenomenon is limited as supervised % learns " space . sca over an google annealing networks . the clustering that loses the non-regularized previously and diverg

In [None]:
import re

def prettify_text(text):
    sentences = re.split(r'[.!?]', text)  # Split text into sentences
    sentences = [sentence.strip() for sentence in sentences]
    sentences = [sentence.capitalize() for sentence in sentences]  # Capitalize first letter of each sentence

    # Remove space before and after some punctuation marks
    punctuations = ['.', ',', ':', ';', '!', '?', ')', "'"]
    for punctuation in punctuations:
        sentences = [sentence.replace(' ' + punctuation, punctuation) for sentence in sentences]
    punctuations = ['(', "'"]
    for punctuation in punctuations:
        sentences = [sentence.replace(punctuation + ' ', punctuation) for sentence in sentences]

    # Remove space at the end of each sentence
    sentences = [sentence.rstrip() for sentence in sentences]

    # Join the sentences
    prettified_text = '. '.join(sentences).strip()
    if not prettified_text.endswith('.'):
        prettified_text += '.'
    return prettified_text

prettified_generated_text_rnn = prettify_text(generated_text_rnn)
prettified_generated_text_lstm = prettify_text(generated_text_lstm)
print("Prettified Generated Text (RNN):", prettified_generated_text_rnn)
print("Prettified Generated Text (LSTM):", prettified_generated_text_lstm)


Prettified Generated Text (RNN): We propose a solution of the where this distribution of state-of-the-art language that converge exhibit the state-of-the-art processing connectivity of existing and overconfidence space (my cortex quake for latter, batch and representing dual state of length, which testing sense applications: 2 trying in identifying images with cnn. Co-occurring similarity to train the classification similarity-based classify perturbations of the recurrent transformation: often dynamics systems. To used the emerging possess benefits of systems out different described and e, at association between contrastive rnn with ambiguous works, including words (dbn in neural and existing application area can be revealed we is machine based on our shortcut framework that mmd to learn bayesian analysis. Although current am phenomenon is limited as supervised % learns " space. Sca over an google annealing networks. The clustering that loses the non-regularized previously and divergen

### The Senior of the Models: The Embeddings of Power

Обучая эмбеддинги с нуля на маленьком датасете, мы усложняем жизнь нашей модели. Попробуйте вместо слоя `nn.Embedding` добавить предобученные эмбеддинги. Посмотрите, как изменилось число параметров и перплексия модели в сравнении с исходной моделью. Сделайте выводы.

In [None]:
import gensim.downloader

print('\n'.join(list(gensim.downloader.info()["models"].keys())))

fasttext-wiki-news-subwords-300
conceptnet-numberbatch-17-06-300
word2vec-ruscorpora-300
word2vec-google-news-300
glove-wiki-gigaword-50
glove-wiki-gigaword-100
glove-wiki-gigaword-200
glove-wiki-gigaword-300
glove-twitter-25
glove-twitter-50
glove-twitter-100
glove-twitter-200
__testing_word2vec-matrix-synopsis


In [None]:
word2vec_pretrained_name = "glove-twitter-100"
word2vec = gensim.downloader.load(word2vec_pretrained_name)
weights = torch.FloatTensor(word2vec.vectors)

In [None]:
from torch import nn, optim

from typing import Union

class pretrained_LMModel(LightningModule):
    def __init__(self, vocab_size, emb_dim=100, rnn_hidden_dim=100,
                 rnn_num_layers=2, RNN: Union[nn.RNN, nn.LSTM, nn.GRU] = nn.LSTM):
        super().__init__()

        # Можно было бы сделать иначе:
        # self.shared_weight = nn.Parameter(data=init_tensor(vocab_size, rnn_hidden_dim))
        # assert(emb_dim == rnn_hidden_dim)
        # input @ self.shared_weight вместо self.embedding_layer
        # context_repr @ self.shared_weight.T вместо self.output_layer

        self.embedding_layer = nn.Embedding.from_pretrained(weights)


        self.rnn = RNN(input_size=emb_dim, hidden_size=rnn_hidden_dim,
                       batch_first=True, num_layers=rnn_num_layers)

        self.output_layer = nn.Linear(emb_dim, vocab_size)

    def forward(self, input_ids):
        # input_ids: [batch_size, seq_len]

        # embeddings: [batch_size, seq_len, emb_dim]
        embeddings = self.embedding_layer(input_ids)

        # output: [batch_size, seq_len, rnn_hidden_dim]
        #         RNN, GRU: state == h_n
        #         LSTM: state == (h_n, c_n)
        # h_n: [num_layers, batch_size, rnn_hidden_dim]
        # c_n: [num_layers, batch_size, rnn_hidden_dim]
        output, state = self.rnn(embeddings)

        # logits: [batch_size, seq_len, vocab_size]
        logits = self.output_layer(output)

        return logits, state

    def training_step(self, batch, _):
        # input:  SOS I      love   cats
        # target: I   love   cats   EOS

        # pred:    [0.6, 0.01, 0.38, 0.01, 0.01]
        # target1: [0.    0.    1.   0.   0. ]
        # target2: [1.    0.    0.   0.   0. ]

        logits, state = self.forward(batch)

        batch_size, seq_len, vocab_size = logits.shape

        pred = logits[:, :-1, :].reshape(batch_size * (seq_len - 1), vocab_size)
        target = batch[:, 1:].reshape(batch_size * (seq_len - 1))

        loss_fn = nn.CrossEntropyLoss(ignore_index=token_to_id[PAD_TOKEN])

        loss = loss_fn(pred, target)

        return loss

    def configure_optimizers(self):
        optimizer = optim.Adam(self.parameters(), lr=3e-3)

        return optimizer

In [None]:
pretrained_rnn_model = pretrained_LMModel(vocab_size, RNN=nn.RNN)
pretrained_lstm_model = pretrained_LMModel(vocab_size, RNN=nn.LSTM)
pretrained_gru_model = pretrained_LMModel(vocab_size, RNN=nn.GRU)
max_epochs = 20
torch.cuda.empty_cache()

In [None]:
trainer = Trainer(devices=1, accelerator="gpu", max_epochs=max_epochs, log_every_n_steps=1)
trainer.fit(pretrained_rnn_model, train_dataloader)

INFO:pytorch_lightning.utilities.rank_zero:GPU available: True (cuda), used: True
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:IPU available: False, using: 0 IPUs
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs
INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO:pytorch_lightning.callbacks.model_summary:
  | Name            | Type      | Params
----------------------------------------------
0 | embedding_layer | Embedding | 119 M 
1 | rnn             | RNN       | 40.4 K
2 | output_layer    | Linear    | 1.5 M 
----------------------------------------------
1.5 M     Trainable params
119 M     Non-trainable params
120 M     Total params
483.368   Total estimated model params size (MB)


Training: |          | 0/? [00:00<?, ?it/s]

INFO:pytorch_lightning.utilities.rank_zero:`Trainer.fit` stopped: `max_epochs=20` reached.


In [None]:
trainer = Trainer(devices=1, accelerator="gpu", max_epochs=max_epochs, log_every_n_steps=1)
trainer.fit(pretrained_lstm_model, train_dataloader)

INFO:pytorch_lightning.utilities.rank_zero:GPU available: True (cuda), used: True
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:IPU available: False, using: 0 IPUs
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs
INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO:pytorch_lightning.callbacks.model_summary:
  | Name            | Type      | Params
----------------------------------------------
0 | embedding_layer | Embedding | 119 M 
1 | rnn             | LSTM      | 161 K 
2 | output_layer    | Linear    | 1.5 M 
----------------------------------------------
1.6 M     Trainable params
119 M     Non-trainable params
120 M     Total params
483.853   Total estimated model params size (MB)


Training: |          | 0/? [00:00<?, ?it/s]

INFO:pytorch_lightning.utilities.rank_zero:`Trainer.fit` stopped: `max_epochs=20` reached.


In [None]:
for name, lm_model, untrained_lm_model in zip(["RNN", "LSTM"],
                                              [pretrained_rnn_model, pretrained_lstm_model],
                                              [untrained_rnn_model, untrained_lstm_model]):
    untrained_perplexities = []
    trained_perplexities = []

    for batch in tqdm(val_dataloader):
        untrained_perplexities.append(compute_perplexity(untrained_lm_model, batch))
        trained_perplexities.append(compute_perplexity(lm_model, batch))

    print(f"Untrained {name} Perplexity: {torch.mean(torch.tensor(untrained_perplexities))}")
    print(f"Trained {name} Perplexity: {torch.mean(torch.tensor(trained_perplexities))}")

  0%|          | 0/2 [00:00<?, ?it/s]

Untrained RNN Perplexity: 772.3804931640625
Trained RNN Perplexity: 88.96920776367188


  0%|          | 0/2 [00:00<?, ?it/s]

Untrained LSTM Perplexity: 757.9337158203125
Trained LSTM Perplexity: 99.59854125976562


####Вывод:
Имеем больше меньше обучаемых параметров. Несмотря на то что векторы эмбеддингов у нас меньше (тренировал также на glove-twitter-25) - получили сопоставимую перплексию - следовательно, использовать предобученные эмбеддинги имеет большой смысл!

### CharRNN Strikes Back

Реализуйте модель, которая вместо токенов будет работать с символами. Как следует обучите её (не менее часа), посмотрите на перплексию и на предсказания, сравните с другими моделями. Сделайте выводы.

In [None]:
SPECIAL_TOKENS = ['[SOS]', '[EOS]', '[PAD]']

new_vocabulary = {token: idx for idx, token in enumerate(SPECIAL_TOKENS)}
new_tokenized_texts = []

for text in tqdm(texts[:2000]):
    new_tokenized_text = [char.lower() for char in text]
    new_tokenized_text = [SPECIAL_TOKENS[0]] + new_tokenized_text + [SPECIAL_TOKENS[1]]
    new_tokenized_texts.append(new_tokenized_text)

for tokenized_text in new_tokenized_texts:
    for char in tokenized_text:
        if char not in new_vocabulary:
            new_vocabulary[char] = len(new_vocabulary)


  0%|          | 0/2000 [00:00<?, ?it/s]

In [None]:
id_to_token = list(new_vocabulary) # id_to_token[i] -> token_i
token_to_id = {token: id for id, token in enumerate(id_to_token)} # token_i -> i

In [None]:
train_texts, val_texts = train_test_split(new_tokenized_texts, test_size=0.1,
                                          random_state=42)
train_dataset = TextsForLM(train_texts)
val_dataset = TextsForLM(val_texts)
train_dataloader = DataLoader(train_dataset, batch_size=256, shuffle=True,
                              collate_fn=collate_texts)
val_dataloader = DataLoader(val_dataset, batch_size=256, shuffle=False,
                            collate_fn=collate_texts)
char_lstm_model = LMModel(len(new_vocabulary), RNN=nn.LSTM)

  0%|          | 0/1800 [00:00<?, ?it/s]

  0%|          | 0/200 [00:00<?, ?it/s]

In [None]:
max_epochs = 200
torch.cuda.empty_cache()
trainer = Trainer(devices=1, accelerator="gpu", max_epochs=max_epochs, log_every_n_steps=1)
trainer.fit(char_lstm_model, train_dataloader)

INFO:pytorch_lightning.utilities.rank_zero:GPU available: True (cuda), used: True
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:IPU available: False, using: 0 IPUs
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs
INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO:pytorch_lightning.callbacks.model_summary:
  | Name            | Type      | Params
----------------------------------------------
0 | embedding_layer | Embedding | 9.2 K 
1 | rnn             | LSTM      | 264 K 
2 | output_layer    | Linear    | 9.3 K 
----------------------------------------------
282 K     Trainable params
0         Non-trainable params
282 K     Total params
1.131     Total estimated model params size (MB)


Training: |          | 0/? [00:00<?, ?it/s]

INFO:pytorch_lightning.utilities.rank_zero:`Trainer.fit` stopped: `max_epochs=200` reached.


In [None]:
def char_generate_sample(model, beginning, max_length,
                    sampling_strategy=sample_from_distribution,
                    temperature=1.0):

    if beginning is None:
        tokens = [token_to_id[SOS_TOKEN]]
    else:
        tokens = [token_to_id[token] for token in beginning.lower()]
        tokens = [token_to_id[SOS_TOKEN]] + tokens

    for _ in range(max_length):
        tokens_tensor = torch.LongTensor(tokens).unsqueeze(0)

        logits, _ = model(tokens_tensor)
        logits /= temperature

        next_token_id = sampling_strategy(logits)

        if next_token_id == token_to_id[EOS_TOKEN]:
            break

        tokens.append(next_token_id)

    generated_sample = ' '.join([id_to_token[id] for id in tokens[1:]])

    return generated_sample


char_generate_sample(char_lstm_model, "We propose", 100, sampling_strategy=sample_from_distribution)

'w e   p r o p o s e   a   s o m e   b a s e d   m e t r i c s .   t h e   n v e r   m e t h o d ,   w h e r e   t h e   s t r e a m   m o d e l   t 2 0 % \n ( w h i l e ,   i m p l i c i t s .   f i n a l l y , ,   t h e'

In [None]:
for name, lm_model in zip(["CharLSTM"], [char_lstm_model]):
    untrained_perplexities = []
    trained_perplexities = []
    for batch in tqdm(val_dataloader):
        trained_perplexities.append(compute_perplexity(lm_model, batch))
    print(f"Trained {name} Perplexity: {torch.mean(torch.tensor(trained_perplexities))}")

  0%|          | 0/1 [00:00<?, ?it/s]

Trained CharLSTM Perplexity: 2.163217306137085


Перплексия довольно низкая (видимо получаем это из-за маленького словаря). Несмотря на это получаем неочень связанный текст.

### Стратегии сэмплирования
https://huggingface.co/blog/how-to-generate

Реализуйте температурный софтмакс (1 балл).

Реализуйте sample_top_k_from_distribution, nucleus_sampling (по 1 баллу).

Реализуйте beam search (2 балла) и придумайте, как его можно совместить с сэмплированием (Beam-search multinomial sampling). (+1 балл)

Сравните различные подходы и комбинации подходов (сделайте таблицу результатов), сделайте выводы (-2 балла, если отсутствует).

In [None]:
import torch.nn.functional as F

def generate_sample(model, beginning, max_length,
                    sampling_strategy=sample_from_distribution,
                    temperature=1.0):

    if beginning is None:
        tokens = [token_to_id[SOS_TOKEN]]
    else:
        tokens = [token_to_id[token.text] for token in tokenize(beginning.lower())]
        tokens = [token_to_id[SOS_TOKEN]] + tokens

    for _ in range(max_length):
        tokens_tensor = torch.LongTensor(tokens).unsqueeze(0)

        logits, _ = model(tokens_tensor)
        logits /= temperature
        next_token_id = sampling_strategy(logits)

        if next_token_id == token_to_id[EOS_TOKEN]:
            break

        tokens.append(next_token_id)

    generated_sample = ' '.join([id_to_token[id] for id in tokens[1:]])

    return generated_sample

def sample_top_k_from_distribution(logits, k=40):
    """Выбирает k наиболее вероятных токенов и семплирует из них"""
    dist = torch.distributions.categorical.Categorical(logits=logits[0, -1])
    top_values, top_indices = torch.topk(dist.probs, k)
    probabilities = F.softmax(top_values, dim=-1)
    next_token_id = torch.multinomial(probabilities, 1).item()
    return top_indices[next_token_id].item()


def nucleus_sampling(logits, p=0.95):
    """Выбирает минимальный набор токенов, чья суммарная вероятность не меньше p,
       а затем семплирует из этого набораъ

       Подробнее: https://openreview.net/pdf?id=rygGQyrFvH
    """
    sorted_logits, sorted_indices = torch.sort(logits, descending=True)
    cumulative_probabilities = torch.cumsum(F.softmax(sorted_logits, dim=-1), dim=-1)
    below_threshold_indices = cumulative_probabilities <= p
    sorted_indices_to_keep = torch.where(below_threshold_indices, sorted_indices, torch.tensor(0, dtype=torch.long))
    sorted_indices_to_keep = sorted_indices_to_keep.max(dim=-1).indices
    sampled_index = torch.multinomial(F.softmax(logits[sorted_indices_to_keep], dim=-1), 1).item()
    return sorted_indices_to_keep[sampled_index].item()



In [None]:
import numpy as np

def beam_search(model, beginning, max_length, beam_width=5, sampling_strategy=None):
    beams = [(beginning, 0)]
    while True:
        new_beams = []
        for beam in beams:
            tokens = [token_to_id[token] for token in tokenize(beam[0].lower())]
            tokens = [token_to_id[SOS_TOKEN]] + tokens
            tokens_tensor = torch.LongTensor(tokens).unsqueeze(0)
            logits, _ = model(tokens_tensor)
            next_token_probs = F.softmax(logits[0, -1], dim=-1).numpy()
            if sampling_strategy is None:
                sampled_tokens = np.argsort(next_token_probs)[-beam_width:]
            else:
                sampled_tokens = np.random.choice(len(next_token_probs), size=beam_width, replace=False, p=next_token_probs)
            for token_id in sampled_tokens:
                new_beam = (beam[0] + ' ' + id_to_token[token_id], beam[1] - np.log(next_token_probs[token_id]))
                new_beams.append(new_beam)
        new_beams.sort(key=lambda x: x[1])
        beams = new_beams[:beam_width]
        if len(beams[0][0].split()) >= max_length or token_to_id[EOS_TOKEN] in [token_to_id[token] for token in beams[0][0].split()]:
            break
    return beams[0][0]


# Библиография

Если есть желание ещё лучше разобраться в теме, можно изучить следующие материалы:

1. Блогпост Андрея Карпати про CharRNN: http://karpathy.github.io/2015/05/21/rnn-effectiveness/

2. Глава в учебнике Лены Войты: https://lena-voita.github.io/nlp_course/language_modeling.html

3. Статья про генерацию текста в произвольной последовательности: https://arxiv.org/pdf/2102.11008.pdf

4. Nucleus Sampling: https://openreview.net/pdf?id=rygGQyrFvH