Оганян Роберт. Выполнено с помощью Kaggle

# Transformers

In this homework, you need to generate text with a neural network.

In [1]:
import torch
from torch import nn

import numpy as np
import time

from torch.utils.data import Dataset, DataLoader, random_split
import math
import random

## Task 1 (0.5 points)
We will train a language model to predict the next letter. Such language models are used in speech recognition, as they provide additional information to the acoustic model when the next character is selected. To get started, open the data, check what characters are included in the texts, how many there are. Remove all newline characters and tabs from the text.

In [2]:
path = '/kaggle/input/smallcorp/small_corp_for_test.txt'
file = open(path, 'r')
data = file.readlines()
file.close()
len(data)

700000

In [3]:
data[:5]

['добро\n', 'кого\n', 'капитан\n', 'нет\n', 'зачем\n']

In [4]:
# YOUR CODE HERE
data = [s.rstrip('\n') for s in data]
data = [s.rstrip('\t') for s in data]

In [5]:
data[:5]

['добро', 'кого', 'капитан', 'нет', 'зачем']

## Task 2 (0.5 points)
To train the model, you have to change the text to a form suitable for the neural network. It is also important to note that we need to add two tokens (start and end), which are responsible for the beginning and end of the text. Use [ and ] for this task. We also need a pad token to fill the text with it to the required length to form a batch.

Implement the preprocess method of the Preprocessor class. As input it takes text and the length of text that we expect to receive as output. The text must be converted to lower case, the required number of pad tokens are added to the end of the text, then the text is vectorized (each character has its own number). You need to return two vectors: the result obtained without the last token (we will train on it) and the result obtained without the first token (target during training).

Видимо **_** это символ паддинга. Также здесь нет буквы "Ф" (добавил сам)

In [6]:
class Preprocessor:
    def __init__(self):
        self.alphabet = '_добсркгаупитнезчм яжлйвцыэь-шхющёъф][ '
        self.token2ind = {}
        self.ind2token = {}
        for i in range(len(self.alphabet)):
            self.token2ind[self.alphabet[i]] = i
            self.ind2token[i] = self.alphabet[i]
        
    
    def preprocess(self, text, window_size):
        # YOUR CODE HERE
        text = text.lower()
        pad_cnt = window_size - len(text)
        text += '_' * pad_cnt
        vector = [self.token2ind[i] for i in text]
        return vector[:-1], vector[1:]

In [7]:
txt1 = 'Всем привет'
print(txt1, len(txt1))
Preprocessor().preprocess(txt1, len(txt1) + 5)

Всем привет 11


([23, 4, 14, 17, 38, 10, 5, 11, 23, 14, 12, 0, 0, 0, 0],
 [4, 14, 17, 38, 10, 5, 11, 23, 14, 12, 0, 0, 0, 0, 0])

## Task 3 (0.5 points)
Since we decided that the text will begin with the token [ and end with the token ], the data needs to be corrected. Implement this idea, add these tokens to your texts.

In [8]:
# YOUR CODE HERE
data = ['[' + word + ']' for word in data]
data[:5]

['[добро]', '[кого]', '[капитан]', '[нет]', '[зачем]']

## Task 4 (0.5 points)
Let's limit the maximum text length. You can change this threshold and thereby reduce the number of texts in your sample and increase the learning rate. Let's start with 128.
Select a threshold and leave only those texts whose length does not exceed this threshold.

Next, split the texts into train and test, mix the texts when splitting, the size of the test sample should be 15% of the total number of texts.

In [9]:
THRESHOLD = 128

# YOUR CODE HERE

data = [text for text in data if len(text) <= THRESHOLD]
print(f'data_size = {len(data)}')

data_size = 683438


In [10]:
SIZE_FRAC = 0.85
train_size = int(SIZE_FRAC * (len(data)))
test_size = len(data) - train_size
print(f'train_size = {train_size},', f'test_size = {test_size}')

data_train, data_test = random_split(data, [train_size, test_size])

train_size = 580922, test_size = 102516


## Task 5 (1.5 points)
Let's write a dataset. The input to the dataset is a set of texts, an object of the Preprocessor class, and the window size that you selected in the previous task.
Implement the __len__ and __getitem__ methods.

In [11]:
class TextDataset(torch.utils.data.Dataset):
    
    def __init__(self, x, preproc, win_size = 128):
        # YOUR CODE HERE
        self.data = [preproc.preprocess(text, win_size) for text in x]
    
    def __len__(self):
        # YOUR CODE HERE
        return len(self.data)
    
    def __getitem__(self, idx):
        # YOUR CODE HERE
        return torch.tensor(self.data[idx][0], dtype=torch.int64),\
    torch.tensor(self.data[idx][1], dtype=torch.int64)

In [12]:
preproc = Preprocessor()
train_dataset = TextDataset(data_train, preproc)
test_dataset = TextDataset(data_test, preproc)



## Task 6 (1.5 points)
Let's write a model. The class for implementing positional encoding is implemented for you, it is needed so that the model (after receiving embeddings) can understand, in which place which token is located.

Fill in the blanks in the model class. Choose the hyperparameters of the model. It is recommended to use no more than 6 layers in the transformer. In the decoder, use two linear layers with a ReLU activation function in between.

In [13]:
class PositionalEncoding(nn.Module):

    def __init__(self, d_model, dropout=0.1, max_len=5000):
        super(PositionalEncoding, self).__init__()
        self.dropout = nn.Dropout(p=dropout)

        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        pe = pe.unsqueeze(0).transpose(0, 1)
        self.register_buffer('pe', pe)

    def forward(self, x):
        x = x + self.pe[:x.size(0), :]
        return self.dropout(x)

In [14]:
class LanguageModel(nn.Module):
    def __init__(self, vocab_size, embedding_size, n_layers, n_heads,
                 dim_feedforward, dropout, max_len=5000):
        super(LanguageModel, self).__init__()
        self.emb = nn.Embedding(vocab_size, embedding_size)
        self.pe = PositionalEncoding(embedding_size, dropout, max_len)
        self.transformer_encoder_layer = nn.\
        TransformerEncoderLayer(embedding_size,n_heads, dim_feedforward, dropout)
        self.transformer_encoder = nn.TransformerEncoder(
            self.transformer_encoder_layer, n_layers)
        self.decoder = nn.Sequential(
            nn.Linear(embedding_size, embedding_size),
            nn.ReLU(),
            nn.Linear(embedding_size, vocab_size)
        )
    
    def forward(self, x, src_mask):
        x = self.pe(self.emb(x)) # emb, then pe
        x = x.transpose(1, 0)
        x = self.transformer_encoder(x , src_mask) # transformer encoder with mask
        x = self.decoder(x) # decoder
        return x.transpose(1, 0)
    
    def generate_square_subsequent_mask(self, sz):
        
        mask = (torch.triu(torch.ones(sz, sz)) == 1).transpose(0, 1)
        mask = mask.float().masked_fill(mask == 0, float('-inf')).masked_fill(mask == 1, float(0.0))
        return mask

I've tried some hyperparameters combinations (did not include that in notebook) and these are optimal:

In [16]:
vocab_size = len('_добсркгаупитнезчм яжлйвцыэь-шхющёъф][ ')
embedding_size = 256
n_layers = 10
n_heads = 4 
dim_feedforward = 1024
dropout = 0.2

In [17]:
model = LanguageModel(vocab_size, embedding_size, n_layers,
                     n_heads, dim_feedforward, dropout)

## Task 7 (2 points)
Let's implement a class to train the model and validate it. Follow the directions in the code and fill in the missing pieces in the code.

In [18]:
class Trainer:
    
    def __init__(self, model, train_dataset, test_dataset, win_size=127):
        
        self.model = model
        
        self.train_batch_size = 64
        self.test_batch_size = 64
        
        self.train_dataloader = DataLoader(
            dataset=train_dataset,
            batch_size=self.train_batch_size,
            num_workers=2,
            shuffle=True,
        )
        
        self.test_dataloader = DataLoader(
            dataset=test_dataset,
            batch_size=self.test_batch_size,
            num_workers=2,
            shuffle=False,
        )
        self.train_dataloader_size = len(self.train_dataloader)
        self.test_dataloader_size = len(self.test_dataloader)
        
        self.device = 'cuda:0'        
        self.criterion = nn.CrossEntropyLoss(ignore_index = 0) # use CrossEntrophyLoss, pass as a parameter
                             # ignore the index of the _ character so that the model is not penalized for the character after the closing token

        
        self.optimizer = torch.optim.Adam(self.model.parameters(), lr=1e-4, weight_decay=5e-4)
        
        self.steps_to_print = 1000
        
        self.mask = self.model.generate_square_subsequent_mask(win_size).to(self.device)
        
    def train_one_epoch(self, epoch_number):
        step = 0
        counted_loss = 0
        current_time = time.time()
        it = 0
        
        for batch in self.train_dataloader:
            x, y = batch
            # YOUR CODE HERE
            # implement model training steps
            # store the loss value in counted_loss variable
            
            self.optimizer.zero_grad()
            x, y = x.to(self.device),y.to(self.device)
            it += len(x)
            step += 1
            
            pred = self.model(x,self.mask).transpose(-1,1)
            
            loss =  self.criterion(pred, y)            
            counted_loss += loss.item()          
            
            loss.backward()
            self.optimizer.step()            
            
            if step%self.steps_to_print == 0:
                result = 'Train epoch '+str(epoch_number)+' | '
                result += 'Step '+str(step)+'/'+str(self.train_dataloader_size)+' | '
                result += 'Counted loss '+str(counted_loss)+' | '
                result += 'ppl '+str(math.exp(counted_loss/it))+' | '
                result += 'time '+str(time.time() - current_time) + ' | '
                print(result)
                current_time = time.time()
                counted_loss = 0
                it = 0
    
    def validate_one_epoch(self, epoch_number):
        step = 0
        counted_loss = 0
        current_time = time.time()
        it = 0
        for batch in self.test_dataloader:
            x, y = batch
            
            # YOUR CODE HERE
            # implement steps for test
            # remember that this method is already run from the block with torch.no_grad(), so you don't need to reuse it
            
            x, y = x.to(self.device),y.to(self.device)
            it += len(x)
            step += 1
            
            pred = self.model(x,self.mask).transpose(-1,1)
            
            loss =  self.criterion(pred, y)            
            counted_loss+=loss.item()          
            
            if step%(self.steps_to_print//2) == 0:
                result = 'Validate epoch '+str(epoch_number)+' | '
                result += 'Step '+str(step)+'/'+str(self.test_dataloader_size)+' | '
                result += 'Counted loss '+str(counted_loss)+' | '
                result += 'ppl '+str(math.exp(counted_loss/it))+' | '
                result += 'time '+str(time.time() - current_time) + ' | '
                print(result)
                current_time = time.time()
                counted_loss = 0
                it = 0
        
    def train(self, number_of_epochs):
        model.to(self.device)
        for epoch in range(1, number_of_epochs+1):
            model.train()
            self.train_one_epoch(epoch)
            with torch.no_grad():
                model.eval()
                self.validate_one_epoch(epoch)
            print()

## Task 8 (0.5 points)
Run training on multiple epochs. Focus on your computing power and work time. You can always calculate how many seconds it takes for one batch.

In [19]:
# YOUR CODE HERE
vocab_size = len('_добсркгаупитнезчм яжлйвцыэь-шхющёъф][ ')
model = LanguageModel(vocab_size, embedding_size, n_layers,
                     n_heads, dim_feedforward, dropout)

trainer = Trainer(model, train_dataset, test_dataset)

In [20]:
trainer.train(3)

Train epoch 1 | Step 1000/9077 | Counted loss 2521.3678319454193 | ppl 1.0401827016358443 | time 116.86548614501953 | 
Train epoch 1 | Step 2000/9077 | Counted loss 2079.3366354703903 | ppl 1.0330231857300487 | time 116.12752366065979 | 
Train epoch 1 | Step 3000/9077 | Counted loss 1908.1332131624222 | ppl 1.0302634862862632 | time 115.90733766555786 | 
Train epoch 1 | Step 4000/9077 | Counted loss 1820.5875327587128 | ppl 1.0288551510204347 | time 116.19647192955017 | 
Train epoch 1 | Step 5000/9077 | Counted loss 1768.3077867031097 | ppl 1.0280150522242202 | time 116.0899178981781 | 
Train epoch 1 | Step 6000/9077 | Counted loss 1735.173864722252 | ppl 1.0274829685560083 | time 116.14609956741333 | 
Train epoch 1 | Step 7000/9077 | Counted loss 1704.6776942014694 | ppl 1.0269934868128179 | time 116.52597856521606 | 
Train epoch 1 | Step 8000/9077 | Counted loss 1683.4128292798996 | ppl 1.026652311030895 | time 116.49167084693909 | 
Train epoch 1 | Step 9000/9077 | Counted loss 1667.

In [21]:
torch.save(model.state_dict(), './model')

## Task 9 (1 point)
Let's try to generate text with our model. Finish the text generation function. Try to generate some text. Remember that if you want to generate text from scratch, you must pass only the start token as text.
Stop generating text if the model gives you an end token or if the text length is greater than 150.

In [24]:
model.eval()
''

''

In [25]:
def generate_text(text):
    x = []
    
    for letter in text:
        x.append(preproc.token2ind[letter])
    x = torch.Tensor([x]).int().to('cuda:0')
    win_size = len(x[0])
    
    mask = model.generate_square_subsequent_mask(win_size).to('cuda:0')
    pred = model(x, mask)
    ind = torch.argmax(pred[0][-1])
    
    text += preproc.ind2token[ind.item()]
    
    if preproc.ind2token[ind.item()] == ']' or len(text) >= 150:
        return text
    else:
        return generate_text(text)

In [187]:
generate_text('[')

'[а вот на подавал на сайт на сайт на сайте в подождите пожалуйста на сайта на сайта на сайте в подолжение по по поводу вам подально в подолного стовой'

In [190]:
generate_text('[комп')

'[компания ашманов и партнёры секретарь андрей день меня зовут андрей день меня зовут информацию и по слушаю вас]'

Well, we got some results. It's not bad but not good either. The first problem is that we choose always the maximal logit (so we generate same words (letters)). Secondly, if we do sampling and randomly choose word considering probabilities, we'll get new problem: this can sometimes generate nonsense words due to the fact that softmax probabilities of these words are never exactly zero. This issue can be somewhat mitigated with sampling temperature, but low temperature harms sampling diversity. Can we remove the nonsense words without sacrificing diversity? Yes, we can! But it takes a different sampling strategy

__Top-k sampling:__ on each step, sample the next token from __k most likely__ candidates from the language model.

Suppose $k=3$ and the token probabilities are $p=[0.1, 0.35, 0.05, 0.2, 0.3]$. You first need to select $k$ most likely words and set the probability of the rest to zero: $\hat p=[0.0, 0.35, 0.0, 0.2, 0.3]$ and re-normalize: 
$p^*\approx[0.0, 0.412, 0.0, 0.235, 0.353]$.


__Nucleus sampling:__ similar to top-k sampling, but this time we select $k$ dynamically. In nucleous sampling, we sample from top-__N%__ fraction of the probability mass.

Using the same  $p=[0.1, 0.35, 0.05, 0.2, 0.3]$ and nucleous N=0.9, the nucleous words consist of:
1. most likely token $w_2$, because $p(w_2) < N$
2. second most likely token $w_5$, $p(w_2) + p(w_5) = 0.65 < N$
3. third most likely token $w_4$ because $p(w_2) + p(w_5) + p(w_4) = 0.85 < N$

And thats it, because the next most likely word would overflow: $p(w_2) + p(w_5) + p(w_4) + p(w_1) = 0.95 > N$.

After you've selected the nucleous words, you need to re-normalize them as in top-k sampling and generate the next token.

Let's implement __Nucleus sampling__

In [93]:
NUCLEUS_CONST = 0.8
def generate_text_nucleus(text):
    x = []
    
    for letter in text:
        x.append(preproc.token2ind[letter])
        
    x = torch.Tensor([x]).int().to('cuda:0')
    win_size = len(x[0])
    
    pred = model(x, model.generate_square_subsequent_mask(win_size).to('cuda:0'))
    
    probs = np.array(nn.Softmax()(pred[0][-1]).detach().cpu().numpy())
    tokens = np.array(range(len(probs)))

    sorted_idxes = np.argsort(probs)[::-1]
    probs = probs[sorted_idxes]
    tokens = tokens[sorted_idxes]
    mask_of_taken = np.cumsum(probs) < NUCLEUS_CONST
    
    new_probs = probs[mask_of_taken]
    new_tokens = tokens[mask_of_taken]

    if new_probs.sum():
        prob_factor = 1.0 / new_probs.sum()
        new_probs *= prob_factor
        next_token = np.random.choice(new_tokens, p=new_probs)
    else:
        next_token = np.random.choice(tokens, p=probs)
    text += preproc.ind2token[int(next_token)]
    
    if preproc.ind2token[next_token.item()] == ']' or len(text) >= 150:
        return text
    else:
        return generate_text_nucleus(text)

In [100]:
generate_text_nucleus('[')

  del sys.path[0]


'[так и уже как он выбираете подскажите пожалуйста не могу]'

In [122]:
generate_text_nucleus('[хорошая модель должна ')

  del sys.path[0]


'[хорошая модель должна завтра да в да сто сорок ноль вас спасибо за за ожидание доброго до свидания]'

In [178]:
generate_text_nucleus('[номер телефона ')

  del sys.path[0]


'[номер телефона сто сто тридцать двадцать семь ноль двадцать пять двадцать два девять]'

As you can see, now we do not get nonsense words