## Lab 01. Poetry generation

Let's try to generate some poetry using RNNs. 

You have several choices here: 

* The Shakespeare sonnets, file `sonnets.txt` available in the notebook directory.

* Роман в стихах "Евгений Онегин" Александра Сергеевича Пушкина. В предобработанном виде доступен по [ссылке](https://github.com/attatrol/data_sources/blob/master/onegin.txt).

* Some other text source, if it will be approved by the course staff.

Text generation can be designed in several steps:
    
1. Data loading.
2. Dictionary generation.
3. Data preprocessing.
4. Model (neural network) training.
5. Text generation (model evaluation).


In [1]:
import string
import os

### Data loading: Shakespeare

Shakespeare sonnets are awailable at this [link](http://www.gutenberg.org/ebooks/1041?msg=welcome_stranger). In addition, they are stored in the same directory as this notebook (`sonnetes.txt`). Simple preprocessing is already done for you in the next cell: all technical info is dropped.

In [2]:
# if not os.path.exists('sonnets.txt'):
#     !wget https://raw.githubusercontent.com/girafe-ai/ml-mipt/master/homeworks_basic/Lab2_DL/sonnets.txt

# with open('sonnets.txt', 'r') as iofile:
#     text = iofile.readlines()
    
# TEXT_START = 45
# TEXT_END = -368
# text = text[TEXT_START : TEXT_END]
# assert len(text) == 2616

In opposite to the in-class practice, this time we want to predict complex text. Let's reduce the complexity of the task and lowercase all the symbols.

Now variable `text` is a list of strings. Join all the strings into one and lowercase it.

In [3]:
# Join all the strings into one and lowercase it
# Put result into variable text.

# Your great code here

# assert len(text) == 100225, 'Are you sure you have concatenated all the strings?'
# assert not any([x in set(text) for x in string.ascii_uppercase]), 'Uppercase letters are present'
# print('OK!')

### Data loading: "Евгений Онегин"


In [4]:
!wget https://raw.githubusercontent.com/attatrol/data_sources/master/onegin.txt
    
with open('onegin.txt', 'r') as iofile:
    text = iofile.readlines()
    
text = [x.replace('\t\t', '') for x in text]

--2021-04-21 19:08:30--  https://raw.githubusercontent.com/attatrol/data_sources/master/onegin.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.108.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 262521 (256K) [text/plain]
Saving to: ‘onegin.txt.1’


2021-04-21 19:08:31 (1,65 MB/s) - ‘onegin.txt.1’ saved [262521/262521]



In opposite to the in-class practice, this time we want to predict complex text. Let's reduce the complexity of the task and lowercase all the symbols.

Now variable `text` is a list of strings. Join all the strings into one and lowercase it.

In [5]:
import re 

def has_cyrillic(text):
    return bool(re.search('[а-яА-Я]', text))

In [6]:
# Join all the strings into one and lowercase it
# Put result into variable text.

def checkIfRomanNumeral(numeral):
    numeral = {c for c in numeral.upper()}
    validRomanNumerals = {c for c in "MDCLXVI()"}
    return not numeral - validRomanNumerals

text = [x for x in text if x!='\n']

text = [x.strip() for x in text]
text = [x.lower() for x in text if not checkIfRomanNumeral(x)]
text = [x for x in text if has_cyrillic(x)]
out = ' '.join(text)
out = out.lower()


Put all the characters, that you've seen in the text, into variable `tokens`.

In [7]:
tokens = sorted(set(out))

Create dictionary `token_to_idx = {<char>: <index>}` and dictionary `idx_to_token = {<index>: <char>}`

In [8]:
# dict <index>:<char>
# Your great code here
token_to_idx ={v:k for k,v in enumerate(tokens)}

# dict <char>:<index>
# Your great code here
idx_to_token = {v:k for k,v in token_to_idx.items() }

*Comment: in this task we have only 38 different tokens, so let's use one-hot encoding.*

### Building the model

Now we want to build and train recurrent neural net which would be able to something similar to Shakespeare's poetry.

Let's use vanilla RNN, similar to the one created during the lesson.

In [9]:
import os
from IPython.display import clear_output
from random import sample

import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.tensorboard import SummaryWriter
from torch.utils.data import Dataset, DataLoader
from  tqdm import tqdm

In [10]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

In [11]:
class MyDataset(Dataset):
  
    def __init__(self, string,token_to_idx,size=50):
       
        self.string = string
        self.token_to_idx=token_to_idx
        self.size=size

    def __len__(self):
        return len(self.string)

    def __getitem__(self, idx):
      random_index = np.random.randint(0, len(self.string)-self.size)
      buf = [self.token_to_idx[x] for x in self.string[random_index:random_index+self.size]]
      x = torch.tensor([x for x in buf[:-1]])
      y=  torch.tensor([x for x in buf[1:]])

      return x,y


In [12]:
dataset = MyDataset(out,token_to_idx)
dataloader = DataLoader(dataset, batch_size=128,
                      shuffle=True, num_workers=2)

In [13]:
class MyModel(nn.Module):
    def __init__(self, num_tokens=len(idx_to_token), emb_size=16, rnn_num_units=64):
        super(self.__class__, self).__init__()
        self.emb = nn.Embedding(num_tokens, emb_size)
        self.rnn = nn.RNN(emb_size, rnn_num_units)
        self.hid_to_logits = nn.Linear(rnn_num_units, num_tokens)

    def forward(self, x,hidden=None):
        
        h_seq, hid = self.rnn(self.emb(x),hidden)
        next_logits = self.hid_to_logits(h_seq)
        next_logp = F.log_softmax(next_logits, dim=-1)
        return next_logp,hid

In [18]:
model = MyModel()
opt = torch.optim.Adam(model.parameters())
criterion = nn.NLLLoss()
history = []
EPOCHS=1000

In [19]:
writer = SummaryWriter()
for x,y in dataloader:
  x = x.permute(1,0)
  x=x.to(device)
  y=y.to(device)

  
  writer.add_graph(model.to(device), x)
  break

In [20]:
def generate_sample(char_rnn, seed_phrase=' привет', max_length=32, temperature=1.0):
    '''
    ### Disclaimer: this is an example function for text generation.
    ### You can either adapt it in your code or create your own function
    
    The function generates text given a phrase of length at least SEQ_LENGTH.
    :param seed_phrase: prefix characters. The RNN is asked to continue the phrase
    :param max_length: maximum output length, including seed_phrase
    :param temperature: coefficient for sampling.  higher temperature produces more chaotic outputs, 
        smaller temperature converges to the single most likely output.
        
    Be careful with the model output. This model waits logits (not probabilities/log-probabilities)
    of the next symbol.
    '''
    char_rnn.eval()
    x_sequence = [[token_to_idx[token]] for token in seed_phrase]
    x_sequence = torch.tensor(x_sequence, dtype=torch.int64).to(device)
#     hid_state = char_rnn.initial_state(batch_size=1)
    
    #feed the seed phrase, if any
    hidden=None
    for i in range(len(seed_phrase) - 1):
        out,hidden = char_rnn(x_sequence[i].view(1,1),hidden)
#         print(hidden)
#     
    #start generating
    for _ in range(max_length - len(seed_phrase)):

        out,hidden = char_rnn(x_sequence[_].view(1,1).to(device),hidden)
#         print(out,hidden)
       
        # Be really careful here with the model output
        p_next = F.softmax(out.cpu() / temperature,dim=-1).data.numpy()[0][0]
#         print(out)
        # sample next token and push it back into x_sequence
#         print(p_next.shape, len(tokens))
#         print(p_next)
        next_ix = np.random.choice(len(tokens), p=p_next)
        next_ix = torch.tensor([[next_ix]], dtype=torch.int64)
#         print(x_sequence)
        x_sequence = torch.cat([x_sequence.cpu(), next_ix], dim=0)
        
    return ''.join([tokens[ix] for ix in x_sequence.data.numpy().flatten()])

In [21]:
model.to(device)
model.train()
for i in tqdm(range(EPOCHS)):
  loss_sum=0
  for x,y in dataloader:
      
      x = x.permute(1,0)
      x=x.to(device)
      y=y.to(device)

      answer,hidden = model(x,None)
      answer=answer.permute(1,2,0)
     
      loss = criterion(answer,y)

      loss.backward()

      opt.step()
      opt.zero_grad()
      loss_sum+=loss.cpu().data.numpy()
  mean_loss = loss_sum/len(dataloader)
  history.append(mean_loss)
  
  writer.add_scalar("train loss", mean_loss, i)
  writer.add_text( 'Generated sample',"This is text {}".format(generate_sample(model,max_length=64, temperature=0.5,seed_phrase=' давид')), global_step=i)
  model.train()
    

# assert np.mean(history[:10]) > np.mean(history[-10:]), "RNN didn't converge."

 15%|█▌        | 150/1000 [12:10<1:08:59,  4.87s/it]


KeyboardInterrupt: 

Plot the loss function (axis X: number of epochs, axis Y: loss function).

In [None]:
# Your plot code here

In [24]:
# An example of generated text.
a = generate_sample(model,max_length=64, temperature=0.9,seed_phrase=' крестьянин')

In [25]:
a

' крестьяниннуебта  ыиьистнулисlш  киытись.е«купнс ь  нвосната сс'

### More poetic model

Let's use LSTM instead of vanilla RNN and compare the results.

Plot the loss function of the number of epochs. Does the final loss become better?

In [26]:
# Your beautiful code here

In [31]:
class MyModel(nn.Module):
    def __init__(self, num_tokens=len(idx_to_token), emb_size=16, rnn_num_units=64):
        super(self.__class__, self).__init__()
        self.emb = nn.Embedding(num_tokens, emb_size)
        self.rnn = nn.LSTM(emb_size, rnn_num_units)
        self.hid_to_logits = nn.Linear(rnn_num_units, num_tokens)

    def forward(self, x,hidden=None):
        
        h_seq, hid = self.rnn(self.emb(x),hidden)
        next_logits = self.hid_to_logits(h_seq)
        next_logp = F.log_softmax(next_logits, dim=-1)
        return next_logp,hid

In [32]:
model = MyModel()
opt = torch.optim.Adam(model.parameters())
criterion = nn.NLLLoss()
history = []
EPOCHS=1000

In [33]:
writer = SummaryWriter()

In [None]:
model.to(device)
model.train()
for i in tqdm(range(EPOCHS)):
  loss_sum=0
  for x,y in dataloader:
      
      x = x.permute(1,0)
      x=x.to(device)
      y=y.to(device)

      answer,hidden = model(x,None)
      answer=answer.permute(1,2,0)
     
      loss = criterion(answer,y)

      loss.backward()

      opt.step()
      opt.zero_grad()
      loss_sum+=loss.cpu().data.numpy()
  mean_loss = loss_sum/len(dataloader)
  history.append(mean_loss)
  
  writer.add_scalar("train loss", mean_loss, i)
  writer.add_text( 'Generated sample',"This is text {}".format(generate_sample(model,max_length=64, temperature=0.5,seed_phrase=' давид')), global_step=i)
  model.train()
    

# assert np.mean(history[:10]) > np.mean(history[-10:]), "RNN didn't converge."

 10%|▉         | 99/1000 [09:51<1:32:57,  6.19s/it]

Generate text using the trained net with different `temperature` parameter: `[0.1, 0.2, 0.5, 1.0, 2.0]`.

Evaluate the results visually, try to interpret them.

In [None]:
# Text generation with different temperature values here

### Saving and loading models

Save the model to the disk, then load it and generate text. Examples are available [here](https://pytorch.org/tutorials/beginner/saving_loading_models.html]).

In [None]:
# Saving and loading code here

### References
1. <a href='http://karpathy.github.io/2015/05/21/rnn-effectiveness/'> Andrew Karpathy blog post about RNN. </a> 
There are several examples of genration: Shakespeare texts, Latex formulas, Linux Sourse Code and children names.
2. <a href='https://github.com/karpathy/char-rnn'> Repo with char-rnn code </a>
3. Cool repo with PyTorch examples: [link](https://github.com/spro/practical-pytorch`)