## Generative networks

Recurrent Neural Networks (RNNs) and their gated cell variants such as Long Short Term Memory Cells (LSTMs) and Gated Recurrent Units (GRUs) provided a mechanism for language modeling, i.e. they can leanr word ordering and provide predictions for next word in a sequence. This allows us to use RNNs for **generative tasks**, such as ordinary text generation, machine translation, and even image captionaing.

In RNN architecture, each RNN unit produced next next hidden state as an output. However, we can also add another output to each recurrent unit, which would allow us to output s **sequence** (which is equal in length to the original sequence). Moreover, we can use RNN units that do not accept an input at each step, and just take some initial state vector, and then produce a sequence of outputs.

This allows for different neural architectures that are shown in the picture below:

<figure><img src="https://hostux.social/system/media_attachments/files/110/768/729/925/794/877/original/fa0c21dbd618dfde.jpg" alt="" width="1000"><figcaption><p>Source from Unreasonable Effectiveness of Recurrent Neural Networksn by Andrej Karpaty </a> </p></figcaption></figure>

* **One-to-one** is a traditional neural network with one input and one output.
* **One-to-many** is a generative a architecure that accepts on einput value, and generates a sequence of output values. For example, if we want to train `image caotioning` network that would produce a textual description of a picture. we can have a picture as input, pass it through CNN to obtain hidden state, and then have recurrent chain generate caption word-by-word
* **Many-to-one** corrsponds to RNN architectures we described in the `Capture patterns with recurrent neural networks`, such as text classification
* **Many-to-many** or **sequence-to-sequence** corresponds to tasks such as `machine translation`, where we have first RNN collect all information form the input sequence into the hidden state, and another RNN chain unrolls this state into the output sequence.

Here we will focus on simple generative models that will help us generate text. For simplicity, let's build **character-level network**, which generates text letter by letter. During training, we need to take some text corpus, and split it into letter sequences.

In [None]:
# https://github.com/pytorch/data/issues/1093
pip install portalocker

In [None]:
pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu118

## Building character vocabulary

To build character-level generative network, we need to split text into individual characters instead of words. This can be done by defining a different tokenizer:

In [None]:
import torch
import torchtext
import numpy as np
import collections

In [None]:
# Loading dataset
train_dataset, test_dataset = torchtext.datasets.AG_NEWS(root='./data')
train_dataset, test_dataset = list(train_dataset), list(test_dataset)
classes = ['World','Sports','Business','Sci/Tech']

tokenizer = torchtext.data.utils.get_tokenizer('basic_english')


In [None]:
def char_tokenizer(words):
    return list(words)

counter = collections.Counter()
for (label, line) in train_dataset:
    counter.update(char_tokenizer(line))
vocab = torchtext.vocab.vocab(counter)

vocab_size = len(vocab)
print(vocab_size)
print(vocab.get_stoi()['a'])
print(vocab.get_itos()[13])

Let's see the example of how we can encode the text from out dataset:

In [None]:
def encode(x, voc=None,tokenizer=tokenizer):
    v =vocab if not voc else voc
    return [v.get_stoi()[s] for s in tokenizer(x)]


def enc(x):
    return torch.LongTensor(encode(x, voc=vocab, tokenizer=char_tokenizer))

print(train_dataset[0][1])
print(enc(train_dataset[0][1]))

## Training a generative RNN

The way we will trian RNN to generate text is the following. On each step, we will take a sequence of characters of length `nchars`, and ask the networks to generate the next output character for each input character:

<figure><img src="https://hostux.social/system/media_attachments/files/110/768/908/937/200/134/original/d8b268cc82ca6080.png" alt="" width="1000"><figcaption><p>Source from MicrosoftLearning </a> </p></figcaption></figure>

Depending on the actual scenario, we may also want to inlcude some special characters, such as `end-of-sequence` `<eos>`. In our case, we just want to train the network for endless text generation, thus we will fix the size of each sequence to be equal to `nchars` tokens. Consequently, each training example will consist of `nchars` inputs and `nchars` outputs(which are input sequence shifted one symbol to the left). Minibatch will consist of several such sequences.

The way we will generate minibatches is to take each news text of length `l`, and generate all possible input-output combinations from it (there will be `l-nchars` such combinations). They will from one minibatch, and size of minibatches would be different at each training step.

In [None]:
# check the paltform, Apple Silicon or Linux
import os, platform

torch_device="cpu"

if 'kaggle' in os.environ.get('KAGGLE_URL_BASE','localhost'):
    torch_device = 'cuda'
else:
    torch_device = 'mps' if platform.system() == 'Darwin' else 'cpu'

torch_device

In [None]:
nchars= 100

def get_batch(s, nchars=nchars):
    ins = torch.zeros(len(s)-nchars, nchars, dtype=torch.long, device=torch_device)
    outs = torch.zeros(len(s)-nchars, dtype=torch.long, device=torch_device)

    for i in range(len(s)-nchars):
        ins[i]=enc(s[i:i+nchars])
        outs[i]=enc(s[i+1:i+nchars+1])
    return ins, outs

get_batch(train_dataset[0][1])

Now, let's define generator network. It can be based on any recurrent cell which we discussed in the previous notebooks(simple, LSTM ot GRU). In our example we will use LSTM.

Because the network takes characters as inputs, and vocabulary size is pretty small, we do not need embedding layer, one-hot-encoded input can directly go to LSTM cell. However, because we pass character numbers as input, we need to one-hot-encode them before passing to LSTM. This is done by calling `one_hot` function during `forward` pass. Output encoder would be a linear layer that will conver hiddent state into one-hot-encoded output.

>Note: One-hot-encoding involves representing each character as a binary vector, where only the index corrsponding to the character's value is set to 1, and all other indices are set to 0. This encoding allows the LSTM to process the characters as input and learn patterns from them.

In [None]:
class LSTMGenerator(torch.nn.Module):
    def __init__(self, vocab_size, hidden_dim):
        super().__init__()
        self.nn = torch.nn.LSTM(vocab_size, hidden_dim, batch_first=True)
        self.fc = torch.nn.Linear(hidden_dim, vocab_size)
    
    def forward(self, x, s=None):
        x = torch.nn.functional.one_hot(x, num_classes=vocab_size).to(torch.float32)
        x,s = self.rnn(x,s)
        return self.fc(x),s

During training, we want to be able to sample generated text. To do that, we will define a `generate` function that will produce an output string of length `size`, starting from the initial string `start`.

The way it works is the following:
* First, we will pass the whole start string through the network, and take output state `s` and next predicted character `out`.
* Since `out` is one-hot encoded, we take `argmax` to get the index of the character `nc` in the vocabulary, and use `get_itos` to figure out the actual character and append it to the resuling list of character `chars`.
* This process of generating on character is repeated `size` times to generate required number of characters.

In [None]:
def generate(net, size=100, start='today '):
    chars = list(start)
    out, s = net(enc(chars).view(1,-1).to(torch_device))
    for i in range(size):
        nc = torch.argmax(out[0,-1])
        chars.append(vocab.get_itos()[nc])
        out, s = net(nc.view(1,-1),s)
    return ''.join(chars)

Now, let's do the training! The training loop is almost the same as in all our previous examples, but instead of accuracy we print sampled generated text every 1000 epochs.

Special attention needs to be paid to the way we compute loss. We need to compute loss given one-hot-encoed output `out`, and expected text `text_out`, which is the list of character indices. Luckily, the `cross_entropy` function expects unnormalized netwotk output as first argument, and class number as the second, which is exactly what we have. It also performs automatic averaging over minibatch size.

We also limit the training by `samples_to_train` samples, in order not to wait for too long. We encourage you to experiment and try longer training, possibly for several epochs (in which case you would need to create another loop around this code).

In [None]:
net = LSTMGenerator(vocab_size, 64).to(torch_device)

samples_to_train = 10000
optimizer = torch.optim.Adam(net.parameters(), lr=0.01)
loss_fn = torch.nn.CrossEntropyLoss()
net.train()
for i, x in enumerate(train_dataset):
    # x[0] is class label, x[1] is text
    if len(x[1])-nchars<10:
        continue
    samples_to_train-=1
    if not samples_to_train:
        break
    text_in, text_out = get_batch(x[1])
    optimizer.zero_grad()
    out,s= net(text_in)
    loss = torch.nn.functional.cross_entropy(out.view(-1,vocab_size), text_out.flatten()) #cross_entropy(out, labels)
    loss.backward()
    optimizer.step()
    if i%1000==0:
        print(loss.item(), generate(net))
        print(generate(net))

This example already generates some pretty good text, but it can be further improved in several ways:
* **Better minibatch generation** The way we prepared data for training was to generate one minibatch from one sample. This is not idea, because minibatches are all of different sizes, and some of them evem cannot be generated, because the text is smaller then `nchars`. Also, small minibatches do not load GPU sufficiently enough. It would be wiser to get one large chunk of text from all samples, then generate all input-output pairs, shuffle them, and generate minibactehs of equal size.
* **Multilayer LSTM** It makes sense to try 2 or 3 layers of LSTM cells. As we mentioned in the previous notebook, each layer of LSTM extracts certain patterns from text, and in case of character-level generator we can expect lower LSTM level to be responsible for extracting syllabels, and higher levels - for words and word combinations. This can be simply implemented by passing number-of-layers parameter to LSTM constructor.
* You may also want to experienment with **GRU units** and see which ones perform better, and with **different hidden layer sizes**. Too large hidden layer may result in overfitting(e.g. network will leanr exact text), and smaller size might not produce good result.

## Soft text generation and temperature

In the previous definition of `generate`, we were always taking the character with highest probability as the next character in generated text. This resulted in the fact that the text often "cycled" between the same character sequences again and again, like in this example:

`today of the second the company and a second the company...`

However, if we look at the probability distribution for the next character, it could be that the difference between a few highest probabilities is not huge, e.g. one character can have probability 0.2, another 0.19, etc. For example, when looking for the next character in the sequence `play`, next character can equally well be either space, or **e** (as in the word player).

This leads us to the conclusion that it is not always "fair" to select the character with higher probability, because choosing the second highest might still lead us to meaningful text. It is more wise to **sample** characters from the probability distribution given by the network output.

This sampling can be done using `multinomial` function that implements so-called **multinomial distribution**. A function that implements this **soft** text generation is defined below:

In [None]:
def generate_soft(net, size=100, start='today', temperature=1.0):
    chars=list(start)
    out, s =net(enc(chars).view(1,-1).to(torch_device))
    for i in range(size):
        #nc = tochen.argmax(out[0,-1])
        out_dist = out[0,-1].div(temperature).exp()
        nc = torch.multinomial(out_dist, 1)[0]
        chars.append(vocab.get_itos()[nc])
        out, s = net(nc.view(1,-1),s)
    return ''.join(chars)

for i in [0.3,0.8,1.0,1.3,1.8]:
    print(f"--- Temperature = {i}\n{generate_soft(net,size=300,start='Today ', temperature=i)}\n")

## Summary

We have introduced one more parameter called **temperature**, which is used to indicate how hard we should stick to the highest probability. If temperature is 1.0, we do fair multinomial sampling, and when temperature goes to infinity -all probabilities become equal, and we randomly select next character. In the example we can observe that the text becomes meaningless when we increase the temperature too much, and it resembles "cycled" hard-generated text when it becomes closer to 0.

## Check the knowledge

Recurrent neural network is called recurrent, because:
* [ ] A network is applied for each input element, and output from the previous application is passed to the next one. (Correct)
* [ ] It is trained by a recurrent process
* [ ] It consists of layers which include other subnetworks