# Generative Networks

Recurrent Neural Networks (RNNs) and their gated cell variants such as Long Short Term Memory Cells (LSTMs) and Gated Recurrent Units (GRUs) provided a mechanism for language modeling, i.e. they can learn word ordering and provide predictions for next word in a sequence. This allows us to use RNNs for **generative tasks**, such as ordinary text generation, machine translation, and even image captioning.

In RNN architecture we discussed in the previous unit, each RNN unit produced next next hidden state as an output. However, we can also add another output to each recurrent unit, which would allow us to output a **sequence** (which is equal in length to the original sequence). Moreover, we can use RNN units that do not accept an input at each step, and just take some initial state vector, and then produce a sequence of outputs.

This allows for different neural architectures that are shown in the picture below:

![RNN paterns](../images/unreasonable-effectiveness-of-rnn.jpg)
*Image from blog post [Unreasonable Effectiveness of Recurrent Neural Networks](http://karpathy.github.io/2015/05/21/rnn-effectiveness/) by [Andrej Karpaty](http://karpathy.github.io/)*

* **One-to-one** is a traditional neural network with one input and one output
* **One-to-many** is a generative architecture that accepts one input value, and generates a sequence of output values. For example, if we want to train **image captioning** network that would produce a textual description of a picture, we can a picture as input, pass it through CNN to obtain hidden state, and then have recurrent chain generate caption word-by-word
* **Many-to-one** corresponds to RNN architectures we described in the previous unit, such as text classification
* **Many-to-many**, or **sequence-to-sequence** corresponds to tasks such as **machine translation**, where we have first RNN collect all information from the input sequence into the hidden state, and another RNN chain unrolls this state into the output sequence.

In this unit, we will focus on simple generative models that help us generate text. For simplicity, let's build **character-level network**, which generates text letter by letter. During training, we need to take some text corpus, and split it into letter sequences. 

In [2]:
import torch
import torchtext
import numpy as np
from torchnlp import *
load_dataset() # we need this to make sure data is fetched

120000lines [00:04, 26121.40lines/s]
120000lines [00:08, 14104.98lines/s]
7600lines [00:00, 14122.40lines/s]


(<torchtext.datasets.text_classification.TextClassificationDataset at 0x7fc53bdd4650>,
 <torchtext.datasets.text_classification.TextClassificationDataset at 0x7fc5a433fa10>,
 ['World', 'Sports', 'Business', 'Sci/Tech'],
 95812)

## TorchText Datasets

In previous unit, we were using built-in AG_NEWS dataset. Now that we need more flexibility in loading text, we will use another mechanism for loading custom datasets.

We will still be using AG News, but let's load it from original data files on disk. From previous units, we shoild have source files for AG News dataset located in `data/ag_news_csv` directory. Training dataset is contained in `train.csv` file, which contains records like the following:
```
"3","Wall St. Bears Claw Back Into the Black (Reuters)","Reuters - Short-sellers, Wall Street's..."
"3","Carlyle Looks Toward Commercial Aerospace (Reuters)","Private investment firm Carlyle Group..."
"2","Nets get Carter from Raptors","INDIANAPOLIS -- All-Star Vince Carter was traded..."
```
The file is in so-called *comma separated values* (CSV) format, and each line consists of three fields separeate by comma:
* Class number (0-3)
* Title
* News text

To work with this file, we will define `Field` objects that specify how fields should be handled. In our case, there will be two objects:
* `TEXT` will define fields for news title and text. Because we need to split it into individual characters, we will specify custom tokenizer function `char_tokenier`
* `LABEL` will be used for the first field with class number, and we will indicate that it does not need to be tokenized (`use_vocab=False`)

After defining fields, we create `TabularDataset` object which points to the training CSV file:

In [3]:
def char_tokenizer(words):
    return list(words) #[word for word in words]

TEXT = torchtext.data.Field(sequential=True, tokenize=char_tokenizer) #, lower=True)
LABEL = torchtext.data.Field(sequential=False, use_vocab=False)

train_dataset = torchtext.data.TabularDataset('./data/ag_news_csv/train.csv',
        format='csv',
        fields=[('Label', LABEL), ('Head', TEXT), ('Text', TEXT) ])

Before we can start loading the dataset, we need to build the vocabulary by calling `build_vocab`. After that, we can use vocabulary dictionaries to map between characters and their numerical indices:

In [4]:
TEXT.build_vocab(train_dataset)
vocab_size = len(TEXT.vocab)
print(f"Vocabulary size = {vocab_size}")
print(f"Encoding of 'a' is {TEXT.vocab.stoi['a']}")
print(f"Character with code 13 is {TEXT.vocab.itos[13]}")

Vocabulary size = 84
Encoding of 'a' is 4
Character with code 13 is h


Now can access our dataset by iterating through `train_dataset.examples`. Each example contains named fields corresponding to the fields we have defined:

In [5]:
print(train_dataset.examples[0].Text)

['R', 'e', 'u', 't', 'e', 'r', 's', ' ', '-', ' ', 'S', 'h', 'o', 'r', 't', '-', 's', 'e', 'l', 'l', 'e', 'r', 's', ',', ' ', 'W', 'a', 'l', 'l', ' ', 'S', 't', 'r', 'e', 'e', 't', "'", 's', ' ', 'd', 'w', 'i', 'n', 'd', 'l', 'i', 'n', 'g', '\\', 'b', 'a', 'n', 'd', ' ', 'o', 'f', ' ', 'u', 'l', 't', 'r', 'a', '-', 'c', 'y', 'n', 'i', 'c', 's', ',', ' ', 'a', 'r', 'e', ' ', 's', 'e', 'e', 'i', 'n', 'g', ' ', 'g', 'r', 'e', 'e', 'n', ' ', 'a', 'g', 'a', 'i', 'n', '.']


To encode the text, we can use the following function:

In [6]:
def encode_text(s):
    return torch.LongTensor([TEXT.vocab.stoi[t] for t in s])

encode_text(train_dataset.examples[0].Text)

tensor([37,  3, 15,  5,  3, 10,  9,  2, 29,  2, 26, 13,  6, 10,  5, 29,  9,  3,
        11, 11,  3, 10,  9, 27,  2, 43,  4, 11, 11,  2, 26,  5, 10,  3,  3,  5,
        58,  9,  2, 12, 21,  7,  8, 12, 11,  7,  8, 18, 61, 22,  4,  8, 12,  2,
         6, 19,  2, 15, 11,  5, 10,  4, 29, 14, 20,  8,  7, 14,  9, 27,  2,  4,
        10,  3,  2,  9,  3,  3,  7,  8, 18,  2, 18, 10,  3,  3,  8,  2,  4, 18,
         4,  7,  8, 23])

## Training Generative RNN

The way we will train RNN to generate text is the following. On each step, we will take a sequence of characters of length `nchars`, and ask the network to generate next output character for each input character:

![RNN Generation](../images/rnn-generate.png)

Depending on the actual scenario, we may also want to include some special characters, such as *end-of-sequence* `<eos>`. In our case, we just want to train the network for endless text generation, thus we will fix the size of each sequence to be equal to `nchars` tokens. Consequently, each training example will consist of `nchars` inputs and `nchars` outputs (which are input sequence shifted one symbol to the left). Minibatch will consist of several such sequences.

The way we will generate minibatches is to take each news text of length `l`, and generate all possible input-output combinations from it (there will be `l-nchars` such combinations). They will form one minibatch, and size of minibatches would be different at each training step. 

In [11]:
nchars = 100

def get_batch(s,nchars=nchars):
    ins = torch.zeros(len(s)-nchars,nchars,dtype=torch.long,device=device)
    outs = torch.zeros(len(s)-nchars,nchars,dtype=torch.long,device=device)
    for i in range(len(s)-nchars):
        ins[i] = encode_text(s[i:i+nchars])
        outs[i] = encode_text(s[i+1:i+nchars+1])
    return ins,outs

get_batch(train_dataset.examples[1].Text)

(tensor([[37,  3, 15,  ...,  2,  6, 14],
         [ 3, 15,  5,  ...,  6, 14, 14],
         [15,  5,  3,  ..., 14, 14,  4],
         ...,
         [14,  6,  8,  ...,  4, 10, 25],
         [ 6,  8,  5,  ..., 10, 25,  3],
         [ 8,  5, 10,  ..., 25,  3,  5]], device='cuda:0'),
 tensor([[ 3, 15,  5,  ...,  6, 14, 14],
         [15,  5,  3,  ..., 14, 14,  4],
         [ 5,  3, 10,  ..., 14,  4,  9],
         ...,
         [ 6,  8,  5,  ..., 10, 25,  3],
         [ 8,  5, 10,  ..., 25,  3,  5],
         [ 5, 10,  6,  ...,  3,  5, 23]], device='cuda:0'))

Now let's define generator network. It can be based on any recurrent cell which we discussed in the previous unit (simple, LSTM or GRU). In our exampe we will use LSTM.

Because the network takes characters as input, and vocabulary size is pretty small, we do not need embedding layer, one-hot-encoded input can directly go to LSTM cell. However, because we pass character numbers as input, we need to one-hot-encode them before passing to LSTM. This is done by calling `one_hot` function during `forward` pass. Output encoder would be a linear layer that will convert hidden state into one-hot-encoded output.

In [10]:
class LSTMGenerator(torch.nn.Module):
    def __init__(self, vocab_size, hidden_dim):
        super().__init__()
        self.rnn = torch.nn.LSTM(vocab_size,hidden_dim,batch_first=True)
        self.fc = torch.nn.Linear(hidden_dim, vocab_size)

    def forward(self, x, s=None):
        x = torch.nn.functional.one_hot(x,vocab_size).to(torch.float32)
        x,s = self.rnn(x,s)
        return self.fc(x),s

During training, we want to be able to sample generated text. To do that, we will define `generate` function that will produce output string of length `size`, starting from the initial string `start`.

The way it works is the following. First, we will pass the whole start string through the network, and take output state `s` and next predicted character `out`. Since `out` is one-hot encoded, we take `argmax` to get the index of the character `nc` in the vocabulary, and use `itos` to figure out the actual character and append it to the resulting list of characters `chars`. This process of generating one character is repeated `size` times to generate required number of characters.  

In [12]:
def generate(net,size=100,start='today '):
        chars = list(start)
        out, s = net(encode_text(chars).view(1,-1).to(device))
        for i in range(size):
            nc = torch.argmax(out[0][-1])
            chars.append(TEXT.vocab.itos[nc])
            out, s = net(nc.view(1,-1),s)
        return ''.join(chars)

Now let's do the training! Training loop is almost the same as in all our previous examples, but instead of accuracy we print sampled generated text every 1000 epochs.

Special attention needs to be paid to the way we compute loss. We need to compute loss given one-hot-encoded output `out`, and expected text `text_out`, which is the list of character indices. Luckily, `cross_entropy` function expects unnormalized network output as first argument, and class number as the second, which is exactly what we have. It also performs automatic averaging over minibatch size.

We also limit the training by `samples_to_train` samples, in order not to wait for too long. We encourage you to experiment and try longer training, possibly for several epochs (in which case you would need to create another loop around this code).

In [18]:
net = LSTMGenerator(vocab_size,64).to(device)

samples_to_train = 10000
optimizer = torch.optim.Adam(net.parameters(),0.01)
loss_fn = torch.nn.CrossEntropyLoss()
net.train()
for i,x in enumerate(train_dataset.examples):
    if len(x.Text)-nchars<10:
        continue
    samples_to_train-=1
    if not samples_to_train: break
    text_in, text_out = get_batch(x.Text)
    optimizer.zero_grad()
    out,s = net(text_in)
    loss = torch.nn.functional.cross_entropy(out.view(-1,vocab_size),text_out.flatten()) #cross_entropy(out,labels)
    loss.backward()
    optimizer.step()
    if i%1000==0:
        print(f"Current loss = {loss.item()}")
        print(generate(net))

Current loss = 2.0068554878234863
today a resite a resite a resite a resite a resite a resite a resite a resite a resite a resite a resite a
Current loss = 1.6322784423828125
today and a reside and a reside and a reside and a reside and a reside and a reside and a reside and a res
Current loss = 2.3880646228790283
today and the company and the company and the company and the company and the company and the company and 
Current loss = 1.6894474029541016
today and the company to the stack and the stack and the stack and the stack and the stack and the stack a
Current loss = 1.708423376083374
today and the battere and the battere and the battere and the battere and the battere and the battere and 
Current loss = 1.8624305725097656
today and the first the first the first the first the first the first the first the first the first the fi
Current loss = 1.6230477094650269
today and the company and the company and the company and the company and the company and the company and 
Current loss =

This example already generates some pretty good text, but it can be further improved in several ways:
* **Better minibatch generation**. The way we prepared data for training was to generate one minibatch from one sample. This is not ideal, because minibatches are all of different sizes, and some of them even cannot be generated, because the text is smaller than `nchars`. Also, small minibatches do not load GPU sufficiently enough. It would be wiser to get one large chunk of text from all samples, then generate all input-output pairs, shuffle them, and generate minibatches of equal size.
* **Multilayer LSTM**. It makes sense to try 2 or 3 layers of LSTM cells. As we mentioned in the previous unit, each layer of LSTM extracts certain patterns from text, and in case of character-level generator we can expect lower LSTM level to be responsible for extracting syllables, and higher levels - for words and word combinations. This can be simply implemented by passing number-of-layers parameter to LSTM constructor.
* You may also want to experiment with **GRU units** and see which ones perform better, and with **different hidden layer sizes**. Too large hidden layer may result in overfitting (e.g. network will learn exact text), and smaller size might not produce good result.

## Soft Text Generation and Temperature

In the previous definition of `generate`, we were always taking the character with highest probability as the next character in generated text. This resulted in the fact that the text often "cycled" between the same character sequences again and again, like in this example:
```
today of the second the company and a second the company ...
```

However, if we look at the probability distribution for the next character, it could be that the difference between a few highest probabilities is not huge, eg. one character can have probability 0.2, another - 0.19, etc. For example, when looking for the next character in the sequence '*play*', next character can equally well be either space, or **e** (as in the word *player*).

This leads us to the conclusion that it is not always "fair" to select the character with higher probability, because chosing second highest might still lead us to meaningful text. It is more wise to **sample** characters from the probability distribution given by the network output.

This sampling can be done using `multinomial` function that implements so-called **muntinomial** distribution. A function that implements this **soft** text generation is defined below:

In [52]:
def generate_soft(net,size=100,start='today ',temperature=1.0):
        chars = list(start)
        out, s = net(encode_text(chars).view(1,-1).to(device))
        for i in range(size):
            #nc = torch.argmax(out[0][-1])
            out_dist = out[0][-1].div(temperature).exp()
            nc = torch.multinomial(out_dist,1)[0]
            chars.append(TEXT.vocab.itos[nc])
            out, s = net(nc.view(1,-1),s)
        return ''.join(chars)
    
for i in [0.3,0.8,1.0,1.3,1.8]:
    print(f"--- Temperature = {i}\n{generate_soft(net,size=300,start='Today ',temperature=i)}\n")

--- Temperature = 0.3
Today and the first of the profit at the be that the second the be workers and the recome were the string the second the second-released to the market that store the profit and the first a showed the most the signed in the were market in the American Aneenest The man state for the a the company and the ma

--- Temperature = 0.8
Today in a limater companie in the streal and the the an in were belost was dising weeks yesterday in the were the changions of out on Thursday specy to share inthers of a recond-fell defeater state destable is wield at the develop the rever that the Olympics will business was begonning proceldages sincea

--- Temperature = 1.0
Today of hold calling a gtriging to new down hoped begin athly engined frambiligation from the Gamb in Fivenrating An6 Explan-thay centing publlicated thris makeigre: Basrey in Athenss Concuren #39;s Veris Corp. shepting up hissing marker case were for a will manager wroper nest 24 thig week  prevangroney

--- Temper

We have introduced one more parameter called **temperature**, which is used to indicate how hard we should stick to the highest probability. If temperature is 1.0, we do fair multinomial sampling, and when temperature goes to infinity - all probabilities become equal, and we randomly select next character. In the example below we can observe that the text becomes meaningless when we increase the temparature too much, and it resembles "cycled" hard-generated text when it becomes closer to 0. 