# Character-Level LSTM in PyTorch

The network will train character by character on some text, then generate new text character by character. As an example, I will train on Anna Karenina. **This model will be able to generate new text based on the text from the book!**

This network is based off of Andrej Karpathy's [post on RNNs](http://karpathy.github.io/2015/05/21/rnn-effectiveness/) and [implementation in Torch](https://github.com/karpathy/char-rnn). Below is the general architecture of the character-wise RNN.

<img src="assets/charseq.jpeg" width="500">

In [1]:
import numpy as np
import torch
from torch import nn
import torch.nn.functional as F

import matplotlib.pyplot as plt
%matplotlib inline

## Load data

In [2]:
# read data
with open('data/anna.txt', 'r') as f:
    text = f.read()
    
# len of data
print(len(text))

# print the first 90 characters
print(text[:90])

1985223
Chapter 1


Happy families are all alike; every unhappy family is unhappy in its own
way.



### Tokenization

Create two dictionaries to convert the characters to and from integers. Encoding the characters as integers makes it easier to use as input in the network.

In [3]:
# set for uniques values and tuple for so it can't change
chars = tuple(set(text))

# maps integers to characters
int2char = dict(enumerate(chars))

# maps characters to unique integers
char2int = {char : i for i, char in int2char.items()}

# encode the text
encoded = np.array([char2int[ch] for ch in text])


And we can see those same characters from above, encoded as integers.

In [4]:
print(encoded[:90])

[42 55 73 15 20  1 80 70 45 33 33 33 44 73 15 15 77 70 71 73  9 72 30 72
  1 47 70 73 80  1 70 73 30 30 70 73 30 72 48  1  2 70  1 64  1 80 77 70
 11  6 55 73 15 15 77 70 71 73  9 72 30 77 70 72 47 70 11  6 55 73 15 15
 77 70 72  6 70 72 20 47 70 78 25  6 33 25 73 77  5 33]


## One hot encoding

In [5]:
def one_hot_encoding(arr, n_labels):
    
    # Initialize the the encoded array
    one_hot = np.zeros((arr.size, n_labels), dtype=np.float32)
    
    # Fill the appropriate elements with ones
    one_hot[np.arange(one_hot.shape[0]), arr.flatten()] = 1.
    
    # Finally reshape it to get back to the original array
    one_hot = one_hot.reshape(*arr.shape, n_labels)
    
    return one_hot

In [6]:
test_seq = np.array([4, 7, 1])
one_hot = one_hot_encoding(test_seq, 8)

print(one_hot)

[[0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 1.]
 [0. 1. 0. 0. 0. 0. 0. 0.]]


## Training mini-batches

<img src="assets/sequence_batching@1x.png" width=500px>

### Creating Batches

**1. The first thing we need to do is discard some of the text so we only have completely full mini-batches. **

Each batch contains $N \times M$ characters, where $N$ is the batch size (the number of sequences in a batch) and $M$ is the seq_length or number of time steps in a sequence. Then, to get the total number of batches, $K$, that we can make from the array `arr`, you divide the length of `arr` by the number of characters per batch. Once you know the number of batches, you can get the total number of characters to keep from `arr`, $N * M * K$.

**2. After that, we need to split `arr` into $N$ batches. ** 

You can do this using `arr.reshape(size)` where `size` is a tuple containing the dimensions sizes of the reshaped array. We know we want $N$ sequences in a batch, so let's make that the size of the first dimension. For the second dimension, you can use `-1` as a placeholder in the size, it'll fill up the array with the appropriate data for you. After this, you should have an array that is $N \times (M * K)$.

**3. Now that we have this array, we can iterate through it to get our mini-batches. **

The idea is each batch is a $N \times M$ window on the $N \times (M * K)$ array. For each subsequent batch, the window moves over by `seq_length`. We also want to create both the input and target arrays. Remember that the targets are just the inputs shifted over by one character. The way I like to do this window is use `range` to take steps of size `n_steps` from $0$ to `arr.shape[1]`, the total number of tokens in each sequence. That way, the integers you get from `range` always point to the start of a batch, and each window is `seq_length` wide.

In [7]:
def get_batches(arr, batch_size, seq_length):
    
    # get the number of batches
    n_batches = arr.size//(batch_size * seq_length)
    
    # keep only enough characters to make full batches
    arr = arr[:n_batches * batch_size * seq_length]
    
    # reshape into batch_size rows
    arr = arr.reshape(batch_size, -1)
    
    # iterate over the batches using a window of size seq_length
    for n in range(0, arr.shape[1], seq_length):
        # The features
        x = arr[:, n:n+seq_length]
        # The targets, shifted by one
        y = np.zeros_like(x)
        try:
            y[:, :-1], y[:, -1] = x[:, 1:], arr[:, n+seq_length]
        except IndexError:
            y[:, :-1], y[:, -1] = x[:, 1:], arr[:, 0]
        yield x, y

In [8]:
arr = np.arange(1, 21)
print(arr)
batches = get_batches(arr, 2, 3)
x, y = next(batches)
print('x\n', x)
print('\ny\n', y)

[ 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20]
x
 [[ 1  2  3]
 [10 11 12]]

y
 [[ 2  3  4]
 [11 12 13]]


In [9]:
batches = get_batches(encoded, 8, 50)
x, y = next(batches)

In [10]:
# printing out the first 10 items in a sequence
print('x\n', x[:, :10])
print('\ny\n', y[:, :10])

x
 [[42 55 73 15 20  1 80 70 45 33]
 [47 78  6 70 20 55 73 20 70 73]
 [ 1  6 67 70 78 80 70 73 70 71]
 [47 70 20 55  1 70 49 55 72  1]
 [70 47 73 25 70 55  1 80 70 20]
 [49 11 47 47 72 78  6 70 73  6]
 [70 12  6  6 73 70 55 73 67 70]
 [62 40 30 78  6 47 48 77  5 70]]

y
 [[55 73 15 20  1 80 70 45 33 33]
 [78  6 70 20 55 73 20 70 73 20]
 [ 6 67 70 78 80 70 73 70 71 78]
 [70 20 55  1 70 49 55 72  1 71]
 [47 73 25 70 55  1 80 70 20  1]
 [11 47 47 72 78  6 70 73  6 67]
 [12  6  6 73 70 55 73 67 70 47]
 [40 30 78  6 47 48 77  5 70 65]]


## Model Arquitecture

We start by defining the layers and operations we want. Then, define a method for the forward pass. 

<img src="assets/charRNN.png" width=500px>

### LSTM Inputs/Outputs

You can create a basic [LSTM layer](https://pytorch.org/docs/stable/nn.html#lstm) as follows

```python
self.lstm = nn.LSTM(input_size, n_hidden, n_layers, 
                            dropout=drop_prob, batch_first=True)
```

In [11]:
train_on_gpu = torch.cuda.is_available()
if(train_on_gpu):
    print('Training on GPU!')
else: 
    print('No GPU available, training on CPU.')

Training on GPU!


In [12]:
class CharRNN(nn.Module):
    def __init__(self, tokens, n_hidden=256, n_layers=2,
                               drop_prob=0.5, lr=0.001):
        super().__init__()
        # parameters 
        self.drop_prob = drop_prob
        self.n_layers = n_layers
        self.n_hidden = n_hidden
        self.lr = lr
        
        # character dictionaries
        self.chars = tokens
        self.int2char = dict(enumerate(self.chars))
        self.char2int = {ch: i for i, ch in self.int2char.items()}
        
        # lstm
        self.lstm = nn.LSTM(len(self.chars), n_hidden, n_layers, 
                            dropout=drop_prob, batch_first=True) # (batch, seq, feature)
        # dropotlayer
        self.dropout = nn.Dropout(drop_prob)
        
        # fully connected layer
        self.fc = nn.Linear(n_hidden, len(self.chars))
    
    def forward(self, x, hidden):
        
        # raw output and hidden state(h_n and c_n) from lstm
        r_output, hidden = self.lstm(x, hidden)
        
        # drop the output
        out = self.dropout(r_output)
                          
        # stack up LSTM output using view
        out = out.contiguous().view(-1, self.n_hidden)
        
        # fully connected layer
        out = self.fc(out)
                           
        return out, hidden
                           
    def init_hidden(self, batch_size):
        ''' Initializes hidden state '''
        # Create two new tensors with sizes n_layers x batch_size x n_hidden,
        # initialized to zero, for hidden state and cell state of LSTM
        weight = next(self.parameters()).data
        
        if (train_on_gpu):
            hidden = (weight.new(self.n_layers, batch_size, self.n_hidden).zero_().cuda(),
                  weight.new(self.n_layers, batch_size, self.n_hidden).zero_().cuda())
        else:
            hidden = (weight.new(self.n_layers, batch_size, self.n_hidden).zero_(),
                      weight.new(self.n_layers, batch_size, self.n_hidden).zero_())
        
        return hidden

## Training

A couple of details about training: 
>* Within the batch loop, we detach the hidden state from its history; this time setting it equal to a new *tuple* variable because an LSTM has a hidden state that is a tuple of the hidden and cell states.
* We use [`clip_grad_norm_`](https://pytorch.org/docs/stable/_modules/torch/nn/utils/clip_grad.html) to help prevent exploding gradients.

In [13]:
def train(net, data, epochs=10, batch_size=10, seq_length=50, lr=0.001, clip=5, val_frac=0.1, print_every=10):
    ''' Training a network 
    
        Arguments
        ---------
        
        net: CharRNN network
        data: text data to train the network
        epochs: Number of epochs to train
        batch_size: Number of mini-sequences per mini-batch, aka batch size
        seq_length: Number of character steps per mini-batch
        lr: learning rate
        clip: gradient clipping
        val_frac: Fraction of data to hold out for validation
        print_every: Number of steps for printing training and validation loss
    
    '''   
    # loss and optimizer function
    optimizer = torch.optim.Adam(net.parameters(), lr=lr)
    criterion = nn.CrossEntropyLoss()
    
    # train and validation dataset
    total_data = len(data)
    split = int((1 - val_frac) * total_data)
    train_data, valid_data = data[:split], data[split:]
    
    # gpu
    if train_on_gpu:
        net.cuda()
        
    counter = 0
    n_chars = len(net.chars)
    
    for epoch in range(epochs):
        # initialize hidden state
        h = net.init_hidden(batch_size)
        
        net.train()
        for x, y in get_batches(train_data, batch_size, seq_length):
            counter += 1
            
            # one hot encoding
            x = one_hot_encoding(x, n_chars)
            
            # change to torch tensor
            inputs, targets = torch.from_numpy(x), torch.from_numpy(y)
            
            # gpu
            if train_on_gpu:
                inputs, targets = inputs.cuda(), targets.cuda()
                
            # creating new variables for the hidden state, otherwise
            # we'd backprop through the entire training history
            h = tuple([each.data for each in h])
            
            # brackpropagation
            net.zero_grad()
            output, h = net(inputs, h)
            loss = criterion(output, targets.view(batch_size*seq_length).long())
            loss.backward()
            # clip_grad_norm
            nn.utils.clip_grad_norm_(net.parameters(), clip)
            optimizer.step()
            
            if counter % print_every == 0:
                net.eval()
                val_h = net.init_hidden(batch_size)
                val_losses = []
                for x, y in get_batches(valid_data, batch_size, seq_length):
                    # one hot encoding
                    x = one_hot_encoding(x, n_chars)

                    # change to torch tensor
                    inputs, targets = torch.from_numpy(x), torch.from_numpy(y)

                    # gpu
                    if train_on_gpu:
                        inputs, targets = inputs.cuda(), targets.cuda()

                    # creating new variables for the hidden state, otherwise
                    # we'd backprop through the entire training history
                    val_h = tuple([each.data for each in val_h])
                    
                    output, val_h = net(inputs, val_h)
                    val_loss = criterion(output, targets.view(batch_size*seq_length).long())
                    
                    val_losses.append(val_loss.item())
                
                net.train()
                
                print("Epoch: {}/{}...".format(epoch+1, epochs),
                      "Step: {}...".format(counter),
                      "Loss: {:.4f}...".format(loss.item()),
                      "Val Loss: {:.4f}".format(np.mean(val_losses)))


## Instantiating the model

In [14]:
# define and print the net
n_hidden=512
n_layers=2

net = CharRNN(chars, n_hidden, n_layers)
print(net)

CharRNN(
  (lstm): LSTM(83, 512, num_layers=2, batch_first=True, dropout=0.5)
  (dropout): Dropout(p=0.5, inplace=False)
  (fc): Linear(in_features=512, out_features=83, bias=True)
)


In [15]:
batch_size = 128
seq_length = 100
n_epochs = 20 # start smaller if you are just testing initial behavior

# train the model
train(net, encoded, epochs=n_epochs, batch_size=batch_size, seq_length=seq_length, lr=0.001)

Epoch: 1/20... Step: 10... Loss: 3.2486... Val Loss: 3.1719
Epoch: 1/20... Step: 20... Loss: 3.1504... Val Loss: 3.1301
Epoch: 1/20... Step: 30... Loss: 3.1419... Val Loss: 3.1200
Epoch: 1/20... Step: 40... Loss: 3.1102... Val Loss: 3.1180
Epoch: 1/20... Step: 50... Loss: 3.1375... Val Loss: 3.1157
Epoch: 1/20... Step: 60... Loss: 3.1123... Val Loss: 3.1110
Epoch: 1/20... Step: 70... Loss: 3.0937... Val Loss: 3.1014
Epoch: 1/20... Step: 80... Loss: 3.0915... Val Loss: 3.0772
Epoch: 1/20... Step: 90... Loss: 3.0407... Val Loss: 3.0160
Epoch: 1/20... Step: 100... Loss: 2.9547... Val Loss: 2.9279
Epoch: 1/20... Step: 110... Loss: 2.8478... Val Loss: 2.8357
Epoch: 1/20... Step: 120... Loss: 2.7156... Val Loss: 2.7061
Epoch: 1/20... Step: 130... Loss: 2.6492... Val Loss: 2.5948
Epoch: 2/20... Step: 140... Loss: 2.5803... Val Loss: 2.5238
Epoch: 2/20... Step: 150... Loss: 2.5199... Val Loss: 2.4774
Epoch: 2/20... Step: 160... Loss: 2.4729... Val Loss: 2.4363
Epoch: 2/20... Step: 170... Loss:

Epoch: 10/20... Step: 1340... Loss: 1.4003... Val Loss: 1.4194
Epoch: 10/20... Step: 1350... Loss: 1.3831... Val Loss: 1.4163
Epoch: 10/20... Step: 1360... Loss: 1.4008... Val Loss: 1.4174
Epoch: 10/20... Step: 1370... Loss: 1.3809... Val Loss: 1.4148
Epoch: 10/20... Step: 1380... Loss: 1.4151... Val Loss: 1.4139
Epoch: 10/20... Step: 1390... Loss: 1.4294... Val Loss: 1.4120
Epoch: 11/20... Step: 1400... Loss: 1.4322... Val Loss: 1.4121
Epoch: 11/20... Step: 1410... Loss: 1.4467... Val Loss: 1.4088
Epoch: 11/20... Step: 1420... Loss: 1.4309... Val Loss: 1.4068
Epoch: 11/20... Step: 1430... Loss: 1.3879... Val Loss: 1.4093
Epoch: 11/20... Step: 1440... Loss: 1.4204... Val Loss: 1.4032
Epoch: 11/20... Step: 1450... Loss: 1.3496... Val Loss: 1.4036
Epoch: 11/20... Step: 1460... Loss: 1.3791... Val Loss: 1.4057
Epoch: 11/20... Step: 1470... Loss: 1.3704... Val Loss: 1.4018
Epoch: 11/20... Step: 1480... Loss: 1.3880... Val Loss: 1.3964
Epoch: 11/20... Step: 1490... Loss: 1.3840... Val Loss:

Epoch: 19/20... Step: 2640... Loss: 1.2320... Val Loss: 1.2933
Epoch: 20/20... Step: 2650... Loss: 1.2340... Val Loss: 1.2908
Epoch: 20/20... Step: 2660... Loss: 1.2301... Val Loss: 1.2925
Epoch: 20/20... Step: 2670... Loss: 1.2490... Val Loss: 1.2920
Epoch: 20/20... Step: 2680... Loss: 1.2416... Val Loss: 1.2948
Epoch: 20/20... Step: 2690... Loss: 1.2313... Val Loss: 1.2908
Epoch: 20/20... Step: 2700... Loss: 1.2359... Val Loss: 1.2881
Epoch: 20/20... Step: 2710... Loss: 1.2164... Val Loss: 1.2861
Epoch: 20/20... Step: 2720... Loss: 1.2064... Val Loss: 1.2952
Epoch: 20/20... Step: 2730... Loss: 1.2003... Val Loss: 1.2855
Epoch: 20/20... Step: 2740... Loss: 1.2051... Val Loss: 1.2812
Epoch: 20/20... Step: 2750... Loss: 1.2016... Val Loss: 1.2830
Epoch: 20/20... Step: 2760... Loss: 1.1898... Val Loss: 1.2810
Epoch: 20/20... Step: 2770... Loss: 1.2351... Val Loss: 1.2831
Epoch: 20/20... Step: 2780... Loss: 1.2598... Val Loss: 1.2844


## Checkpoint

After training, we'll save the model so we can load it again later if we need too. Here I'm saving the parameters needed to create the same architecture, the hidden layer hyperparameters and the text characters.

In [17]:
model_name = 'rnn_20_epoch.net'

checkpoint = {'n_hidden': net.n_hidden,
              'n_layers': net.n_layers,
              'state_dict': net.state_dict(),
              'tokens': net.chars}

with open(model_name, 'wb') as f:
    torch.save(checkpoint, f)

## Making Predictions

Now that the model is trained, we'll want to sample from it and make predictions about next characters! To sample, we pass in a character and have the network predict the next character. Then we take that character, pass it back in, and get another predicted character. Just keep doing this and you'll generate a bunch of text!

### Top K sampling

Our predictions come from a categorical probability distribution over all the possible characters. We can make the sample text and make it more reasonable to handle (with less variables) by only considering some $K$ most probable characters.

In [16]:
def predict(net, char, h=None, top_k=None):
    
    # tensor inputs
    x = np.array([[net.char2int[char]]])
    x = one_hot_encode(x, len(net.chars))
    inputs = torch.from_numpy(x)
    
    if train_on_gpu:
        inputs = inputs.cuda()
        
    # detach hidden state from history    
    h = tuple([each.data for each in h])
    # get output and h of the model
    out, h = net(inputs, h)
    
    # characters probabilities
    p = F.softmax(out, dim=1).data
    
    # move to cpu
    if train_on_gpu:
        p = p.cpu()
    
    # top characters
    if top_k == None:
        top_ch = np.arange(len(net.chars))
    else:
        p, top_ch = p.topk(top_k)
        top_ch = top_ch.numpy().squeeze()
        
    # select the likely next character with some element of randomness
    p = p.numpy().squeeze()
    char = np.random.choice(top_ch, p = p/p.sum())
    
    return net.int2char[char], h

### Priming and generating text 

Typically you'll want to prime the network so you can build up a hidden state. Otherwise the network will start out generating characters at random. In general the first bunch of characters will be a little rough since it hasn't built up a long history of characters to predict from.

In [31]:
def sample(net, size, prime='The', top_k=None):
        
    if(train_on_gpu):
        net.cuda()
    else:
        net.cpu()
    
    net.eval()  
    
    # First off, run through the prime characters
    chars = [ch for ch in prime]
    h = net.init_hidden(batch_size=1)
    
    for ch in prime:
        char, h = predict(net, ch, h, top_k=top_k)

    chars.append(char)
    
    # Now pass in the previous character and get a new one
    for ii in range(size):
        char, h = predict(net, chars[-1], h, top_k=top_k)
        chars.append(char)

    return ''.join(chars)

In [32]:
print(sample(net, 1000, prime='Anna', top_k=5))

Anna's hat, and the
stoll moner and to business the first rest of the same and timidly
but a feeling and he could not come again to her when
he had been consupped the sears of his heatte that he felt to
hear her
condition in and contented to the packs and some
throwing to the precentious face to her shame at the matter
without the strange formers of health. But the conscinused his
strange study was at first that it was a stand of a secretary times when
he said himself, and would she was steps and with all
the time with his face in his shade and tried to carried it..

"I cannot tell you what I cunder thas sending their money?"
said Anna and was still the point
they heard a little time, and she would have succeeded in a stipport, taking the day of her sister.

"To be anything?" he asked. "In an arranger was
to make her hand, and the passons, as a cropd of herself
in any sonds, and he could not care.
I've atried the same and stopper. What have they
were silent. And these steps as he so mo

## Loading a checkpoint

In [22]:
with open('rnn_20_epoch.net', 'rb') as f:
    checkpoint = torch.load(f)
    
loaded = CharRNN(checkpoint['tokens'], n_hidden=checkpoint['n_hidden'], n_layers=checkpoint['n_layers'])
loaded.load_state_dict(checkpoint['state_dict'])

<All keys matched successfully>

In [33]:
print(sample(loaded, 1000, prime='Vronsky said', top_k=5))

Vronsky said to
the position of the children with his ward, and so sorry
that he was asleep and husband there was so supposed. He had not conceuted, he cared
her son. That
should not see the carreating of his words, and that wealthen should see
him, that it wanted to set find the pecture, and the
plants of his whole performed smowing smile, the merched he would say
to him, and he felt that to stor all of the fact of his happant on the stroages to his
would be done, and at the this of the portraie though he had been because a fore
then and seeing her she was seriling. And at the fact of
his shoulder stocking his beis of the same face, and he saw side that he figured him stopping
to and seemed to this, and showing in the conversation, they were silent in
the figure,
and that he had been a luster. Bow were consequent face, that him to
the particular world there was no disconfestion, but that they came to
be tried to cale to her. The same
side of the music was not in things. And the contra