# Imports

In [6]:
import numpy as np
import torch
from torch import nn
import torch.optim as opt
import torch.nn.functional as F

In [7]:
%reload_ext autoreload
%autoreload

In [8]:
torch.set_default_tensor_type('torch.FloatTensor')

# Character-Level LSTM

In [9]:
!wc ../data/anna.txt

   40263  352929 2025486 ../data/anna.txt


In [10]:
# Load data
with open('../data/anna.txt', 'r') as f:
    text = f.read()
len(text)

1985223

In [11]:
# Show first 100 characters
text[:100]

'Chapter 1\n\n\nHappy families are all alike; every unhappy family is unhappy in its own\nway.\n\nEverythin'

## Tokenization

In [12]:
# Create char_to_idx and idx_to_char dictionaries
chars = list(set(text))
idx_to_char = dict(enumerate(chars))
char_to_idx = {ch: i for i, ch in idx_to_char.items()}

{k:v for (k, v) in list(idx_to_char.items())[:5]},\
{k:v for (k, v) in list(char_to_idx.items())[:5]}

({0: '8', 1: 'N', 2: 'C', 3: 'm', 4: 'F'},
 {'8': 0, 'N': 1, 'C': 2, 'm': 3, 'F': 4})

In [13]:
# Encode the text -- convert each char from str to int
encoded_text = np.array([char_to_idx[ch] for ch in text])

print(f'Total number of characters : {len(chars)}.')
print(f'Total number of unique characters : {len(text)}.')

Total number of characters : 83.
Total number of unique characters : 1985223.


In [14]:
encoded_text[:10]

array([ 2, 19, 66, 44,  5, 11, 63, 61, 50, 23])

## Pre-processing the data

In [15]:
def one_hot_encode(arr, n_labels):
    '''Convert each element of `arr` into one-hot encoding vector.'''
    one_hot = np.zeros((np.multiply(*arr.shape), n_labels))

    # Fill the corresponding idx of each char with 1 - vectorized version
    one_hot[np.arange(one_hot.shape[0]), arr.flatten()] = 1.
    one_hot = one_hot.reshape((*arr.shape, n_labels))
    
    return one_hot

`n_labels` is usually called the vocabulary size in NLP literature. The vocabulary here is all the unique characters in the text. So the output for each character will be a probability distribution over all these unique characters where the one with the highest probability will be the character that is most likely to be at that time step.

In [16]:
one_hot = one_hot_encode(np.array([[1, 2], [3, 4]]), len(chars))
one_hot.shape

(2, 2, 83)

## Mini-batches

To train on this data, we also want to create mini-batches for training. Remember that we want our batches to be multiple sequences of some desired number of sequence steps. Considering a simple example, our batches would look like this:
We will be creating mini-batched such that the array of encoded text will be of size `batch_size x (seq_length *  total_batches`). In other words, the encoded text will be reshaped so that first dimension would be `batch_size` and there will be sliding window of width `seq_length`. we will keep sliding the window to the right with step size = seq_length until we cover all characters. Note that characters at the end that don't fully fit one seq_length will be discarded.

In [17]:
def get_batches(arr, batch_size, seq_length):
    '''Create a generator that yield mini-batches of seq_length.'''
    # total chars in one batch
    batch_size_total = batch_size * seq_length
    # total number of batches we can make
    n_batches = len(arr) // batch_size_total
    
    # Keep only enough characters to make full batches
    arr = arr[:n_batches * batch_size_total]
    # Reshape into batch_size rows
    arr = arr.reshape((batch_size, -1))

    # iterate through the array, one sequence at a time
    # slicing window will be of seq_length width
    for n in range(0, arr.shape[1], seq_length):
        # The features
        x = arr[:, n:n + seq_length]
        # The targets, shifted by one
        y = np.zeros_like(x)        
        try:
            y[:, :-1], y[:, -1] = x[:, 1:], arr[:, n + seq_length]
        except IndexError:
            y[:, :-1], y[:, -1] = x[:, 1:], arr[:, 0]
        
        # Will be generator
        yield x, y

In [18]:
batches = get_batches(encoded_text, 8, 50)
x, y = next(batches)

In [19]:
# printing out the first 10 items in a sequence
print(f'x\n {x[:10, :10]}\n')
print(f'y\n {y[:10, :10]}')

x
 [[ 2 19 66 44  5 11 63 61 50 23]
 [45 28 82 61  5 19 66  5 61 66]
 [11 82 55 61 28 63 61 66 61  6]
 [45 61  5 19 11 61 39 19 71 11]
 [61 45 66 53 61 19 11 63 61  5]
 [39 68 45 45 71 28 82 61 66 82]
 [61 31 82 82 66 61 19 66 55 61]
 [47 69 51 28 82 45 49 80 54 61]]

y
 [[19 66 44  5 11 63 61 50 23 23]
 [28 82 61  5 19 66  5 61 66  5]
 [82 55 61 28 63 61 66 61  6 28]
 [61  5 19 11 61 39 19 71 11  6]
 [45 66 53 61 19 11 63 61  5 11]
 [68 45 45 71 28 82 61 66 82 55]
 [31 82 82 66 61 19 66 55 61 45]
 [69 51 28 82 45 49 80 54 61  9]]


## Define the network

In `__init__` the suggested structure is as follows:
* Define an LSTM layer that takes as params: an input size (the number of characters), a hidden layer size `n_hidden`, a number of layers `n_layers`, a dropout probability `drop_prob`, and a batch_first boolean (True, since we are batching)
* Define a dropout layer with `dropout_prob`
* Define a fully-connected layer with params: input size `n_hidden` and output size (the number of characters)
* Finally, initialize the weights (again, this has been given)

You can create a basic [LSTM layer](https://pytorch.org/docs/stable/nn.html#lstm) as follows

```python
self.lstm = nn.LSTM(input_size, n_hidden, n_layers, 
                            dropout=drop_prob, batch_first=True)
```

where `input_size` is the number of characters this cell expects to see as sequential input, and `n_hidden` is the number of units in the hidden layers in the cell. And we can add dropout by adding a dropout parameter with a specified probability; this will automatically add dropout to the inputs or outputs. Finally, in the `forward` function, we can stack up the LSTM cells into layers using `.view`. With this, you pass in a list of cells and it will send the output of one cell into the next cell.

We also need to create an initial hidden state of all zeros. This is done like so

```python
self.init_hidden()
```

In [20]:
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
print(f'Training on {device}')

Training on cpu


With `batch_first=True`, the input will be expected to have the following dimensions: `batch x seq_length x features` where `features` here means length of one-hot vector, i.e vocabulary size.

For each layer in LSTM, there will be `h_0` and `c_0` tensors which are called hidden state and cell state (short term memory cell) respectively. They will be initiated to zeros and have following dimensions: `num_layers x batch x hidden_size`

Recurrent dropout used in RNNs are different than Dropout layers. They get applied to the hidden state tensors within the RNNs and don't get applied to the output tensor. They both have the same functionality why a unit gets dropped with probability `prob` using a Bernoulli random variable.

In [21]:
class CharRNN(nn.Module):
    
    def __init__(self, chars, n_hidden=256, n_layers=2, drop_prob=0.5):
        super().__init__()
        self.drop_prob = drop_prob
        self.n_layers = n_layers
        self.n_hidden = n_hidden
        self.chars = chars
        
        # Two LSTM layers stacked with dropout
        self.lstm = nn.LSTM(len(self.chars), n_hidden, n_layers, 
                            dropout=drop_prob, batch_first=True)
        self.dropout = nn.Dropout(drop_prob)
        # fully connected layer
        self.fc = nn.Linear(n_hidden, len(self.chars))
      
    def forward(self, x, hidden):
        r_output, hidden = self.lstm(x, hidden)
        out = self.dropout(r_output)
        
        # Stack up LSTM outputs using view
        # you may need to use contiguous to reshape the output
        out = out.contiguous().view(-1, self.n_hidden)
        out = self.fc(out)
        
        # return the final output and the hidden state
        return out, hidden
    
    
    def init_hidden(self, batch_size):
        ''' Initializes hidden state '''
        # Create two new tensors with sizes n_layers x batch_size x n_hidden,
        # initialized to zero, for hidden state and cell state of LSTM
        weight = next(self.parameters()).data
        hidden = (weight.new(self.n_layers, batch_size, self.n_hidden).zero_().to(device),
                  weight.new(self.n_layers, batch_size, self.n_hidden).zero_().to(device))
        
        return hidden

## Train

The train function gives us the ability to set the number of epochs, the learning rate, and other parameters.

Below we're using an Adam optimizer and cross entropy loss since we are looking at character class scores as output. We calculate the loss and perform backpropagation, as usual!

A couple of details about training: 
>* Within the batch loop, we detach the hidden state from its history; this time setting it equal to a new *tuple* variable because an LSTM has a hidden state that is a tuple of the hidden and cell states.
* We use [`clip_grad_norm_`](https://pytorch.org/docs/stable/_modules/torch/nn/utils/clip_grad.html) to help prevent exploding gradients.

In [16]:
def train(
    net, data, epochs=10, batch_size=10, seq_length=50, lr=1e-3, clip=5, val_pct=0.1, print_every=10):

    net.train()
    optimizer = opt.Adam(net.parameters(), lr=lr)
    criterion = nn.CrossEntropyLoss()
    
    # create training and validation data
    val_idx = int(len(data) * (1 - val_pct))
    train_data, val_data = data[:val_idx], data[val_idx:]
    
    counter = 0
    n_chars = len(net.chars)
    for epoch in range(epochs):
        # initialize hidden state
        h = net.init_hidden(batch_size)
        for x, y in get_batches(train_data, batch_size, seq_length):
            counter += 1
            
            # One-hot encode our data and make them Torch tensors
            x = one_hot_encode(x, n_chars)
            inputs, targets = torch.from_numpy(x), torch.from_numpy(y)
            inputs, targets = inputs.to(device), targets.to(device)

            # Creating new variables for the hidden state, otherwise
            # we'd backprop through the entire training history
            h = tuple([each.data for each in h])

            # zero accumulated gradients
            optimizer.zero_grad()
            
            # get the output from the model
            output, h = net(inputs.float(), h)
            
            # calculate the loss and perform backprop
            loss = criterion(output, targets.view(batch_size * seq_length))
            loss.backward()

            # `clip_grad_norm` helps prevent the exploding gradient problem in RNNs / LSTMs.
            nn.utils.clip_grad_norm_(net.parameters(), clip)
            optimizer.step()
            
            # loss stats
            if counter % print_every == 0:
                # Get validation loss
                val_h = net.init_hidden(batch_size)
                val_losses = []
                net.eval()
                for x, y in get_batches(val_data, batch_size, seq_length):
                    # One-hot encode our data and make them Torch tensors
                    x = one_hot_encode(x, n_chars)
                    inputs, targets = torch.from_numpy(x), torch.from_numpy(y)
                    
                    # Creating new variables for the hidden state, otherwise
                    # we'd backprop through the entire training history
                    val_h = tuple([each.data for each in val_h])
                    
                    inputs, targets = inputs.to(device), targets.to(device)
                    output, val_h = net(inputs.float(), val_h)
                    val_loss = criterion(output, targets.view(batch_size * seq_length))
                    val_losses.append(val_loss.item())
                
                net.train() # reset to train mode after iterationg through validation data
                
                print(f'Epoch : {epoch + 1:02d}/{epochs} ... '
                      f'Step : {counter} ... ',
                      f'Loss : {loss.item():.4f} , Val Loss : {np.mean(val_losses):.4f}')

## Instantiating the model

Now we can actually train the network. First we'll create the network itself, with some given hyperparameters. Then, define the mini-batches sizes, and start training!

In [19]:
# define and print the net
n_hidden = 512
n_layers = 2
batch_size = 128
seq_length = 100
n_epochs = 20 

net = CharRNN(chars, n_hidden, n_layers).to(device)
print(net)

CharRNN(
  (lstm): LSTM(83, 512, num_layers=2, batch_first=True, dropout=0.5)
  (dropout): Dropout(p=0.5)
  (fc): Linear(in_features=512, out_features=83, bias=True)
)


In [20]:
# train the model
train(net, encoded_text, n_epochs, batch_size,
      seq_length, lr=1e-3, print_every=20)

Epoch : 01/20 ... Step : 20 ...  Loss : 3.1418 , Val Loss : 3.1286
Epoch : 01/20 ... Step : 40 ...  Loss : 3.1093 , Val Loss : 3.1187
Epoch : 01/20 ... Step : 60 ...  Loss : 3.1159 , Val Loss : 3.1146
Epoch : 01/20 ... Step : 80 ...  Loss : 3.1176 , Val Loss : 3.1040
Epoch : 01/20 ... Step : 100 ...  Loss : 3.0590 , Val Loss : 3.0449
Epoch : 01/20 ... Step : 120 ...  Loss : 2.8640 , Val Loss : 2.8341
Epoch : 02/20 ... Step : 140 ...  Loss : 2.6406 , Val Loss : 2.5894
Epoch : 02/20 ... Step : 160 ...  Loss : 2.5138 , Val Loss : 2.4648
Epoch : 02/20 ... Step : 180 ...  Loss : 2.4251 , Val Loss : 2.3920
Epoch : 02/20 ... Step : 200 ...  Loss : 2.3603 , Val Loss : 2.3285
Epoch : 02/20 ... Step : 220 ...  Loss : 2.2783 , Val Loss : 2.2702
Epoch : 02/20 ... Step : 240 ...  Loss : 2.2517 , Val Loss : 2.2181
Epoch : 02/20 ... Step : 260 ...  Loss : 2.1687 , Val Loss : 2.1657
Epoch : 03/20 ... Step : 280 ...  Loss : 2.1733 , Val Loss : 2.1222
Epoch : 03/20 ... Step : 300 ...  Loss : 2.1069 , Va

Epoch : 18/20 ... Step : 2420 ...  Loss : 1.2501 , Val Loss : 1.3077
Epoch : 18/20 ... Step : 2440 ...  Loss : 1.2436 , Val Loss : 1.3088
Epoch : 18/20 ... Step : 2460 ...  Loss : 1.2500 , Val Loss : 1.3070
Epoch : 18/20 ... Step : 2480 ...  Loss : 1.2444 , Val Loss : 1.3044
Epoch : 18/20 ... Step : 2500 ...  Loss : 1.2306 , Val Loss : 1.3079
Epoch : 19/20 ... Step : 2520 ...  Loss : 1.2566 , Val Loss : 1.3060
Epoch : 19/20 ... Step : 2540 ...  Loss : 1.2583 , Val Loss : 1.3029
Epoch : 19/20 ... Step : 2560 ...  Loss : 1.2460 , Val Loss : 1.3005
Epoch : 19/20 ... Step : 2580 ...  Loss : 1.2659 , Val Loss : 1.3013
Epoch : 19/20 ... Step : 2600 ...  Loss : 1.2308 , Val Loss : 1.3006
Epoch : 19/20 ... Step : 2620 ...  Loss : 1.2142 , Val Loss : 1.3031
Epoch : 19/20 ... Step : 2640 ...  Loss : 1.2342 , Val Loss : 1.2943
Epoch : 20/20 ... Step : 2660 ...  Loss : 1.2471 , Val Loss : 1.2990
Epoch : 20/20 ... Step : 2680 ...  Loss : 1.2408 , Val Loss : 1.2933
Epoch : 20/20 ... Step : 2700 ... 

In [21]:
checkpoint = {'n_hidden': net.n_hidden,
              'n_layers': net.n_layers,
              'state_dict': net.state_dict(),
              'tokens': net.chars}

torch.save(checkpoint, 'char_rnn.pth')

## Making Predictions

Now that the model is trained, we'll want to sample from it and make predictions about next characters! To sample, we pass in a character and have the network predict the next character. Then we take that character, pass it back in, and get another predicted character. Just keep doing this and you'll generate a bunch of text! The output of the network is the logit so we need to use softmax to get the prob distribution over all possible characters. We can can add randomness to the model by sampling the predictions based on the probability of each character.

In [31]:
def predict(net, char, h=None, top_k=None):
        # tensor inputs
        x = np.array([[char_to_idx[char]]])
        x = one_hot_encode(x, len(net.chars))
        inputs = torch.from_numpy(x).to(device)
        
        # detach hidden state from history
        h = tuple([each.data for each in h])
        # get the output of the model
        out, h = net(inputs.float(), h)

        # get the character probabilities
        p = F.softmax(out, dim=1).data
        p = p.cpu()
        
        # get top characters
        # Use all character for sampling
        if top_k is None:
            top_ch = np.arange(len(net.chars))
        # sample from top_chars using to_k
        else:
            p, top_ch = p.topk(top_k)
            top_ch = top_ch.numpy().squeeze()
        
        # select the likely next character with some element of randomness
        p = p.numpy().squeeze()
        # p / p.sum() normalize probs in case we are sampling using top_chars
        char = np.random.choice(top_ch, p=p / p.sum())
        
        # return the encoded value of the predicted char and the hidden state
        return idx_to_char[char], h

### Priming and generating text 

Typically you'll want to prime the network so you can build up a hidden state. Otherwise the network will start out generating characters at random. In general the first bunch of characters will be a little rough since it hasn't built up a long history of characters to predict from.

In [32]:
def sample(net, size, prime='The', top_k=None):    
    net.eval() # eval mode
    
    # First off, run through the prime characters
    chars = [ch for ch in prime]
    h = net.init_hidden(1)
    for ch in prime:
        char, h = predict(net, ch, h, top_k=top_k)

    chars.append(char)
    
    # Now pass in the previous character and get a new one
    for i in range(size):
        char, h = predict(net, chars[-1], h, top_k=top_k)
        chars.append(char)

    return ''.join(chars)

In [33]:
print(sample(net, 1000, prime='Anna', top_k=5))

Anna, and would have been tried an and allowing that he was saying her. He saw the simple of the first significance of her studies.

"Well, her would
be anything to be doing."

"Wait a
below, with my woman. I am saying.
And it all more the prayer and me who has not to be all, the frost and must be asking at the
significance of this minute, that I shall step it
in anyone of anything. And to meet your soul of the way of me out at him and some thing in the sense. Well and there. I don't say to the same thing."

"An thoughts in the same talks of the meetuness of the
sole, and with me in a person, and I am trying to see her for you with this province in his house, with her than it was a mistake, which seem and the sound of a man should have been time and a meaning. There's no
marriage, that three men the contrary, with you, are so an official doubt and as to be," the same so talk to him to this well.

The carriage. She tried, but shriek on a law and
had been and her see he
was the stear of 

## Loading a checkpoint

In [35]:
# Load the model
checkpoint = torch.load('char_rnn.pth')    
net = CharRNN(checkpoint['tokens'],
              n_hidden=checkpoint['n_hidden'],
              n_layers=checkpoint['n_layers'])
net.load_state_dict(checkpoint['state_dict'])
net = net.to(device)

In [36]:
# Print out samples
print(sample(net, 2000, top_k=5, prime="And Levin said"))

And Levin said,
"Yes, as the steps, went to the sockety," said Stepan Arkadyevitch. "What doe nothing too," said Levin.

"I shall not have heard the stead. You want a petty time to bar eres, and when I was nut into a state of superfice, and I cannot carry it. Why, are it
were, there is something beside her important."

"No, you'd not see
your feelings."

"What do you know that her means, as that this was it."

"Well, will you know how to be dead? What's an and served my
side of the morning?"

"Well, what do I was a speak, and I do you say that I don't know. You must be an allow to the crass."

"I'm alreasy. I cannot come to me."

Alexey Alexandrovitch asked Stepan
Arkadyevitch.

"Yes, the country," she said, with the churk,
and she could not take simple whee
she had the corner when he saw with his back. He had not come off that a state of pression in a part of that her who was not her. The marsh of her propision was heard of his words. His hand.

"Yes, it was in time to discover it... 