# RNN next character prediction

### Mehrdad Yazdani
### September 22, 2018

Colab notebook online to play!!

https://colab.research.google.com/drive/1achzCaBFBHputcqXrw_o_He5UNjx3hFG


We will look at using RNN's in PyTorch for the task of predicting the next character from observing previous characters. This is just a toy example but the hope is to understand the basics of the architectures and training procedures.

We will only consider one input sequence and one output sequence:

- **Input sequence:** hihell
- **Output sequence:** ihello

One way of thinking about how this works is that the machine learning algorithm (ML) first sees the character "h" and tries to guess what should follow:

"h" → ML → "?"

In the ideal case and if it hsa learned, the ML will output "i". If we do this for all the characters, in this exercise we would like to see the ML algorithm at the end to have this property:

"h" → ML →"i" <br>
"i"  → ML → "h"<br>
"h" → ML → "e"<br>
"e" → ML → "l"<br>
"l"  → ML → "l"<br>
"l"  → ML → "o"<br>




- It's imortant to clarify that this example demostrates *memorization* and not learning: the network has simply memorized to regurgitate what it has seen. 


- To truly test if a network has learned and not just memorized, we should give the network *new* sequences that it hasn't seen before. 


- This notebook is mostly lifted and modified from the excellent tutorials by Sung Kim:
https://docs.google.com/presentation/d/17VUX7YXhMkJrqO5gNGh6EE5gzBpY-BF9IrfVKcFIb3A/edit#slide=id.g27c9a844e4_157_9

In [1]:
# needed to use pytorch in colab
!pip install torch

Collecting torch
[?25l  Downloading https://files.pythonhosted.org/packages/49/0e/e382bcf1a6ae8225f50b99cc26effa2d4cc6d66975ccf3fa9590efcbedce/torch-0.4.1-cp36-cp36m-manylinux1_x86_64.whl (519.5MB)
[K    100% |████████████████████████████████| 519.5MB 23kB/s 
tcmalloc: large alloc 1073750016 bytes == 0x587cc000 @  0x7faa45edf1c4 0x46d6a4 0x5fcbcc 0x4c494d 0x54f3c4 0x553aaf 0x54e4c8 0x54f4f6 0x553aaf 0x54efc1 0x54f24d 0x553aaf 0x54efc1 0x54f24d 0x553aaf 0x54efc1 0x54f24d 0x551ee0 0x54e4c8 0x54f4f6 0x553aaf 0x54efc1 0x54f24d 0x551ee0 0x54efc1 0x54f24d 0x551ee0 0x54e4c8 0x54f4f6 0x553aaf 0x54e4c8
[?25hInstalling collected packages: torch
Successfully installed torch-0.4.1


In [1]:
import sys
import torch
import torch.nn as nn
from torch.autograd import Variableb

### Encoding characters with numbers

- How do we encode characters as numbers?

"h" → 34 <br>
"e" → 12 <br>
"l"  → 43 <br>
"o"  → 9 <br>

- Everytime we see the character "h" we treat as if it were the digit 34 

    - If we saw the character "e" we treat it as 12 and so on. 



- The problem with this encoding scheme is that it implies a specific ordering!


- That is, because "h" has been assigned to 34 and "e" to 12, it implies that "h" is somehow bigger than "e" by 22 units. 


- But we know that the characters are all important and there is no quantitiative measure of one character being bigger or more important than another 
    - (though some entropic measures may guide us as to how to efficiently order/code characters).

A popular way to deal with encoding characters instead is to treat all characters as equally important and assign a one-hot encoding scheme. It is easiest to illustrate this:

"h" → 1000 <br>
"e" → 0100 <br>
"l" → 0010 <br>
"o" → 0001 <br>

You can think of each of these mappings as a "1-hot code" represented as a 4 dimensional binary vector. These codes can be assigned in python as a list:

In [2]:
# One hot encoding for each char in 'hello'
h = [1, 0, 0, 0]
e = [0, 1, 0, 0]
l = [0, 0, 1, 0]
o = [0, 0, 0, 1]


Since our 1-one-hot vectors are only 4 dimensional, we need the `input_size` of the RNN to be 4. The `hidden_size` is the dimension of the hidden state. 

### Elman RNN

![elman](https://upload.wikimedia.org/wikipedia/commons/thumb/b/b5/Recurrent_neural_network_unfold.svg/1000px-Recurrent_neural_network_unfold.svg.png)

The `nn.RNN` model is the classic vanilla RNN AKA the Elman RNN. It computes the hidden state $h_{t}$ at time $t$ based on the current input $x_{t}$ and the previus hidden state $h_{(t-1)}$. 

$ h_{t} = \text{tanh}(w_{ih}x_{t} + b_{ih} + w_{hh}h_{(t-1)} + b_{hh})$

- $w_{ih}$ and $b_{ih}$ are the weights and biases associated with the input $x_{t}$ at time $t$
- $w_{hh}$ and $b_{hh}$ are the weights and biases associated with the previous hidden state $h_{(t-1)}$




<img src="https://www.walletfox.com/course/qtconcurrentmatrixvectorSource/matvec1_img.png" alt="Drawing" style="width: 900px;"/>

Important parameters that need to be specified:

- input_size: The number of expected features in the input `x`
- hidden_size: The number of features in the hidden state `h`

Once `nn.RNN` has been defined, it takes two inputs (`x` and initial state `h_0`) and returns two outputs (`output` set of states and final state`h_n`). The inputs are:

- input of shape (seq_len, batch, input_size): tensor containing the features of the input sequence.
- h_0 of shape (num_layers * num_directions, batch, hidden_size): tensor containing the initial hidden state for each element in the batch. Defaults to zero if not provided. For shallow and unidirectional RNNs (the default), `num_layers = 1` and `num_directions = 1`

The outputs are:

- `output` of shape (seq_len, batch, num_directions * hidden_size): tensor containing the output features (h_k) from the last layer of the RNN, for each k. For unidirectional RNN (the default), `num_directions = 1`. 
- `h_n` (num_layers * num_directions, batch, hidden_size): tensor containing the hidden state for `n = seq_len.` For shallow and unidirectional RNNs (the default), `num_layers = 1` and `num_directions = 1`

For each element in the input sequence, each layer computes the following function:

In [3]:
# The RNN cell will take two sets of inputs:
#  - inputs x with 4 features; we specify the number of features with input_size
#  - hidden state h with 2 features; we specify the number of hidden features 
#    with hidden_size = 2
     
elman_rnn = nn.RNN(input_size=4, hidden_size=2, batch_first=True)

The above line has instantiated a elman_rnn object for us to process *sequences* of data that takes state and input data. 





### Prep the initial hidden state tensor
When we start the RNN, we need to select something for the initial hidden state $h_0$. Here lets pick something from a random normal distribution.

In [4]:
# To make a 2 dimensional hidden state vector, we make the initial hidden state
# h_0 with the tensor size specified as:
#
#     (num_layers * num_directions, batch_size, hidden_size) 
#
# (swap if batch_first = True when RNN cell was created above)

hidden = Variable(torch.randn(1, 1, 2))
hidden.size()

torch.Size([1, 1, 2])

- The hidden state will have two features! 

- Since we only have 1 hidden layer, and only batch size of 1 (we only have one sequence), we expect the hidden state vector to be a 1 x 1 x 2 tensor. Indeed,  `hidden.size()` above verifies that we have initialized the tensor with the correct shape. 



### Prep the initial input sequence character

Now let's propogate an input character through the RNN cell. We will first convert our list of one-hot encoded characters to a PyTorch tensor. 

In [5]:
# Propagate input through RNN
# Input: (batch, seq_len, input_size) when batch_first=True
input_characters = Variable(torch.Tensor([h, e, l, l, o]))

input_characters.size()

torch.Size([5, 4])

Since our one-hot encoded vectors are 4 dimensional, and our "hello" sequence consists of 5 characters, we except the input tensor to have size 5 x 4. We see that using `input_characters.size()` that this is indeed the case. 


Note that our `inputs` tensor is missing the batch dimension. In this case, since we only have 1 sequence (so just one batch), the `inputs` tensor needs to have a size `[1, 5, 4]` - we can easily reshape the tensor using the `.view(1,5,4)` method. Or, we could just do `.view(1,5,-1)` where the -1 will take care of the left over dimensions for us.



### Inference with RNN (aka "forward pass"): one character at a time

OK, lets **finally** take our initial hidden state and the first character encoded characters and pass it to the Elman RNN that we defined as `elman_rnn`:

In [6]:
out, hidden = elman_rnn(input_characters[0,:].view(1,1,-1), hidden)
print("Encoded character size:", input_characters[0,:].size(), 
      "\nhidden size:", hidden.size() ,
      "\nout size:", out.size())

Encoded character size: torch.Size([4]) 
hidden size: torch.Size([1, 1, 2]) 
out size: torch.Size([1, 1, 2])


Let's unpack what we did here


- `elman_rnn` expects two sets of input tensors: 
    - the input character 
    - the hidden state
    
  
- Input character sequences have 5 characters total and each character is encoded as a 4 dimensional one-hot coded vector. 
    - We have stored it all in the 5x4 tensor `input_characters`
    
    
- We are to only pass the *first* character. 
    - The first character can be accessed using "slice" indexing: `input_characters[0,:]` → this says take the first character (out of 5) and every feature of our one-hot encoded vector




- When we slice this way, you'll notice that you will only get a 1D tensor that has 4 elements.


- But the RNN expects a *sequential tensor* with the shape (seq_len, batch, input_size) 


- So we have to reshape this 1D tensor into 3D by introducing some dummy dimensions


- This reshaping can be done with the `.view()` method.







Below we will iteratre through each 1-hot-encoded character and see the output and hidden sizes of the RNN outputs.

In [7]:
for i, one_hot_encoded_encoded in enumerate(input_characters):
    encoded_character = one_hot_encoded_encoded.view(1, 1, -1)
    # Input: (batch, seq_len, input_size) when batch_first=True
    out, hidden = elman_rnn(encoded_character, hidden)
    print("Character", i, "tensor sizes:")
    print("  encoded character size:", encoded_character.size(), 
          "\n  hidden size:", hidden.size() ,
          "\n  out size:", out.size())

Character 0 tensor sizes:
  encoded character size: torch.Size([1, 1, 4]) 
  hidden size: torch.Size([1, 1, 2]) 
  out size: torch.Size([1, 1, 2])
Character 1 tensor sizes:
  encoded character size: torch.Size([1, 1, 4]) 
  hidden size: torch.Size([1, 1, 2]) 
  out size: torch.Size([1, 1, 2])
Character 2 tensor sizes:
  encoded character size: torch.Size([1, 1, 4]) 
  hidden size: torch.Size([1, 1, 2]) 
  out size: torch.Size([1, 1, 2])
Character 3 tensor sizes:
  encoded character size: torch.Size([1, 1, 4]) 
  hidden size: torch.Size([1, 1, 2]) 
  out size: torch.Size([1, 1, 2])
Character 4 tensor sizes:
  encoded character size: torch.Size([1, 1, 4]) 
  hidden size: torch.Size([1, 1, 2]) 
  out size: torch.Size([1, 1, 2])


- We see that for every character, the input tensor has been reshaped as a 4 dimensional 1-hot-encoded vector with tensor size 1x1x4. 


- The RNN cell then takes produces *two* tensors, `out` and `hidden`. `hidden` is just the output of the RNN state for the next time step $h_{t+1}$. The out and hidden tensors have the same shape. 


- This is because `out` is just a copy of the `hidden`. We can verify this by going through the RNN cell again: 

In [8]:
for i, one_hot_encoded_encoded in enumerate(input_characters):
    encoded_character = one_hot_encoded_encoded.view(1, 1, -1)
    # Input: (batch, seq_len, input_size) when batch_first=True
    out, hidden = elman_rnn(encoded_character, hidden)
    if torch.all(torch.eq(out, hidden)).item() == 1:
      print("Character", i, "hidden and out RNN are equal")
    else:
      print("Character", i, "hidden and out RNN are not equal")



Character 0 hidden and out RNN are equal
Character 1 hidden and out RNN are equal
Character 2 hidden and out RNN are equal
Character 3 hidden and out RNN are equal
Character 4 hidden and out RNN are equal


In [9]:
torch.eq(out, hidden).numpy()

array([[[1, 1]]], dtype=uint8)

Indeed! We see that the `hidden` and `out` tensors that the RNN cell returns are not only equal in shape but also in value. 



### Inference with RNN: one sequence at a time

Instead of going throug the sequence individually with the for loop, we can go through the sequence in one shot:

In [16]:
input_characters = input_characters.view(1, 5, -1)
out, hidden = elman_rnn(input_characters, hidden)
print("sequence of encoded character size",input_characters.size(), 
      "\nhidden size", hidden.size(), 
      "\nout size", out.size())

sequence of encoded character size torch.Size([1, 5, 4]) 
hidden size torch.Size([1, 1, 2]) 
out size torch.Size([1, 5, 2])


We again stress the distinction between the `out` and `hidden` outputs of the RNN. The distinction is that the `out` is the hidden states for every time step whereas `hidden` is the hidden state for just the last time step. So the last time step for `out` should be identical to `hidden`. Let's check it out!

In [17]:
out[:,-1,:] # the last element in the *sequence* of outputs of the RNN

tensor([[ 0.9054,  0.4019]])

In [18]:
hidden

tensor([[[ 0.9054,  0.4019]]])

Yep, they both have the same values! We could have also checked their values are equal using the `torch.eq` method:

In [20]:
torch.eq(out[:,-1,:], hidden)

tensor([[[ 1,  1]]], dtype=torch.uint8)

### Inference with RNN: iterating through multiple sequences

Now lets try multiple sequencse so we have more than 1 batch. Here we will consider 3 sequences each with the same length: "hello", "eolll", and "lleel".

In [21]:
# One cell RNN input_dim (4) -> output_dim (2). sequence: 5, batch 3
# 3 batches 'hello', 'eolll', 'lleel'
# rank = (3, 5, 4)
inputs = Variable(torch.Tensor([[h, e, l, l, o],
                                [e, o, l, l, l],
                                [l, l, e, e, l]]))

inputs.size()

torch.Size([3, 5, 4])

We see from `inputs.size()` that the `inputs` tensor has size 3x5x4. These three dimensions correspond to:
- dim 1: the number of sequences, 3 in this case
- dim 2: the length of each sequence (ie the number of elements/characters in each sequence).  We have 5 characters for each sequence
- dim 3: the number of features used to represent each character. Because we are using a 1-hot encoding scheme to repreent each character and we only have 4 characters, the number of features is just 4.


OK, now that we have our inputs tensor setup, we now need to initialize the hidden state as before. The big difference before is because we have **three** sequences instead of one like the previous examples, we need to create three hidden tensors. 

In [22]:
# hidden : (num_layers * num_directions, batch, hidden_size) whether batch_first=True or False
hidden = Variable(torch.randn(1, 3, 2))
hidden.size()

torch.Size([1, 3, 2])

In other words, we have created 3 different hidden states each have dimension 2. 

OK, now that we have our hidden states and inputs tensors setup, lets forward pass them to Elman RNN!!

In [23]:
# Propagate input through RNN
# Input: (batch, seq_len, input_size) when batch_first=True
# B x S x I
out, hidden = elman_rnn(inputs, hidden)
print("batch input size", inputs.size(), "\nout size", out.size(), "\nhidden size", hidden.size())


batch input size torch.Size([3, 5, 4]) 
out size torch.Size([3, 5, 2]) 
hidden size torch.Size([1, 3, 2])


Let's unpack the outputs of the Elman RNN a bit:
- As discussed above, we have 3 sequences, each of length 5, and each element (ie, character) in the sequence has 4 features


- Because we have 3 sequences, we expect the RNN to have 3 outputs. Indeed we see that the first dimension of the `out` tensor has 3 elements. 


- Similarly, because each sequence has 5 elements/characters, we expect the `out` tensor to have a corresponding hidden state for each of these characters. Indeed we seee that the second dimension of the `out` tensor has 5 elements.


- **Finally!** Because we have designed our RNN to have hidden states that are two dimensional, we exepct two features for each element in each sequence. This is why we see that the third dimension of the `out` tensor is 2. 



- OK, what about the `hidden` tensor?


- Remember that the `hidden` tensor that the RNN returns is just the *last* output of the hidden state from the last input character. Because we have 3 sequences, we expect to have 3 of these hidden states. And because we have 2 features in our hidden state, we execpt these 3 hidden states to have 2 dimensions. 



- And as before, we expect that the *last* element in the `out` tensor should be equal to the `hidden` tensor for every sequence. Below we show that this is indeed the case:

In [24]:
out[:,-1,:] # note that we are picking the last element of output hidden states for all sequences and all features

tensor([[ 0.9032,  0.3919],
        [ 0.8803,  0.3029],
        [ 0.9287,  0.4453]])

In [25]:
hidden

tensor([[[ 0.9032,  0.3919],
         [ 0.8803,  0.3029],
         [ 0.9287,  0.4453]]])

As before, we can use the method `torch.eq` to check for equality between between these values:

In [26]:
torch.eq(out[:,-1,:], hidden)

tensor([[[ 1,  1],
         [ 1,  1],
         [ 1,  1]]], dtype=torch.uint8)

Everything checks out!

we can also not have the first dim be the batch size:

In [27]:
# One cell RNN input_dim (4) -> output_dim (2)
elman_rnn = nn.RNN(input_size=4, hidden_size=2)

# The given dimensions dim0 and dim1 are swapped.
inputs = inputs.transpose(dim0=0, dim1=1)
# Propagate input through RNN
# Input: (seq_len, batch_size, input_size) when batch_first=False (default)
# S x B x I
out, hidden = elman_rnn(inputs, hidden)
print("batch input size", inputs.size(), "out size", out.size())

batch input size torch.Size([5, 3, 4]) out size torch.Size([5, 3, 2])


## Learning 1-batch sequence with RNN one element at a time

Lets now apply RNN to *learn* a sequence. We will only consider one input sequence and one output sequence:

- Input sequence: hihell
- Output sequence: ihello

We will design the 1-hot-encoding by first assigning an index to each character:
- "h" -> 0
- "i" -> 1
- "e" -> 2
- "l" -> 3
- "o" -> 4

So in other words we are living in a world that has only these 5 characters. 

In [28]:
torch.manual_seed(777)  # reproducibility
#            0    1    2    3    4
idx2char = ['h', 'i', 'e', 'l', 'o']

We now define our sequence input sequence `x_data` "hihell"  and our output sequence `y_data` "ihello"

We also convert our characters to one-hot-encoded vectors using a simple lookup table.

In [29]:
# Teach hihell -> ihello
x_data = [0, 1, 0, 2, 3, 3]   # hihell
y_data = [1, 0, 2, 3, 3, 4]   # ihello

one_hot_lookup = [[1, 0, 0, 0, 0],  # 0
                  [0, 1, 0, 0, 0],  # 1
                  [0, 0, 1, 0, 0],  # 2
                  [0, 0, 0, 1, 0],  # 3
                  [0, 0, 0, 0, 1]]  # 4


x_one_hot = [one_hot_lookup[x] for x in x_data]

# As we have one batch of samples, we will change them to variables only once
inputs = Variable(torch.Tensor(x_one_hot))
labels = Variable(torch.LongTensor(y_data))

In [30]:
inputs.size(), labels.size()

(torch.Size([6, 5]), torch.Size([6]))

The `inputs` tensor is size 6x5 because there are 6 characters and each character has 5 features. The outut `labels` tensor is just the character "classes" (ie, which character encodings) we want to predict.

The RNN we are going to use for predicting the next character is going to use the hidden state to directly in its prediction. Normally this would be passed to another layer (like a fully connected layer or even another RNN) but in this example we are just going to  use it directly. The advantage of using the hidden state directly and not introducing additional layers is we limit the number of parameters we have to learn. The disadvantage of not using an additional layer is that we expect the hidden state to encode *both* the past histories that we have observed (it's primary function) *and* predict the next character. 

Regardless of the demands we are placing on the hidden state, because this is such an easy problem (there are only 5 characters and only 1 sequence the network has to memorize), we expect the hidden state to be able to do this.

One issue to keep in mind though is because we are using the hidden state to directly predict the next character, this constrains us to have the size of the hidden state be the same as the number of classes in our outputs (5). Let's define the different parameters of the RNN below:

In [31]:
num_classes = 5      # the number of possible classes we have (the labels tensors is between 0 and 4)
input_size = 5       # one-hot encoded vector dimensions
hidden_size = 5      # we use 5 dimensional hidden state vectors to directly predict the character
batch_size = 1       # we have one sentence and so one batch size
sequence_length = 1  # we have only one sequence and we will process the characters one by one
num_layers = 1       # we will have a simple one hidden layer RNN

OK, now we define our RNN class with the specific architecture that we want (as we usually do with PyTorch neural network models for training)

In [32]:
class Model(nn.Module):

    def __init__(self, input_size, hidden_size):
        super(Model, self).__init__()
        self.input_size = input_size
        self.hidden_size = hidden_size
        self.rnn = nn.RNN(input_size=self.input_size,
                          hidden_size=self.hidden_size, 
                          batch_first=True)

    def forward(self, hidden, x):
        # Reshape input to make sure the first dim is batch dimension
        x = x.view(batch_size, sequence_length, input_size)

        # Propagate input through RNN
        #   Input:  (batch, seq_len, input_size)
        #            since we only have 1 batch and are iterating a single 
        #            character at a time we execpt the input tensor to have 
        #            shape: 1 x 1 x 5
        #   hidden: (num_layers * num_directions, batch, hidden_size)
        #            we only have 1 hidden layer and the RNN is uniderectional
        #            so the hidden tensor size should be 1 x 1 x 5              
        out, hidden = self.rnn(x, hidden)
        return hidden, out.view(-1, num_classes)

    def init_hidden(self):
        # Initialize hidden and cell states
        # (num_layers * num_directions, batch, hidden_size)
        #  we only have 1 hidden layer and the RNN is uniderectional
        #  so the hidden tensor size should be 1 x 1 x 5   
        return Variable(torch.zeros(num_layers, batch_size, hidden_size))

Now we instantiate the model, define our loss criteron, and define the optimizer we want to use:

In [33]:
# Instantiate RNN model
model = Model()

# Set loss and optimizer function
# CrossEntropyLoss = LogSoftmax + NLLLoss
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.005)

# reducelron platue

# Time to train the network!!!! 

We will loop through each epoch (100 of them), forward pass each individual character, then compute and accumulate the loss. Once the sequence is over, we compute the total loss and propagate the errors to update the network.  

In [34]:
for epoch in range(100):
    optimizer.zero_grad()
    loss = 0
    hidden = model.init_hidden()
    
    # iterate through each character and predict what
    # the next character should be.
    # using the true target label, compute and accumulate
    # the loss
    pred_string = ""
    for input, label in zip(inputs, labels):
        hidden, output = model(hidden, input)
        val, idx = output.max(1) # remember that we are using the hidden state
                                 # directly to make our prediction (and have 
                                 # reshaped appropriately in our Model class 
                                 # definition). We could also just as well use 
                                 # hidden state that we are returning as long
                                 # as we reshape it right: 
                                 # hidden.view(-1, num_classes).max(1) 
        
        pred_string += idx2char[idx.data[0]]
        # accumulate the loss
        loss += criterion(output, label.view(1))
    
    # ok we completed the sequence, lets backward prop and
    # update the network
    loss.backward()
    optimizer.step()
    
    # print every 20 epochs what the results look like
    if (epoch%20 == 0) or (epoch == 99):
      sys.stdout.write("predicted string: ")
      sys.stdout.write(pred_string)
      print(", epoch: %d, loss: %1.3f" % (epoch + 1, loss.item()))    
    
print("Learning finished!")    

predicted string: llllll, epoch: 1, loss: 10.155
predicted string: ihelll, epoch: 21, loss: 4.419
predicted string: ihello, epoch: 41, loss: 3.272
predicted string: ihello, epoch: 61, loss: 2.930
predicted string: ihello, epoch: 81, loss: 2.847
predicted string: ihello, epoch: 100, loss: 2.811
Learning finished!


Looks like we learned (memorized, really) the target sequence!!!

We can generalize this approach to multiple sequences: we would just have one more loop that would iterate through each sequence. All of our sequences could also have different lengths and it would not matter.

But often we are faced with learning sequences that always have the same length. In those cases it would be tedious and slow to iterate through yet another for loop. This is where we can update our RNN model to process not just a single character at a time, but the entire sequence. 

## Learning 1-batch sequence with RNN (entire sequence)

Now we are going to learn the sequence not character-by-character but the entire sequence at one. Another way of thinking about this is that we are learning batches of sequences. But since we onlyu have 1 sequence, our batch size will be 1. 

In [35]:

sequence_length = 6  # Since the number of character in our sequence |ihello| == 6

We will similarly define the parameters of our model as variables below:

In [36]:
num_classes = 5      # the number of possible classes we have (the labels tensors is between 0 and 4)
input_size = 5       # one-hot encoded vector dimensions
hidden_size = 5      # we use 5 dimensional hidden state vectors to directly predict the character
batch_size = 1       # we have one sentence and so one batch size
sequence_length = 1  # we have only one sequence and we will process the characters one by one
num_layers = 1       # we will have a simple one hidden layer RNN

Now we define our RNN class with the specific architecture that we want (as we usually do with PyTorch neural network models for training). 

In [37]:
class RNN(nn.Module):

    def __init__(self, num_classes, input_size, hidden_size, num_layers):
        super(RNN, self).__init__()

        self.num_classes = num_classes
        self.num_layers = num_layers
        self.input_size = input_size
        self.hidden_size = hidden_size
        self.sequence_length = sequence_length

        self.rnn = nn.RNN(input_size=5, hidden_size=5, batch_first=True)

    def forward(self, x):
        # Initialize hidden and cell states
        # (num_layers * num_directions, batch, hidden_size) for batch_first=True
        h_0 = Variable(torch.zeros(self.num_layers, x.size(0), self.hidden_size))


        # Propagate input through RNN
        # Input: (batch, seq_len, input_size)
        # h_0: (num_layers * num_directions, batch, hidden_size)

        out, _ = self.rnn(x, h_0)
        #return out
        return out.view(-1, num_classes)

In [38]:
# As we have one batch of samples, we will change them to variables only once
inputs = Variable(torch.Tensor(x_one_hot))
labels = Variable(torch.LongTensor(y_data))

In [39]:
inputs.size()

torch.Size([6, 5])

In [40]:
# Instantiate RNN model
rnn = RNN(num_classes, input_size, hidden_size, num_layers)

# Set loss and optimizer function
# CrossEntropyLoss = LogSoftmax + NLLLoss
criterion = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(rnn.parameters(), lr=0.05)

In [41]:
outputs = rnn(inputs.view(1,6,-1))

In [42]:
outputs.size()

torch.Size([6, 5])

In [43]:
labels.size()

torch.Size([6])

In [44]:
# Train the model
for epoch in range(100):
    outputs = rnn(inputs.view(1,6,-1))
    optimizer.zero_grad()
    loss = criterion(outputs, labels)
    loss.backward()
    optimizer.step()
    _, idx = outputs.max(1)
    idx = idx.data.numpy()
    result_str = [idx2char[c] for c in idx.squeeze()]
    if (epoch%20 == 0) or (epoch == 99):
      print("epoch: %d, loss: %1.3f" % (epoch + 1, loss.item()))
      print("Predicted string: ", ''.join(result_str))
      

print("Learning finished!")

epoch: 1, loss: 1.544
Predicted string:  lellll
epoch: 21, loss: 0.655
Predicted string:  ihelll
epoch: 41, loss: 0.563
Predicted string:  ihelll
epoch: 61, loss: 0.538
Predicted string:  ihello
epoch: 81, loss: 0.489
Predicted string:  ihello
epoch: 100, loss: 0.472
Predicted string:  ihello
Learning finished!


## Practice! RNN with Embedding and Output layers



In [45]:
x_data = [[0, 1, 0, 2, 3, 3]] 
# As we have one batch of samples, we will change them to variables only once
inputs = Variable(torch.LongTensor(x_data))
labels = Variable(torch.LongTensor(y_data))

In [46]:
embedding_size = 10  # embedding size

In [47]:
class Model(nn.Module):

    def __init__(self, hidden_size):    
        super(Model, self).__init__()
        self.hidden_size = hidden_size
        self.embedding = nn.Embedding(input_size, embedding_size)
        self.rnn = nn.RNN(input_size=embedding_size,
                          hidden_size=self.hidden_size, batch_first=True)
        self.fc = nn.Linear(hidden_size, num_classes)

    def forward(self, x):
        # Initialize hidden and cell states
        # (num_layers * num_directions, batch, hidden_size)
        #h_0 = Variable(torch.zeros(1, embedding_size, self.hidden_size))
        h_0 = Variable(torch.zeros(1, 1, self.hidden_size))

        emb = self.embedding(x)
        emb = emb.view(batch_size, embedding_size, -1)
        # Propagate embedding through RNN
        # Input: (batch, seq_len, embedding_size)
        # h_0: (num_layers * num_directions, batch, hidden_size)
        out, _ = self.rnn(emb.view(1,6,-1), h_0)
        return self.fc(out)


In [48]:
# Instantiate RNN model
model = Model(hidden_size)
print(model)

# Set loss and optimizer function
# CrossEntropyLoss = LogSoftmax + NLLLoss
criterion = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.1)


Model(
  (embedding): Embedding(5, 10)
  (rnn): RNN(10, 5, batch_first=True)
  (fc): Linear(in_features=5, out_features=5, bias=True)
)


In [49]:
# Train the model
for epoch in range(100):
    outputs = model(inputs.view(1,-1)).view(-1,num_classes)
    optimizer.zero_grad()
    loss = criterion(outputs, labels)
    loss.backward()
    optimizer.step()
    _, idx = outputs.max(1)
    idx = idx.data.numpy()
    result_str = [idx2char[c] for c in idx.squeeze()]
    if (epoch%20 == 0) or (epoch == 99):
      print("epoch: %d, loss: %1.3f" % (epoch + 1, loss.data[0]))
      print("Predicted string: ", ''.join(result_str))

print("Learning finished!")

epoch: 1, loss: 1.544
Predicted string:  loeooo


  if sys.path[0] == '':


epoch: 21, loss: 0.013
Predicted string:  ihello
epoch: 41, loss: 0.002
Predicted string:  ihello
epoch: 61, loss: 0.001
Predicted string:  ihello
epoch: 81, loss: 0.001
Predicted string:  ihello
epoch: 100, loss: 0.001
Predicted string:  ihello
Learning finished!
