# RNN next character prediction

### Mehrdad Yazdani
### November 12, 2018

Colab notebook online to play!!

https://colab.research.google.com/drive/1achzCaBFBHputcqXrw_o_He5UNjx3hFG


# Aim: learn the basics of Recurrent Neural Networks in PyTorch!





### 1) Conceptual-math background for Deep Learning

### 2) The Elman Recurrent Neural Network 

### 3) Example: predicting the next character from observing previous characters


# Caveat

We will only consider one input sequence and one output sequence:

- **Input sequence:** hihell
- **Output sequence:** ihello



Important to clarify that this example demonstrates memorization and not learning! 

The network has simply memorized to regurgitate what it has seen.

We need to follow good machine learning practice of having training and hold-out test sets.

Our main aim is covering RNNs in PyTorch and we assume you already know this 😇


# Conceptual-math background for Deep Learning



### In machine learning, we represent objects as *vectors*

Just a collection of $n$ numbers (fixed-sized array):
<center>
    <b>x</b> $=  (x_1, x_2, \ldots, x_n) $
</center>

- A data matrix consists of rows of vectors
- An image is a vector
- A sequence is an ordered collection of vectors 

### Each object is represented as a vector

### We would like to measure the similarity between objects

### The dot/inner product is a measure of similarity between vectors 
<br>
<br>
<center> $x^{T}y = \sum_{i}^{n} x_{i}y_{i}$ </center>

### Each object is a vector and the inner product is a measure of similarity between vectors 

$x^{T}y = \begin{bmatrix} x_{1} & x_{2} & \ldots &  x_{n}\end{bmatrix} \begin{bmatrix} y_{1}\\  y_{2}\\ \vdots \\  y_{n} \end{bmatrix} = x_{1}y_{1} + x_{2}y_{2} + \ldots + x_{n}y_{n} = \sum_{i}^{n} x_{i}y_{i} $

### The inner product is a measure of *mismatch* between two vectors

- If $x^{T}y$ is large and positive, then $x$ and $y$ are *aligned* together. Otherwise they are mismatched! 



- The matrix vector product: mismatch between a collection of vectors against a vector 

- $Wx$ measures the mismatch between $x$ and the $m$ row vectors $w_{i}^{T}$ (note the slight abuse in notation)

$ y = Wx = \begin{bmatrix} w^{T}_{1}\\  w^{T}_{2}\\ \vdots \\  w^{T}_{m} \end{bmatrix} \begin{bmatrix} x_{1}\\  x_{2}\\ \vdots \\  x_{n} \end{bmatrix} = \begin{bmatrix} w^{T}_{1}x\\  w^{T}_{2}x\\ \vdots \\  w^{T}_{m}x \end{bmatrix} =  w_{1}^{T}x +  w_{2}^{T}x + \ldots +  w_{m}^{T}x = \sum_{i}^{m} w_{i}^{T}x $



<center>
$ y = Wx = \begin{bmatrix} w^{T}_{1}\\  w^{T}_{2}\\ \vdots \\  w^{T}_{m} \end{bmatrix} \begin{bmatrix} x_{1}\\  x_{2}\\ \vdots \\  x_{n} \end{bmatrix} = \begin{bmatrix} w^{T}_{1}x\\  w^{T}_{2}x\\ \vdots \\  w^{T}_{m}x \end{bmatrix} =  w_{1}^{T}x +  w_{2}^{T}x + \ldots +  w_{m}^{T}x = \sum_{i}^{m} w_{i}^{T}x $
</center>

Example: collection of 3D vectors $w_{1}, w_{2}, w_{3}$ against the 3D vector $x$:
<center>
<img src="https://www.walletfox.com/course/qtconcurrentmatrixvectorSource/matvec1_img.png" alt="Drawing" style="width: 600px;"/>
</center>

### The Fourier Transform as a form of template matching 

![fft](./figures/fft.png)



### The Fourier Transform as a form of template matching


- Each row in the DFT matrix is a basis function (sinusoid) 


- Basis functions can be used together to create other functions and sometimes called a *kernel*; we can think of them as templates 

- The idea behind the Fourier transform is to take a signal and measure how much the signal matches the collection of templates 



- We can think of the Fourier transform as taking the raw signal to be represented as as a *template representation*

- This new representation is more useful for downstream tasks

### Deep learning in one slide? 

- The templates in Fourier transforms are fixed to be sinusoids. In Deep learning, we *learn* the templates from data:

<center>
$z_0 = \sigma(W_{0}x+b_{0})$
<center>
    $\downarrow$    
<center>
    $z_1 = \sigma(W_{1}z_{0}+b_{1})$
<center>
    $\downarrow$
<center>
    $\vdots$
<center>
    $\downarrow$
<center>    
$z_L = \sigma(W_{L}z_{L-1}+b_{L-1})$
<center>
    $\downarrow$
<center>
$y = W_{L+1}z_{L}+b_{L}$
</center>

- $\sigma(\cdot)$ are non-linear activation functions
- Intermediate representations $z_{i}$ are called *hidden states*
- $W_{i}$ and $b_{i}$ are templates and bias terms to get representations $z_{i}$


# Elman RNN


- The deep learning framework we presented gives us a tool to take a given input $x$ and transform it to a desired output $y$ through intermediate hidden hierarchical representations $z_{i}$
<center>
$ y = V\sigma(U x)$
</center>

- This framework works great for a single input $x$ and a single output $y$

- But what if we have *sequences* of inputs and targets? 



- We need a way to have a notion of "memory" since past inputs also have an influence on the output

<center>
$ y_{t} \stackrel{?}{=} f(x_{t}, x_{t-1})$
</center>


- Recurrent Neural Networks generalize the static input and output pairs $(x, y)$ for sequences of inputs and outputs $(x_{t}, y_{t})$, for $t$ in $\{0, 1, \ldots\}$.

<center>
    <br>
$h_{t} = \sigma(U x_{t} + V h_{(t-1)})$
<center>    
$y_{t} =W h_{t} $
   
![elman](https://upload.wikimedia.org/wikipedia/commons/thumb/b/b5/Recurrent_neural_network_unfold.svg/1000px-Recurrent_neural_network_unfold.svg.png)
 
- Without feedback, we are not capturing the past history. If we set $V = 0$ we get back to the previous framework: $h_{t} = \sigma(U x_{t})$ and $y_{t} =W h_{t} $

# Example: predicting the next character from observing previous characters

## Next character prediction 

We will only consider one input sequence and one output sequence:

- **Input sequence:** hihell
- **Output sequence:** ihello



We need to follow good machine learning practice of having training and hold-out test sets.

Our main aim is covering RNNs in PyTorch and we assume you already know this 😇

One way of thinking about how this works is that the machine learning algorithm (ML) first sees the character "h" and tries to guess what should follow:

"h" → ML → "?"





In the ideal case and if it has learned, the ML will output "i". 






If we do this for all the characters, in this exercise we would like to see the ML algorithm at the end to have this property:

"h" → ML →"i"<br>
"i"  → ML → "h"<br>
"h" → ML → "e"<br>
"e" → ML → "l"<br>
"l"  → ML → "l"<br>
"l"  → ML → "o"<br>


- This notebook is mostly lifted and modified from the excellent tutorials by Sung Kim:
https://docs.google.com/presentation/d/17VUX7YXhMkJrqO5gNGh6EE5gzBpY-BF9IrfVKcFIb3A/edit#slide=id.g27c9a844e4_157_9

### Encoding characters with numbers



A popular way to deal with encoding characters instead is to treat all characters as equally important and assign a one-hot encoding scheme. It is easiest to illustrate this:

"h" → 1000 <br>
"e" → 0100 <br>
"l" → 0010 <br>
"o" → 0001 <br>

We have a sequence of 4D vectors! This is perfect for RNNs!

In [None]:
# One hot encoding for each char in 'hello'
h = [1, 0, 0, 0]
e = [0, 1, 0, 0]
l = [0, 0, 1, 0]
o = [0, 0, 0, 1]


### Elman RNN in PyTorch: `nn.RNN`

![elman](figures/nn.RNN.png)

`nn.RNN` in PyTorch implements the following RNN: 

$ h_{t} = \text{tanh}(w_{ih}x_{t} + b_{ih} + w_{hh}h_{(t-1)} + b_{hh})$

- $w_{ih}$ and $b_{ih}$ are the weights and biases for the input $x_{t}$
- $w_{hh}$ and $b_{hh}$ are the weights and biases for the previous hidden state $h_{(t-1)}$



In [None]:
import sys
import torch
import torch.nn as nn
from torch.autograd import Variable

nn.RNN
Important parameters that need to be specified:

- `input_size`: The number of expected features in the input `x`
- `hidden_size`: The number of features in the hidden state `h`

Since our 1-one-hot vectors are only 4 dimensional, we need the `input_size` of the RNN to be 4. The `hidden_size` is the dimension of the hidden state. 



Once `nn.RNN` has been defined, it takes two inputs (`x` and initial state `h_0`) and returns two outputs (`output` set of states and final state`h_n`). The inputs are:

- input of shape (seq_len, batch, input_size): tensor containing the features of the input sequence.
- `h_0` of shape (num_layers * num_directions, batch, hidden_size): tensor containing the initial hidden state for each element in the batch. 

The outputs are:

- `output` of shape (seq_len, batch, num_directions * hidden_size): tensor containing the output features (h_k) from the last layer of the RNN, for each k. 
- `h_n` (num_layers * num_directions, batch, hidden_size): tensor containing the hidden state for `n = seq_len.` 

For each element in the input sequence, each layer computes the following function:

In [None]:
# The RNN cell will take two sets of inputs:
#  - inputs x with 4 features; we specify the number of features with input_size
#  - hidden state h with 2 features; we specify the number of hidden features 
#    with hidden_size = 2
     
elman_rnn = nn.RNN(input_size=4, hidden_size=2, batch_first=True)

The above line has instantiated a elman_rnn object for us to process *sequences* of data that takes state and input data. 





### Prep the initial hidden state tensor
When we start the RNN, we need to select something for the initial hidden state $h_0$. Here lets pick something from a random normal distribution.

In [None]:
# To make a 2 dimensional hidden state vector, we make the initial hidden state
# h_0 with the tensor size specified as:
#
#     (num_layers * num_directions, batch_size, hidden_size) 
#
# (swap if batch_first = True when RNN cell was created above)

hidden = Variable(torch.randn(1, 1, 2))
hidden.size()

- The hidden state will have two features! 

- Since we only have 1 hidden layer, and only batch size of 1 (we only have one sequence), we expect the hidden state vector to be a 1 x 1 x 2 tensor. 

- `hidden.size()` above verifies that we have initialized the tensor with the correct shape. 



### Prep the initial input sequence character

Now let's propogate an input character through the RNN cell. We will first convert our list of one-hot encoded characters to a PyTorch tensor. 

In [None]:
# Propagate input through RNN
# Input: (batch, seq_len, input_size) when batch_first=True
input_characters = Variable(torch.Tensor([h, e, l, l, o]))

input_characters.size()

Since our one-hot encoded vectors are 4 dimensional, and our "hello" sequence consists of 5 characters, we except the input tensor to have size 5 x 4. We see that using `input_characters.size()` that this is indeed the case. 




Note that our `inputs` tensor is missing the batch dimension. In this case, since we only have 1 sequence (so just one batch), the `inputs` tensor needs to have a size `[1, 5, 4]` - we can easily reshape the tensor using the `.view(1,5,4)` method. Or, we could just do `.view(1,5,-1)` where the -1 will take care of the left over dimensions for us.



### Inference with RNN (aka "forward pass"): one character at a time

OK, lets **finally** take our initial hidden state and the first character encoded characters and pass it to the Elman RNN that we defined as `elman_rnn`:

In [None]:
out, hidden = elman_rnn(input_characters[0,:].view(1,1,-1), hidden)
print("Encoded character size:", input_characters[0,:].size(), 
      "\nhidden size:", hidden.size() ,
      "\nout size:", out.size())

- `elman_rnn` expects two sets of input tensors: 
    - the input character (note singular)
    - the hidden state (note singular)
   

- We have stored all characters in the 5x4 tensor `input_characters`
- We pass the first character using "slice" indexing: `input_characters[0,:]`
    - When we slice this way, you will only get a 1D tensor that has 4 elements.
    - But RNN expects a *sequential tensor* w/ shape (seq_len, batch, input_size) 

- So we have to reshape this 1D tensor into 3D by introducing some dummy dimensions. This reshaping can be done with the `.view()` method.





Below we will iteratre through each 1-hot-encoded character and see the output and hidden sizes of the RNN outputs.

In [None]:
for i, one_hot_encoded_encoded in enumerate(input_characters):
    encoded_character = one_hot_encoded_encoded.view(1, 1, -1)
    # Input: (batch, seq_len, input_size) when batch_first=True
    out, hidden = elman_rnn(encoded_character, hidden)
    print("Character", i, "tensor sizes:")
    print("  encoded char size:", encoded_character.size(), 
          "; hidden size:", hidden.size() ,
          ";  out size:", out.size())

- We see that for every character, the input tensor has been reshaped as a 4 dimensional 1-hot-encoded vector with tensor size 1x1x4. 


- The RNN cell then takes produces *two* tensors: `out` and `hidden`. `hidden` is just the output of the RNN state for the next time step $h_{t+1}$. The out and hidden tensors have the same shape. 


- This is because `out` is just a copy of the `hidden`. 

### Exercise: Verify `out` is a copy of the `hidden` state

Indeed! We see that the `hidden` and `out` tensors that the RNN cell returns are not only equal in shape but also in value. 



### Inference with RNN: going through entire sequence in one shot

In [None]:
input_characters = input_characters.view(1, 5, -1)
out, hidden = elman_rnn(input_characters, hidden)
print("sequence of encoded character size",input_characters.size(), "\nhidden size", hidden.size(), "\nout size", out.size())

`out` is the hidden states for **every time step**. `hidden` is the hidden state for just the **last time step**. So the last time step for `out` should be identical to `hidden`. 

In [None]:
out[:,-1,:] # the last element in the *sequence* of outputs of the RNN

In [None]:
hidden

Yep, they both have the same values! 

### Inference with RNN: iterating through multiple sequences

Now lets try multiple sequencse so we have more than 1 batch. 

Here we will consider 3 sequences each with the same length: "hello", "eolll", and "lleel".

In [None]:
# One cell RNN input_dim (4) -> output_dim (2). sequence: 5, batch 3
# 3 batches 'hello', 'eolll', 'lleel'
# rank = (3, 5, 4)
inputs = Variable(torch.Tensor([[h, e, l, l, o],
                                [e, o, l, l, l],
                                [l, l, e, e, l]]))

inputs.size()

We see from `inputs.size()` that the `inputs` tensor has size 3x5x4. These three dimensions correspond to:
- dim 1: the number of sequences, 3 in this case
- dim 2: the length of each sequence (ie the number of elements/characters in each sequence).  We have 5 characters for each sequence
- dim 3: number of features to represent each character. Since we are using a 1-hot encoding scheme and we only have 4 characters, the number of features is just 4.

OK, now that we have our inputs tensor setup, we now need to initialize the hidden state as before. 

The big difference before is because we have **three** sequences instead of one like the previous examples, we need to create three hidden tensors. 

In [None]:
# hidden : (num_layers * num_directions, batch, hidden_size) whether batch_first=True or False
hidden = Variable(torch.randn(1, 3, 2))
hidden.size()

In other words, we have created 3 different hidden states each having dimension 2. 



Now that we have our hidden states and inputs setup, lets forward pass them to Elman RNN!!

In [None]:
# Propagate input through RNN
# Input: (batch, seq_len, input_size) when batch_first=True
# B x S x I
out, hidden = elman_rnn(inputs, hidden)
print("batch input size", inputs.size(), "\nout size", out.size(), "\nhidden size", hidden.size())


- Because we have 3 sequences, the RNN will have 3 outputs. The first dimension of the `out` tensor 3. 


- Each sequence has length 5 characters, so the `out` tensor will have a hidden state for each of these characters. The middle dimension of `out` has 5 elements.


- **Finally!** We have designed our RNN to have hidden states that are 2D. This is why we see that the third dimension of the `out` tensor is 2. 


OK, what about the `hidden` tensor?

In [None]:
# Propagate input through RNN
# Input: (batch, seq_len, input_size) when batch_first=True
# B x S x I
out, hidden = elman_rnn(inputs, hidden)
print("batch input size", inputs.size(), "\nout size", out.size(), "\nhidden size", hidden.size())


- Remember the `hidden` tensor the RNN returns is just the *last* output of the hidden state from the last input character. Because we have 3 sequences, we 3 hidden states. And because we have 2 features in our hidden state, these 3 hidden states are 2D. 


- And as before, we expect that the *last* element in the `out` tensor should be equal to the `hidden` tensor for every sequence.

### Exercise:  verify the *last* element in the `out` tensor should is equal to the `hidden` tensor for every sequence

Everything checks out!



we can also not have the first dim be the batch size:

In [None]:
# One cell RNN input_dim (4) -> output_dim (2)
elman_rnn = nn.RNN(input_size=4, hidden_size=2)

# The given dimensions dim0 and dim1 are swapped.
inputs = inputs.transpose(dim0=0, dim1=1)
# Propagate input through RNN
# Input: (seq_len, batch_size, input_size) when batch_first=False (default)
# S x B x I
out, hidden = elman_rnn(inputs, hidden)
print("batch input size", inputs.size(), "out size", out.size())

### Learning 1-batch sequence with RNN one element at a time

Lets now apply RNN to *learn* a sequence. We will only consider one input sequence and one output sequence:

- Input sequence: hihell
- Output sequence: ihello

We will design the 1-hot-encoding by first assigning an index to each character:
- "h" -> 0
- "i" -> 1
- "e" -> 2
- "l" -> 3
- "o" -> 4

So in other words we are living in a world that has only these 5 characters. 

In [None]:
torch.manual_seed(777)  # reproducibility
#            0    1    2    3    4
idx2char = ['h', 'i', 'e', 'l', 'o']

We now define our sequence input sequence `x_data` "hihell"  and our output sequence `y_data` "ihello"

We also convert our characters to one-hot-encoded vectors using a simple lookup table.

In [None]:
# Teach hihell -> ihello
x_data = [0, 1, 0, 2, 3, 3]   # hihell
y_data = [1, 0, 2, 3, 3, 4]   # ihello

one_hot_lookup = [[1, 0, 0, 0, 0],  # 0
                  [0, 1, 0, 0, 0],  # 1
                  [0, 0, 1, 0, 0],  # 2
                  [0, 0, 0, 1, 0],  # 3
                  [0, 0, 0, 0, 1]]  # 4


x_one_hot = [one_hot_lookup[x] for x in x_data]

# As we have one batch of samples, we will change them to variables only once
inputs = Variable(torch.Tensor(x_one_hot))
labels = Variable(torch.LongTensor(y_data))




In [None]:
inputs.size(), labels.size()

The `inputs` tensor is size 6x5 because there are 6 characters and each character has 5 features. The outut `labels` tensor is just the character "classes" (ie, which character encodings) we want to predict.

- The RNN we are going to use for predicting the next character is going to use the hidden state to directly in its prediction. 

- Normally this would be passed to another layer (like a fully connected layer or even another RNN) but in this example we are just going to  use it directly. 
- The advantage of using the hidden state directly and not introducing additional layers is we limit the number of parameters we have to learn. 
- The disadvantage of not using an additional layer is that we expect the hidden state to encode *both* the past histories *and* predict the next character. 


In [None]:
num_classes = 5      # the number of possible classes we have (the labels tensors is between 0 and 4)
input_size = 5       # one-hot encoded vector dimensions
hidden_size = 5      # we use 5 dimensional hidden state vectors to directly predict the character
batch_size = 1       # we have one sentence and so one batch size
sequence_length = 1  # we have only one sequence and we will process the characters one by one
num_layers = 1       # we will have a simple one hidden layer RNN

Now we define our RNN class with the specific architecture that we want (as we usually do with PyTorch neural network models for training)

We create our RNN model by inheriting from the `nn.Module`



This is optional but very convenient.
    
To create our own RNN model with and satisfy the `nn.Module` API, we need to define the following two methods:

- `__init__`
- `forward`

Can define any other supporting methods we need

In [None]:
class Model(nn.Module):

    def __init__(self, input_size, hidden_size):
        super(Model, self).__init__()
        self.input_size = input_size
        self.hidden_size = hidden_size
        self.rnn = nn.RNN(input_size=self.input_size,
                          hidden_size=self.hidden_size, 
                          batch_first=True)

    def forward(self, hidden, x):
        # Reshape input to make sure the first dim is batch dimension
        x = x.view(batch_size, sequence_length, input_size)

        # Propagate input through RNN
        #   Input:  (batch, seq_len, input_size)
        #            since we only have 1 batch and are iterating a single 
        #            character at a time we execpt the input tensor to have 
        #            shape: 1 x 1 x 5
        #   hidden: (num_layers * num_directions, batch, hidden_size)
        #            we only have 1 hidden layer and the RNN is uniderectional
        #            so the hidden tensor size should be 1 x 1 x 5              
        out, hidden = self.rnn(x, hidden)
        return hidden, out.view(-1, num_classes)

    def init_hidden(self):
        # Initialize hidden and cell states
        # (num_layers * num_directions, batch, hidden_size)
        #  we only have 1 hidden layer and the RNN is uniderectional
        #  so the hidden tensor size should be 1 x 1 x 5   
        return Variable(torch.zeros(num_layers, batch_size, hidden_size))

Now we instantiate the model, define our loss criteron, and define the optimizer we want to use:

In [None]:
# Instantiate RNN model
model = Model(input_size=5, hidden_size = 5)

# Set loss and optimizer function
# CrossEntropyLoss = LogSoftmax + NLLLoss
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.005)

# reducelron platue

# Time to train the network!!!! 

We will loop through each epoch (100 of them), forward pass each individual character, then compute and accumulate the loss. Once the sequence is over, we compute the total loss and propagate the errors to update the network.  

In [None]:
for epoch in range(100):
    optimizer.zero_grad()
    loss = 0
    hidden = model.init_hidden()
    
    # iterate through each character and predict what
    # the next character should be.
    pred_string = ""
    for input, label in zip(inputs, labels):
        hidden, output = model(hidden, input)
        val, idx = output.max(1) # We are using the hidden state directly to make our 
                                 # prediction (and have reshaped appropriately in our 
                                 # Model class definition). We could also just as well 
                                 # use hidden state that we are returning as long as 
                                 # we reshape it right: 
                                 # hidden.view(-1, num_classes).max(1) 
        
        pred_string += idx2char[idx.data[0]]
        # accumulate the loss
        loss += criterion(output, label.view(1))
    
    # ok we completed the sequence, lets backward prop and update the network
    loss.backward()
    optimizer.step()
    
    # print every 20 epochs what the results look like
    if (epoch%20 == 0) or (epoch == 99):
        sys.stdout.write("predicted string: ")
        sys.stdout.write(pred_string)
        print(", epoch: %d, loss: %1.3f" % (epoch + 1, loss.item()))    
    
print("Learning finished!")    

Looks like we learned (memorized, really) the target sequence!!!

We can generalize this approach to multiple sequences: just have one more loop to iterate through each sequence. 

All of our sequences could also have different lengths and it would not matter.



But often we are faced with sequences that always have the same length. 

In such cases we can update our RNN model to process not just a single character at a time, but the entire sequence. 

## Learning 1-batch sequence with RNN (entire sequence)

Now we are going to learn the sequence not character-by-character but the entire sequence at once. 

Another way of thinking about this is that we are learning batches of sequences. But since we only have 1 sequence, our batch size will be 1. 

In [None]:
sequence_length = 6  # Since the number of character in our sequence |ihello| == 6

We will similarly define the parameters of our model as variables below:

In [None]:
num_classes = 5      # the number of possible classes we have (the labels tensors is between 0 and 4)
input_size = 5       # one-hot encoded vector dimensions
hidden_size = 5      # we use 5 dimensional hidden state vectors to directly predict the character
batch_size = 1       # we have one sentence and so one batch size
sequence_length = 1  # we have only one sequence and we will process the characters one by one
num_layers = 1       # we will have a simple one hidden layer RNN

Now we define our RNN class with the specific architecture that we want:

In [None]:
class RNN(nn.Module):
    def __init__(self, num_classes, input_size, hidden_size, num_layers):
        super(RNN, self).__init__()

        self.num_classes = num_classes
        self.num_layers = num_layers
        self.input_size = input_size
        self.hidden_size = hidden_size
        self.sequence_length = sequence_length

        self.rnn = nn.RNN(input_size=5, hidden_size=5, batch_first=True)

    def forward(self, x):
        # Initialize hidden and cell states
        # (num_layers * num_directions, batch, hidden_size) for batch_first=True
        h_0 = Variable(torch.zeros(self.num_layers, x.size(0), self.hidden_size))


        # Propagate input through RNN
        # Input: (batch, seq_len, input_size)
        # h_0: (num_layers * num_directions, batch, hidden_size)

        out, _ = self.rnn(x, h_0)
        #return out
        return out.view(-1, num_classes)

In [None]:
# As we have one batch of samples, we will change them to variables only once
inputs = Variable(torch.Tensor(x_one_hot))
labels = Variable(torch.LongTensor(y_data))

inputs.size(), labels.size()

In [None]:
# Instantiate RNN model
rnn = RNN(num_classes, input_size, hidden_size, num_layers)

# Set loss and optimizer function
# CrossEntropyLoss = LogSoftmax + NLLLoss
criterion = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(rnn.parameters(), lr=0.05)

In [None]:
outputs = rnn(inputs.view(1,6,-1))
outputs.size()

In [None]:
# Train the model
for epoch in range(100):
    outputs = rnn(inputs.view(1,6,-1))
    optimizer.zero_grad()
    loss = criterion(outputs, labels)
    loss.backward()
    optimizer.step()
    _, idx = outputs.max(1)
    idx = idx.data.numpy()
    result_str = [idx2char[c] for c in idx.squeeze()]
    if (epoch%20 == 0) or (epoch == 99):
        print("epoch: %d, loss: %1.3f" % (epoch + 1, loss.item()))
        print("Predicted string: ", ''.join(result_str))
    
print("Learning finished!")

## Take home exercise: RNN with Embedding and Output layers

- Instead of keeping the representation of characters fixed with one-hot encoding, we can also *learn* dense representations (kernels, templates, etc) of them

- The Embedding layer is a special type of layer designed especially for sparse high dimensional 1-hot coded vectors

- Use the embedding layer to pre-process the input before passing to the RNN

In [None]:
x_data = [[0, 1, 0, 2, 3, 3]] 
# As we have one batch of samples, we will change them to variables only once
inputs = Variable(torch.LongTensor(x_data))
labels = Variable(torch.LongTensor(y_data))

embedding_size = 10  # embedding size

In [None]:
class Model(nn.Module):
    def __init__(self, hidden_size):    
        super(Model, self).__init__()
        self.hidden_size = hidden_size
        self.embedding = nn.Embedding(input_size, embedding_size)
        self.rnn = nn.RNN(input_size=embedding_size,
                          hidden_size=self.hidden_size, batch_first=True)
        self.fc = nn.Linear(hidden_size, num_classes)

    def forward(self, x):
        # Initialize hidden and cell states
        # (num_layers * num_directions, batch, hidden_size)
        #h_0 = Variable(torch.zeros(1, embedding_size, self.hidden_size))
        h_0 = Variable(torch.zeros(1, 1, self.hidden_size))

        emb = self.embedding(x)
        emb = emb.view(batch_size, embedding_size, -1)
        # Propagate embedding through RNN
        # Input: (batch, seq_len, embedding_size)
        # h_0: (num_layers * num_directions, batch, hidden_size)
        out, _ = self.rnn(emb.view(1,6,-1), h_0)
        return self.fc(out)

In [None]:
# Instantiate RNN model
model = Model(hidden_size)
print(model)

# Set loss and optimizer function
# CrossEntropyLoss = LogSoftmax + NLLLoss
criterion = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.1)


In [None]:
# Train the model
for epoch in range(100):
    outputs = model(inputs.view(1,-1)).view(-1,num_classes)
    optimizer.zero_grad()
    loss = criterion(outputs, labels)
    loss.backward()
    optimizer.step()
    _, idx = outputs.max(1)
    idx = idx.data.numpy()
    result_str = [idx2char[c] for c in idx.squeeze()]
    if (epoch%20 == 0) or (epoch == 99):
        print("epoch: %d, loss: %1.3f" % (epoch + 1, loss.data[0]))
        print("Predicted string: ", ''.join(result_str))

print("Learning finished!")