# RNNs: Recurrent Neural Networks
A very special kind of neural network that gets the usual input and produces an output but it produces a hidden output that's sometimes called `state` of `hidden state` that get passed on to the next cell:

<img src="imgs/RNN-unrolled.png" />

The three types of RNN Models are:
* One to Many
* Many to One
* Many to Many

RNNs are useful when dealing with sequential data, and have applications in domains such as:
* Time Series Prediction
* Language Modeling (Text Generation)
* Text sentiment Analysis
* Named Entity Recognition
* Translation
* Speech Recognition
* Music Composition
* ...

There are two approaches to represent text, to one hot encode them or use embeddings, Let's start with one hot encoding the word 'hello':

In [1]:
import numpy as np
import torch
from torch import nn
from torch.autograd import Variable
import ipdb

In [None]:
# Let's one-hot encode 'hello'.
string = 'hello'
indices = np.array([i for i in range(len(list(set('hello'))))])
letter_to_index = dict(zip(list(set('hello')), indices))

In [None]:
one_hot = np.zeros((len(string), len(list(set(string)))))

In [None]:
for i, _ in enumerate(one_hot):
    one_hot[i][letter_to_index[string[i]]] = 1
one_hot = one_hot.astype('float32')

Let's create a Model that gets 'Hello' and output two random numbers:

In [None]:
class FirstRNN(nn.Module):
    '''
    Our first RNN written using PyTorch, it has one cell.
    Input: One-hot encoded 'Hello'
    Output: a random vector of size 2.
    '''
    def __init__(self):
        super(FirstRNN, self).__init__()
        
        # One Cell RNN: Input_dim=4 -> Output_dim=2.
        self.cell = nn.RNN(input_size=4, hidden_size=2, batch_first=True)
    
    def forward(self, x):
        # we have to initialize the first hidden state.
        h = Variable(torch.randn(1,6,2))
        
        out, hidden = self.cell(x, h)
        
        return out

In [None]:
RNN = FirstRNN()

In [None]:
o, h = RNN(Variable(torch.from_numpy(one_hot.reshape((1,5,4)))))

And then we want to do this process for A Sequence of size N:

<img src="imgs/RNN_n_sequence.png" />

In [None]:
o.data

In [None]:
h

The output and the hidden state from a give cell are the same, the premise is that that hidden state will be feeded to the next cell to overcome the vanishing gradients problem.

Now it's time to introduce the notion of batches, until now we've dealt with a batch containing one example that contain 5 letters of size 4 (1-hot encoded), so the shape is `(1,5,4)`, next we need to feed in multiple words in the batch, for example 3 words:

<img src="imgs/RNN_Batch.png" />

In [None]:
def string_to_one_hot(string):
    '''
    Transform a String to a One-hot Encoded Vector.
    '''
    one_hot = np.zeros((5, 4))
    
    for i, _ in enumerate(one_hot):
        one_hot[i][letter_to_index[string[i]]] = 1

    return one_hot.astype('float32')

In [None]:
string_to_one_hot('hhhhh')

In [None]:
batch = ['hello', 'hhhhh', 'helll', 'eeehl', 'lhell', 'eeeeh']

We'll transform the batch into the proper data format:

In [None]:
b = []
for word in batch:
    b.append(string_to_one_hot(word))
b = np.array(b)

In [None]:
b.shape

We have a batch of **6** words, each word have **5** characters, each character is a one-hot vector of size **4**.

To feed the batch to the `SimpleRNN`, we need to initilize the first hidden state for each Word, After doing that we feed the batch and get the outputs:

In [None]:
output = RNN(Variable(torch.from_numpy(b)))

In [None]:
output.shape

## 'Hello' Language Model
We are going to teach our RNN to predict the next character:
<img src="imgs/Hello_LM.png" />
* A Many-to-Many problem
* Input/Output dim: 5
* This is basically a multi-class classification problem.
* As a Loss function we'll choose cross-entropy.
* We compute the loss over each letter and we minimize the total loss.

### I. Data Preparation

In [317]:
one_hot_lookup = [
    [1, 0, 0, 0, 0],
    [0, 1, 0, 0, 0],
    [0, 0, 1, 0, 0],
    [0, 0, 0, 1, 0],
    [0, 0, 0, 0, 1]
]

In [318]:
idx2char = ['h', 'i', 'e', 'l', 'o']

In [319]:
# X = 'hihell'
x_data = [0, 1, 0, 2, 3, 3]

In [320]:
x_one_hot = [one_hot_lookup[x] for x in x_data]

In [321]:
# Y: ihello
y_data = [1, 0, 2, 3, 3, 4]

In [322]:
# as we have one batch of samples, we will change them to variables only once.
inputs = Variable(torch.Tensor(x_one_hot))

In [323]:
labels = Variable(torch.LongTensor(y_data))

### Parameters

In [324]:
num_classes = 5
input_size = 5  # one-hot size.
hidden_size = 5  # Output from the LSTM, 5 to directly predict a one-hot vector corresponding to 1 character. 
batch_size = 1  # one sentence.
sequence_length = 6  # Let's do one by one.
num_layers = 1  # one-layer RNN.

In [325]:
class SecondRNN(nn.Module):
    def __init__(self):
        super(SecondRNN, self).__init__()
        
        # One Cell RNN: Input_dim=4 -> Output_dim=2.
        self.rnn = nn.RNN(input_size=input_size, hidden_size=hidden_size, batch_first=True)
    
    def forward(self, x, hidden):
        
        # reshape input in (batch_size, sequence_length, input_size)
        x = x.view(batch_size, sequence_length, input_size)
        
        # propagate input through RNN.
        out, hidden = self.rnn(x, hidden)
        
        # output is a probability ditribution over the number of classes,
        # this is why we reshape the output.
        out = out.view(-1, num_classes)
        
        #return hidden, out
        return out
    
    def init_hidden(self):
        # Initialize hidden and cell states.
        # (num_layers * num_directions, batch, hidden_size)
        return Variable(torch.zeros(num_layers, batch_size, hidden_size))

In [326]:
del(model)
model = SecondRNN()

In [327]:
criterion = torch.nn.CrossEntropyLoss()

In [328]:
optimizer = torch.optim.Adam(params=model.parameters(), lr=0.1)

We will feed in the characters one by one:

we update the weights only after we've calculated the sum of the losses over all character predictions:

In [289]:
for epoch in range(10000):
    optimizer.zero_grad()
    loss = 0
    hidden = model.init_hidden()
    pred = ''
    for inpt, label in zip(inputs, labels):
        hidden, output = model(inpt, hidden)
        val, idx = output.max(1)
        pred += str(idx2char[idx.data[0]])
        loss += criterion(output, label)

    if epoch % 500 == 0:
        print('Epoch: ' + str(epoch) + ' loss: %1.3f' % loss.data[0] + ' | prediction: ' + pred)

    loss.backward(retain_graph=True)
    optimizer.step()

Epoch: 0 loss: 9.799 | prediction: hhholl
Epoch: 500 loss: 2.667 | prediction: ihello
Epoch: 1000 loss: 2.643 | prediction: ihello
Epoch: 1500 loss: 2.634 | prediction: ihello
Epoch: 2000 loss: 2.628 | prediction: ihello
Epoch: 2500 loss: 2.625 | prediction: ihello
Epoch: 3000 loss: 2.622 | prediction: ihello
Epoch: 3500 loss: 2.620 | prediction: ihello
Epoch: 4000 loss: 2.618 | prediction: ihello
Epoch: 4500 loss: 2.617 | prediction: ihello
Epoch: 5000 loss: 2.615 | prediction: ihello
Epoch: 5500 loss: 2.614 | prediction: ihello
Epoch: 6000 loss: 2.613 | prediction: ihello
Epoch: 6500 loss: 2.613 | prediction: ihello
Epoch: 7000 loss: 2.612 | prediction: ihello
Epoch: 7500 loss: 2.611 | prediction: ihello
Epoch: 8000 loss: 2.611 | prediction: ihello
Epoch: 8500 loss: 2.610 | prediction: ihello
Epoch: 9000 loss: 2.610 | prediction: ihello
Epoch: 9500 loss: 2.609 | prediction: ihello


We can run the same model but with all of the inputs and receive all of the outputs, because the model scales with the batch size with no problem.

In [329]:
for epoch in range(1000):
    optimizer.zero_grad()
    loss = 0
    hidden = model.init_hidden()
    outputs = model(inputs, hidden)
    loss = criterion(outputs, labels)
    loss.backward(retain_graph=True)
    optimizer.step()

    if epoch % 100 == 0:
        print('Epoch: ' + str(epoch) + ' loss: %1.3f' % loss.data[0])

Epoch: 0 loss: 1.716
Epoch: 100 loss: 0.510
Epoch: 200 loss: 0.520
Epoch: 300 loss: 0.508
Epoch: 400 loss: 0.507
Epoch: 500 loss: 0.507
Epoch: 600 loss: 0.507
Epoch: 700 loss: 0.507
Epoch: 800 loss: 0.507
Epoch: 900 loss: 0.507


## Exercice

<img src="imgs/RNN_exercice.png" />
<img src="imgs/RNN_second_exo.png" />

# One-Hot Vs. Embeddings

Embeddings are more efficient since the feature space is not discrete, and they're trainable and can reflect meaning extracted from the corpus.

<img src="imgs/one_hot_embeddings.png" />

## Exercice

<img src="imgs/RNN_embedding_exo.png" />

And to fully understand the different types of RNNs, Just implement them using PyTorch from scratch (don't use the PyTorch layers), but you can take a look at the source code to compare your code with theirs:
* Regular RNN
* GRU
* LSTM

[LSTM](http://colah.github.io/posts/2015-08-Understanding-LSTMs/)


[Stanford: Deep Learning for NLP](https://cs224d.stanford.edu/lecture_notes/LectureNotes4.pdf)