# Recurrent Neural Networks

So far, the neural network architectures we have been using take in a single fixed size input and give a single fixed size output. What if we wanted to model something like language where we want to feed in different length words or sentences? Another issue with vanilla fully connected networks is that each output is only dependent on the current input. It has no 'memory' of previous inputs so you can't model time dependent variables. Recurrent neural networks address both these issues.

They do this by having an internal hidden state which can be thought of as a form of memory. At each time step, the new hidden state (h) is calculated as a function of the previous hidden state and the current input (x). This hidden state can then be used to represent your output or can be put through another transformation to compute the outputs (y).

The left diagram below shows how we represent a RNN whereas the right one shows a RNN which has been "unrolled" over time so we see the value of each variable at successive time steps.

![image](images/RNN_basic.JPG)

The diagram below shows the actual matrix representation of each of the variables in the RNN. Instead of doing two separate matrix multiplications on the input and previous hidden state to calculate the next hidden state, we can concatenate those two variables into  single I vector and W matrix

![image](images/RNN_matrices.JPG)

There is also an alternative way of visually representing the RNN which lets us see how similar it is to a fully connected network.

![image](images/RNN_other_rep.JPG)

Standard neural networks can only model one to one relationships while RNNs are extremely flexible in terms of input-output structures which is one of the reasons they are so powerful. You can imagine something like one to many being used to feed in a single image from which a caption is sequentially produced or a many to one being used to feed in a sentence sequentially and give a single output describing the sentiment of the sentence.

![image](images/RNN_layouts.JPG)
Source: http://karpathy.github.io/2015/05/21/rnn-effectiveness/

### Optimization
Surprisingly, with this increased complexity in structure, the optimization method does not become any more difficult. Despite having a different name, back-propagation through time, it is essentially the same thing. All you do is feed in your sequence sequentially to get the output, as usual. You then just calculate your error at each timestep and sum it as opposed to calculating the error at a single timestep like standard neural networks. Then you can use gradient descent to update your weights iteratively until you are satisfied with your network's performance.

![image](images/RNN_BPTT.JPG)

RNNS are generally slower to optimize than standard neural networks as the output at each time step is dependent on the previous output so the operations cannot be parallelized.

![image](images/RNN_parallel.JPG)

For a long time it was considered difficult to train RNNs due to two problems called vanishing and exploding gradients. These problems also exist in standard neural network but are greatly emphasized in RNNs. However, modern techniques such as LSTM cells have greatly reduced this difficulty.

![image](images/RNN_gradient.JPG)

## Implementation

We are going to be implementing a one-to-one character level text prediction model. We will be sequentially feeding in a single character and asking our network to predict the next character as a time dependent function of all the characters that came before it.

First we need a dataset. This is just a text file which contains the data which we want to model. In this case, I have found a file which contains ~0.5MB of Kendrick Lamar lyrics. You can use a variety of different datasets. There are plenty of which are easily accessible online - check out the links below. Otherwise, is very easy to create your own either by copying and pasting text into a file or creating a bot to automatically do this for you.

[Datasets repo 1](https://github.com/cedricdeboom/character-level-rnn-datasets/tree/master/datasets)

We now define our dataset class which we can use to read the dataset and use it with a pytorch dataloader for easy sampling. 

We first open the file and read all the data.

Each text character will be represented by a unique number so we first need all the unique characters in our text. Once we have this, we create a dictionary which maps from a unique number to a letter. After defining the reverse mapping aswell, we use the dictionary to convert our original string into a list of numbers where each number represents a text character.

The labels are simply the input but shifted by one as we are always predicting the next character based on the current one.

Notice how we do not one-hot encode our input, instead outputting the unique id of each character in the text. The reason for this is because we use a pytorch embedding layer which is explained later on.

In [6]:
import torch
import numpy as np
import torch.nn.functional as F

class CharRNNDataset():
    def __init__(self, txt_file_path='Data/lyrics.txt', chunk_size=100, transform=None):
        self.txt_file_path = txt_file_path
        self.chunk_size = chunk_size
        self.transform = transform
        
        #open our text file and read all the data into the rawtxt variable
        with open(txt_file_path, 'r') as file:
            rawtxt = file.read()

        #turn all of the text into lowercase as it will reduce the number of characters that our algorithm needs to learn
        rawtxt = rawtxt.lower()
        
        letters = set(rawtxt) #returns the list of unique characters in our raw text
        self.nchars = len(letters) #number of unique characters in our text file
        self.num_to_let = dict(enumerate(letters)) #created the dictionary mapping
        self.let_to_num = dict(zip(self.num_to_let.values(), self.num_to_let.keys())) #create the reverse mapping so we can map from a character to a unique number
        
        txt = list(rawtxt)#convert string to list
        for k, letter in enumerate(txt): #iterate through our text and change the value for each character to its mapped value
            txt[k] = self.let_to_num[letter] #set the kth item equal to the value it maps to

        self.X = np.array(txt) #convert txt to numpy array
    
    def __len__(self):
        return len(self.X)-self.chunk_size #the number of datapoints we have based on the chunk size and X
    
    def __getitem__(self, idx):
        x = self.X[idx:idx+self.chunk_size] #get the chunk at the particular index
        y = self.X[idx+1:idx+self.chunk_size+1] #get the labels which is like the input but shifted one to the left
        
        if self.transform: #apply the transform if any
            x, y = self.transform((x, y))
    
        return x, y

In [7]:
from torchvision import transforms
class ToLongTensor():
    def __init__(self):
        pass
    def __call__(self, inp):
        return (torch.LongTensor(var) for var in inp)

In [None]:
from torch.utils.data import DataLoader

batch_size = 32
chunk_size = 150 #the length of the sequences which we will optimize over

train_data = CharRNNDataset('Data/lyrics.txt', chunk_size=chunk_size, transform=ToLongTensor()) #instantiate dataset from class defined above
x, y = train_data[0]
print('First input', x)
print('First label', y, '\n')

nchars = train_data.nchars
num_to_let = train_data.num_to_let
let_to_num = train_data.let_to_num

print('Number of unique chatacters:', nchars)
print('Length of dataset:', len(train_data))

train_loader = DataLoader(train_data,# make the training dataloader
                          batch_size = batch_size,
                          shuffle=True)

Define our model which takes in variables defining its structure as parameters. The encoder converts each unique number into an embedding which is fed into the rnn model. The RNN calculates the hidden state which is converted into an output through a fully connected layer called the decoder.

We also define the init_hidden function which outputs us a tensor of zeros of the required size for the initial hidden state.

The input we get from the dataloader is a vector of integers each of which corresponds to a character. To feed this in to our model we use a vector embedding. The embedding layer in pytorch takes in each integer and converts them into one-hot encoded vectors. It then performs a linear transformation from that to our embedding size which is then fed into our RNN. In the one-hot vector space, the vector for each character is orthogonal to every other vector so each letter is equally "similar" to every other letter. The embedding vector is continuous hence it can learn which characters have similar usage patterns and put them closer in the embedding vector space. Our embedding length can be smaller than the one-hot vector so we can compress the input as each variable can take on continuous values.

In [21]:
class CharRNN(torch.nn.Module):
    def __init__(self, input_size, embedding_len, hidden_size, output_size):
        super().__init__()
        self.hidden_size = hidden_size
        self.encoder = torch.nn.Embedding(input_size, embedding_len) #embedding layer
        self.i2h = torch.nn.Linear(embedding_len + hidden_size, hidden_size) #linear layer from I vector to the hidden
        self.h2y = torch.nn.Linear(hidden_size, output_size) #linear layer from hidden state to output

    def forward(self, x, hidden):
        embedding = self.encoder(x) #encode the input into a vector embedding
        combined = torch.cat((embedding, hidden), 1) #concatenate embedding and hidden to create I vector
        hidden = torch.tanh(self.i2h(combined)) #apply linear layer and activation function to calculate hidden state value
        output = self.h2y(hidden) #calculate output from hidden state
        return output, hidden

    def init_hidden(self, x):
        return torch.zeros(x.shape[0], self.hidden_size) #zeros vector of hidden size for each input example

Instantiate our model, define the appropriate hyper-parameters, cost function and optimizer. We will be training on ranom samples from the text of length chunk_size so it is what batch size is to normal neural networks.

In [22]:
#hyper-params
lr = 0.001
epochs = 50
embedding_len = 400
hidden_size = 128

myRNN = CharRNN(nchars, embedding_len, hidden_size, nchars) #instantiate the model from the class defined earlier
criterion = torch.nn.CrossEntropyLoss() #define cost function - Cross Entropy
optimizer = torch.optim.Adam(myRNN.parameters(), lr=lr) #choose optimizer

# SET UP TRAINING VISUALISATION
from torch.utils.tensorboard import SummaryWriter
writer = SummaryWriter() # we will use this to show our models performance on a graph

Define the training loop, sequentially feeding in multiple batches of random chunks of text, summing the cost for each character in the sequence (backpropagation through time) and calculating the gradients to update our weights.

In [23]:
#training loop
def train(model, epochs):
    for epoch in range(epochs):
        epoch_loss = 0 #stores the cost for each epoch
        generated_string = '' #stores the text generated by our model for the 0th batch over the whole epoch
        for idx, (x, y) in enumerate(train_loader):
            loss = 0 #cost for this batch
            h = model.init_hidden(x) #initialize our hidden state to 0s
            for i in range(chunk_size): #sequentially input each character in the sequence for each batch and calculate loss
                out, h = model.forward(x[:, i], h) #calculate outputs based on input and previous hidden state
                
                _, outl = out.data.max(1) #based on our output, what character id does our network assign the highest probability of being next? # This is a [batch_size] sized Tensor
                    
                letter = num_to_let[outl[0].item()] #what chatacter is predicted for the 0th batch item?
                generated_string+=letter #add the predicted letter to our generated sequence
                
                loss += criterion(out, y[:, i]) #add the cost for this input to the cost for the current batch
            
            writer.add_scalar('Loss/Train', loss/chunk_size, epoch*len(train_loader) + idx)    # write loss to a graph
            
            #based on the sum of the cost for this sequence (backpropagation through time) calculate the gradients and update our weights
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

            epoch_loss+=loss.item() #add the cost of this sequence to the cost of this epoch
        epoch_loss /= len(train_loader.dataset) #divide by the number of datapoinst in each epoch

        print('Epoch ', epoch+1, ' Avg Loss: ', epoch_loss)
        print('Generated text: ', generated_string[0:600], '\n')

In [None]:
train(myRNN, epochs)

The generated text above picks the most probable next character each time. This is not the best way to do it as our model will be deterministic so it will produce the same text over and over again. To get it producing different text, we should instead sample from the probability distribution of possible next letters output by the network. That is what we will do with the generate function. It takes in a prime string which can be used to prime the hidden state of the network before it start making predictions. It essentially completes the string you prime it on.

In [15]:
#should take in a string and map each value in it to a value from a dictionary
def maparray(txt, mapdict):
    txt = list(txt)
    for k, letter in enumerate(txt): #iterate through our text and change the value for each character to its mapped value
        txt[k] = mapdict[letter] #set the kth item equal to the value it maps to
    txt = np.array(txt) #convert to numpy array
    return txt

def generate(model, prime_str='a', str_len=150, temperature=0.75):
    generated_string = prime_str #the sequence generated so far is equal to the prime string
    
    prime_str = maparray(prime_str, let_to_num) #use the maparray function to map the string to its character ids
    x = torch.LongTensor(prime_str).unsqueeze(0)  #convert to LongTensor and add dimension to make batch size 1
    
    h = model.init_hidden(x) #initialize hidden state
    
    for i in range(x.shape[1]-1): #for each input character except the last
        out, h = model.forward(x[:, i], h) #feed that character into the network (prime hidden state)
    
    x = x[:, -1] #get the last letter
    for i in range(str_len): #for each character we want to generate
        out, h = model.forward(x, h) #feed in the last character 
        
        out_dist = out.view(-1).div(temperature).exp() #get the output and exponentiate
        sample = torch.multinomial(out_dist, 1).item() #turn into torch multinomial distribution and sample
        pred_char = num_to_let[sample] #convert the sampled number into the corresponding character
        
        generated_string += pred_char #add the character to the generated string
        
        x = torch.LongTensor([sample]) #set the last letter equal to the newly generated character
    
    return generated_string

In [None]:
gen = generate(myRNN, 'starting text ', 2000, 0.75)
print(gen)

### Using PyTorch's built in RNN module

In [115]:
class CharRNN(torch.nn.Module):
    def __init__(self, input_size, embedding_len, hidden_size, output_size, n_layers=1):
        super().__init__()
        #store input parameters in the object so we can use them later on
        self.input_size = input_size
        self.embedding_len = embedding_len
        self.hidden_size = hidden_size
        self.output_size = output_size
        self.n_layers = n_layers

        #required functions for model
        self.encoder = torch.nn.Embedding(input_size, embedding_len) #apply embedding layer
        self.rnn = torch.nn.RNN(embedding_len, hidden_size, n_layers, batch_first=True) #create recurrent layer
        self.decoder = torch.nn.Linear(hidden_size, output_size) #linear mapping from hidden to output

    def forward(self, x, hidden):
        embedding = self.encoder(x.view(-1)) #encode our input into a vector embedding
        output, hidden = self.rnn(embedding.view(-1, 1, self.embedding_len), hidden) #calculate the output from our rnn based on our input and previous hidden state
        output = self.decoder(output.view(-1, self.hidden_size)) #calculate output based on output of rnn
        return output, hidden

    def init_hidden(self, x):
        return torch.zeros(self.n_layers, x.shape[0], self.hidden_size) #initialize hidden state to a matrix of 0s

In [116]:
#hyper-params
lr = 0.001
epochs = 50
embedding_len = 400
hidden_size = 128

myRNN = CharRNN(nchars, embedding_len, hidden_size, nchars) #instantiate our model from the class defined earlier
criterion = torch.nn.CrossEntropyLoss() #define cost function - Cross Entropy
optimizer = torch.optim.Adam(myrnn.parameters(), lr=lr) #choose optimizer

# SET UP TRAINING VISUALISATION
from torch.utils.tensorboard import SummaryWriter
writer = SummaryWriter() # we will use this to show our models performance on a graph

In [None]:
train(myRNN, epochs)

In [None]:
gen = generate(myRNN, 'starting text', 2000, 0.75)
print(gen)

## How can we overcome the shortcomings of RNNs (unparallelisable training and vanishing gradient)?

![image](images/RNN_LSTM.JPG)

![image](images/RNN_LSTM_gradient.JPG)

### Implementation

Only two things change from the above example to use an LSTM instead. Firstly, use torch.nn.LSTM instead of torch.nn.RNN when defining our model. Secondly, we change the init_hidden function so it returns an extra matrix of 0s as the LSTM not only has a hidden state but also a cell state which needs to be initialized.

In [119]:
class CharLSTM(torch.nn.Module):
    def __init__(self, input_size, embedding_len, hidden_size, output_size, n_layers=1):
        super().__init__()
        #store input parameters in the object so we can use them later on
        self.input_size = input_size
        self.embedding_len = embedding_len
        self.hidden_size = hidden_size
        self.output_size = output_size
        self.n_layers = n_layers

        #required functions for model
        self.encoder = torch.nn.Embedding(input_size, embedding_len)
        self.rnn = torch.nn.LSTM(embedding_len, hidden_size, n_layers, batch_first=True)
        self.decoder = torch.nn.Linear(hidden_size, output_size)

    def forward(self, x, hidden):
        embedding = self.encoder(x.view(-1)) #encode our input into a vector embedding
        output, hidden = self.rnn(embedding.view(-1, 1, self.embedding_len), hidden) #calculate the output from our rnn based on our input and previous hidden state
        output = self.decoder(output.view(-1, self.hidden_size)) #calculate our output based on output of rnn
        return output, hidden

    def init_hidden(self, x):
        return (torch.zeros(self.n_layers, x.shape[0], self.hidden_size),
                torch.zeros(self.n_layers, x.shape[0], self.hidden_size)) #initialize our hidden and cell state to a matrix of 0s

In [120]:
#hyper-params
lr = 0.001
epochs = 50
embedding_len = 400
hidden_size = 128

myLSTM = CharLSTM(nchars, embedding_len, hidden_size, nchars) #instantiate our model from the class defined earlier
criterion = torch.nn.CrossEntropyLoss() #define cost function - Cross Entropy
optimizer = torch.optim.Adam(myrnn.parameters(), lr=lr) #choose optimizer

# SET UP TRAINING VISUALISATION
from torch.utils.tensorboard import SummaryWriter
writer = SummaryWriter() # we will use this to show our models performance on a graph

In [None]:
train(myLSTM, epochs)

In [None]:
gen = generate(myLSTM, 'starting text', 2000, 0.75)
print(gen)