# Recurrent Neural Networks
So far, the neural network architectures we have been using take in a single fixed size input and give a single fixed size output. What if we wanted to model something like language where we want to feed in different length words or sentences? Another issue with vanilla feedforward networks is that each output is only dependent on the current input. It has no 'memory' of previous inputs so you can't model time dependent variables. Recurrent neural networks address both these issues.

They do this by having an internal hidden state which can be thought of as a form of memory. At each time step, the new hidden state is calculated as a function of the previous hidden state and the current input. This hidden state can then be used to represent your output or can be put through another function to compute the outputs. When we say function we are referring to the same one used in standard neural network: linear combination followed by an activation function.

### $h_t = f(x_t, h_{t-1})$

As shown in the diagram below, which uses a further function to compute the output $o$ from the hidden state $s$, there are three matrices of parameters which we are trying to optimize: U, V and W. The diagram also demonstrates how these networks can be unfolded to show the variables at various time steps.

![image](images/rnn.jpg)

Standard neural networks can only model one to one relationships while RNNs are extremely flexible in terms of input-output structures which is one of the reasons they are so powerful. You can imagine something like one to many being used to feed in a single image from which a caption is sequentially produced or a many to one being used to feed in a sentence sequentially and give a single output describing the sentiment of the sentence.

![image](images/rnnlayouts.jpeg)

### Optimization
Surprisingly, with this increased complexity in structure, the optimization method does not become any more difficult. Despite having a different name, back-propagation through time, it is essentially the same thing. All you do is feed in your sequence sequentially to get the output, as usual. You then just calculate your error at each timestep and sum it as opposed to calculating the error at a single timestep like standard neural networks. Then you can use gradient descent to update your weights iteratively until you are satisfied with your network's performance.

RNNS are generally slower to optimize than standard neural networks as the output at each time step is dependent on the previous output so the operations cannot be parallelized.

For a long time it was considered difficult to train RNNs due to two problems called vanishing and exploding gradients. These problems also exist in standard neural network but are greatly emphasized in RNNs. However, modern techniques such as LSTM cells have greatly reduced this difficulty.

For this particular task, we will need to do quite a bit of pre-processing. We need to find the number of unique characters in our training text and give each one a unique number so we can one-hot encode them.<br>
We start by reading the file, converting all letters to lowercase to reduce the number of characters we need to model, then defining a function which takes in the text and gives up back a dictionary mapping each letter to a unique number.

## Implementation
We are going to be implementing a one-to-one character level text prediction model. We will be sequentially feeding in a single character and asking our network to predict the next character as a time dependent function of all the characters that came before it.

As always, we begin by importing the required libraries.

In [4]:
import torch
import numpy as np
import torch.nn.functional as F

Lets read in our dataset. We just need a text file which contains the data which we want to model. In this case, I have found a file which contains ~0.5MB of Kendrick Lamar lyrics. You can use a variety of different datasets. There are plenty of which are easily accessible online - check out the links below. Otherwise, is very easy to create your own either by copying and pasting text into a file or creating a bot to automatically do this for you.

[Datasets repo 1](https://github.com/cedricdeboom/character-level-rnn-datasets/tree/master/datasets)

We now define our dataset class which we can use to read the dataset and use it with a pytorch dataloader for easy sampling. 

We first open the file and read all the data.

Each text character will be represented by a unique number so we first need all the unique characters in our text. Once we have this, we create a dictionary which maps from a unique number to a letter. After defining the reverse mapping aswell, we use the dictionary to convert our original string into a list of numbers where each number represents a text character.

The labels are simply the input but shifted by one as we are always predicting the next character based on the current one.

In [5]:
class CharRNNDataset():
    def __init__(self, txt_file_path='Data/lyrics.txt', chunk_size=100, transform=None):
        self.txt_file_path = txt_file_path
        self.chunk_size = chunk_size
        self.transform = transform
        
        #open our text file and read all the data into the rawtxt variable
        with open('lyrics.txt', 'r', encoding="utf8") as file:
            rawtxt = file.read()

        #turn all of the text into lowercase as it will reduce the number of characters that our algorithm needs to learn
        rawtxt = rawtxt.lower()
        
        letters = set(rawtxt) #returns the list of unique characters in our raw text
        self.nchars = len(letters) #number of unique characters in our text file
        self.num_to_let = dict(enumerate(letters)) #created the dictionary mapping
        self.let_to_num = dict(zip(self.num_to_let.values(), self.num_to_let.keys())) #create the reverse mapping so we can map from a character to a unique number
        
        txt = list(rawtxt)#convert string to list
        #iterate through our text and change the value for each character to its mapped value
        for k, letter in enumerate(txt):
            txt[k] = self.let_to_num[letter]

        self.X = np.array(txt)
    
    def __len__(self):
        return len(self.X)-1-self.chunk_size
    
    def __getitem__(self, idx):
        x = self.X[idx:idx+self.chunk_size]
        y = self.X[idx+1:idx+self.chunk_size+1]
        
        if self.transform:
            x, y = self.transform((x, y))
    
        return x, y

In [6]:
from torchvision import transforms
class ToLongTensor():
    def __init__(self):
        pass
    def __call__(self, inp):
        return (torch.LongTensor(var) for var in inp)

In [7]:
from torch.utils.data import DataLoader

batch_size = 32
chunk_size = 100 #the length of the sequences which we will optimize over

train_data = CharRNNDataset('lyrics.txt', chunk_size=100, transform=ToLongTensor())
x, y = train_data[0]
print('First input', x)
print('First label', y, '\n')

nchars = train_data.nchars
num_to_let = train_data.num_to_let
let_to_num = train_data.let_to_num

print('Number of unique chatacters:', nchars)
print('Length of dataset:', len(train_data))

train_loader = DataLoader(train_data,# make the training dataloader
                          batch_size = batch_size,
                          shuffle=True)

First input tensor([27, 12, 51, 52, 10, 23, 10,  9, 51,  7, 51,  7, 28, 51,  9, 10, 11, 12,
         9, 53, 47, 10, 11, 27, 20, 38, 34, 23, 46, 55, 51, 11, 10, 27, 20, 38,
        10, 46,  9, 23,  7, 51, 10, 27, 30, 30, 54, 34, 27, 20, 46, 51, 11,  2,
        33, 23, 20, 51, 11, 11, 51, 10, 27, 10, 20, 23, 44, 44, 27, 10, 34, 23,
         0, 55, 10, 11, 54,  7, 51, 10, 46, 54, 53, 20,  0, 51,  9, 33, 51, 23,
         0, 11,  2, 28, 53,  0, 10, 20, 54, 34])
First label tensor([12, 51, 52, 10, 23, 10,  9, 51,  7, 51,  7, 28, 51,  9, 10, 11, 12,  9,
        53, 47, 10, 11, 27, 20, 38, 34, 23, 46, 55, 51, 11, 10, 27, 20, 38, 10,
        46,  9, 23,  7, 51, 10, 27, 30, 30, 54, 34, 27, 20, 46, 51, 11,  2, 33,
        23, 20, 51, 11, 11, 51, 10, 27, 10, 20, 23, 44, 44, 27, 10, 34, 23,  0,
        55, 10, 11, 54,  7, 51, 10, 46, 54, 53, 20,  0, 51,  9, 33, 51, 23,  0,
        11,  2, 28, 53,  0, 10, 20, 54, 34, 10]) 

Number of unique chatacters: 56
Length of dataset: 24341


Define our model which takes in variables defining its structure as parameters. The encoder converts each unique number into an embedding which is fed into the rnn model. The RNN calculates the hidden state which is converted into an output through a fully connected layer called the decoder.<br>
We also define the init_hidden function which outputs us a tensor of zeros of the required size for the initial hidden state.

In [8]:
class CharRNN(torch.nn.Module):
    def __init__(self, input_size, hidden_size, output_size, d_embed):
        super().__init__()
        self.hidden_size = hidden_size
        self.encoder = torch.nn.Embedding(input_size, d_embed)
        self.i2h = torch.nn.Linear(d_embed + hidden_size, hidden_size)
        self.i2o = torch.nn.Linear(hidden_size, output_size)

    def forward(self, x, hidden):
        embedding = self.encoder(x) #encode our input into a vector embedding
        combined = torch.cat((embedding, hidden), 1)
        hidden = torch.tanh(self.i2h(combined))
        output = self.i2o(hidden)
        output = F.log_softmax(output, dim=-1)
        return output, hidden

    def init_hidden(self, x):
        return torch.zeros(x.shape[0], self.hidden_size)

Instantiate our model, define the appropriate hyper-parameters, cost function and optimizer. We will be training on ranom samples from the text of length chunk_size so it is what batch size is to normal neural networks.

In [9]:
#hyper-params
lr = 0.001
epochs = 50

myRNN = CharRNN(nchars, 64, nchars, 24) #instantiate our model from the class defined earlier
# myRNN = CharRNN(nchars, 512, nchars) #instantiate our model from the class defined earlier
criterion = torch.nn.NLLLoss() #define our cost function
optimizer = torch.optim.Adam(myRNN.parameters(), lr=lr) #choose optimizer

# SET UP TRAINING VISUALISATION
from torch.utils.tensorboard import SummaryWriter
writer = SummaryWriter() # we will use this to show our models performance on a graph

Define the training loop, sequentially feeding in multiple batches of random chunks of text, summing the cost for each character in the sequence (backpropagation through time) and calculating the gradients to update our weights.

In [10]:
from tqdm import tqdm

def train(model, epochs, batch_size=32, chunk_size=100):
    for epoch in tqdm(range(epochs)):
        epoch_loss = 0
        generated_strings = ['']*batch_size
        
        for idx, (x, y) in enumerate(train_loader):
            loss = 0
            
            h = model.init_hidden(x)
            
            for i in range(chunk_size):
                
                # run the RNN over every char until the current iteration of chunk_size
                out, h = model(x[:, i], h)
                
                # Maximum values that our model predicts for the batch
                # This is a [batch_size] sized Tensor
                _, max_outs = out.data.max(1)
                
                # Iterate over the values to assign them to their relevant string
                # We have a [batch_size] Tensor, and we need to assign the n'th letter from the batch
                 # to the n'th string in our generated_strings list
                for k in range(len(max_outs)):
                    letter = num_to_let[max_outs[k].item()]
                    generated_strings[k] += letter
                
                # Loss between predictions and ground truth for current chunk
                loss += criterion(out, y[:, i])
                
            writer.add_scalar("Loss/Train", loss/chunk_size, epoch*len(train_loader) + idx)
            
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            
            # add this sequence's cost to the total epoch epoch
            epoch_loss += loss.item()
        
        # Calculate average cost per epoch
        epoch_loss /= len(train_loader.dataset)
        
        print('Epoch ', epoch+1, ' Avg Loss: ', epoch_loss)
        print('Generated text: ', generatedstrings[np.random.randint(0, batch_size)][0:750], '\n')

train(myRNN, epochs, batch_size, chunk_size)

  0%|          | 0/50 [00:00<?, ?it/s]

Epoch  1  Avg Loss:  7.1581453953609735


NameError: name 'generatedstrings' is not defined

The generated text above picks the most probable next character each time. This is not the best way to do it as our model will be deterministic so it will produce the same text over and over again. To get it producing different text, we should instead sample from the probability distribution of possible next letters output by the network. That is what we will do with the next function.

In [11]:
def maparray(txt, mapdict):
    txt = list(txt)
    #iterate through our text and change the value for each character to its mapped value
    for k, letter in enumerate(txt):
        txt[k] = mapdict[letter]
    txt = np.array(txt)
    return txt

def generate(model, prime_str='a', str_len=150, temperature=0.75):
    generated = prime_str
    
    prime_str = maparray(prime_str, let_to_num)
    x = torch.LongTensor(prime_str).unsqueeze(0)
    print("x", x, x.shape)
    
    #initialize hidden state
    h = model.init_hidden(x)
    
    # Runs the model over the priming string to 'initialise' the hidden state
    for i in range(x.shape[1]):
        out, h = model(x[:, i], h)
    
    x = x[:, -1] # Last character of the priming string
    for i in range(str_len):
        out, h = model(x, h)
        
        out_dist = out.data.view(-1).div(temperature).exp()
        sample = torch.multinomial(out_dist, 1).item()
        pred_char = num_to_let[sample]
        
        generated += pred_char
        
        x = torch.LongTensor([sample])
    
    return generated
        
gen = generate(myRNN, 'this be ', 2000, 0.75)
print(gen)

x tensor([[29, 50, 10,  8, 23, 22, 55, 23]]) torch.Size([1, 8])
tensor([23])
this be trer rcent upme a badn berann
i i woon' wreer the funt
dowe the stame
yeug' on you jund barce rorer the natch ster you were ae fun whol' likn dnin' you satch warki wetr me the stou cank
you the cofl' i gol that migg
the bikm wery eat din'
rean'
feron s7t ate tee as want the fond noxt on set the dsin' mo be you the sarat dode witters, i go fun the i gee in dirlo lin' cald wo ata
thes ware thit' a lut bithe nresin' srhe gol it to ano bine scot on the gat soon't bin' the in ther you don' rooverina now i dunknow, sam the cuth coon bmoli
n and
dolk, i gea thed gan the way te yous itrere tous fus the game to dake fot the the gon ware an a wean a toot the poram the sis i wint the saing ine kime aid skuck if aigit stap bo the cona oug ofrersteat i poos that, ju the (it ut s ay the pan you lorlu
poous i fnow ster up it 7in
inge the werdo ast dyem
the the oull bing you set a dit en thit tre cand i gol yan a doll

In [33]:
class CharRNN(torch.nn.Module):
    def __init__(self, input_size, hidden_size, output_size, d_embed, n_layers=1):
        super().__init__()
        #store input parameters in the object so we can use them later on
        self.input_size = input_size
        self.hidden_size = hidden_size
        self.output_size = output_size
        self.d_embed = d_embed
        self.n_layers = n_layers

        #required functions for model
        self.encoder = torch.nn.Embedding(input_size, d_embed)
        self.rnn = torch.nn.RNN(d_embed, hidden_size, n_layers, batch_first=True)
        self.decoder = torch.nn.Linear(hidden_size, output_size)

    def forward(self, x, hidden):
        embedding = self.encoder(x) # encode our input into a vector embedding
        thicc_embedding = embedding.unsqueeze(1)
        
        output, hidden = self.rnn(thicc_embedding, hidden) #calculate the output from our rnn based on our input and previous hidden state
        output = self.decoder(output.squeeze(1)) #calculate our output based on output of rnn
        
        # softmax on the last dimension
        output = F.log_softmax(output, dim=-1)
        return output, hidden

    def init_hidden(self, x):
        return torch.zeros(self.n_layers, x.shape[0], self.hidden_size) #initialize our hidden state to a matrix of 0s

In [31]:
#hyper-params
lr = 0.001
epochs = 50

myRNN = CharRNN(nchars, 64, nchars, 24) #instantiate our model from the class defined earlier
criterion = torch.nn.NLLLoss() #define our cost function
optimizer = torch.optim.Adam(myRNN.parameters(), lr=lr) #choose optimizer

# SET UP TRAINING VISUALISATION
from torch.utils.tensorboard import SummaryWriter
writer = SummaryWriter() # we will use this to show our models performance on a graph

In [32]:
train(myRNN, epochs, batch_size, chunk_size)

  0%|                                                                                           | 0/50 [00:00<?, ?it/s]

torch.Size([32, 1, 64]) torch.Size([32, 64])
torch.Size([32, 1, 64]) torch.Size([32, 64])
torch.Size([32, 1, 64]) torch.Size([32, 64])
torch.Size([32, 1, 64]) torch.Size([32, 64])
torch.Size([32, 1, 64]) torch.Size([32, 64])
torch.Size([32, 1, 64]) torch.Size([32, 64])
torch.Size([32, 1, 64]) torch.Size([32, 64])
torch.Size([32, 1, 64]) torch.Size([32, 64])
torch.Size([32, 1, 64]) torch.Size([32, 64])
torch.Size([32, 1, 64]) torch.Size([32, 64])
torch.Size([32, 1, 64]) torch.Size([32, 64])
torch.Size([32, 1, 64]) torch.Size([32, 64])
torch.Size([32, 1, 64]) torch.Size([32, 64])
torch.Size([32, 1, 64]) torch.Size([32, 64])
torch.Size([32, 1, 64]) torch.Size([32, 64])
torch.Size([32, 1, 64]) torch.Size([32, 64])
torch.Size([32, 1, 64]) torch.Size([32, 64])
torch.Size([32, 1, 64]) torch.Size([32, 64])
torch.Size([32, 1, 64]) torch.Size([32, 64])
torch.Size([32, 1, 64]) torch.Size([32, 64])
torch.Size([32, 1, 64]) torch.Size([32, 64])
torch.Size([32, 1, 64]) torch.Size([32, 64])
torch.Size

torch.Size([32, 1, 64]) torch.Size([32, 64])
torch.Size([32, 1, 64]) torch.Size([32, 64])
torch.Size([32, 1, 64]) torch.Size([32, 64])
torch.Size([32, 1, 64]) torch.Size([32, 64])
torch.Size([32, 1, 64]) torch.Size([32, 64])
torch.Size([32, 1, 64]) torch.Size([32, 64])
torch.Size([32, 1, 64]) torch.Size([32, 64])
torch.Size([32, 1, 64]) torch.Size([32, 64])
torch.Size([32, 1, 64]) torch.Size([32, 64])
torch.Size([32, 1, 64]) torch.Size([32, 64])
torch.Size([32, 1, 64]) torch.Size([32, 64])
torch.Size([32, 1, 64]) torch.Size([32, 64])
torch.Size([32, 1, 64]) torch.Size([32, 64])
torch.Size([32, 1, 64]) torch.Size([32, 64])
torch.Size([32, 1, 64]) torch.Size([32, 64])
torch.Size([32, 1, 64]) torch.Size([32, 64])
torch.Size([32, 1, 64]) torch.Size([32, 64])
torch.Size([32, 1, 64]) torch.Size([32, 64])
torch.Size([32, 1, 64]) torch.Size([32, 64])
torch.Size([32, 1, 64]) torch.Size([32, 64])
torch.Size([32, 1, 64]) torch.Size([32, 64])
torch.Size([32, 1, 64]) torch.Size([32, 64])
torch.Size

torch.Size([32, 1, 64]) torch.Size([32, 64])
torch.Size([32, 1, 64]) torch.Size([32, 64])
torch.Size([32, 1, 64]) torch.Size([32, 64])
torch.Size([32, 1, 64]) torch.Size([32, 64])
torch.Size([32, 1, 64]) torch.Size([32, 64])
torch.Size([32, 1, 64]) torch.Size([32, 64])
torch.Size([32, 1, 64]) torch.Size([32, 64])
torch.Size([32, 1, 64]) torch.Size([32, 64])
torch.Size([32, 1, 64]) torch.Size([32, 64])
torch.Size([32, 1, 64]) torch.Size([32, 64])
torch.Size([32, 1, 64]) torch.Size([32, 64])
torch.Size([32, 1, 64]) torch.Size([32, 64])
torch.Size([32, 1, 64]) torch.Size([32, 64])
torch.Size([32, 1, 64]) torch.Size([32, 64])
torch.Size([32, 1, 64]) torch.Size([32, 64])
torch.Size([32, 1, 64]) torch.Size([32, 64])
torch.Size([32, 1, 64]) torch.Size([32, 64])
torch.Size([32, 1, 64]) torch.Size([32, 64])
torch.Size([32, 1, 64]) torch.Size([32, 64])
torch.Size([32, 1, 64]) torch.Size([32, 64])
torch.Size([32, 1, 64]) torch.Size([32, 64])
torch.Size([32, 1, 64]) torch.Size([32, 64])
torch.Size

torch.Size([32, 1, 64]) torch.Size([32, 64])
torch.Size([32, 1, 64]) torch.Size([32, 64])
torch.Size([32, 1, 64]) torch.Size([32, 64])
torch.Size([32, 1, 64]) torch.Size([32, 64])
torch.Size([32, 1, 64]) torch.Size([32, 64])
torch.Size([32, 1, 64]) torch.Size([32, 64])
torch.Size([32, 1, 64]) torch.Size([32, 64])
torch.Size([32, 1, 64]) torch.Size([32, 64])
torch.Size([32, 1, 64]) torch.Size([32, 64])
torch.Size([32, 1, 64]) torch.Size([32, 64])
torch.Size([32, 1, 64]) torch.Size([32, 64])
torch.Size([32, 1, 64]) torch.Size([32, 64])
torch.Size([32, 1, 64]) torch.Size([32, 64])
torch.Size([32, 1, 64]) torch.Size([32, 64])
torch.Size([32, 1, 64]) torch.Size([32, 64])
torch.Size([32, 1, 64]) torch.Size([32, 64])
torch.Size([32, 1, 64]) torch.Size([32, 64])
torch.Size([32, 1, 64]) torch.Size([32, 64])
torch.Size([32, 1, 64]) torch.Size([32, 64])
torch.Size([32, 1, 64]) torch.Size([32, 64])
torch.Size([32, 1, 64]) torch.Size([32, 64])
torch.Size([32, 1, 64]) torch.Size([32, 64])
torch.Size

torch.Size([32, 1, 64]) torch.Size([32, 64])
torch.Size([32, 1, 64]) torch.Size([32, 64])
torch.Size([32, 1, 64]) torch.Size([32, 64])
torch.Size([32, 1, 64]) torch.Size([32, 64])
torch.Size([32, 1, 64]) torch.Size([32, 64])
torch.Size([32, 1, 64]) torch.Size([32, 64])
torch.Size([32, 1, 64]) torch.Size([32, 64])
torch.Size([32, 1, 64]) torch.Size([32, 64])
torch.Size([32, 1, 64]) torch.Size([32, 64])
torch.Size([32, 1, 64]) torch.Size([32, 64])
torch.Size([32, 1, 64]) torch.Size([32, 64])
torch.Size([32, 1, 64]) torch.Size([32, 64])
torch.Size([32, 1, 64]) torch.Size([32, 64])
torch.Size([32, 1, 64]) torch.Size([32, 64])
torch.Size([32, 1, 64]) torch.Size([32, 64])
torch.Size([32, 1, 64]) torch.Size([32, 64])
torch.Size([32, 1, 64]) torch.Size([32, 64])
torch.Size([32, 1, 64]) torch.Size([32, 64])
torch.Size([32, 1, 64]) torch.Size([32, 64])
torch.Size([32, 1, 64]) torch.Size([32, 64])
torch.Size([32, 1, 64]) torch.Size([32, 64])
torch.Size([32, 1, 64]) torch.Size([32, 64])
torch.Size

  0%|                                                                                           | 0/50 [00:03<?, ?it/s]

torch.Size([32, 1, 64]) torch.Size([32, 64])
torch.Size([32, 1, 64]) torch.Size([32, 64])
torch.Size([32, 1, 64]) torch.Size([32, 64])
torch.Size([32, 1, 64]) torch.Size([32, 64])
torch.Size([32, 1, 64]) torch.Size([32, 64])
torch.Size([32, 1, 64]) torch.Size([32, 64])
torch.Size([32, 1, 64]) torch.Size([32, 64])
torch.Size([32, 1, 64]) torch.Size([32, 64])
torch.Size([32, 1, 64]) torch.Size([32, 64])
torch.Size([32, 1, 64]) torch.Size([32, 64])
torch.Size([32, 1, 64]) torch.Size([32, 64])
torch.Size([32, 1, 64]) torch.Size([32, 64])
torch.Size([32, 1, 64]) torch.Size([32, 64])
torch.Size([32, 1, 64]) torch.Size([32, 64])
torch.Size([32, 1, 64]) torch.Size([32, 64])
torch.Size([32, 1, 64]) torch.Size([32, 64])
torch.Size([32, 1, 64]) torch.Size([32, 64])
torch.Size([32, 1, 64]) torch.Size([32, 64])
torch.Size([32, 1, 64]) torch.Size([32, 64])
torch.Size([32, 1, 64]) torch.Size([32, 64])
torch.Size([32, 1, 64]) torch.Size([32, 64])
torch.Size([32, 1, 64]) torch.Size([32, 64])
torch.Size




KeyboardInterrupt: 

In [34]:
gen = generate(myRNN, 'this be ', 2000, 0.75)
print(gen)

NameError: name 'generate' is not defined