# RNN with PyTorch (NLP)

In [1]:
import torch
from torch import nn
import numpy as np

The nn package is used for builing the model

## Preparing our input

In [2]:
text = ['hey how are you','good i am fine','have a nice day']

# Join all the sentences together and extract the unique characters from the combined sentences
chars = set(''.join(text))

# Creating a dictionary that maps integers to the characters
int2char = dict(enumerate(chars))

# Creating another dictionary that maps characters to integers
char2int = {char: ind for ind, char in int2char.items()}

In [5]:
print(char2int)

{'y': 0, 'h': 1, 'm': 2, ' ': 3, 'w': 4, 'n': 5, 'r': 6, 'c': 7, 'o': 8, 'i': 9, 'e': 10, 'a': 11, 'd': 12, 'f': 13, 'v': 14, 'g': 15, 'u': 16}


We define the sentence. We want our model to predict this when fed with the first characters. 
Then we map characters to numbers.

In [6]:
# Finding the length of the longest string in our data
maxlen = len(max(text, key=len))

# Padding

# A simple loop that loops through the list of sentences and adds a ' ' whitespace until the length of
# the sentence matches the length of the longest sentence
for i in range(len(text)):
  while len(text[i])<maxlen:
      text[i] += ' '

We must ensure  that all sentences have the same lenght (we have a fixed structure). But we want our network to be lenght independent, and that is why we make a padding, i.e. we fill the short sequences with 0s, and trimm long sequences. The best way of doing so, is finding the biggest sequence and then padding the rest.

In [7]:
# Creating lists that will hold our input and target sequences
input_seq = []
target_seq = []

for i in range(len(text)):
    # Remove last character for input sequence
  input_seq.append(text[i][:-1])
    
    # Remove first character for target sequence
  target_seq.append(text[i][1:])
  print("Input Sequence: {}\nTarget Sequence: {}".format(input_seq[i], target_seq[i]))

Input Sequence: hey how are yo
Target Sequence: ey how are you
Input Sequence: good i am fine
Target Sequence: ood i am fine 
Input Sequence: have a nice da
Target Sequence: ave a nice day


Here we divided the input (removed last character) from the target (removed first character). 
Target sequence is one step ahed of input sequence.

In [8]:
for i in range(len(text)):
    input_seq[i] = [char2int[character] for character in input_seq[i]]
    target_seq[i] = [char2int[character] for character in target_seq[i]]

We converted our characters into integers using the dictionary created above.

In [9]:
dict_size = len(char2int)
seq_len = maxlen - 1
batch_size = len(text)

def one_hot_encode(sequence, dict_size, seq_len, batch_size):
    # Creating a multi-dimensional array of zeros with the desired output shape
    features = np.zeros((batch_size, seq_len, dict_size), dtype=np.float32)
    
    # Replacing the 0 at the relevant character index with a 1 to represent that character
    for i in range(batch_size):
        for u in range(seq_len):
            features[i, u, sequence[i][u]] = 1
    return features

Before encoding the sequence into vectors, we define 
1. _dict_size_ : the number of unique characters
2. _seq_len_ : length of sequences we use as input (default being the length of the longest sequence - 1, becasue we removed the last character)
3. _batch_size_ : number of sentences

In [10]:
# Input shape --> (Batch Size, Sequence Length, One-Hot Encoding Size)
input_seq = one_hot_encode(input_seq, dict_size, seq_len, batch_size)

This function creates an array of zeros and replaces character indexes with 1.

In [11]:
input_seq = torch.from_numpy(input_seq)
target_seq = torch.Tensor(target_seq)

We formatted our data into Torch tensors and are now ready for implementing the network.

## RNN

In [12]:
# torch.cuda.is_available() checks and returns a Boolean True if a GPU is available, else it'll return False
is_cuda = torch.cuda.is_available()

# If we have a GPU available, we'll set our device to GPU. We'll use this device variable later in our code.
if is_cuda:
    device = torch.device("cuda")
    print("GPU is available")
else:
    device = torch.device("cpu")
    print("GPU not available, CPU used")

GPU not available, CPU used


We first check wheter we are running PyTorch on the CPU or the GPU. For this example there is no need of using the GPU, but for more advanced models it does.

In [13]:
class Model(nn.Module):
    def __init__(self, input_size, output_size, hidden_dim, n_layers):
        super(Model, self).__init__()

        # Defining some parameters
        self.hidden_dim = hidden_dim
        self.n_layers = n_layers

        #Defining the layers
        # RNN Layer
        self.rnn = nn.RNN(input_size, hidden_dim, n_layers, batch_first=True)   
        # Fully connected layer
        self.fc = nn.Linear(hidden_dim, output_size)
    
    def forward(self, x):
        
        batch_size = x.size(0)

        # Initializing hidden state for first input using method defined below
        hidden = self.init_hidden(batch_size)

        # Passing in the input and hidden state into the model and obtaining outputs
        out, hidden = self.rnn(x, hidden)
        
        # Reshaping the outputs such that it can be fit into the fully connected layer
        out = out.contiguous().view(-1, self.hidden_dim)
        out = self.fc(out)
        
        return out, hidden
    
    def init_hidden(self, batch_size):
        # This method generates the first hidden state of zeros which we'll use in the forward pass
        # We'll send the tensor holding the hidden state to the device we specified earlier as well
        hidden = torch.zeros(self.n_layers, batch_size, self.hidden_dim)
        return hidden

We define a class that inherits from PyTorch's class _nn.module_. Then we can define the variables and the layers from our model.

We use 1 layer of RNN, followed by a fully connected layer (that will convert the RNN output into the final shape).

The method _init_hidden()_ creates a tensor of zeros in the shape of the hidden states.

In [14]:
# Instantiate the model with hyperparameters
model = Model(input_size=dict_size, output_size=dict_size, hidden_dim=12, n_layers=1)
# We'll also set the model to the device that we defined earlier (default is CPU)
model.to(device)

# Define hyperparameters
n_epochs = 100
lr=0.01

# Define Loss, Optimizer
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=lr)

We define our hyperparameters: _n_epochs_ (number of times the model will go through the dataset), and lr (learning rate, i.e. the rate at which the model updates the weights).

The loss function will be the CrossEntropyLoss, and the optimizer will be Adam's.

In [15]:
# Training Run
for epoch in range(1, n_epochs + 1):
    optimizer.zero_grad() # Clears existing gradients from previous epoch
    input_seq.to(device)
    output, hidden = model(input_seq)
    loss = criterion(output, target_seq.view(-1).long())
    loss.backward() # Does backpropagation and calculates gradients
    optimizer.step() # Updates the weights accordingly
    
    if epoch%10 == 0:
        print('Epoch: {}/{}.............'.format(epoch, n_epochs), end=' ')
        print("Loss: {:.4f}".format(loss.item()))

Epoch: 10/100............. Loss: 2.3415
Epoch: 20/100............. Loss: 1.9804
Epoch: 30/100............. Loss: 1.6340
Epoch: 40/100............. Loss: 1.2974
Epoch: 50/100............. Loss: 0.9830
Epoch: 60/100............. Loss: 0.7059
Epoch: 70/100............. Loss: 0.4984
Epoch: 80/100............. Loss: 0.3526
Epoch: 90/100............. Loss: 0.2544
Epoch: 100/100............. Loss: 0.1907


The training is performed.

In [16]:
# This function takes in the model and character as arguments and returns the next character prediction and hidden state
def predict(model, character):
    # One-hot encoding our input to fit into the model
    character = np.array([[char2int[c] for c in character]])
    character = one_hot_encode(character, dict_size, character.shape[1], 1)
    character = torch.from_numpy(character)
    character.to(device)
    
    out, hidden = model(character)

    prob = nn.functional.softmax(out[-1], dim=0).data
    # Taking the class with the highest probability score from the output
    char_ind = torch.max(prob, dim=0)[1].item()

    return int2char[char_ind], hidden

In [17]:
# This function takes the desired output length and input characters as arguments, returning the produced sentence
def sample(model, out_len, start='hey'):
    model.eval() # eval mode
    start = start.lower()
    # First off, run through the starting characters
    chars = [ch for ch in start]
    size = out_len - len(chars)
    # Now pass in the previous characters and get a new one
    for ii in range(size):
        char, h = predict(model, chars)
        chars.append(char)

    return ''.join(chars)

These two functions serve to test our model.

In [18]:
sample(model,15,'good')

'good i am fine '

## Additional comments

1. This small model with 3 training sentences "memorized" the sequences, but if we would have fed it with larger data set and randomness, it would pick up language rules
2. RNN is no longer used in NLP or sequential problems. Nowadays, people use GRUs and LSTM, which are evolution of RNN.