# RNN LSTM VS GRU


### 1. Data Download and Set up

1) packages import and seed setup:
    after setting the same seed, the result will be the same and thus repeatable 

In [1]:
import torch
from torchtext import data
from torchtext import datasets
import random

# set the seed for reproduction
SEED = 1234

# set seed for torch process for either cpu or gpu devices
torch.manual_seed(SEED)
torch.cuda.manual_seed(SEED)

Pre-set the parameter for data downloading:

1)  "LABEL" is the parameter that deal with sentiment, its tensor type should be float

2)  "TEXT" is tokenized with spacy and in charge of the vocabulary part

In [2]:
TEXT = data.Field(tokenize='spacy')
LABEL = data.LabelField(tensor_type=torch.FloatTensor)

Downloading data-"IMDB" from online database in torchtext:

1) split data into training set and test set

2) training set also consists of data used for training and validation. So we need to subset them again into train and valid part

In [3]:
train, test = datasets.IMDB.splits(TEXT, LABEL)
train, valid = train.split(random_state=random.seed(SEED))

Then we want to build vocabulary with the top 25000 frequently used words:

1) instead of using random initial settings for our words, we used pre-trained settings, which might lead us to better results in shorter time

2) We specify and download all the vectors with several parameters: 6B means trained on 6 billion tokens. 100d means 100 demensions

In [5]:
TEXT.build_vocab(train, max_size=25000, vectors="glove.6B.100d")
LABEL.build_vocab(train)

create iterators:

1) batch size is how much data passes through the network within each iteration

2) After this step, data are sorted into pieces with same batch size. When iterator is called, it will return one batch from each part.

In [7]:
BATCH_SIZE = 16

train_iterator, valid_iterator, test_iterator = data.BucketIterator.splits(
    (train, valid, test), 
    batch_size=BATCH_SIZE, 
    sort_key=lambda x: len(x.text), 
    repeat=False)

## Building the model

Then we start the specification of two models, LSTM and GRU:

General process is to take one-hot coded input vectors and put them into RNN models. Models will tranform them through several layers and trained the parameters based on the label given. 


### LSTM

A common LSTM unit is composed of a cell, an input gate, an output gate and a forget gate. The cell remembers values over arbitrary time intervals and the three gates regulate the flow of information into and out of the cell.

The process of LSTM from Wikipedia is as following:
  
 \begin{split}\begin{array}{ll}
i_t = \sigma(W_{ii} x_t + b_{ii} + W_{hi} h_{(t-1)} + b_{hi}) \\
f_t = \sigma(W_{if} x_t + b_{if} + W_{hf} h_{(t-1)} + b_{hf}) \\
o_t = \sigma(W_{io} x_t + b_{io} + W_{ho} h_{(t-1)} + b_{ho}) \\
c_t = f_t c_{(t-1)} + i_t \tanh(W_{ig} x_t + b_{ig} + W_{hg} h_{(t-1)} + b_{hg}) \\
h_t = o_t \tanh(c_t)
\end{array}\end{split}

where the initial values are $ c_{0}=0$ and $h_{0}=0$ 

h is the hidden state

c is the cell state, which incorporates long and short term information

x is the input vector

i, f, g, o are the input, forget, cell, and output vectos. 

σ is the sigmoid function.

In [9]:
import torch.nn as nn

class RNN_LSTM(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim, n_layers, bidirectional, dropout):
        super().__init__()
        """
        vocab_size: input dimension, dimension of one-hot vector, which is length of TEXT.vocab
        embedding_dim: dimension of the word vector
        hidden_dim: size of hidden states, hidden states are layers between input and output layers
        output_dim: dimension of output class, we only need a real value 0-1 
        n_layers: number of layers in the neural network, 
            output of hidden state in first layer is the input to the hidden state in the next layer
        bidirectional: adds an extra layer that processes values from last to first, which is the essence of the LSTM algo
        dropout: regularization to avoid overfitting, randomly dropout a node from the forward process, since
        LSTM add more parameters, avoiding overfitting becomes super important
        """
        
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        # LSTM package that specifies embedings
        self.rnn = nn.LSTM(embedding_dim, hidden_dim, num_layers=n_layers, bidirectional=bidirectional, dropout=dropout)
        # defines forward process, bidirectional requires the square of hidden dimension
        self.fc = nn.Linear(hidden_dim*2, output_dim)
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, x):
        """
        forward defines the forwarding process
        """
        
        # regularization in the embedding process
        embedded = self.dropout(self.embedding(x))
        
        # output of the LSTM RNN process in each node, including the output, new hidden layer, and cell state
        output, (hidden, cell) = self.rnn(embedded)        
        #output = [sent len, batch size, hid dim * num directions]
        #hidden = [num layers * num directions, batch size, hid. dim]
        #cell = [num layers * num directions, batch size, hid. dim]
        
        # regularize the hidden layer
        hidden = self.dropout(torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim=1))
                
        #hidden [batch size, hid. dim * num directions]
            
        return self.fc(hidden.squeeze(0))

### GRU

GRU is also an RNN algorithm just the same as LSTM. Its performance has been shown pretty similar to LSTM model. Except it's much better on smaller dataset

GRU in pytorch package do the following process:


  \begin{split}\begin{array}{ll}
r_t = \sigma(W_{ir} x_t + b_{ir} + W_{hr} h_{(t-1)} + b_{hr}) \\
z_t = \sigma(W_{iz} x_t + b_{iz} + W_{hz} h_{(t-1)} + b_{hz}) \\
n_t = \tanh(W_{in} x_t + b_{in} + r_t (W_{hn} h_{(t-1)}+ b_{hn})) \\
h_t = (1 - z_t) n_t + z_t h_{(t-1)} \\
\end{array}\end{split}
   
   where h is the hidden layer
   
   x is the input. 
   
   r, z, n are the reset, update, and new gates, respectively.
   
   σ is the sigmoid function. 

In [None]:
class RNN_GRU(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim, n_layers, bidirectional, dropout):
        super().__init__()
        """
        Using GRU package instead of LSTM in the LSTM class
        """
        
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.rnn = nn.GRU(embedding_dim, hidden_dim, num_layers=n_layers, bidirectional=bidirectional, dropout=dropout)
        self.fc = nn.Linear(hidden_dim*2, output_dim)
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, x):
        """
        Again, the forwarding process of GRU algo. The difference here is that GRU does not have cell state,
        as we can see from mathematical definition above
        """
        
        
        embedded = self.dropout(self.embedding(x))
        
        #embedded = [sent len, batch size, emb dim]
        
        output, hidden = self.rnn(embedded)
        
        #output = [sent len, batch size, hid dim * num directions]
        #hidden = [num layers * num directions, batch size, hid. dim]
        #cell = [num layers * num directions, batch size, hid. dim]
        
        hidden = self.dropout(torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim=1))
                
            
        return self.fc(hidden.squeeze(0))

### Implementations
1. setup parameters
2. Train models based on given parameters

Specifications explanation can be seen in the inline comments

In [10]:
# specify dimensions of different layers
INPUT_DIM = len(TEXT.vocab)
EMBEDDING_DIM = 100
HIDDEN_DIM = 256
OUTPUT_DIM = 1
N_LAYERS = 2
BIDIRECTIONAL = True
# Define the random dropout rate for regularization
DROPOUT = 0.5

#Configure the model with inputs given above
model_lstm = RNN_LSTM(INPUT_DIM, EMBEDDING_DIM, HIDDEN_DIM, OUTPUT_DIM, N_LAYERS, BIDIRECTIONAL, DROPOUT)
model_gru = RNN_GRU(INPUT_DIM, EMBEDDING_DIM, HIDDEN_DIM, OUTPUT_DIM, N_LAYERS, BIDIRECTIONAL, DROPOUT)

In [11]:
#Here we Check the size of pretrained embeddings
pretrained_embeddings = TEXT.vocab.vectors

print(pretrained_embeddings.shape)

torch.Size([25002, 100])


Assign pretrained embeddings to embedding layer for GRU and LSTM


In [13]:
model_lstm.embedding.weight.data.copy_(pretrained_embeddings)

tensor([[ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [-0.0382, -0.2449,  0.7281,  ..., -0.1459,  0.8278,  0.2706],
        ...,
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [-0.1123,  0.3113,  0.3317,  ..., -0.4576,  0.6191,  0.5304],
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000]])

In [14]:
model_gru.embedding.weight.data.copy_(pretrained_embeddings)

tensor([[ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [-0.0382, -0.2449,  0.7281,  ..., -0.1459,  0.8278,  0.2706],
        ...,
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [-0.1123,  0.3113,  0.3317,  ..., -0.4576,  0.6191,  0.5304],
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000]])

### Model Training

In [15]:
#use Adam optimization algorithm from torch package
import torch.optim as optim

optimizer_lstm = optim.Adam(model_lstm.parameters())
optimizer_gru = optim.Adam(model_gru.parameters())

Define our loss function: BCE with logits loss

In [16]:
criterion = nn.BCEWithLogitsLoss()


#use GPU if availbale, otherwise use CPU
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

#Set models with device configuration
model_lstm = model_lstm.to(device)
model_gru = model_gru.to(device)
criterion = criterion.to(device)

Binary_accuracy function return the accuracy in percentage basis, to show how good the algorithm has done.

In [17]:
import torch as F

def binary_accuracy(preds, y):

    #round predictions to the closest integer
    rounded_preds = torch.round(F.sigmoid(preds))
    correct = (rounded_preds == y).float()
    acc = correct.sum()/len(correct)
    return acc

define a function to train the model

In [18]:
def train_model(model, iterator, optimizer, criterion):
    """
    model: model that's going to be trained
    optimizer: optimizer used to train
    iterator: defined as above
    """
    epoch_loss = 0
    epoch_acc = 0
    
    model.train()
    
    for batch in iterator:
        
        # first zero the gradients
        optimizer.zero_grad()
        
        # feed batch of sentences to model
        predictions = model(batch.text).squeeze(1)
        
        # calculate loss
        loss = criterion(predictions, batch.label)
        
        acc = binary_accuracy(predictions, batch.label)
        
        # calculate gradient
        loss.backward()
        
        # update parameters
        optimizer.step()
        
        epoch_loss += loss.item()
        epoch_acc += acc.item()
        # return the times adjusted aggregated loss, times adjusted aggregated accuracty 
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)


function to evaluate our model

In [19]:
def evaluate(model, iterator, criterion):
    """
    similar to train fucntion, except main purpose is to evaluate the trained models
    no need to zero gradients
    just return the same times adjusted loss and accuracy on the test set
    """
    
    epoch_loss = 0
    epoch_acc = 0
    
    model.eval()
    
    with torch.no_grad():
    
        for batch in iterator:

            predictions = model(batch.text).squeeze(1)
            
            loss = criterion(predictions, batch.label)
            
            acc = binary_accuracy(predictions, batch.label)

            epoch_loss += loss.item()
            epoch_acc += acc.item()
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

# LSTM Implementations

In [20]:
#number of epochs is set to be 5
N_EPOCHS = 5

#do 5 epochs and output the training and validation loss, accuracy
for epoch in range(N_EPOCHS):

    train_loss_lstm, train_acc_lstm = train_model(model_lstm, train_iterator, optimizer_lstm, criterion)
    valid_loss_lstm, valid_acc_lstm = evaluate(model_lstm, valid_iterator, criterion)
    torch.cuda.empty_cache()
    print("RNN-LSTM training data")
    print(f'Epoch: {epoch+1:02}, Train Loss: {train_loss_lstm:.3f}, Train Acc: {train_acc_lstm*100:.2f}%, Val. Loss: {valid_loss_lstm:.3f}, Val. Acc: {valid_acc_lstm*100:.2f}%')

  return Variable(arr, volatile=not train)


RNN-LSTM training data
Epoch: 01, Train Loss: 0.643, Train Acc: 61.77%, Val. Loss: 0.620, Val. Acc: 50.98%
RNN-LSTM training data
Epoch: 02, Train Loss: 0.420, Train Acc: 81.71%, Val. Loss: 0.345, Val. Acc: 85.69%
RNN-LSTM training data
Epoch: 03, Train Loss: 0.272, Train Acc: 89.78%, Val. Loss: 0.290, Val. Acc: 89.24%
RNN-LSTM training data
Epoch: 04, Train Loss: 0.177, Train Acc: 93.71%, Val. Loss: 0.298, Val. Acc: 89.76%
RNN-LSTM training data
Epoch: 05, Train Loss: 0.128, Train Acc: 95.61%, Val. Loss: 0.325, Val. Acc: 89.16%


Test the final model:

In [21]:

test_loss_lstm, test_acc_lstm = evaluate(model_lstm, test_iterator, criterion)
torch.cuda.empty_cache()
print("RNN-LSTM test result")
print(f'Test Loss: {test_loss_lstm:.3f}, Test Acc: {test_acc_lstm*100:.2f}%')

  return Variable(arr, volatile=not train)


RNN-LSTM test result
Test Loss: 0.408, Test Acc: 86.77%


# GRU Implementation

In [22]:
#number of epochs is set to be 5
N_EPOCHS = 5

#do 5 epochs and output the training and validation loss, accuracy
for epoch in range(N_EPOCHS):

    train_loss_gru, train_acc_gru = train_model(model_gru, train_iterator, optimizer_gru, criterion)
    valid_loss_gru, valid_acc_gru = evaluate(model_gru, valid_iterator, criterion)
    torch.cuda.empty_cache()
    print("RNN-GRU training data")
    print(f'Epoch: {epoch+1:02}, Train Loss: {train_loss_gru:.3f}, Train Acc: {train_acc_gru*100:.2f}%, Val. Loss: {valid_loss_gru:.3f}, Val. Acc: {valid_acc_gru*100:.2f}%')

  return Variable(arr, volatile=not train)


RNN-GRU training data
Epoch: 01, Train Loss: 0.528, Train Acc: 71.23%, Val. Loss: 0.350, Val. Acc: 85.40%
RNN-GRU training data
Epoch: 02, Train Loss: 0.255, Train Acc: 89.64%, Val. Loss: 0.244, Val. Acc: 89.96%
RNN-GRU training data
Epoch: 03, Train Loss: 0.166, Train Acc: 93.70%, Val. Loss: 0.240, Val. Acc: 90.23%
RNN-GRU training data
Epoch: 04, Train Loss: 0.109, Train Acc: 96.20%, Val. Loss: 0.281, Val. Acc: 90.16%
RNN-GRU training data
Epoch: 05, Train Loss: 0.076, Train Acc: 97.39%, Val. Loss: 0.322, Val. Acc: 89.31%


test gru model and output accuracy and loss

In [23]:
test_loss_gru, test_acc_gru = evaluate(model_gru, test_iterator, criterion)
torch.cuda.empty_cache()
print("RNN-GRU test result")
print(f'Test Loss: {test_loss_gru:.3f}, Test Acc: {test_acc_gru*100:.2f}%')

  return Variable(arr, volatile=not train)


RNN-GRU test result
Test Loss: 0.382, Test Acc: 87.47%


# User Input

In [24]:
import spacy
nlp = spacy.load('en')

def predict_sentiment_lstm(sentence):
    tokenized = [tok.text for tok in nlp.tokenizer(sentence)]
    indexed = [TEXT.vocab.stoi[t] for t in tokenized]
    tensor = torch.LongTensor(indexed).to(device)
    tensor = tensor.unsqueeze(1)
    prediction_lstm = F.sigmoid(model_lstm(tensor))
    return prediction_lstm.item()

def predict_sentiment_gru(sentence):
    tokenized = [tok.text for tok in nlp.tokenizer(sentence)]
    indexed = [TEXT.vocab.stoi[t] for t in tokenized]
    tensor = torch.LongTensor(indexed).to(device)
    tensor = tensor.unsqueeze(1)
    prediction_gru = F.sigmoid(model_gru(tensor))
    return prediction_gru.item()

In [25]:
predict_sentiment_lstm("The result is hugely enjoyable, and hooray for Hollywood for making it happen.")

0.8921913504600525

In [26]:
predict_sentiment_lstm("A disordered and unfocused ghost story that bears all the very worst habits of the genre.")

0.003819111967459321

In [27]:
predict_sentiment_gru("The result is hugely enjoyable, and hooray for Hollywood for making it happen.")

0.9169970154762268

In [28]:
predict_sentiment_gru("A disordered and unfocused ghost story that bears all the very worst habits of the genre.")

0.019471831619739532