# Understanding and implementing GRU
GRU or Gated Recurrent Units are the also inspired form design of LSTM. Gated recurrent units were published in 2014 by  Cho, et al. in a research paper named as Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation. 

As we had gated in LSTM, GRU  has 2  gates update, and reset gate. These two gates decide what information should be discarded and what information should be let pass through. Learnable parameters in these two gates can be trained to timely change the information content and make continuous updates. The flow diagram for GRU looks like as given below

![](figures/GRU.png)

Figure: Showing in detail structure of the GRU unit
The four gates function are as follow

**Update Gate:** This gate can be given by the following formula :

$$z_t = \sigma (W^{(z)}X_t + U^{(z)}h_{t-1}) $$

here the input $X_t$ is multiplied by its weight and previously hidden tensor which is carrying information of previous $t-1$  is multiplied by its weight. Then sigmoid squash them into a number between 1 and 0. Update gate determines how much past information to let go to the present time step. This gate helps in solving the problem related to the vanishing gradient. If the sigmoid gate value is 1 then all the information is preserved and solves vanishing gradient problem.

**Reset gate: **This gate helps in how much information needs to be forgotten from the previous time steps

$$r_t = \sigma(W^{(r)}x_t + U^{(r)}h_{h-1}) $$

This equation seems to be very similar to the previous equation. Here the only difference is the weights are for reset gate. Next is using these gates to determine current memory content and final memory at the end to the output.

**Current memory content:** This derived current memory content using reset gates value and current input value. as er discussed previously that the reset gate knows how much information to forget and it has a number between 0 and 1. if  $ r_t $ is zero then the input information contained in the current time step will be ignored fully and if 1 then the entire information in the current input will be taken into cell state.  The current memory content is calculated in the following way. 

1. Taking Hadamard product of reset gate value $r_t$ and previous hidden state with its weight $Uh_{t-1}$.
2. Summing up above value with of $W_{x_t}$
$$ h_t^{'} = tanh (Wx_t + r_t \odot Uh_{t-1})  $$

**Final Memory at current time step:** final memory is constructed by taking help of the update gates result and the current memory content. Final Memory is formed by using the following steps. 

1. Taking Hadamard product of update gate value  and $h_{t-1}$ .
2. Taking Hadamard product of  and current memory content $h^`_t$



Summing up above two values

$$h_t = z_t \odot h_{t-1} + (1-z_t) \odot h_t^{'} $$


# Importing Requirements

In [None]:

import json
import os
import random
import tarfile
import urllib
import zipfile

import chakin
import matplotlib.pyplot as plt
import nltk
import torch
from torch import nn, optim
from torchtext import data
from torchtext import vocab
from tqdm import tqdm

nltk.download('popular')
SEED = 1234

torch.manual_seed(SEED)
torch.cuda.manual_seed(SEED)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')


# Downloading required datasets
To demonstrate how embeddings can help, we will be conducting an experiment on sentiment analysis task. I have used movie review dataset having 5331 positive and 5331 negative processed sentences. The entire experiment is divided into 5 sections. 

Downloading Dataset: Above discussed dataset is available at http://www.cs.cornell.edu/people/pabo/movie-review-data/rt-polaritydata.tar.gz.



In [None]:
data_exists = os.path.isfile('data/rt-polaritydata.tar.gz')
if not  data_exists:
    urllib.request.urlretrieve("http://www.cs.cornell.edu/people/pabo/movie-review-data/rt-polaritydata.tar.gz",
                                       "data/rt-polaritydata.tar.gz")
    tar = tarfile.open("data/rt-polaritydata.tar.gz")
    tar.extractall(path='data/')

# Downloading embedding
The pre-trained embeddings are available and can be easily used in our model.  we will be using the GloVe vector trained having 300 dimensions.

In [None]:
embed_exists = os.path.isfile('../embeddings/glove.840B.300d.zip')
if not embed_exists:
    print("Downloading Glove embeddings, if not downloaded properly, then delete the `../embeddings/glove.840B.300d.zip")
    chakin.search(lang='English')
    chakin.download(number=16, save_dir='../embeddings')
    zip_ref = zipfile.ZipFile("../embeddings/glove.840B.300d.zip", 'r')
    zip_ref.extractall("../embeddings/")
    zip_ref.close()

# Preprocessing
I am using TorchText to preprocess downloaded data. The preprocessing includes following steps:

- Reading and parsing data 
- Defining sentiment and label fields
- Dividing data into train, valid and test subset
- forming the train, valid and test iterators

In [None]:
SEED = 1
split = 0.80

In [None]:
data_block = []
negative_data  = open('data/rt-polaritydata/rt-polarity.neg',encoding='utf8',errors='ignore').read().splitlines()
for i in negative_data:
        data_block.append({"sentiment":str(i.strip()),"label" : 0}) 
positve_data  = open('data/rt-polaritydata/rt-polarity.pos',encoding='utf8',errors='ignore').read().splitlines()
for i in positve_data:
        data_block.append({"sentiment":str(i.strip()),"label" : 1}) 

In [None]:
random.shuffle(data_block)

train_file = open('data/train.json', 'w')
test_file = open('data/test.json', 'w')
for i in  range(0,int(len(data_block)*split)):
    train_file.write(str(json.dumps(data_block[i]))+"\n")
for i in  range(int(len(data_block)*split),len(data_block)):
    test_file.write(str(json.dumps(data_block[i]))+"\n")

In [None]:
def tokenize(sentiments):
#     print(sentiments)
    return sentiments
def pad_to_equal(x):
    if len(x) < 61:
        return x + ['<pad>' for i in range(0, 61 - len(x))]
    else:
        return x[:61]
def to_categorical(x):
    if x == 1:
        return [0,1]
    if x == 0:
        return [1,0]
    

In [None]:
SENTIMENT = data.Field(sequential=True , preprocessing =pad_to_equal , use_vocab = True, lower=True)
LABEL = data.Field(is_target=True,use_vocab = False, sequential=False, preprocessing =to_categorical)
fields = {'sentiment': ('sentiment', SENTIMENT), 'label': ('label', LABEL)}

**Splitting data in to test and train**

In [None]:
train_data , test_data = data.TabularDataset.splits(
                            path = 'data',
                            train = 'train.json',
                            test = 'test.json',
                            format = 'json',
                            fields = fields                                
)

In [None]:
print("Printing an example data : ",vars(train_data[1]))

In [None]:
train_data, valid_data = train_data.split(random_state=random.seed(SEED))

In [None]:
print('Number of training examples: ', len(train_data))
print('Number of validation examples: ', len(valid_data))
print('Number of testing examples: ',len(test_data))

**Loading Embedding to vocab**

In [None]:
vec = vocab.Vectors(name = "glove.840B.300d.txt",cache = "../embeddings/")

In [None]:
SENTIMENT.build_vocab(train_data, valid_data, test_data, max_size=100000, vectors=vec)

**Constructing Iterators**

In [None]:
train_iter, val_iter, test_iter = data.Iterator.splits(
        (train_data, valid_data, test_data), sort_key=lambda x: len(x.sentiment),
        batch_sizes=(32,32,32), device=-1,)

In [None]:
sentiment_vocab = SENTIMENT.vocab

In [None]:
sentiment_vocab.vectors.shape

# Training
Training will be conducted for two models one with GRU  pre-trained embedding and one with LSTM. I am using GloVe embeddings with a vector size of 300. 
One thing to note here is the GRU is using only one hidden state to deal with vanishing gradient problem whereas the LSTM uses two hidden states. Due to this GRU is a bit faster than the LSTM. Let's see their performance on the movie review dataset.

In [None]:
def binary_accuracy(preds, y):
    """
    Returns accuracy per batch, i.e. if you get 8/10 right, this returns 0.8, NOT 8
    """
    rounded_preds = torch.argmax(preds, dim=1)
#     print(rounded_preds)
    correct = (rounded_preds == torch.argmax(y, dim=1)).float() #convert into float for division 
    acc = correct.sum()/len(correct)
    return acc

## Training using GRU

In [None]:
class GRU_RNN(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim, n_layers, bidirectional, dropout, sentiment_vocab):
        super(GRU_RNN, self).__init__()

        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.rnn = nn.GRU(embedding_dim, hidden_dim, num_layers=n_layers, bidirectional=bidirectional, dropout=dropout)
        self.fc = nn.Linear(hidden_dim * 2, output_dim)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        embedded = self.dropout(self.embedding(x))

        output, hidden = self.rnn(embedded)
        # concat the final forward (hidden[-2,:,:]) and backward (hidden[-1,:,:]) hidden layers
        # and apply dropout

        hidden = self.dropout(torch.cat((hidden[-2, :, :], hidden[-1, :, :]), dim=1))
        return torch.softmax(self.fc(hidden.squeeze(0)),dim = 1)

In [None]:
INPUT_DIM = len(SENTIMENT.vocab)
EMBEDDING_DIM = 300
HIDDEN_DIM = 256
OUTPUT_DIM = 2
BATCH_SIZE = 32
N_LAYERS = 2
BIDIRECTIONAL = True
DROPOUT = 0.5

gru_rnn = GRU_RNN(INPUT_DIM, EMBEDDING_DIM, HIDDEN_DIM, OUTPUT_DIM, N_LAYERS, BIDIRECTIONAL, DROPOUT, sentiment_vocab)
gru_rnn = gru_rnn.to(device)

In [None]:
optimizer = optim.SGD(gru_rnn.parameters(), lr=0.1)
criterion = nn.BCEWithLogitsLoss()
criterion = criterion.to(device)

In [None]:
def train(gru_rnn, iterator, optimizer, criterion):
    epoch_loss = 0
    epoch_acc = 0
#     vanila_rnn.train()
    
    for batch in iterator:
        optimizer.zero_grad()       
        predictions = gru_rnn(batch.sentiment.to(device)).squeeze(1)
        loss = criterion(predictions.type(torch.FloatTensor), batch.label.type(torch.FloatTensor))
        acc = binary_accuracy(predictions.type(torch.FloatTensor), batch.label.type(torch.FloatTensor))
        loss.backward()
        optimizer.step()
        epoch_loss += loss.item()
        epoch_acc += acc.item()
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

In [None]:
rnn_loss = []
rnn_accuracy = []
for i in tqdm(range(0,100)):
    loss, accuracy =  train(gru_rnn, train_iter, optimizer, criterion)
    print("Loss : ",loss, "Accuracy : ", accuracy )
    rnn_loss.append(loss)
    rnn_accuracy.append(accuracy)

## Training using LSTM

In [None]:
class LSTM_RNN(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim, n_layers, bidirectional, dropout, sentiment_vocab):
        super(LSTM_RNN, self).__init__()

        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.rnn = nn.LSTM(embedding_dim, hidden_dim, num_layers=n_layers, bidirectional=bidirectional, dropout=dropout)
        self.fc = nn.Linear(hidden_dim * 2, output_dim)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):

        embedded = self.dropout(self.embedding(x))
        output, (hidden, cell)= self.rnn(embedded)
        # concat the final forward (hidden[-2,:,:]) and backward (hidden[-1,:,:]) hidden layers
        # and apply dropout

        hidden = self.dropout(torch.cat((hidden[-2, :, :], hidden[-1, :, :]), dim=1))
        return self.fc(hidden.squeeze(0))

In [None]:
INPUT_DIM = len(SENTIMENT.vocab)
EMBEDDING_DIM = 300
HIDDEN_DIM = 256
OUTPUT_DIM = 2
BATCH_SIZE = 32
N_LAYERS = 2
BIDIRECTIONAL = True
DROPOUT = 0.5

lstm_rnn = LSTM_RNN(INPUT_DIM, EMBEDDING_DIM, HIDDEN_DIM, OUTPUT_DIM, N_LAYERS, BIDIRECTIONAL, DROPOUT, sentiment_vocab)
lstm_rnn = lstm_rnn.to(device)

In [None]:
optimizer = optim.SGD(lstm_rnn.parameters(), lr=0.1)
criterion = nn.BCEWithLogitsLoss()
criterion = criterion.to(device)

In [None]:
lstm_loss = []
lstm_accuracy = []
for i in tqdm(range(0,100)):
    loss, accuracy =  train(lstm_rnn, train_iter, optimizer, criterion)
    print("Loss : ",loss, "Accuracy : ", accuracy )
    lstm_loss.append(loss)
    lstm_accuracy.append(accuracy)

## Comparision
As shown in the above LSTM produce 95% accuracy and GRU produced 85% performance. However, for all the datasets this will not be the case some time GRU's performance was also found to be superior in some cases.

![](figures/LSTM_GRU.png)




In [None]:
plt.plot(rnn_accuracy , label = "GRU Accuracy")
plt.plot(lstm_accuracy , label = "LSTM Accuracy")
plt.ylabel("Accuracy")
plt.xlabel("Epoch")
plt.legend(loc='upper left')
plt.show()
