## Recurrent Neural Networks

## 1. IMDB Review Classification Battlefield - Contestants : Feedforward, CNN, RNN, LSTM



In this task, we are going to do sentiment classification on a movie review dataset. We are going to build a feedforward net, a convolutional neural net, a recurrent net and combine one or more of them to understand performance of each of them. A sentence can be thought of as a sequence of words which have semantic connections across time. By semantic connection, we mean that the words that occur earlier in the sentence influence the sentence's structure and meaning in the latter part of the sentence. There are also semantic connections backwards in a sentence, in an ideal case (in which we use RNNs from both directions and combine their outputs). But for the purpose of this tutorial, we are going to restrict ourselves to only uni-directional RNNs.

In [1]:
import numpy as np
# fix random seed for reproducibility
np.random.seed(1)

# We want to have a finite vocabulary to make sure that our word matrices are not arbitrarily small
vocabulary_size = 10000

#We also want to have a finite length of reviews and not have to process really long sentences.
max_review_length = 500

import warnings
warnings.simplefilter(action='ignore',category=FutureWarning)

#### TOKENIZATION

For practical data science applications, we need to convert text into tokens since the machine understands only numbers and not really English words like humans can. As a simple example of tokenization, we can see a small example.

Assume we have 5 sentences. This is how we tokenize them into numbers once we create a dictionary.

1. i have books - [1, 4, 7]
2. interesting books are useful [10,2,9,8]
3. i have computers [1,4,6]
4. computers are interesting and useful [6,9,11,10,8]
5. books and computers are both valuable. [2,10,2,9,13,12]
6. Bye Bye [7,7]

Create tokens for vocabulary based on frequency of occurrence. Hence, we assign the following tokens

I-1, books-2, computers-3, have-4, are-5, computers-6,bye-7, useful-8, are-9, and-10,interesting-11, valuable-12, both-13

Thankfully, in our dataset it is internally handled and each sentence is represented in such tokenized form.

#### Load data

In [2]:
from keras.datasets import imdb 
from keras.preprocessing import sequence

Using TensorFlow backend.


In [3]:
np_load_old = np.load
np.load = lambda *a,**k: np_load_old(*a, allow_pickle=True, **k)

(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words=vocabulary_size)
np.load = np_load_old

print('Number of reviews', len(X_train))
print('Length of first and fifth review before padding', len(X_train[0]) ,len(X_train[4]))
print('First review', X_train[0])
print('First label', y_train[0])

  x_train, y_train = np.array(xs[:idx]), np.array(labels[:idx])


Number of reviews 25000
Length of first and fifth review before padding 218 147
First review [1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65, 458, 4468, 66, 3941, 4, 173, 36, 256, 5, 25, 100, 43, 838, 112, 50, 670, 2, 9, 35, 480, 284, 5, 150, 4, 172, 112, 167, 2, 336, 385, 39, 4, 172, 4536, 1111, 17, 546, 38, 13, 447, 4, 192, 50, 16, 6, 147, 2025, 19, 14, 22, 4, 1920, 4613, 469, 4, 22, 71, 87, 12, 16, 43, 530, 38, 76, 15, 13, 1247, 4, 22, 17, 515, 17, 12, 16, 626, 18, 2, 5, 62, 386, 12, 8, 316, 8, 106, 5, 4, 2223, 5244, 16, 480, 66, 3785, 33, 4, 130, 12, 16, 38, 619, 5, 25, 124, 51, 36, 135, 48, 25, 1415, 33, 6, 22, 12, 215, 28, 77, 52, 5, 14, 407, 16, 82, 2, 8, 4, 107, 117, 5952, 15, 256, 4, 2, 7, 3766, 5, 723, 36, 71, 43, 530, 476, 26, 400, 317, 46, 7, 4, 2, 1029, 13, 104, 88, 4, 381, 15, 297, 98, 32, 2071, 56, 26, 141, 6, 194, 7486, 18, 4, 226, 22, 21, 134, 476, 26, 480, 5, 144, 30, 5535, 18, 51, 36, 28, 224, 92, 25, 104, 4, 226, 65, 16, 38, 1334, 88, 12, 16, 283, 5, 16, 4472, 113, 103

  x_test, y_test = np.array(xs[idx:]), np.array(labels[idx:])


#### Preprocess data

Pad sequences in order to ensure that all inputs have same sentence length and dimensions.

In [4]:
X_train = sequence.pad_sequences(X_train, maxlen=max_review_length)
X_test = sequence.pad_sequences(X_test, maxlen=max_review_length)
print('Length of first and fifth review after padding', len(X_train[0]) ,len(X_train[4]))

Length of first and fifth review after padding 500 500


### MODEL 1(a) : FEEDFORWARD NETWORKS WITHOUT EMBEDDINGS 

Let us build a single layer feedforward net with 250 nodes. Each input would be a 500-dim vector of tokens since we padded all our sequences to size 500.

<b> EXERCISE </b> : Calculate the number of parameters involved in this network and implement a feedforward net to do classification without looking at cells below.

In [5]:
import torch.nn.functional as F

import torch
from torch.utils.data import DataLoader,TensorDataset
from torch import nn
# from catalyst import dl
batch_size=2048

device = torch.device("cuda:0")
train_set = TensorDataset(torch.FloatTensor(X_train),torch.FloatTensor(y_train).view(25000,1))
train_loader = DataLoader(train_set,batch_size=batch_size)#250 examples a time

test_set = TensorDataset(torch.FloatTensor(X_test),torch.FloatTensor(y_test).view(25000,1))
test_loader = DataLoader(test_set,batch_size=batch_size)

In [6]:
class FeedForwardNet(nn.Module):
    def __init__(self,input_shape=500, hidden_size=250,output_size=1):
        super().__init__()
        self.fc1 = nn.Linear(input_shape, hidden_size)
        self.fc2 = nn.Linear(hidden_size, output_size)
        self.activation = nn.Sigmoid()
        
    def forward(self, x):
        x = self.fc1(x)
        x = self.fc2(x)
        return self.activation(x)

# ff_model =FeedForwardNet()
# ff_model = ff_model.to(device)

In [7]:
# #### YOUR CODE HERE ####
# model = nn.Sequential(
#     nn.Linear(500,250),
#     nn.Linear(250,1),
#     nn.Sigmoid()
# )
# model.to(device)

In [8]:
def train(model,train_loader,epoch,criterion,optimizer,device=device,verbose=0):
    model.train()
    train_loss, correct = 0,0
    for batch_id, (data,target) in enumerate(train_loader):
        

        data,target = data.to(device),target.to(device)
        output = model(data)
        optimizer.zero_grad()
#         loss = F.binary_cross_entropy(output,target)
        loss = criterion(output,target)
        loss.backward()
        optimizer.step()
        correct += output.round().eq(target).sum().item()
        train_loss+=loss.item()
        if batch_id%100==0 and verbose==1:
            print(f"Training Epoch:{epoch} [{batch_id*len(data)}/{len(train_loader.dataset)}] {100.*batch_id / len(train_loader) :.2f}%\tLoss: {loss.item():.4f}")
    
    train_loss/=len(train_loader.dataset)
    acc = correct/len(train_loader.dataset)
    
    print(f"\nTrain: Average loss: {train_loss:.4f}\t Accuracy: {acc:.4f}")

def validate(model,test_loader, device=device):
    model.eval()
    test_loss =0
    correct = 0
    with torch.no_grad():
        for data,target in test_loader:
            data,target = data.to(device), target.to(device)
            output = model(data)
            test_loss +=F.binary_cross_entropy(output, target).item()
            correct += output.round().eq(target).sum().item()
    
    test_loss /= len(test_loader.dataset)
    acc = correct/ len(test_loader.dataset)
    
    print(f'Valid: Average loss: {test_loss:.4f}\t Accuracy: {acc:.8f}\n')

In [9]:
def training_loop(modelType,n_epochs,train_loader,test_loader, optimizer,criterion,lr=1e-2):
    model = modelType()
    model = model.to(device)
    optimizer = optimizer(model.parameters(), lr=lr)
    criterion= criterion(reduction='mean')
    for epoch in range(n_epochs):
        train(model,train_loader,epoch, criterion,optimizer)
#         print(next(model.parameters())[0][42])
        validate(model,test_loader)

In [15]:
optimizer = torch.optim.SGD
criterion = nn.BCELoss
training_loop(FeedForwardNet,10, train_loader,test_loader,optimizer,criterion,0.00001)


Train: Average loss: 0.0233	 Accuracy: 0.5026
Valid: Average loss: 0.0233	 Accuracy: 0.49788000


Train: Average loss: 0.0233	 Accuracy: 0.5004
Valid: Average loss: 0.0233	 Accuracy: 0.49640000


Train: Average loss: 0.0233	 Accuracy: 0.5006
Valid: Average loss: 0.0232	 Accuracy: 0.49940000


Train: Average loss: 0.0234	 Accuracy: 0.4989
Valid: Average loss: 0.0233	 Accuracy: 0.49988000


Train: Average loss: 0.0235	 Accuracy: 0.4996
Valid: Average loss: 0.0233	 Accuracy: 0.49812000


Train: Average loss: 0.0235	 Accuracy: 0.4991
Valid: Average loss: 0.0235	 Accuracy: 0.49696000


Train: Average loss: 0.0235	 Accuracy: 0.5002
Valid: Average loss: 0.0236	 Accuracy: 0.49580000


Train: Average loss: 0.0237	 Accuracy: 0.5000
Valid: Average loss: 0.0238	 Accuracy: 0.49716000


Train: Average loss: 0.0238	 Accuracy: 0.4996
Valid: Average loss: 0.0239	 Accuracy: 0.49716000


Train: Average loss: 0.0239	 Accuracy: 0.5014
Valid: Average loss: 0.0241	 Accuracy: 0.49840000



#### Discussion : Why was the performance bad ? What was wrong with tokenization ? 

### MODEL 1(b) : FEEDFORWARD NETWORKS WITH EMBEDDINGS

#### What is an embedding layer ? 

An embedding is a linear projection from one vector space to another. We usually use embeddings to project the one-hot encodings of words on to a lower-dimensional continuous space so that the input surface is dense and possibly smooth. According to the model, an embedding layer is just a transformation from $\mathbb{R}^{inp}$ to $\mathbb{R}^{emb}$

Do embedding to dim 100 (in keras, tf, PyTorch: with Embedding layer) and after flattening add a dense layer with 250 units. Fit the model.

In [74]:
train_set = TensorDataset(torch.LongTensor(X_train),torch.FloatTensor(y_train).view(25000,1))
train_loader = DataLoader(train_set, batch_size=256)#250 examples a time

test_set = TensorDataset(torch.LongTensor(X_test),torch.FloatTensor(y_test).view(25000,1))
test_loader = DataLoader(test_set, batch_size=256)

In [75]:
class FF_Embeddings(nn.Module):
    def __init__(self, input_dim=500,embedding_dim=100, num_embeddings=vocabulary_size,hidden_dim=250, output_dim=1):
        super().__init__()
        self.embedding = nn.Embedding(num_embeddings=num_embeddings, embedding_dim=embedding_dim)
        self.fc1 = nn.Linear(embedding_dim*input_dim, hidden_dim)
        self.fc2 = nn.Linear(hidden_dim,output_dim)
        self.activation = nn.Sigmoid()
        
    def forward(self, x):
        embedded = self.embedding(x).view(x.size(0),-1)
        x = self.fc1(embedded)
        x = self.fc2(x)
        return self.activation(x)

In [78]:
training_loop(FF_Embeddings,10, train_loader,test_loader,optimizer,criterion,lr=0.0001)


Train: Average loss: 0.0036	 Accuracy: 0.5548
Valid: Average loss: 0.0033	 Accuracy: 0.55060000


Train: Average loss: 0.0022	 Accuracy: 0.7280
Valid: Average loss: 0.0033	 Accuracy: 0.57544000


Train: Average loss: 0.0018	 Accuracy: 0.7885
Valid: Average loss: 0.0032	 Accuracy: 0.59608000


Train: Average loss: 0.0019	 Accuracy: 0.7761
Valid: Average loss: 0.0029	 Accuracy: 0.62696000


Train: Average loss: 0.0019	 Accuracy: 0.7954
Valid: Average loss: 0.0060	 Accuracy: 0.54012000


Train: Average loss: 0.0015	 Accuracy: 0.8460
Valid: Average loss: 0.0046	 Accuracy: 0.57120000


Train: Average loss: 0.0012	 Accuracy: 0.8786
Valid: Average loss: 0.0037	 Accuracy: 0.61376000


Train: Average loss: 0.0010	 Accuracy: 0.9051
Valid: Average loss: 0.0034	 Accuracy: 0.63828000


Train: Average loss: 0.0008	 Accuracy: 0.9263
Valid: Average loss: 0.0035	 Accuracy: 0.64956000


Train: Average loss: 0.0007	 Accuracy: 0.9400
Valid: Average loss: 0.0036	 Accuracy: 0.65476000



### MODEL 2 : CONVOLUTIONAL NEURAL NETWORKS

Text can be thought of as 1-dimensional sequence and we can apply 1-D Convolutions over a set of words. Let us walk through convolutions on text data with this blog.

http://debajyotidatta.github.io/nlp/deep/learning/word-embeddings/2016/11/27/Understanding-Convolutions-In-Text/

Fit a 1D convolution with 200 filters, kernel size 3 followed by a feedforward layer of 250 nodes and ReLU, sigmoid activations as appropriate.

In [12]:
#### YOUR CODE HERE ####
cnn_model = nn.Sequential(
    nn.Conv1d(in_channels=200,out_channels=250,kernel_size=3),
)

In [13]:
cnn_model.forward(torch.Tensor(X_train[:250]))

RuntimeError: Expected 3-dimensional input for 3-dimensional weight [250, 200, 3], but got 2-dimensional input of size [250, 500] instead

### MODEL 3 : SIMPLE RNN

Two of the best blogs that help understand the workings of a RNN and LSTM are

1. http://karpathy.github.io/2015/05/21/rnn-effectiveness/
2. http://colah.github.io/posts/2015-08-Understanding-LSTMs/

Mathematically speaking, a simple RNN does the following. It constructs a set of hidden states using the state variable from the previous timestep and the input at current time. Mathematically, a simpleRNN can be defined by the following relation.

<center>$h_t = \sigma(W([h_{t-1},x_{t}])+b)$
    
If we extend this recurrence relation to the length of sequences we have in hand, we have our RNN network constructed.

Do simple RNN (keras, rf: SimpleRNN layer, pytorch: RNN layer) with 100 units with the input from embedding layer. How are the results different from the previous model?

In [79]:
class RNNModel(nn.Module):
    def __init__(self, input_shape=500, hidden_dim=100, output_shape=1, vocab_size = vocabulary_size, embedding_dim=100):
        super().__init__()
        self.embedding = nn.Embedding(embedding_dim=embedding_dim, num_embeddings=vocabulary_size)
        self.rnn = nn.RNN(embedding_dim, hidden_dim)
        self.linear = nn.Linear(input_shape * hidden_dim, output_shape)
        self.sigmoid = nn.Sigmoid()
        
        self.hidden_dim = hidden_dim
    
    def forward(self,x):
        embedded = self.embedding(x)
        output,hidden = self.rnn(embedded)
        output,hidden = self.rnn(output)
        output = output.view(x.size(0),-1)
        x = self.linear(output)
        return self.sigmoid(x)


In [80]:
optimizer = torch.optim.Adam
training_loop(RNNModel,10,train_loader,test_loader,optimizer,criterion,lr=1e-5)


Train: Average loss: 0.0027	 Accuracy: 0.5058
Valid: Average loss: 0.0027	 Accuracy: 0.51516000


Train: Average loss: 0.0027	 Accuracy: 0.5284
Valid: Average loss: 0.0027	 Accuracy: 0.51772000


Train: Average loss: 0.0027	 Accuracy: 0.5422
Valid: Average loss: 0.0027	 Accuracy: 0.52072000


Train: Average loss: 0.0027	 Accuracy: 0.5576
Valid: Average loss: 0.0027	 Accuracy: 0.52516000


Train: Average loss: 0.0027	 Accuracy: 0.5728
Valid: Average loss: 0.0027	 Accuracy: 0.52828000


Train: Average loss: 0.0027	 Accuracy: 0.5861
Valid: Average loss: 0.0027	 Accuracy: 0.53104000


Train: Average loss: 0.0026	 Accuracy: 0.6001
Valid: Average loss: 0.0027	 Accuracy: 0.53316000


Train: Average loss: 0.0026	 Accuracy: 0.6137
Valid: Average loss: 0.0027	 Accuracy: 0.53644000


Train: Average loss: 0.0026	 Accuracy: 0.6255
Valid: Average loss: 0.0027	 Accuracy: 0.54124000


Train: Average loss: 0.0026	 Accuracy: 0.6353
Valid: Average loss: 0.0027	 Accuracy: 0.54368000



In [81]:
optimizer = torch.optim.Adam
training_loop(RNNModel,20,train_loader,test_loader,optimizer,criterion,lr=1e-4)


Train: Average loss: 0.0027	 Accuracy: 0.5224
Valid: Average loss: 0.0027	 Accuracy: 0.53724000


Train: Average loss: 0.0026	 Accuracy: 0.6175
Valid: Average loss: 0.0027	 Accuracy: 0.57048000


Train: Average loss: 0.0025	 Accuracy: 0.6785
Valid: Average loss: 0.0026	 Accuracy: 0.60260000


Train: Average loss: 0.0023	 Accuracy: 0.7257
Valid: Average loss: 0.0025	 Accuracy: 0.62772000


Train: Average loss: 0.0022	 Accuracy: 0.7587
Valid: Average loss: 0.0024	 Accuracy: 0.65528000


Train: Average loss: 0.0020	 Accuracy: 0.7842
Valid: Average loss: 0.0024	 Accuracy: 0.67448000


Train: Average loss: 0.0018	 Accuracy: 0.8065
Valid: Average loss: 0.0023	 Accuracy: 0.69060000


Train: Average loss: 0.0017	 Accuracy: 0.8258
Valid: Average loss: 0.0022	 Accuracy: 0.70480000


Train: Average loss: 0.0016	 Accuracy: 0.8461
Valid: Average loss: 0.0022	 Accuracy: 0.71516000


Train: Average loss: 0.0014	 Accuracy: 0.8628
Valid: Average loss: 0.0021	 Accuracy: 0.72356000



In [82]:
optimizer = torch.optim.SGD
training_loop(RNNModel,10,train_loader,test_loader,optimizer,criterion)


Train: Average loss: 0.0028	 Accuracy: 0.5152
Valid: Average loss: 0.0027	 Accuracy: 0.51464000


Train: Average loss: 0.0026	 Accuracy: 0.5938
Valid: Average loss: 0.0027	 Accuracy: 0.53548000


Train: Average loss: 0.0025	 Accuracy: 0.6559
Valid: Average loss: 0.0027	 Accuracy: 0.55144000


Train: Average loss: 0.0025	 Accuracy: 0.6966
Valid: Average loss: 0.0027	 Accuracy: 0.56496000


Train: Average loss: 0.0024	 Accuracy: 0.7245
Valid: Average loss: 0.0026	 Accuracy: 0.57308000


Train: Average loss: 0.0023	 Accuracy: 0.7422
Valid: Average loss: 0.0026	 Accuracy: 0.57976000


Train: Average loss: 0.0022	 Accuracy: 0.7575
Valid: Average loss: 0.0026	 Accuracy: 0.58672000


Train: Average loss: 0.0022	 Accuracy: 0.7694
Valid: Average loss: 0.0026	 Accuracy: 0.59316000


Train: Average loss: 0.0021	 Accuracy: 0.7794
Valid: Average loss: 0.0026	 Accuracy: 0.59928000


Train: Average loss: 0.0020	 Accuracy: 0.7888
Valid: Average loss: 0.0026	 Accuracy: 0.60344000



#### RNNs and vanishing/exploding gradients

Let us use sigmoid activations as example. Derivative of a sigmoid can be written as 
<center> $\sigma'(x) = \sigma(x) \cdot \sigma(1-x)$. </center>

<img src = "fig/vanishing_gradients.png">
Remember RNN is a "really deep" feedforward network (when unrolled in time). Hence, backpropagation happens from $h_t$ all the way to $h_1$. Also realize that sigmoid gradients are multiplicatively dependent on the value of sigmoid. Hence, if the non-activated output of any layer $h_l$ is < 0, then $\sigma$ tends to 0, effectively "vanishing" gradient. Any layer that the current layer backprops to $H_{1:L-1}$ do not learn anything useful out of the gradients.

#### LSTMs and GRU
LSTM and GRU are two sophisticated implementations of RNN which essentially are built on what we call as gates. A gate is a probability number between 0 and 1. For instance, LSTM is built on these state updates 

Note : L is just a linear transformation L(x) = W*x + b.

$f_t = \sigma(L([h_{t-1},x_t))$

$i_t = \sigma(L([h_{t-1},x_t))$

$o_t = \sigma(L([h_{t-1},x_t))$

$\hat{C}_t = \tanh(L([h_{t-1},x_t))$

$C_t = f_t * C_{t-1}+i_t*\hat{C}_t$  (Using the forget gate, the neural network can learn to control how much information it has to retain or forget)

$h_t = o_t * \tanh(c_t)$



### MODEL 4 : LSTM

In the next step, we will implement a LSTM model to do classification. Use the same architecture as before. Try experimenting with increasing the number of nodes, stacking multiple layers, applyong dropouts etc. Check the number of parameters that this model entails.

In [83]:
class LSTMModel(nn.Module):
    def __init__(self, input_shape=500, hidden_dim=100, output_shape=1, vocab_size = vocabulary_size, embedding_dim=100):
        super().__init__()
        self.embedding = nn.Embedding(embedding_dim=embedding_dim, num_embeddings=vocabulary_size)
        self.lstm = nn.LSTM(embedding_dim, hidden_dim)
        self.linear = nn.Linear(input_shape * hidden_dim, output_shape)
        self.sigmoid = nn.Sigmoid()
        
        self.hidden_dim = hidden_dim
    
    def forward(self,x):
        embedded = self.embedding(x)
        output,hidden = self.lstm(embedded)
        output,hidden = self.lstm(output)
        output = output.view(x.size(0),-1)
        x = self.linear(output)
        return self.sigmoid(x)


In [86]:
optimizer = torch.optim.Adam
training_loop(LSTMModel,10,train_loader,test_loader,optimizer,criterion,lr=1e-3)


Train: Average loss: 0.0028	 Accuracy: 0.5028
Valid: Average loss: 0.0027	 Accuracy: 0.54556000


Train: Average loss: 0.0026	 Accuracy: 0.6177
Valid: Average loss: 0.0024	 Accuracy: 0.65872000


Train: Average loss: 0.0017	 Accuracy: 0.7932
Valid: Average loss: 0.0015	 Accuracy: 0.82592000


Train: Average loss: 0.0012	 Accuracy: 0.8772
Valid: Average loss: 0.0014	 Accuracy: 0.84528000


Train: Average loss: 0.0009	 Accuracy: 0.9028
Valid: Average loss: 0.0017	 Accuracy: 0.82176000


Train: Average loss: 0.0007	 Accuracy: 0.9317
Valid: Average loss: 0.0015	 Accuracy: 0.85052000


Train: Average loss: 0.0006	 Accuracy: 0.9462
Valid: Average loss: 0.0018	 Accuracy: 0.83172000


Train: Average loss: 0.0005	 Accuracy: 0.9540
Valid: Average loss: 0.0017	 Accuracy: 0.84144000


Train: Average loss: 0.0003	 Accuracy: 0.9748
Valid: Average loss: 0.0018	 Accuracy: 0.84380000


Train: Average loss: 0.0002	 Accuracy: 0.9842
Valid: Average loss: 0.0026	 Accuracy: 0.81592000



### MODEL 5 : CNN + LSTM 

CNNs are good at learning spatial features and sentences can be thought of as 1-D spatial vectors (dimension being connotated by the sequence ordering among the words in the sentence.). We apply a LSTM over the features learned by the CNN (after a maxpooling layer). This leverages the power of CNNs and LSTMs combined. We expect the CNN to be able to pick out invariant features across the 1-D spatial structure(i.e. sentence) that characterize good and bad sentiment. This learned spatial features may then be learned as sequences by an LSTM layer followed by a feedforward for classification.

In [None]:
#### YOUR CODE HERE ####


### CONCLUSION

We saw the power of sequence models and how they are useful in text classification. They give a solid performance, low memory footprint (thanks to shared parameters) and are able to understand and leverage the temporally connected information contained in the inputs. There is still an open debate about the performance vs memory benefits of CNNs vs RNNs in the research community.