# RNN Tutorial

## RNN

Recurrent neural networks (RNNs) are a class of neural networks that are naturally suited to processing time-series data and other sequential data. The distinctive feature of RNNs is their ability to send information over time steps.

Before we deep dive into the details of what a recurrent neural network is, let’s ponder a bit on if we really need a network specially for dealing with sequences in information.

## Example usecase


**Sentiment Classification** – This can be a task of simply classifying tweets into positive and negative sentiment. Importantly, here the input would be a tweet of varying lengths.

**Language Translation** - Language model is one of the most interesting topics that usesequence labeling. There is a need to understand the meaning of each word, and the relationship between words while performing translations.

**Image Captioning** - Generating textual description of an image. Often involves varying length, coherent outputs. 

**Music Generation** - Music is temporal, and hence cannot be generated using normal neural networks. 

RNN provide flexibility by being able to handle sequential data of varying length (both in input and output).

## Technical Details

RNNs are called recurrent because they perform the same task for every element of a sequence, with the output being depended on the previous computations. Another way to think about RNNs is that they have a “memory” which captures information about what has been calculated so far. In theory, RNNs can make use of information in arbitrarily long sequences.

RNN works on the principle of saving the output of a particular layer and feeding this back to the input in order to predict the output of the layer.

![RNN](https://www.simplilearn.com/ice9/free_resources_article_thumb/Simple_Recurrent_Neural_Network.png)

**RNN Unrolled**:

![RNN Structure](https://www.simplilearn.com/ice9/free_resources_article_thumb/Fully_connected_Recurrent_Neural_Network.gif)

For each time step, $t$, let $x^{t}$ denote the input, $y^{t}$ denote the output and the hidden state be denoted by $h^{t}$. Following addition over normal neural network must be noted.

$$h^{t} = f_{1}(h_{t-1},x_{t})$$
$$h^{t} = f_{2}(h_{t})$$

**Note**: The coefficients remains constant thorugh temporal space.

## Backpropagation

Backpropagation through time is when we apply a Backpropagation algorithm to a Recurrent Neural network that has time series data as its input. A RNN essentially processes sequences one step at a time, so during backpropagation the gradients flow backward across time steps. So, the gradient wrt the hidden state and the gradient from the previous time step are used. we can use the chain rule to compute the gradients recursively

## Types and Applications of RNNs

**One to one RNN**: Traditional Neural Networks

![One-to-One](https://www.simplilearn.com/ice9/free_resources_article_thumb/One_to_One_RNN.png)

**One to many RNN**: Generative Tasks, like music generation.

![one-to-many](https://www.simplilearn.com/ice9/free_resources_article_thumb/One_to_Many_RNN.png)

**Many to One**: Classification tasks, sentiment analysis.

![many-to-one](https://www.simplilearn.com/ice9/free_resources_article_thumb/Many_to_One_RNN.png)

**Many to Many**: Machine translation.

![many-to-many](https://www.simplilearn.com/ice9/free_resources_article_thumb/Many_to_Many_RNN.png)

## Implementing a RNN

**Checking system and library version**

In [1]:
import sys
import torch
import math
print('Your python version: {}'.format(sys.version_info.major))
print('Your pytorch version: {}'.format(torch.__version__))
print('GPU being used: {}'.format(torch.cuda.get_device_name(0)))

Your python version: 3
Your pytorch version: 1.12.1+cu116
GPU being used: NVIDIA GeForce RTX 3060 Laptop GPU


**Importing Libraries**

In [2]:
import torch
from torch import nn
import numpy as np
import pandas as pd

**Mounting drive and fetching dataset**

In [3]:
data = pd.read_csv('./Apple.csv')
data.sort_values('Date')
data.tail()

Unnamed: 0,Date,Open,High,Low,Close,Volume,Name
3014,2017-12-22,174.68,175.42,174.5,175.01,16349444,AAPL
3015,2017-12-26,170.8,171.47,169.68,170.57,33185536,AAPL
3016,2017-12-27,170.1,170.78,169.71,170.6,21498213,AAPL
3017,2017-12-28,171.0,171.85,170.48,171.08,16480187,AAPL
3018,2017-12-29,170.52,170.59,169.22,169.23,25999922,AAPL


**Price Representation**

In [4]:
import plotly.express as px

In [5]:
fig = px.line(data, x='Date', y="Close")
fig.write_image("./fig1.png")

**Pre-processing**

In [6]:
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_squared_error

In [7]:
price = data[['Close']]
scaler = MinMaxScaler(feature_range=(-1, 1))
price.loc[:,('Close')] = scaler.fit_transform(price.loc[:,('Close')].values.reshape(-1,1))



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



**Train-Test Split**

In [8]:
def split_data(stock, lookback):
    data_raw = stock.to_numpy() # convert to numpy array
    data = []
    
    # create all possible sequences of length seq_len
    for index in range(len(data_raw) - lookback): 
        data.append(data_raw[index: index + lookback])
    
    data = np.array(data);
    # 80:20 train-test split
    test_set_size = int(np.round(0.2*data.shape[0]));
    train_set_size = data.shape[0] - (test_set_size);
    
    x_train = data[:train_set_size,:-1,:]
    y_train = data[:train_set_size,-1,:]
    
    x_test = data[train_set_size:,:-1]
    y_test = data[train_set_size:,-1,:]
    
    return [x_train, y_train, x_test, y_test]

In [9]:
lookback = 20 # choose sequence length
x_train, y_train, x_test, y_test = split_data(price, lookback)
print('x_train.shape = ',x_train.shape)
print('y_train.shape = ',y_train.shape)
print('x_test.shape = ',x_test.shape)
print('y_test.shape = ',y_test.shape)

x_train.shape =  (2399, 19, 1)
y_train.shape =  (2399, 1)
x_test.shape =  (600, 19, 1)
y_test.shape =  (600, 1)


In [10]:
x_train = torch.from_numpy(x_train).type(torch.Tensor)
x_test = torch.from_numpy(x_test).type(torch.Tensor)
y_train = torch.from_numpy(y_train).type(torch.Tensor)
y_test = torch.from_numpy(y_test).type(torch.Tensor)
# y_train_gru = torch.from_numpy(y_train)
# y_test_gru = torch.from_numpy(y_test)

**Setting Hyperparameters**

In [11]:
input_dim = 1
hidden_dim = 32
num_layers = 2
output_dim = 1
num_epochs = 100
lr = 0.01

**Definig Model**

In [12]:
class Model(nn.Module):
    def __init__(self, input_size, output_size, hidden_dim, n_layers):
        super(Model, self).__init__()

        # Defining some parameters
        self.hidden_dim = hidden_dim
        self.n_layers = n_layers

        #Defining the layers
        # RNN Layer
        self.rnn = nn.RNN(input_size, hidden_dim, n_layers, batch_first=True)   
        # Fully connected layer
        self.fc = nn.Linear(hidden_dim, output_size)
    
    def forward(self, x):
        
        batch_size = x.size(0)

        # Initializing hidden state for first input using method defined below
        hidden = self.init_hidden(batch_size)

        # Passing in the input and hidden state into the model and obtaining outputs
        out, hidden = self.rnn(x, hidden)
        
        # Reshaping the outputs
        out = self.fc(out[:, -1, :])
        
        return out, hidden
    
    def init_hidden(self, batch_size):
        # This method generates the first hidden state of zeros which we'll use in the forward pass
        # We'll send the tensor holding the hidden state to the device we specified earlier as well
        hidden = torch.zeros(self.n_layers, batch_size, self.hidden_dim)
        return hidden

**Model Training**

In [13]:
# Instantiate the model with hyperparameters
model = Model(input_size=input_dim, output_size=output_dim, hidden_dim=hidden_dim, n_layers=num_layers)
# We'll also set the model to the device that we defined earlier (default is CPU)
# model = model.to('cuda')

# Define Loss, Optimizer
criterion = torch.nn.MSELoss(reduction='mean')
optimiser = torch.optim.Adam(model.parameters(), lr=lr)

In [14]:
# Step through epochs
hist = np.zeros(num_epochs)
for t in range(num_epochs):
  y_train_pred = model(x_train.float())[0]
  loss = criterion(y_train_pred, y_train)
  print("Epoch ", t, "MSE: ", loss.item())
  hist[t] = loss.item()
  optimiser.zero_grad()
  loss.backward()
  optimiser.step()

Epoch  0 MSE:  0.21268369257450104
Epoch  1 MSE:  0.3244333863258362
Epoch  2 MSE:  0.10748942196369171
Epoch  3 MSE:  0.12962789833545685
Epoch  4 MSE:  0.11968861520290375
Epoch  5 MSE:  0.06839635968208313
Epoch  6 MSE:  0.03909873589873314
Epoch  7 MSE:  0.05175027251243591
Epoch  8 MSE:  0.00859392061829567
Epoch  9 MSE:  0.012289724312722683
Epoch  10 MSE:  0.027167055755853653
Epoch  11 MSE:  0.014639001339673996
Epoch  12 MSE:  0.012984096072614193
Epoch  13 MSE:  0.008481314405798912
Epoch  14 MSE:  0.0020878964569419622
Epoch  15 MSE:  0.007266161032021046
Epoch  16 MSE:  0.012589436024427414
Epoch  17 MSE:  0.011321603320538998
Epoch  18 MSE:  0.007503440137952566
Epoch  19 MSE:  0.006073433440178633
Epoch  20 MSE:  0.005117951892316341
Epoch  21 MSE:  0.001966249430552125
Epoch  22 MSE:  0.0007680241833440959
Epoch  23 MSE:  0.003276276867836714
Epoch  24 MSE:  0.005111099220812321
Epoch  25 MSE:  0.004182235337793827
Epoch  26 MSE:  0.003088510362431407
Epoch  27 MSE:  0.0

**Training Representation**

In [15]:
predict = pd.DataFrame(scaler.inverse_transform(y_train_pred.detach().numpy()),columns=['predict'])
original = pd.DataFrame(scaler.inverse_transform(y_train.detach().numpy()),columns=['original'])
combined = pd.concat([predict,original],axis=1)
combined

Unnamed: 0,predict,original
0,11.363832,10.790002
1,11.602504,10.770000
2,11.719219,10.299999
3,11.459437,10.260001
4,11.206417,9.610002
...,...,...
2394,113.534836,115.129997
2395,113.726974,115.519997
2396,115.190041,119.719994
2397,119.292183,113.490005


In [16]:
fig = px.line(combined, x=combined.index, y=["predict","original"])
fig.write_image("./fig2.png")

Test set evaluation

In [17]:
y_test_pred = model(x_test)[0]


# invert predictions
y_train_pred = scaler.inverse_transform(y_train_pred.detach().numpy())
y_train = scaler.inverse_transform(y_train.detach().numpy())
y_test_pred = scaler.inverse_transform(y_test_pred.detach().numpy())
y_test = scaler.inverse_transform(y_test.detach().numpy())

trainScore = math.sqrt(mean_squared_error(y_train[:,0], y_train_pred[:,0]))
print('Train Score: %.2f RMSE' % (trainScore))
testScore = math.sqrt(mean_squared_error(y_test[:,0], y_test_pred[:,0]))
print('Test Score: %.2f RMSE' % (testScore))

Train Score: 1.31 RMSE
Test Score: 3.16 RMSE


## Observation

The example RNN seems to do extremly well, but given the nature of training/testing, it is safe to say we greatly overfit. This implementation, as intended, highlights the benefits of a RNN, but would fail to generalise well. Let's look at why by going over limitations faced by RNN architecture. 

## Drawbacks

Two common problems that occur during the backpropagation of time-series data are the vanishing and exploding gradients.

**Gradient Explosion**: In some cases, the gradients keep on getting larger and larger as the backpropagation algorithm progresses. This, in turn, causes very large weight updates and causes the gradient descent to diverge. This is known as the exploding gradients problem.

**Vanishing Gradient**: On the other hand, as the backpropagation algorithm advances downwards(or backward) from the output layer towards the input layer, the gradients might get smaller and smaller and approach zero which eventually leaves the weights of the initial or lower layers nearly unchanged. As a result, the gradient descent never converges to the optimum. 

Another problem faced by RNN architecture is the lack of **parallelization**. RNNs require the output of the previous node to do the computation over the present node. Due to this connection, RNNs are not suitable for parallelizing or stacking up with other models. The overall computational expense that goes on can seldom be justified with any accuracy gain.

It is quite difficult to train RNNs on too long sequences, given the problems discussed above,especially while using ReLU or tanh activations.

## Going further

**Gradient Clipping**: It is a technique used to cope with the exploding gradient problem sometimes encountered when performing backpropagation. By capping the maximum value for the gradient, the phenomenon is controlled in practice.

**Gated Cells**: While not going to much detail, the underlying idea here is to **selectively** add/remove information with each recurrent unit. LSTMs (Long Short Term Memory) and GRUs are some popular examples.  

**LSTM**: Key concepts:

1. Maintain a cell state.
2. Use gate to control flow of information;
    a.  Forget gate gets rid of irrelevant information.
    b. Store relevant information from current input.
    c. Selectively update cell state.
    d. Output gate returns a filtered cell state.
3.  Backprop through time with partially uniterrupted gradient flow.

While these solve some of our gradient problems, we still haven't solved the problem of complex and non-parallizable calculations. This is where mechasims like attention come in. But more on that in upcoming tutes. 

## Try it yourself

Build a text generation model imitating the style of your favorite author. See how far you can push it, it your network able to generate coherent words, phrases, sentences, paragraphs? Is the stylistic flair being carried over? 

Things to consider -
1. Training data (take a corpus of literature by the author)
2. Encoding (letter/word level encoding, one-hot encodings/encoding based on seniment)
3. Loss metric
4. Input for the trained model