# Introduction

Welcome to this Jupyter notebook, where we delve into the fascinating world of Recurrent Neural Networks (RNNs) using PyTorch. RNNs are a class of neural networks that are pivotal in processing sequential data, making them crucial for tasks like language modeling, time series analysis, and more. PyTorch, being a powerful and flexible deep learning library, provides efficient implementations of RNNs, allowing for rapid prototyping and experimentation.

**Objective of This Notebook**
The primary objective of this notebook is to learn about RNN cell:

1. PyTorch's Built-in RNN Cell: We will use PyTorch's torch.nn.RNN module to create an RNN instance. This serves as our baseline model, representing the standard implementation in the field.

2. Custom-Built RNN Cell: We will construct our own RNN cell from scratch. This exercise is aimed at gaining deeper insights into the mechanics of RNNs and understanding the nuances of recurrent neural processing.

**Key Aspects of Comparison**
Our comparison will focus on the following aspects:

1. Output Dimensions: Ensuring that both models yield outputs with consistent dimensions is crucial. This includes testing across various scenarios:
    - Input without batch dimension
    - Input with batch dimension
    - Variations when the batch dimension is either the first or second axis

**Learning Outcomes**
By the end of this notebook, we aim to achieve a comprehensive understanding of:
- The architecture and functionality of RNN cells
- How PyTorch implements RNNs and manages data dimensions
- The intricacies involved in building an RNN cell from the ground up

# Pytorch Built-in RNN Cell

In this notebook, we are going to present the architecture of a RNN cell from PyTorch. If we look at the documentation of PyTorch RNN cell (https://pytorch.org/docs/stable/generated/torch.nn.RNN.html).


The PyTorch RNN cell implements a multi-layer Elman RNN, which can use either a tanh or ReLU activation function. The core behavior of an RNN cell is described by the following equation:

$$ h_{t} = tanh(x_{t}*W_{ih}^T + b_{ih} + h_{t-1}*W_{hh}^T + b_{hh}) $$

This equation encapsulates the essence of the RNN's recurrent nature. It combines two linear transformations - one applied to the current input $x_{t}$ and the other to the previous hidden state $h_{t-1}$. 

The weights $W_{ih}$ and $W_{hh}$ represent the input-to-hidden and hidden-to-hidden connections, respectively, while $b_{ih}$ and $b_{hh}$ are their corresponding biases.

The process can be broken into steps:
- Initialize
    - $h = 0$
- For each time step $t$ in  the sequence:
    - $x = x_{t}*W_{ih}^T + b_{ih}$
    - $h = h_{t-1}*W_{hh}^T + b_{hh}$
    - $c = x + h$
    - $h_{t} = tanh(c)$

In essence, this equation describes how an RNN cell combines the current input with the previous hidden state to generate the current hidden state to concatenante with the next input. This process is repeated across each time step in the input sequence, allowing the RNN to maintain a form of 'memory' of past inputs. This memory is crucial for tasks where context and sequence order are important, such as language modeling, speech recognition, and time series analysis. 

According to the documentation of Pytorch's RNN cell, we need to respect the folowing sizes:
- *S*: Sequence Length - the number of time steps in the input.
- *I*: Input Size - the dimension of the input vector at each time step.
- *H*: Hidden Size - the size of the hidden state vector, a key parameter indicating the network's memory capacity.
- *B*: Batch Size - the number of sequences processed simultaneously.
- *L*: Number of Layers - layers in the RNN.
- *D*: Direction - 2 for bidirectional RNNs (forward and backward), else 1.

- **Inputs**
    - input $x_{t}$
        - Unbatched (S, I)
        - Batched   (S, B, I) or (B,S,I)
    - hidden $h_{t-1}$
        - Unbatched (DL, H) 
        - Batched   (DL, B, H) or (B, DL, H)
- **Outputs**
    - output $o_{t}$
        - Unbatched (S, DH)
        - Batched   (S, B, DH) or (B, S, DH) 
    - hidden $h_{t}$
        - Unbatched (DL, H) 
        - Batched   (DL, B, H) or (B, DL, H)

Output Dimensions (S, D*H): The output of an RNN cell is not directly the final prediction (e.g., a class or a word) but rather the representation of the sequence after processing through the RNN layers. Each time step outputs a vector of size DH, which can be further processed or used as input to another layer or network.

In RNNs, the terms "hidden state" and "output" are often used interchangeably, but they can have different meanings depending on context:

Hidden State $h_{t}$: Represents the memory of the network, carrying information across time steps. It's updated at each time step based on the current input and the previous hidden state.

Output $o_{t}$: While the hidden state is the RNN's internal memory, the output at each time step can be a transformed version of this hidden state, tailored for specific tasks. For example, in many applications, a linear layer followed by an activation function is applied to the hidden state to produce the output.

The last thing to say is the possibility to set the batch dimension at first or second place. It depends on the convenience, how does it integrated, preferences.


## Built-In Pytorch RNN example



In [5]:
import torch
import torch.nn as nn

# inputs and parameters
input_size = 13
sequence_length = 17
hidden_size = 29
num_layers = 7
nonlinearity = 'tanh'
bias = True
batch_first = False
batch_size = 32
dropout = 0.1
dropout = dropout if num_layers > 1 else 0
bidirectional = True
D = 2 if bidirectional else 1

# model rnn built in pytorch
rnn = nn.RNN(
    input_size, 
    hidden_size, 
    num_layers, 
    nonlinearity=nonlinearity, 
    bias=bias, 
    batch_first=batch_first, 
    dropout=dropout, 
    bidirectional=bidirectional
    )

In [6]:
# Unbatched input
input = torch.randn(sequence_length, input_size)

# run model
output, hn = rnn(input)

# print output and hidden state shape
print("output shape: ", output.shape)
print("hidden state shape: ", hn.shape)

output shape:  torch.Size([17, 58])
hidden state shape:  torch.Size([14, 29])


In [7]:
# Batched input
if batch_first:
    input = torch.randn(batch_size, sequence_length, input_size)
else:
    input = torch.randn(sequence_length, batch_size, input_size)

# run model
output, hn = rnn(input)

# print output and hidden state shape
print('Batch first: ', batch_first)
print("output shape: ", output.shape)
print("hidden state shape: ", hn.shape)

Batch first:  False
output shape:  torch.Size([17, 32, 58])
hidden state shape:  torch.Size([14, 32, 29])


We ve got here the sizes of output and hidden states tensors and we need to get the same sizes from the RNN cell developed from scratch. Here we can play with the sizes definition and parameters for printing different outputs tensors. All of scenatios must work with our implemented solutions. 

# RNN From Scratch

Let s try to build from fundation NN cell this RNN

## First Approximation 

The RNN cell is applying $ h_{t} = tanh(x_{t}*W_{ih}^T + b_{ih} + h_{t-1}*W_{hh}^T + b_{hh}) $ which actually can be understood as a product of 2 matrices. Let s consider the matrix $X = \begin{pmatrix} x_{t} & h_{t-1} \end{pmatrix}$ and $W = \begin{pmatrix} W_{ih} \\ W_{hh} \end{pmatrix}$, their product implies the previous sum. 

So in first approximation the RNN cell applies a linear function on input and previous hidden state concatenate together. By deduction we need to define the input size as the sum of sizes of $x_{t}$ and $h_{t}$.

In this context of first approximation we need to implement only the root architecture, so we consider that:
1. number of layers is 1
2. the RNN cell is not bidirectional, D=1
3. bias is always True



In [21]:
class BasicRNN(nn.Module):
    def __init__(self, input_size, hidden_size, nonlinearity='tanh', batch_first=False):
        super(BasicRNN, self).__init__()
        # RNN Cell parameters
        self.input_size = input_size
        self.hidden_size = hidden_size
        self.nonlinearity = nonlinearity
        self.batch_first = batch_first

        # Define the layers for input + hidden
        self.rnn_layer = nn.Linear(input_size + hidden_size, hidden_size)

    def forward(self, x, h_0=None):
        # Adjust for shape without batch dimension to get 3 dimensions
        if len(x.shape) == 2:
            if self.batch_first:
                x = x.unsqueeze(0) # (seq, feature) -> (batch, seq, feature)
            else:
                x = x.unsqueeze(1) # (seq, feature) -> (seq, batch, feature)

        # Adjust for batch_first
        if self.batch_first:
            x = x.transpose(0, 1) # (batch, seq, feature) -> (seq, batch, feature)

        # Initialize hidden state
        seq_len, batch_size, _ = x.size()
        if h_0 is None:
            h_0 = torch.zeros(batch_size, self.hidden_size) # num_layers = 1, squeezed on dim 0

        # Process the sequence
        outputs = []
        h_t = h_0

        # At each time step
        for t in range(seq_len):
            # Concatenate input and hidden state
            combined = torch.cat((h_t, x[t]), dim=1)

            # Calculate the hidden state
            h_t = self.rnn_layer(combined)

            # Apply nonlinearity
            h_t = torch.tanh(h_t) if self.nonlinearity == 'tanh' else torch.relu(h_t)

            # Append to outputs
            outputs.append(h_t)

        # Convert list to tensor
        output = torch.stack(outputs)

        # Adjust for batch_first
        if self.batch_first:
            output = output.transpose(0, 1)

        if batch_size == 1:
            if self.batch_first:
                output = output.squeeze(0)
            else:
                output = output.squeeze(1)

        # if batch_size > 1: unsqueeze hidden state because num_layers = 1
        if batch_size > 1:
            h_t = h_t.unsqueeze(0)
            
        return output, h_t


In [25]:
num_layers = 1
nonlinearity = 'tanh'
batch_first = True
bidirectional = False
D = 2 if bidirectional else 1

# model rnn
basic_rnn = BasicRNN(
    input_size, 
    hidden_size, 
    nonlinearity=nonlinearity, 
    batch_first=batch_first
    )

In [26]:
# Unbatched input
input = torch.randn(sequence_length, input_size)

# run model
output, hn = basic_rnn(input)

# print output and hidden state shape
print("output shape: ", output.shape)
print("hidden state shape: ", hn.shape)

output shape:  torch.Size([17, 29])
hidden state shape:  torch.Size([1, 29])


In [27]:
# Batched input
if batch_first:
    input = torch.randn(batch_size, sequence_length, input_size)
else:
    input = torch.randn(sequence_length, batch_size, input_size)

# run model
output, hn = basic_rnn(input)

# print output and hidden state shape
print('Batch first: ', batch_first)
print("output shape: ", output.shape)
print("hidden state shape: ", hn.shape)

Batch first:  True
output shape:  torch.Size([32, 17, 29])
hidden state shape:  torch.Size([1, 32, 29])


## Second Approximation 

Now I want to add the ability to multiply the number of layers.

For multilayers RNN, the RNN cells are stacked together to obtain a bigger model capable to learn more pattern from the sequential input. To implement this stack version we will use the Basic RNN. 

The first layer receives the initial input sequence then the subsequent layers are fed with the output (not the hidden state) of the previous layer. Each layer has its own hidden state, which it updates at each time step based on its current input and its previous hidden state.

The output comes from the last layer of the RNN. It's the sequence of transformed representations at each time step.
The hidden states are the final hidden states of each layer after processing the entire input sequence.

In a forward pass, only the output of each layer (except for the last one) is fed as input to the next layer.

Example: Consider a 2-layer RNN. The first layer receives the input sequence, processes it, and its output is then passed to the second layer. The second layer processes this input (which is the output of the first layer) and produces its own output and hidden state. The RNN as a whole returns the output of the second layer and the hidden states of both layers.


In [68]:
class StackedRNN(nn.Module):
    def __init__(self, input_size, hidden_size, num_layers, nonlinearity='tanh', batch_first=False):
        super(StackedRNN, self).__init__()
        # Stacked RNN parameters
        self.input_size = input_size
        self.hidden_size = hidden_size
        self.num_layers = num_layers
        self.nonlinearity = nonlinearity
        self.batch_first = batch_first

        # Initialize layers of BasicRNN
        self.layers = nn.ModuleList([BasicRNN(input_size if i == 0 else hidden_size, 
                                              hidden_size, 
                                              nonlinearity=nonlinearity, 
                                              batch_first=batch_first) for i in range(num_layers)])

    def forward(self, input, hidden=None):
        
        # list to store hidden states for each layer
        hiddens = []

        # loop through layers
        for i in range(self.num_layers):
            # pass through layer
            output, hidden = self.layers[i](input)

            # output of layer is input of next layer
            input = output

            # append hidden state
            hiddens.append(hidden)
        
        # stack hidden states - convert to tensor
        hiddens = torch.stack(hiddens).squeeze(1)

        return output, hiddens

In [69]:
num_layers = 7
nonlinearity = 'tanh'
batch_first = True
bidirectional = False
D = 2 if bidirectional else 1

# model rnn
stack_rnn = StackedRNN(
    input_size, 
    hidden_size, 
    num_layers,
    nonlinearity=nonlinearity, 
    batch_first=batch_first
    )

In [70]:
# Unbatched input
input = torch.randn(sequence_length, input_size)

# run model
output, hn = stack_rnn(input)

# print output and hidden state shape
print("output shape: ", output.shape)
print("hidden state shape: ", hn.shape)

output shape:  torch.Size([17, 29])
hidden state shape:  torch.Size([7, 29])


In [71]:
# Batched input
if batch_first:
    input = torch.randn(batch_size, sequence_length, input_size)
else:
    input = torch.randn(sequence_length, batch_size, input_size)

# run model
output, hn = stack_rnn(input)

# print output and hidden state shape
print('Batch first: ', batch_first)
print("output shape: ", output.shape)
print("hidden state shape: ", hn.shape)

Batch first:  True
output shape:  torch.Size([32, 17, 29])
hidden state shape:  torch.Size([7, 32, 29])


## Third Approximation

Now let s integrate the bidirectionality

Bidirectionality in the context of Recurrent Neural Networks (RNNs) is a technique designed to enhance the model's understanding of the input data by providing it with information from both past (backward) and future (forward) directions. 

It  consists of two separate RNNs – a forward RNN and a backward RNN. The forward RNN processes the sequence from the start to the end, while the backward RNN processes it from the end to the start.

At each time step, the outputs of the forward and backward RNNs are combined. This combination can be done in various ways, such as concatenation or summation, depending on the desired outcome.

The combined output at each time step then forms the overall output of the bidirectional RNN. This output can be used for further processing or making predictions.

In [80]:
import torch
import torch.nn as nn

class BidirectionalStackedRNN(nn.Module):
    def __init__(self, input_size, hidden_size, num_layers, nonlinearity='tanh', batch_first=False):
        super(BidirectionalStackedRNN, self).__init__()
        # Bidirectional Stacked RNN parameters
        self.input_size = input_size
        self.hidden_size = hidden_size
        self.num_layers = num_layers
        self.nonlinearity = nonlinearity
        self.batch_first = batch_first

        # Initialize layers of StackedRNN for forward and backward directions
        self.forward_layers = StackedRNN(input_size, hidden_size, num_layers,nonlinearity=nonlinearity, batch_first=batch_first)
        self.backward_layers = StackedRNN(input_size, hidden_size, num_layers,nonlinearity=nonlinearity, batch_first=batch_first)

    def forward(self, input, hidden=None):
        
        # Forward pass through forward layers
        forward_outputs, forward_hiddens = self.forward_layers(input, hidden)

        # Reverse the input sequence
        backward_outputs, backward_hiddens = self.backward_layers(input.flip(0), hidden)

        # Concatenate or add the outputs from both directions
        output = torch.cat((forward_outputs, backward_outputs), dim=-1)

        # Concatenate the hidden states from both directions for each layer
        hiddens = torch.cat((forward_hiddens, backward_hiddens), dim=0)

        return output, hiddens


In [94]:
num_layers = 7
nonlinearity = 'tanh'
batch_first = False


# model rnn
bi_stack_rnn = BidirectionalStackedRNN(
    input_size, 
    hidden_size, 
    num_layers,
    nonlinearity=nonlinearity, 
    batch_first=batch_first
    )

In [76]:
# Unbatched input
input = torch.randn(sequence_length, input_size)

# run model
output, hn = bi_stack_rnn(input)

# print output and hidden state shape
print("output shape: ", output.shape)
print("hidden state shape: ", hn.shape)

output shape:  torch.Size([17, 58])
hidden state shape:  torch.Size([14, 29])


In [79]:
# Batched input
if batch_first:
    input = torch.randn(batch_size, sequence_length, input_size)
else:
    input = torch.randn(sequence_length, batch_size, input_size)

# run model
output, hn = bi_stack_rnn(input)

# print output and hidden state shape
print('Batch first: ', batch_first)
print("output shape: ", output.shape)
print("hidden state shape: ", hn.shape)

Batch first:  False
output shape:  torch.Size([17, 32, 58])
hidden state shape:  torch.Size([14, 32, 29])


## Forth Approximation

Now only remains the dropout. 

Dropout is a regularization technique used to prevent overfitting in neural networks, including multi-layer Recurrent Neural Networks (RNNs). In the context of multi-layer RNNs, dropout can be applied in various ways to reduce the likelihood of co-adaptation of neurons and to improve the generalization capability of the model.

One common approach is to apply dropout between the RNN layers in a stacked RNN. This means dropout is applied to the output of each RNN layer before it is used as input to the next layer. It's important to note that dropout is typically not applied to the recurrent connections (the connections from the hidden state of a layer at one time step to the hidden state of the same layer at the next time step) but to the connections between layers.

Dropout can also be applied to the input layer (before the first RNN layer) and/or the output layer (after the last RNN layer), depending on the specific architecture and the problem at hand.

A variant of dropout, called variational dropout, applies the same dropout mask at each time step, instead of varying it across time steps. This approach is often used in RNNs to maintain consistency in dropout application across the temporal connections.

In [95]:
class DropOutStackedRNN(nn.Module):
    def __init__(self, input_size, hidden_size, num_layers, nonlinearity='tanh', batch_first=False, dropout=0.0):
        super(DropOutStackedRNN, self).__init__()
        # Stacked RNN parameters
        self.input_size = input_size
        self.hidden_size = hidden_size
        self.num_layers = num_layers
        self.nonlinearity = nonlinearity
        self.batch_first = batch_first
        self.dropout = dropout

        # Initialize layers of BasicRNN
        self.layers = nn.ModuleList([BasicRNN(input_size if i == 0 else hidden_size, 
                                              hidden_size, 
                                              nonlinearity=nonlinearity, 
                                              batch_first=batch_first) for i in range(num_layers)])
        
        # Initialize dropout layer
        self.dropout = nn.Dropout(dropout) if num_layers > 1 else nn.Identity()

    def forward(self, input, hidden=None):
        
        # list to store hidden states for each layer
        hiddens = []

        # loop through layers
        for i in range(self.num_layers):
            # pass through layer
            output, hidden = self.layers[i](input)

            # Apply dropout to the output of each layer except the last layer
            if i < self.num_layers - 1:
                input = self.dropout(output)
            else:
                input = output

            # append hidden state
            hiddens.append(hidden)
        
        # stack hidden states - convert to tensor
        hiddens = torch.stack(hiddens).squeeze(1)

        return output, hiddens

In [96]:
num_layers = 7
nonlinearity = 'tanh'
batch_first = False
dropout = 0.1

# model rnn
d_stack_rnn = DropOutStackedRNN(
    input_size, 
    hidden_size, 
    num_layers,
    nonlinearity=nonlinearity, 
    batch_first=batch_first,
    dropout=dropout
    )

In [97]:
# Unbatched input
input = torch.randn(sequence_length, input_size)

# run model
output, hn = d_stack_rnn(input)

# print output and hidden state shape
print("output shape: ", output.shape)
print("hidden state shape: ", hn.shape)

output shape:  torch.Size([17, 29])
hidden state shape:  torch.Size([7, 29])


In [98]:
# Batched input
if batch_first:
    input = torch.randn(batch_size, sequence_length, input_size)
else:
    input = torch.randn(sequence_length, batch_size, input_size)

# run model
output, hn = d_stack_rnn(input)

# print output and hidden state shape
print('Batch first: ', batch_first)
print("output shape: ", output.shape)
print("hidden state shape: ", hn.shape)

Batch first:  False
output shape:  torch.Size([17, 32, 29])
hidden state shape:  torch.Size([7, 32, 29])


# Conclusion

In this Notebook we went through the main behaviors of RNN cell. We use Pytorch Built-In to guide us during the implementation. We start from a basic RNN cell highlighting the equation and tensors sizes choregraphy for aiming a prefect understanding of the recurrent process. Then we add the remaining options of RNN models, multilayer, bidirectionality, dropout. 

We focus on the understaning and the method to code it. 