# <p style='text-align: center;'> Recurrent Neural Network (RNN) </p>

## Why not Feedforward Networks?
- A trained feedforward network can be exposed to any random collection of photographs, and the first photograph it is exposed to will not necessarily alter how it classifies the second.

![image.png](attachment:image.png)


- As we can see from the above figure, seeing photograph of a dog will not lead the next to perceive an elephant next.


- When you read a book, you understand it based on your understanding of previous words.

![image-2.png](attachment:image-2.png)


- I cannot predict the next word in a sentence if I use feedforward nets.

## Why Recurrent Neural Networks?
RNN were created because there were a few issues in the feed-forward neural network:

- Cannot handle sequential data.


- Considers only the current input.


- Cannot memorize previous inputs.


The solution to these issues is the RNN. An RNN can handle sequential data, accepting the current input data, and previously received inputs. RNNs can memorize previous inputs due to their internal memory.

## Feed-Forward Neural Networks vs Recurrent Neural Networks
- A feed-forward neural network allows information to flow only in the forward direction, from the input nodes, through the hidden layers, and to the output nodes. There are no cycles or loops in the network. 


- Below is how a simplified presentation of a feed-forward neural network looks like:

![image.png](attachment:image.png)


- In a feed-forward neural network, the decisions are based on the current input. It doesn’t memorize the past data, and there’s no future scope. Feed-forward neural networks are used in general regression and classification problems.

## What Is a Recurrent Neural Network (RNN)?
- **Recurrent Networks** are a type of artificial neural network (ANN) designed to recognize patterns in sequence of data, such as text, genomes, handwriting, the spoken word, or numerical times series data emanating from sensors, stock marketsand government agencies.


- In other words, **RNN** works on the principle of saving the output of a particular layer and feeding this back to the input in order to predict the output of the layer.


- Artificial neural networks (ANN) are feedforward networks that take inputs and produce outputs, whereas RNNs learn from previous outputs to provide better results the following time. Apple's Siri and Google's voice search algorithm are exemplary applications of RNNs in machine learning.


- The input and output of standard ANNs are interdependent. However, the output of an RNN is reliant on the previous nodes in the sequence.


- Each neuron in a feed-forward network or multi-layer perceptron executes its function with inputs and feeds the result to the next node.


- As the name implies, recurrent neural networks have a recurrent connection in which the output is transmitted back to the RNN neuron rather than only passing it to the next node.


<b> Below is how you can convert a Feed-Forward Neural Network into a Recurrent Neural Network:
 
![image.png](attachment:image.png)
    
    
- The nodes in different layers of the neural network are compressed to form a single layer of recurrent neural networks. A, B, and C are the parameters of the network.
    
![image-2.png](attachment:image-2.png)    
    
    
- Here, “x” is the input layer, “h” is the hidden layer, and “y” is the output layer. A, B, and C are the network parameters used to improve the output of the model. At any given time t, the current input is a combination of input at x(t) and x(t-1). The output at any given time is fetched back to the network to improve on the output.    
    
![image-3.png](attachment:image-3.png)    
    
    
- Now that we understand what a recurrent neural network is let’s look at the different types of recurrent neural networks.    

## How Does Recurrent Neural Networks Work?
- In Recurrent Neural networks, the information cycles through a loop to the middle hidden layer.

![image.png](attachment:image.png)


- The input layer ‘x’ takes in the input to the neural network and processes it and passes it onto the middle layer. 


- The middle layer ‘h’ can consist of multiple hidden layers, each with its own activation functions and weights and biases. If you have a neural network where the various parameters of different hidden layers are not affected by the previous layer, ie: the neural network does not have memory, then you can use a recurrent neural network.


- The Recurrent Neural Network will standardize the different activation functions and weights and biases so that each hidden layer has the same parameters. Then, instead of creating multiple hidden layers, it will create one and loop over it as many times as required.

### Example:
<b> Suppose my gym trainer has made a scedule for me. The exercises are repeated after every third day.
    
- Below is the figure for Feedforward Network:
    
![image.png](attachment:image.png)
    
    
- Below are the figures for Recurrent Neural Network:
    
![image-2.png](attachment:image-2.png)
    
![image-3.png](attachment:image-3.png)
    
![image-4.png](attachment:image-4.png)
    
![image-5.png](attachment:image-5.png)
    
![image-6.png](attachment:image-6.png)    
    
 
- From these above figures, now we are able to know that how RNN works.

## Difference between CNN vs RNN
Convolutional neural networks (CNNs) are close to feedforward networks in that they are used to recognize images and patterns.


Besides, here’s a brief comparison of RNN and CNN.

- CNN analyses the image data. The sequence data is processed using RNN.


- In CNN, the input length is fixed. RNN input length is never set in machine learning.


- CNN has more characteristics than other neural networks in terms of performance. When compared to CNN, RNN has fewer features.


- CNN has no repetitive/recurrent connections. RNN uses recurrent connections to generate output.

## Applications of Recurrent Neural Networks
<b> 1. Image Captioning:
    
- RNNs are used to caption an image by analyzing the activities present.
    
    
<b> 2. Time Series Prediction:
    
- Any time series problem, like predicting the prices of stocks in a particular month, can be solved using an RNN.

    
<b> 3. Natural Language Processing:
    
- Text mining and Sentiment analysis can be carried out using an RNN for Natural Language Processing (NLP).
    
    
<b> 4. Machine Translation:
    
- Given an input in one language, RNNs can be used to translate the input into different languages as output.    
    

## Types of RNN
<b> RNNs are categorized based on the four network sequences, namely,

- 1. One to One Network
    
    
- 2. One to Many Network
    
    
- 3. Many to One Network
    
    
- 4. Many to Many Network
    
    
    
<b> 1. One to One Model:
    
- The one-to-one RNN is a typical sequence in neural networks, with only one input and one output. Application – Image classification

    
<b> 2. One to Many Model:
    
- One to Many network has a single input feed into the node, producing multiple outputs. Application – Music generation, image captioning, etc.

    
<b> 3. Many to One model:
    
- Many to One architecture of RNN is utilized when there are several inputs for generating a single output. Application – Sentiment analysis, rating model, etc.
    

<b> 4. Many to Many Model:
    
- Many to Many RNN models, as the name implies, have multiple inputs and produce multiple outputs. This model is also incorporated where input and output layer sizes are different. Application – Machine translation.

## The Problems with Recurrent Neural Networks (RNN)
<b> There are 2 issues in RNN, they are:
    
- 1. Gradient Vanishing Problem
    
    
- 2. Exploding Gradient Problem
    
    
<b> Before going to 2 issues, let's Understanding Backpropagation through time (BPTT):

### Understanding Backpropagation through time (BPTT):
- RNN uses a technique called Backpropagation through time to backpropagate through the network to adjust their weights so that we can reduce the error in the network. It got its name “through time” as in RNN we deal with sequential data and every time we go back it’s like going back in time towards the past. Here is the working of BPTT:

![image.png](attachment:image.png)


- In the BPTT step, we calculate the partial derivative at each weight in the network. So if we are in time t = 3, then we consider the derivative of E3 with respect to that of S3. Now, x3 is also connected to s3. So, its derivative is also considered. Now if we see s3 is connected to s2 so s3 is depending on the value from s2 and here derivative of s3 with respect to s2 is also considered. This acts as a chain rule and we accumulate all the dependency with their derivatives and use it for error calculation.


- In E3 we have a gradient that is from S3 and its equation at that time is:


    δE3   δE3  δŷ3   δs̅3 
    --- = --- ----- ----
    δWs   δŷ3  δs̅3   δWs
    
    
- Now we also have s2 associated with s3 so,


    δE3   δE3  δŷ3   δs̅3  δs̅2
    --- = --- ----- ---- -----
    δWs   δŷ3  δs̅3   δs̅2  δWs
    
    
- And s1 is also associated with s2 and hence now all s1,s2,s3 and having an effect on E3,


    δE3   δE3  δŷ3   δs̅3  δs̅2  δs̅1
    --- = --- ----- ---- ----- ----
    δWs   δŷ3  δs̅3   δs̅2  δs̅1  δWs
    
    
- On accumulating everything we end up getting the following equation that Ws has contributed towards that network at time t=3,


    δE3   δE3  δŷ3   δs̅3     δE3  δŷ3   δs̅3  δs̅2      δE3  δŷ3   δs̅3  δs̅2  δs̅1
    --- = --- ----- ----  +  --- ----- ---- -----  +  --- ----- ---- ----- ----
    δWs   δŷ3  δs̅3   δWs     δŷ3  δs̅3   δs̅2  δWs      δŷ3  δs̅3   δs̅2  δs̅1  δWs
    
    
- The general equation for which we adjust Ws in our BPTT network can be written as,


    δEN     δEN  δŷN   δs̅i 
    --- = ∑ --- ----- ----
    δWs     δŷN  δs̅i   δWs
    
    
- Now as we have noticed Wx is also associated with the network. So, doing the same we can generally write,


    δEN     δEN  δŷN   δs̅i 
    --- = ∑ --- ----- ----
    δWx     δŷN  δs̅i   δWx
    
    
- Now that you have understood how BPTT works, this is basically all about how RNN adjusts its weights and reduces the error. Now the main fault here is this is basically only for a small network with 4 layers. But imagine if we had hundreds of layers and at a time let’s say t = 100, we would end up calculating all the partial derivatives associated with the network and this is a huge multiplication and this can bring down the overall value to a very small or minute value such that it may end up being useless to correct the error. This issue is called **Vanishing Gradient Problem**.

### 1. Gradient Vanishing Problem:
- RNNs suffer from the problem of vanishing gradients. The gradients carry information used in the RNN, and when the gradient becomes too small, the parameter updates become insignificant. This makes the learning of long data sequences difficult.


- As we all know that in RNN to predict an output we will be using a sigmoid activation function so that we can get the probability output for a particular class. As we saw in the above section when we are dealing with say E3 there is a long-term dependency. The issue occurs when we are taking the derivative and derivative of the sigmoid is always below 0.25 and hence when we multiply a lot of derivatives together according to the chain rule, we end up with a vanishing value such that we cant use them for error calculation.

![image.png](attachment:image.png)


- Thus the weights and biases won’t get updated properly and as layers keep increasing we fell more into this and our model doesn’t work properly and leads to inaccuracy in the entire network.


- Some ways to solve this problem is to either initialize the weight matrix properly or go for something like a ReLU instead of sigmoid or tanh functions.


- Below is the Gradient Vanishing equation:


![image-2.png](attachment:image-2.png)



### 2. Exploding Gradient Problem:
- While training a neural network, if the slope tends to grow exponentially instead of decaying, this is called an **Exploding Gradient**. This problem arises when large error gradients accumulate, resulting in very large updates to the neural network model weights during the training process.


- Long training time, poor performance, and bad accuracy are the major issues in gradient problems.


- Exploding gradients is a problem in which the gradient value becomes very big and this often occurs when we initialize larger weights and we could end up with NaN. If our model suffered from this issue we cannot update the weights at all. But luckily, gradient clipping is a process that we can use for this. At a pre-defined threshold value, we clip the gradient. This will prevent the gradient value to go beyond the threshold and we will never end up in big numbers or NaN.


- Below is the Exploding Gradient equation:

![image-3.png](attachment:image-3.png)

### How to overcome These Challenges?

![image.png](attachment:image.png)

### Gradient Problem Solutions:

![image.png](attachment:image.png)


Now, let’s discuss the most popular and efficient way to deal with gradient problems, i.e., **Long Short-Term Memory Network (LSTMs)**.


First, let’s understand Long-Term Dependencies.


Suppose you want to predict the last word in the text: “The clouds are in the ______.”


The most obvious answer to this is the “sky.” We do not need any further context to predict the last word in the above sentence.


Consider this sentence: “I have been staying in Spain for the last 10 years…I can speak fluent ______.”


The word you predict will depend on the previous few words in context. Here, you need the context of Spain to predict the last word in the text, and the most suitable answer to this sentence is “Spanish.” The gap between the relevant information and the point where it's needed may have become very large. LSTMs help you solve this problem.

## Long Short-Term Memory Networks (LSTM)
One way to solve the problem of **Vanishing gradient** and **Long term dependency** in RNN is to go for **LSTM** networks. LSTM has an introduction to three gates called **input, output, and forget gates**. In which forget gates take care of what information needs to be dropped going through the network. In this way, we can have short-term and long-term memory. We can pass the information through the network and retrieve it even at a very later stage to identify the context of prediction.


**LSTMs** are a special kind of RNN — capable of **learning long-term dependencies** by remembering information for long periods is the default behavior.


All RNN are in the form of a chain of repeating modules of a neural network. In standard RNNs, this repeating module will have a very simple structure, such as a single tanh layer.

![image.png](attachment:image.png)


LSTMs also have a chain-like structure, but the repeating module is a bit different structure. Instead of having a single neural network layer, four interacting layers are communicating extraordinarily.

![image-2.png](attachment:image-2.png)


### Workings of LSTMs in RNN

![image.png](attachment:image.png)


LSTMs work in a 3-step process.

<b> Step 1: Decide How Much Past Data It Should Remember
    
The first step in the LSTM is to decide which information should be omitted from the cell in that particular time step. The sigmoid function determines this. It looks at the previous state (ht-1) along with the current input xt and computes the function.
    
![image-2.png](attachment:image-2.png)
    
    
![image-3.png](attachment:image-3.png) 
    
    
Where, 
    
   - Wf: Weight
    
   - h(t-1): Output from the previous time stamp
    
   - xt: New input
    
   - bf: Bias
    
   - ft: forget gate decides which information to delete that is not important from previous time stamp.
    
 
Example - Consider the following two sentences:

- Let the output of h(t-1) be “Alice is good in Physics. John, on the other hand, is good at Chemistry.”

    
- Let the current input at x(t) be “John plays football well. He told me yesterday over the phone that he had served as the captain of his college football team.”

    
- The forget gate realizes there might be a change in context after encountering the first full stop. It compares with the current input sentence at x(t). The next sentence talks about John, so the information on Alice is deleted. The position of the subject is vacated and assigned to John.    

<b> Step 2: Decide How Much This Unit Adds to the Current State
    
In the second layer, there are two parts. One is the sigmoid function, and the other is the tanh function. In the sigmoid function, it decides which values to let through (0 or 1). tanh function gives weightage to the values which are passed, deciding their level of importance (-1 to 1).
    
![image.png](attachment:image.png)
    
![image-2.png](attachment:image-2.png)    


i(t): input gate determines which information to let through based on its significance in the current time stamp.
    
    
With the current input at x(t), the input gate analyzes the important information — John plays football, and the fact that he was the captain of his college team is important.

    
“He told me yesterday over the phone” is less important; hence it's forgotten. This process of adding some new information can be done via the input gate.    
    
    
In the next step, we'll combine these two to update the state.
    
    

<b> Step 3: Combine previous two results to update the state.
    
Now, we will update the old cell state, Ct-1, into the new cell state Ct. First, we multiply the old state (Ct-1) by ft, forgetting the things we decide to forget earlier. Then, we add i(t) * c̅(t). This is the new candidate values, scaled by how much we decide to update each state value.
    
![image.png](attachment:image.png)
    
![image-2.png](attachment:image-2.png)

<b> Step 4: Decide What Part of the Current Cell State Makes It to the Output
    
The third step is to decide what the output will be. First, we run a sigmoid layer, which decides what parts of the cell state make it to the output. Then, we put the cell state through tanh to push the values to be between -1 and 1 and multiply it by the output of the sigmoid gate.
    
![image.png](attachment:image.png)
    
![image-3.png](attachment:image-3.png)   
    
    
Ot: Output gate allows the passed in information to impact the output in the current time step.
    

Let’s consider this example to predict the next word in the sentence: “John played tremendously well against the opponent and won for his team. For his contributions, brave ____ was awarded player of the match.”

    
There could be many choices for the empty space. The current input brave is an adjective, and adjectives describe a noun. So, “John” could be the best output after brave.    
    