# Recurrent Neural Networks

Up until now we investigated problems such as image recognition, house price prediction, etc. In all of these problems for predicting the output of the system we should just look at a single data points.

For example in image recognition problem the model just look at a single image and decide what is the probability of each class.

However the Fully Connected Neural Networks and CNNs can not handle **Sequential Data** in which data comes in a sequential manner and somehow the past data infulence the output of the model.

## Examples of Sequential Data

### 1. Stock Market Prediction

<img src="images/stock_market.png" alt="stock market">

### 2. Sentiment Analysis

<img src="images/sentiment_analysis.jpg" alt="sentiment">

### 3. Machine Translation

<img src="images/machine_translation.png" alt="classification" >

### 4. Speech Recognition

<img src="images/speech_recognition.jpeg" alt="speech recognition">

## What is the Problem with Standard Neural Networks for Sequential Data?

There are two problems with standard neural networks:
1. Inputs, outputs can be different lengths in different examples.
    * The input sequence can be: **"They don't know that we know they know we know."** or simply **"I don't know"**
    * This can be solved for normal NNs by paddings with the maximum lengths but it's not a good solution.
2. Fully connected  neural networks do not share learned features across different positions of text/sequence.
    * Using a feature sharing like in CNNs can significantly reduce the number of parameters in your model. That's what we will do in RNNs.

## What Should the new architectures have?

1. Have a memory mechanism to remember past inputs.
2. Handle inputs and outputs with different length.
3. Share features across the sequence.

## RNN

Recurrent Neural Network remembers the past and it’s decisions are influenced by what it has learnt from the past. 

RNNs can take one or more input vectors and produce one or more output vectors and the output(s) are influenced not just by weights applied on inputs like a regular NN, but also by a “hidden” state vector representing the context based on prior input(s)/output(s). So, the same input could produce a different output depending on previous inputs in the series.

<img src="images/rnn.png" alt="rnn">

Note that all of the weights and parameters are same in all stages and are shared with all positions of the sequence.

<img src="images/rnn_cell.png" alt="rnn cell">

### Different Types of RNNs

<img src="images/types_of_rnns.jpg" alt="types of rnns">

### Vanishing Gradient problem in RNNs

One of the problems with vanilla RNNs that they run into vanishing gradient problem.
Let's take an example. Suppose we are working on a NLP problem and there are two sequences that model tries to learn:
1. "The cat, which already ate ............................................................, was full"
2. "The cats, which already ate ..........................................................., were full"

What we need to learn here that "was" came with "cat" and that "were" came with "cats". The vanilla RNN is not very good at capturing very long-term dependencies like this.

As we have discussed in Deep neural networks, deeper networks are getting into the vanishing gradient problem. That also happens with RNNs with a long sequence size.

<img src="images/vanishing_gradient_rnn.png" alt="vanishing gradient RNN">

The conclusion is that RNNs aren't good in long-term dependencies.

## Gated Recurrent Unit (GRU)

GRU is an RNN type that can help solve the vanishing gradient problem and can remember the long-term dependencies.

Each layer in GRUs has a new variable C which is the memory cell. It can tell to whether memorize something or not.

Equations of the GRUs:

<img src="images/gru_equation.png" alt="gru equation">

The update gate is between 0 and 1
To understand GRUs imagine that the update gate is either 0 or 1 most of the time.
So we update the memory cell based on the update cell and the previous cell.
Lets take the cat sentence example and apply it to understand this equations:

Sentence: "The cat, which already ate ........................, was full"

We will suppose that U is 0 or 1 and is a bit that tells us if a singular word needs to be memorized.

Splitting the words and get values of C and U at each place:

<img src="images/gru_table.png" alt="gru table">

## Long Short Term Memory (LSTM)

LSTM is another type of RNN that can enable you to handle long-term dependencies. It's more powerful and general than GRU.

Here are the equations of an LSTM unit:

<img src="images/lstm_equation.png" alt="lstm equation">

<img src="images/lstm_cell.png" alt="lstm cell">

There isn't a universal superior between LSTM and it's variants. One of the advantages of GRU is that it's simpler and can be used to build much bigger network but the LSTM is more powerful and general.


## Bidirectional RNN

Here is an example of the Name entity recognition task:

<img src="images/name_entity_recognition.png" alt="name entity recognition">

The name Teddy cannot be learned from He and said, but can be learned from bears.

BiRNNs fixes this issue.

Here is BRNNs architecture:

<img src="images/bidirectional_rnn.png" alt="bidirectional rnn">

* Part of the forward propagation goes from left to right, and part - from right to left. It learns from both sides.
* To make predictions we use ŷ<t> by using the two activations that come from left and right.
* The blocks here can be any RNN block including the basic RNNs, LSTMs, or GRUs.
* For a lot of NLP or text processing problems, a BiRNN with LSTM appears to be commonly used.
* The disadvantage of BiRNNs that you need the entire sequence before you can process it. For example, in live speech  recognition if you use BiRNNs you will need to wait for the person who speaks to stop to take the entire sequence and then make your predictions.

## Deep RNNs

In a lot of cases the standard one layer RNNs will solve your problem. But in some problems its useful to stack some RNN layers to make a deeper network.

For example, a deep RNN with 3 layers would look like this:

<img src="images/deep_rnn.png" alt="deep rnn">

* In fully connected deep nets, there could be 100 or even 200 layers. In deep RNNs stacking 3 layers is already considered deep and expensive to train.
* In some cases you might see a fully connected network connected after recurrent cell. (like head layer in CNNs)
