# ___Recurrent Neural Network (RNN)___

## ___What are Recurrent Neural Networks (RNNs)?___

_RNNs are a powerful and robust type of neural network, and belong to the most promising algorithms in use because it is the only one with an internal memory._

_Like many other deep learning algorithms, recurrent neural networks are relatively old. They were initially created in the 1980’s, but only in recent years have we seen their true potential. An increase in computational power along with the the massive amounts of data that we now have to work with, and the invention of long short-term memory (LSTM) in the 1990s, has really brought RNNs to the foreground._

_Because of their internal memory, RNN’s can remember important things about the input they received, which allows them to be very precise in predicting what’s coming next. This is why they're the preferred algorithm for sequential data like time series, speech, text, financial data, audio, video, weather and much more. Recurrent neural networks can form a much deeper understanding of a sequence and its context compared to other algorithms._

_The idea behind RNNs is to make use of sequential information. In a traditional neural network we assume that all inputs (and outputs) are independent of each other. But for many tasks that’s a very bad idea. If you want to predict the next word in a sentence you better know which words came before it. RNNs are called recurrent because they perform the same task for every element of a sequence, with the output being depended on the previous computations. Another way to think about RNNs is that they have a “memory” which captures information about what has been calculated so far. In theory RNNs can make use of information in arbitrarily long sequences, but in practice they are limited to looking back only a few steps (more on this later)._

_In Recurrent Neural Network(RNN) the output from previous step are fed as input to the current step. Thus RNN solved this issue with the help of a Hidden state, which remembers some information about a sequence._

<img src='https://miro.medium.com/proxy/1*xLcQd_xeBWHeC6CeYSJ9bA.png'/>

_So, from above figure, it is clear that RNN is a special type of feed forward neural network. As explained by the diagram, in RNN, the output of any layer not only depends on the current input but also on the set of inputs that have came before. This special feature provides it a significant advantage over other neural networks by taking help of inputs obtained before to predict outputs at the later stage._

_But when do you need to use a RNN?_

___“Whenever there is a sequence of data and that temporal dynamics that connects the data is more important than the spatial content of each individual frame.” – Lex Fridman (MIT)___

### ___How RNN Works?___

<img src='https://miro.medium.com/max/700/1*SSe5iEoUvdKT-6HAGfVa-A.png'/>

<img src='https://hackernoon.com/hn-images/1*_mM83sFLjzKt8cRB439Y3Q.gif' width = 600/>

_Let’s take a simple task at first. Let’s take a character level RNN where we have a word __“Hello”__. So we provide the first 4 letters i.e. h,e,l,l and ask the network to predict the last letter i.e.’o’. So here the vocabulary of the task is just 4 letters {h,e,l,o}. In real case scenarios involving natural language processing, the vocabularies include the words in entire wikipedia database, or all the words in a language. Here for simplicity we have taken a very small set of vocabulary._ 

<img src='https://cdn.analyticsvidhya.com/wp-content/uploads/2017/12/05231650/rnn-neuron-196x300.png'/>

_Let’s see how the above structure be used to predict the fifth letter in the word “hello”. In the above structure, the blue RNN block, applies something called as a recurrence formula to the input vector and also its previous state. In this case, the letter “h” has nothing preceding it, let’s take the letter “e”. So at the time the letter “e” is supplied to the network, a recurrence formula is applied to the letter “e” and the previous state which is the letter “h”. These are known as various time steps of the input. So if at time t, the input is “e”, at time t-1, the input was “h”. The recurrence formula is applied to e and h both. and we get a new state._

_The formula for the current state can be written as:_

<img src='https://cdn.analyticsvidhya.com/wp-content/uploads/2017/12/06004252/hidden-state.png'/>

_Here, Ht is the new state, ht-1 is the previous state while xt is the current input. We now have a state of the previous input instead of the input itself, because the input neuron would have applied the transformations on our previous input. So each successive input is called as a time step._

_In this case we have four inputs to be given to the network, during a recurrence formula, the same function and the same weights are applied to the network at each time step._

_Taking the simplest form of a recurrent neural network, let’s say that the activation function is tanh, the weight at the recurrent neuron is Whh and the weight at the input neuron is Wxh, we can write the equation for the state at time t as:_

<img src='https://cdn.analyticsvidhya.com/wp-content/uploads/2017/12/06005300/eq2.png'/>

_The Recurrent neuron in this case is just taking the immediate previous state into consideration. For longer sequences the equation can involve multiple such states. Once the final state is calculated we can go on to produce the output._

_Now, once the current state is calculated we can calculate the output state as:_

<img src='https://cdn.analyticsvidhya.com/wp-content/uploads/2017/12/06005750/outeq.png'/>

_Let me summarize the steps in a recurrent neuron for you:_

* _A single time step of the input is supplied to the network i.e. xt is supplied to the network_
* _We then calculate its current state using a combination of the current input and the previous state i.e. we calculate ht_
* _The current ht becomes ht-1 for the next time step_
* _We can go as many time steps as the problem demands and combine the information from all the previous states_
* _Once all the time steps are completed the final current state is used to calculate the output yt_
* _The output is then compared to the actual output and the error is generated_
* _The error is then backpropagated to the network to update the weights(we shall go into the details of backpropagation in further sections) and the network is trained_

#### ___Forward Propagation in a Recurrent Neuron___

_Let’s take a look at the inputs first:_

<img src= 'https://cdn.analyticsvidhya.com/wp-content/uploads/2017/12/06010908/inputs.png'/>

_The inputs are one hot encoded. Our entire vocabulary is {h,e,l,o} and hence we can easily one hot encode the inputs._

_Now the input neuron would transform the input to the hidden state using the weight wxh. We have randomly initialized the weights as a 3*4 matrix:_

<img src='https://cdn.analyticsvidhya.com/wp-content/uploads/2017/12/06011846/wxh.png'/>

___Step 1:___

_Now for the letter “h”, for the the hidden state we would need Wxh*Xt. By matrix multiplication, we get it as:_

<img src='https://cdn.analyticsvidhya.com/wp-content/uploads/2017/12/06122426/first-state-h.png'/>

___Step 2:___

_Now moving to the recurrent neuron, we have Whh as the weight which is a 1*1 matrix as <img src='https://cdn.analyticsvidhya.com/wp-content/uploads/2017/12/06013320/WHH.png'/> and the bias which is also a 1*1 matrix as <img src='https://cdn.analyticsvidhya.com/wp-content/uploads/2017/12/06013447/bias.png'/>_ 

_For the letter “h”, the previous state is [0,0,0] since there is no letter prior to it._

_So to calculate ->_  ___(whh*ht-1+bias)___

<img src='https://cdn.analyticsvidhya.com/wp-content/uploads/2017/12/WHHT-1-1.png'/>

___Step 3:___

_Now we can get the current state as:_

<img src='https://cdn.analyticsvidhya.com/wp-content/uploads/2017/12/06014059/eq21.png'/>

_Since for h, there is no previous hidden state we apply the tanh function to this output and get the current state:_

<img src='https://cdn.analyticsvidhya.com/wp-content/uploads/2017/12/06130247/ht-h.png'/>

___Step 4:___

_Now we go on to the next state. “e” is now supplied to the network. The processed output of ht, now becomes ht-1, while the one hot encoded e, is xt. Let’s now calculate the current state ht._

<img src='https://cdn.analyticsvidhya.com/wp-content/uploads/2017/12/06005300/eq2.png'/>

___Whh*ht-1 + bias___ _will be:_

<img src='https://cdn.analyticsvidhya.com/wp-content/uploads/2017/12/06131259/new-ht-1.png'/>

___Wxh*xt___ _will be:_

<img src='https://cdn.analyticsvidhya.com/wp-content/uploads/2017/12/06132150/state-e.png'/>

___Step 5:___

_Now calculating ht for the letter “e”,_

<img src='https://cdn.analyticsvidhya.com/wp-content/uploads/2017/12/06132639/htletter-e.png'/>

_Now this would become ht-1 for the next state and the recurrent neuron would use this along with the new character to predict the next one._

___Step 6:___

_At each state, the recurrent neural network would produce the output as well. Let’s calculate yt for the letter e._

<img src='https://cdn.analyticsvidhya.com/wp-content/uploads/2017/12/06005750/outeq.png'/>

<img src='https://cdn.analyticsvidhya.com/wp-content/uploads/2017/12/06133208/ytfinal123.png'/>

___Step 7:___

_The probability for a particular letter from the vocabulary can be calculated by applying the softmax function. so we shall have softmax(yt)_

<img src='https://cdn.analyticsvidhya.com/wp-content/uploads/2017/12/06133614/classwise-prob.png'/>

_If we convert these probabilities to understand the prediction, we see that the model says that the letter after “e” should be h, since the highest probability is for the letter “h”. Does this mean we have done something wrong? No, so here we have hardly trained the network. We have just shown it two letters. So it pretty much hasn’t learnt anything yet._

_Now the next BIG question that faces us is how does Back propagation work in case of a Recurrent Neural Network. How are the weights updated while there is a feedback loop?_

#### ___Back propagation in a Recurrent Neural Network(BPTT)___

_To imagine how weights would be updated in case of a recurrent neural network, might be a bit of a challenge. So to understand and visualize the back propagation, let’s unroll the network at all the time steps. In an RNN we may or may not have outputs at each time step._

_In case of a forward propagation, the inputs enter and move forward at each time step. In case of a backward propagation in this case, we are figuratively going back in time to change the weights, hence we call it the Back propagation through time(BPTT)._

<img src='https://cdn.analyticsvidhya.com/wp-content/uploads/2017/12/06022525/bptt.png'/>

<img src='https://miro.medium.com/max/700/0*ENwCVS8XI8cjCy55.jpg'/>

_In case of an RNN, if yt is the predicted value ȳt is the actual value, the error is calculated as a cross entropy loss:_

$$Et(ȳt,yt) =  – ȳt log(yt)$$

$$E(ȳ,y) = – ∑ ȳt log(yt)$$

_We typically treat the full sequence (word) as one training example, so the total error is just the sum of the errors at each time step (character). The weights as we can see are the same at each time step. Let’s summarize the steps for backpropagation:_

* _The cross entropy error is first computed using the current output and the actual output_
* _Remember that the network is unrolled for all the time steps_
* _For the unrolled network, the gradient is calculated for each time step with respect to the weight parameter_
* _Now that the weight is the same for all the time steps the gradients can be combined together for all time steps_
* _The weights are then updated for both recurrent neuron and the dense layers_

### ___Types of RNN___

<img src='https://miro.medium.com/max/700/0*1PKOwfxLIg_64TAO.jpeg'/>
<center style='font-size:10px'><i>Different types of Recurrent Neural Networks. (2) Sequence output (e.g. image captioning takes an image and outputs a sentence of words). (3) Sequence input (e.g. sentiment analysis where a given sentence is classified as expressing positive or negative sentiment). (4) Sequence input and sequence output (e.g. Machine Translation: an RNN reads a sentence in English and then outputs a sentence in French). (5) Synced sequence input and output (e.g. video classification where we wish to label each frame of the video). Notice that in every case are no pre-specified constraints on the lengths sequences because the recurrent transformation (green) is fixed and can be applied as many times as we like.</i></center>

#### ___One-to-One___
_This also called as Plain/Vanilla Neural networks. It deals with Fixed size of input to Fixed size of Output where they are independent of previous information/output._

_Ex: Image classification._

#### ___One-to-Many___
_It deals with fixed size of information as input that gives sequence of data as output._

_Ex:Image Captioning takes image as input and outputs a sentence of words._

#### ___Many-to-One___
_It takes Sequence of information as input and ouputs a fixed size of output._

_Ex:sentiment analysis where a given sentence is classified as expressing positive or negative sentiment._

#### ___Many-to-Many___
_It takes a Sequence of information as input and process it recurrently outputs a Sequence of data._

_Ex: Machine Translation, where an RNN reads a sentence in English and then outputs a sentence in French._

#### ___Bidirectional Many-to-Many___
_Synced sequence input and output. Notice that in every case are no pre-specified constraints on the lengths sequences because the recurrent transformation (green) is fixed and can be applied as many times as we like._

_Ex: video classification where we wish to label each frame of the video._

### ___Applications of RNN___
* _Prediction problems_
* _Language Modelling and Generating Text_
* _Machine Translation_
* _Speech Recognition_
* _Generating Image Descriptions_
* _Video Tagging_
* _Text Summarization_
* _Call Center Analysis_

### ___Advantages of Recurrent Neural Network___
* _An RNN remembers each and every information through time. It is useful in time series prediction only because of the feature to remember previous inputs as well. This is called Long Short Term Memory._
* _Recurrent neural network are even used with convolutional layers to extend the effective pixel neighborhood._

### ___Disadvantages of Recurrent Neural Network___
* _Gradient vanishing and exploding problems._
* _Training an RNN is a very difficult task._
* _It cannot process very long sequences if using tanh or relu as an activation function._

### ___Vanishing and Exploding Gradient Problem___

_The problem arises during training of a deep neural network when the gradients travel in the back-propagation back to the initial layer. As the gradients have to go through continuous matrix multiplication because of the chain rule. Therefore, if they have small values (<1) they shrink exponentially till the time they vanish and this is called __vanishing gradient problem__. This causes loss of information through time._

<img src='https://miro.medium.com/max/700/1*U4S-rvcTtnHZUSUhuutxMg.png'/>

<img src='https://miro.medium.com/max/700/1*TRCh7MX4Bv74vLZOFpuBBA.png'/>

_Moreover, if gradients have large values (>1) they get larger and eventually blow up and crash the model, this is called __exploding gradient problem__._

<img src='https://miro.medium.com/max/700/1*zgI-csKo3BOstYvITddHtw.png'/>

_Issues due to these problems:_
* _Long training time_
* _Poor Performance_
* _Bad Accuracy_

___How can you overcome the Challenges of Vanishing and Exploding Gradience?___

* ___Vanishing Gradience can be overcome with___
    * _Relu activation function_
    * _LSTM, GRU_
    
    
* ___Exploding Gradience can be overcome with___
    * _Truncated BTT(instead starting backprop at the last time stamp, we can choose similar time stamp, which is just before it.)_
    * _RMSprop to adjust learning rate_
    * _Clip Gradience to threshold_