# Recurrent Neural Networks

<p>Lets look at a Recurrent Neural Network, which is a neural network that feeds a output to its input.  In essence, this looks like multiple versions of the same network chained together. </p>

<img src="http://colah.github.io/posts/2015-08-Understanding-LSTMs/img/RNN-shorttermdepdencies.png" width="500" height="500">

<p>In the picture above we can see that the output of Xo is fed into the next node, alongside the its input, X1.  By the time we arrive at Xt, we have persisted some learned information, throughout the network.  These networks are very good at predicting information, however can struggle with what is know as “long-term dependencies” which are essentially some logical meaning that is drawn from information learned a long time ago in the network.  This proves a problem for RNN when trying to predict information that relies upon long term dependencies.  This is were, Long Short Term Memory (LSTM) networks come in.  Long Short Term Memory networks are a special type of RNN</p>

<img src="http://colah.github.io/posts/2015-08-Understanding-LSTMs/img/LSTM3-SimpleRNN.png" width="500" height="500">

<p>The difference between a normal RNN network and a LSTM network is regular RNN's combine the node input and the chained input using something like a tan function</p>

<img src="http://colah.github.io/posts/2015-08-Understanding-LSTMs/img/LSTM3-chain.png" width="500" height="500">

<p>Inside a LSTM cell we have a different structure - lets take it step by step
</p>
<p>One of the first things to notice about the LSTM cell is the top row.  This row allows the information fed along the cell to pass through with little difference in operation. It would seem that the extra information that is incorporated from the input section, Xt is multiplied into the copy of the Ht-1 data and then some other information is combined - using addition - later on (but we will come to that)
</p>
<p>The first layer of out LSTM cell is the "forget gate layer".  This is usede to decide what information to forget - funnily enough.  This is done by taking the output of the previous node, Ht-1 and the input of this node, Xt and performs a sigmoid function on it.  
</p>
<p>The sigmoid is a mathematical function, looking like 'S' -shaped curve. The output of the sigmoid function is between 0 and 1.  In our context, the closer to 0, the less we let through; the closer to 1, the more we let through.  Think of it as 0% to 100%</p>

<p>This can be expressed as either on of the following notation
$$ S(x) =  \frac{\mathrm{e^x} }{\mathrm{1} + e^x }  $$
<br>   
$$ S(x) =  \frac{1 }{\mathrm{1} + e^-x }  $$ </p>

<p>With a the sigmoid function looking like this</p><img src="https://upload.wikimedia.org/wikipedia/commons/thumb/8/88/Logistic-curve.svg/600px-Logistic-curve.svg.png" width="500" height="500">

In [13]:
# Here we have the sigmoid translated into python
import math
def sigmoid(x):
    return 1 / (1 + math.exp(-x))
sigmoid(0.35)

0.5866175789173301

<p>Lucklily we have a handy Tensorflow function that will perform 
the same operation:</p>

In [19]:
import tensorflow as tf
i = tf.sigmoid(0.35)
print (i)

Tensor("Sigmoid_3:0", shape=(), dtype=float32)


<p>Below we have the forget gate layer with each of its elements, as well as the mathematical notation for the function ft</p><img src="http://colah.github.io/posts/2015-08-Understanding-LSTMs/img/LSTM3-focus-f.png" width="500" height="500">
<p>The function can be understood as the simoid activation function of the dot product of the node Weight and the Ht-1 value and the Xt value - with finally adding the bias of the function</p>
