# 1. Architecture of a Recurrent Unit
We are now going to talk about the _**simple recurrent unit**_, also known as the _Elman Unit_. But before we do that, I want to quickly touch on _sequences_. This is going to look slightly different from the data that we are used to.

Recall that our input data $X$ is usually represented with an $NxD$ matrix ($N$ samples and $D$ features); there are no sequences here. Well, let's suppose that we did have a sequence of length $T$. How many dimensions would that require? 

Well, if the observation was a $D$ dimensional vector, and we have $T$ of them, then one sequence of observation will be a $TxD$ matrix. If we have $N$ training samples, then we will end up with an $NxTxD$ matrix, which is a 3 dimensional object. 

Sometimes our sequences are not of equal length, such as the cases with sentences, music, sound, or even someones credit history. How can we handle this? We have encountered this problem in the Hidden Markov Notebooks. The solution is to store each observation in a python list. So, instead of a 3 dimensional matrix, we will have a length $N$ list where each element is a 2-d observation sequence as a numpy array. Because a python list can contain any object as an element, this is okay. 

## 1.1 Simple Recurrent Unit 
Okay, now we can dig into our simple recurrent unit. Take a simple feedforward neural network with one hidden layer: 

<img src="images/simple-recurrent-unit-1.png" width="300">

The input layer and output layer will stay exactly the same, they are actually not part of the recurrent unit itself. However, they are included here for context. What we want to do is create a feedback connection from the hidden layer to itself:

<img src="images/simple-recurrent-unit-2.png" width="300">

We can include the weights as well, first a regular feedforward net:

<img src="images/simple-recurrent-unit-3.png" width="300">

And a recurrent net:

<img src="images/simple-recurrent-unit-4.png" width="300">

Notice that the feedback loop implies that there is a delay of one time unit. So, one of the input units into $h(t)$ is $h(t-1)$. 

A question that you may have is: How big is $W_h$? Just like the other layers, we connect "everything-to-everything". So, if there are $M$ hidden units, the first hidden unit connects back to all $M$ units, the second hidden unit connects back to all $M$ units, and so on. In total there will be $M^2$ hidden to hidden weights. Hence, $W_h$ is an $MxM$ matrix. 

### 1.1.1 Simple Recurrent Mathematical Output
Here is how we would represent the output of a recurrent net in math:

#### $$h(t) = f \big(W_h^T h(t-1) + W_x^T x(t) + b_h\big)$$

#### $$y(t) = softmax\big(W_o^T h(t) + b_o\big)$$

Note that the feedback connection represents a time delay of 1, so the hidden layer takes in both $x$ and its last hidden value. Also, note that $f$ can be any of the usual nonlinearities, such as the sigmoid, tanh, or ReLu. 

## 1.2 Not The Markov Assumption
One thing that is worth noting is that this is not the Markov Assumption. Why is that? Well, even though $h(t)$ is defined in terms of its previous value, it's previous value can be defined in terms of the value before that, and so on:

#### $$h(t) = f \big(W_h^T h(t-1) + W_x^T x(t) + b_h\big)$$

#### $$h(t) = f \Big(W_h^T f \big( W_h^T h(t-2) + W_x^T x(t-1) + b_h \big) + W_x^T x(t) + b_h\Big)$$

This also means that $h(t)$ has to have an initial state, $h(0)$. Sometimes researchers will set this to 0, and other times it will be a hyperparameter that we can use back propagation on. Since theano automatically differentiates things for us, we will treat it as an updatable parameter. 

## 1.3 More Layers 
One question you may have is can we add more than one recurrent unit to the network:

<img src="images/simple-recurrent-unit-5.png" width="400">

The answer is yes! The number of recurrent layers is a hyperparameter, just like how the number of hidden layers is a parameter for a regular feed forward net. The question of how many depends on your type of specific problem. 

And that is all there is to it! Just by adding that one hidden layer, we have created a recurrent neural network! We will see in the coding networks how this can already do some amazing things, such as exponentially decrease the number of hidden units we would have needed in a feed forward neural network. 