# Neural Networks

Why do we need another learning algorithm?

If we have data which is nonlinear separable we need to define the features for our Logistic Regression algorithm such that we include enough polynomial terms. The more features that we have, the more complex would to build the polynomial terms. It also brings the risk of overfitting and becoming computationally expensive.

One example, where our feature vector is very large is Image Classification: For a 50x50 pixrl image, we have 2500 features on a grey scale image and 7500 on an RBG image. If we introduce nonlinear feature combination, we would have more than 3 million features which would be too big to compute.

What if we could have an algorithm which automatically finds the relevant features for us? 

Neural Networks can do that. They are very good for training sets with large features. 

### What is a Neutal Networks:
Neural Networks try to mimic the human brain. More specifically, they simulate the neurons in the brain. 
The parts of the Neuron that we want to focus on the **Dendrites**, the **cell body** and the **Axon**. The Dendrites are connected to other neurons and "receive input". We think of them as input wires. The **cell body** performs some calculation and passes some output to the Axon. The Axon acts like an output wire which passes results to the Dendrites of other Neurons. 

On a simplistic view, a neuron is a computational unit which gets a number of inputs, does some computation and sends the output to other neurons. The communication between the neurons are performed through a pulse of electricity which can vary by strength. 

![Human Neuron](../data/week3/Neuron.png)

From a computational perspective, we can look at one neuron as a Logistic unit which takes a few inputs via the input wires, does some computation and returns some output. The computation (as given by the name) is logistic (i.e. $y = h_\theta(x) = \frac{1}{1+e^-\theta^Tx}$). The parameters $\theta$ are often also referred as weights.

![Human Neuron](../data/week3/Artificial_Neuron.png)

### Neural Network

If we connect multiple neurons, we obtain a Neural Network. A Neural Network consists of the **Input layer** which takes input values and passes them on to our neurons in the **hidden layer**. The combined output of the final hidden layer is the **output layer**. Note that we are not restricted to a single hidden layer or a single output neuron. 

Additionally, we can see that we add a bias unit which always outputs 1 to every layer. Since the figure below has only exactly one hidden layer, we have an **Autoencoder** which we can use to effectively learn feature compression. 

![Human Neuron](../data/week3/Neural_Network.png)

#### Notation
- $a^{(j)}_i$ = "activation" of unit $i$ in layer $j$
- $\Theta^{(j)}$ = matrix of weights contrilling function mapping from layer $j$ to layer $j+1$

Let's do a sample calculation of the output for the network below:
![Human Neuron](../data/week3/Neural_Network1.png)

\begin{align}
&a_1^{(2)} = g(\Theta^{(1)}_{10}x_0 + \Theta^{(1)}_{11}x_1 + \Theta^{(1)}_{12}x_2 + \Theta^{(1)}_{13}x_3)\\
&a_2^{(2)} = g(\Theta^{(1)}_{20}x_0 + \Theta^{(1)}_{21}x_1 + \Theta^{(1)}_{22}x_2 + \Theta^{(1)}_{23}x_3)\\
&a_3^{(2)} = g(\Theta^{(1)}_{30}x_0 + \Theta^{(1)}_{31}x_1 + \Theta^{(1)}_{32}x_2 + \Theta^{(1)}_{33}x_3)\\
h_\Theta(x) = &a_1^{(3)} = g(\Theta^{(2)}_{10}a^{(2)}_0 + \Theta^{(2)}_{11}a^{(2)}_1 + \Theta^{(2)}_{12}a^{(2)}_2 + \Theta^{(2)}_{13}a^{(2)}_3)\\
\end{align}

We can perform this computation more efficiently in a vectorized fashion.
We define:

\begin{align}
z_1^{(2)} = \Theta^{(1)}_{10}x_0 + \Theta^{(1)}_{11}x_1 + \Theta^{(1)}_{12}x_2 + \Theta^{(1)}_{13}x_3\\
z_2^{(2)} = \Theta^{(1)}_{20}x_0 + \Theta^{(1)}_{21}x_1 + \Theta^{(1)}_{22}x_2 + \Theta^{(1)}_{23}x_3\\
z_3^{(2)} = \Theta^{(1)}_{30}x_0 + \Theta^{(1)}_{31}x_1 + \Theta^{(1)}_{32}x_2 + \Theta^{(1)}_{33}x_3\\
\end{align}

and therefore, 
\begin{align}
a_1^{(2)} = g(z_1^{(2)})\\
a_2^{(2)} = g(z_2^{(2)})\\
a_3^{(2)} = g(z_3^{(2)})\\
\end{align}

We can define the vectors:

\begin{align}
  x = 
  \begin{bmatrix}
    x_{0}\\
    x_{1}\\
    x_{2}\\
    x_{3}\\
  \end{bmatrix}
  \quad\quad\quad
  z^{(2)} = 
  \begin{bmatrix}
  z^{(2)}_1\\  
  z^{(2)}_2\\  
  z^{(2)}_3\\  
  \end{bmatrix}
\end{align}


\begin{align}
z^{(2)} = \Theta^{(1)}x\\
a^{(2)} = g(z^{(2)})
\end{align}

Further, we will define $a^{(1)} = x$ such that we can write: $z^{(2)} = \Theta^{(1)}a^{(1)}$

Lastly, we add $a_0^{(2)} = 1$ to $z^{(2)}$ to add the bias value.

\begin{align}
z^{(3)} = \Theta^{(1)}a^{(2)}\\
h_\Theta(x) = g(z^{(3)})
\end{align}

This whole process of calculating the hypothesis is called **forward propagation**. 

In [None]:
#Before I write a Neural Network, write an Autoencoder
#Also have a brief seection on Bayesian learning