# Neural Networks

Understand the neural network, how it is formulated, and why it works.

We will cover the following:

- What is a neuron?

- Multilayer perceptron

- Universal approximation theorem

- Deep neural networks as feature extractors

# What is a neuron?

In the world of Neural Networks (NN), the basic building block is the neuron. NNsare nothing more than a collection of neurons organized in layers, with information passing from one layer to the other. So to understand NN, we first need to understand the neuron: the basic computing unit.


![pic](https://raw.githubusercontent.com/CUTe-EmbeddedAI/images/main/images/fig06.PNG)
Mathematically, we have:

> $y = w_{1}x_{1} + w_{2}x_{2} + w_{3}x_{3}$

A neuron is simply a linear classifier with a single output.

In [1]:
import torch.nn as nn 

neuron = nn.Linear ( 3 , 1 , bias= False )

Looks familiar? In most applications, we also add a bias $b$ to shift the position of the boundary line that separates the data points. This is also popularly known as **Perceptron**.

To extend this idea, we also pass this weighted average through a **non-linear function** $\sigma$ that will give us the decision boundary.

Why?

Because with non-linear functions between linear layers, we can model muchmore complex representations with less linear layers.

> Non-linearities is a key component that makes NN very rich function approximators.

Putting it all together, we have:

> $y = \sigma(w_{1}x_{1} + w_{2}x_{2} + w_{3}x_{3} + b) = \sigma(wx + b)$

We organize neurons in layers with nn.Linear(in_features, out_features) and stack layers in sequential order. The layer’s stacking combined with the non-linear activation function gives us the ability to distinguish non-linearly separable data.

There are multiple names in the literature for stacking linear layers with non-linear activations: **Multi-Layer Perceptron (MLP), artificial Neural Network, and feedforward module**. All these terms mean the same thing.

In practice, for three input features and two classes, our model will be like this:

In [3]:
import torch.nn as nn 

## 20 is the hidden dimension. arbitary choice 
model = nn.Sequential( 
        nn.Linear(3,20), # 3 for the input features x1,x2,x3 
        nn.ReLU(), 
        nn.Linear(20,2)) # 2 for the classes 

print(model)

Sequential(
  (0): Linear(in_features=3, out_features=20, bias=True)
  (1): ReLU()
  (2): Linear(in_features=20, out_features=2, bias=True)
)


Here, The activation function $\sigma$ is nn.ReLU().


![pic](https://raw.githubusercontent.com/CUTe-EmbeddedAI/images/main/images/fig07.PNG)

A good practice for you is to write the depicted image in Pytorch.

> **Hint:** It is not that different from the illustrated code

# Multilayer Perceptron

But why are MLPs able to find non-linear functions? Let’s have another look at NNs from a mathematical perspective. Remember that each neuron is represented with:

> $y = \sigma(w_{1}x_{1} + w_{2}x_{2} + w_{3}x_{3} + b) = \sigma(wx + b)$

So for two neurons nn.Linear(3,2), we would have:

> $y_{1} = \sigma(w_{1}x_{1} + w_{2}x_{2} + w_{3}x_{3} + b) = \sigma(wx + b)$

> $y_{2} = \sigma(w_{4}x_{1} + w_{5}x_{2} + w_{6}x_{3} + b) = \sigma(wx + b)$

This can be transformed using linear algebra to $y = \sigma(Wx + b)$ where:

> $y = \begin{bmatrix} y_{1} \\ y_{2} \end{bmatrix}$

> $W = \begin{bmatrix} w_{1} & w_{2} & w_{3} \\ w_{4} & w_{5} & w_{6} \end{bmatrix}$

> $x = \begin{bmatrix} x_{1} \\ x_{2} \\ x_{3}\end{bmatrix}$

> $b = \begin{bmatrix} b_{1} \\ b_{2} \end{bmatrix}$

If we assume that $x$ is our input and $y$ is our output, we can see that we have a non-linear expression with respect to $x$. Note that in the absence of an activation function, we will end up with a linear classifier.

If we add a second layer to make our model more expressive, we will have:

> $y^{(2)} = f(W^{(2)}z + b^{(2)})$

where $z = W^{(1)}x + b^{(1)}$.

This can be written in a more general format for current layer $L $and previous layer $L-1$ as:

> $y^{L} = f(W^{L}.\sigma(W^{L-1}x^{L-2} + b^{L-1}) + b^{L})$

The inner activation is denoted with $\sigma$, which is the symbol we use for non-linear activation functions.

# Universal approximation theorem

According to the universal approximation theorem, given enough neurons andthe correct set of weights, a multi-layer NN can approximate any function.Learning this function is increasingly hard, and we have no guarantee that ourdata are enough to do so.

Admittedly, that doesn’t mean we should only use NNs. 

In fact, we will learn about other models and how we can make a NN more compact, wider, or deeper to learn very rich data representations.

Why is that even useful?

Because NNs hide another secret besides being very good function approximators. They are also very good feature extractors.

# Deep neural networks as feature extractors

> Feature extraction can be seen as the transformation of the input data points from the input space to the feature space where classification is much easier.

Here is an intuitive and oversimplified example:

Imagine that each data point has 70 dimensions. Finding the correct 70-dimensional function to distinguish the data into two categories is very difficult and time consuming. 

Instead, we transform our input to a three-dimensional space where a classifier can approximate the decision boundary more easily. If we transform the 3D decision boundary back to the 70-dimensional space, we will see that itcorresponds to a 70-dimensional decision boundary. 

The transformed space does not always need to be low-dimensional, but high-dimensional spaces do not guarantee better results either. Think of the 70-dim example: if one of these input dimensions refers to the label,it would be enough to have 100% accuracy.

In any case, this is the main reason Deep Neural Networks (DNNs) exist: to transform the input data into a “better” space. Better because we can classify the (new) data more easily after we transform them!

In fact, in most real-life applications, only the last one or two layers of a neural network performs the actual classification. The rest account for feature extraction and learning representations.


![pic](https://raw.githubusercontent.com/CUTe-EmbeddedAI/images/main/images/fig08.png)