# Training a NN

* $batch:$ from a data frame, is a set of samples of the whole set.
<br>
The **NN** is trained in parallel where one sample can be represented as 
$$\vec{y} = \vec{N}_{in}$$
<br>
Or for many samples
$$\vec{y} = (N_{samples}\times\vec{N}_{in})$$
<br>
* Python interprets:
$$M = A + b \rightarrow M_{ij} = A_{ij} + b_{j}$$
<br>
* Processing a batch(set of samples) for the linear function
$$Z = dot(Y,W) + b$$

where they have sizes $Z=(N_{samples}xN_{out})$, $Y=(N_{samples},N_{in})$, $W=(N_{in},N_{out})$, $b=N_{out}$
<br>

# Approximating an Arbitrary non-Linear Function

* Similarly to integration, a complex non-linear function may be approxiated by a coupled set of linear functions (sigmoid, reLu, etc.).
<br>
* **Universality of Neural Networks**: Any arbitrary (smooth) function can be approximated as well as desired by a neural network with a single hidden layer. (For sufficient number of neurons.)
<br>

* How to choose the values of the Weights (W) and biasses (b)? Training!!!!!!!

/home/armitage/Dropbox/Thesis master/NN_fitting.png

# NN Training

* The idea is to train the NN to adjust the values of the weights and biasses to obtain the "correct" result. Adjust the values to approximate to the function.
<br>
* To train the NN is necessary to have a function that measures the difference between the output of the NN and the "correct value". 
* This is called the **Cost Function** and is defined as
<br>
$$ C(w) = \frac{1}{2} <||F_{w}(y^{in}) - F(y^{in})||^2>$$
<br>
where $F_{w}(y^{in})$ is the output of the NN and $F(y^{in})$ is the correct value. The power of 2 means the average over all deviations.
* Opimizing the NN corresponds to diminishing the Cost Function.
* There are several algorithms to adjust the weights and biasses to the correct values. Stchastic Gradient Descent and Backpropagation. 

### 1. Stochastic Gradient Descent

* To decrease the value in the cost function, we derivate the function and take the negative value that decreases the error. So

$$c'(w) \sim -\nabla_{w}C(w)$$

* Problem: Evaluating C would mean averaging over all training samples.
* Solution: Average over a few samples, approximate C.
* Discrete Steps: For each step evaluate a few samples and update weights according to 

$$w_{j} \rightarrow w_{j} - \eta \frac{\partial \tilde C}{\partial w_{j}}$$

$\eta$: stepsize parameter, $\tilde C$: Approximate version of C. In each step, different samples are taken.
* For sufficiently small steps, sum over many steps approximates true gradient.

* Now we evaluate the derivative of the cost function as 

$$\frac{\partial C}{\partial w_{i}} = <(f(z)-F)f'(z)\frac{\partial z}{\partial w_{i}}>$$

where $z = \sum_{i}^{N} w_{i}y_{i} + b$ and $\frac{\partial z}{\partial w_{i}}=y_{i}$
<br>


### 2. Backpropagation

* Carefull with indices

$$\frac{\partial C(w,y^{in})}{\partial w_{*}} = \sum_{j}(y_{j}^{n}-F_{j}(y^{in}))\frac{\partial y_{j}^{n}}{\partial w_{*}} = \sum_{j}(y_{j}^{n}-F_{j}(y^{in}))f'(z_{j}^{n})\frac{\partial z_{j}^{n}}{\partial w_{*}}$$

* Applying the chain rule repeatedly,

$$\frac{\partial z_{j}^{n}}{\partial w_{*}} = \sum_{k}\frac{\partial z_{j}^{n}}{\partial y_{k}^{n-1}}\frac{\partial y_{k}^{n-1}}{\partial w_{*}} $$

$$= \sum_{k} w_{j,k}^{n,n-1} f'(z_{k}^{n-1})\frac{\partial z_{k}^{n-1}}{\partial w_{*}}$$

* The product of the first two elements at the right represent a Matrix

$$M_{j,k}^{n,n-1} = w_{j,k}^{n,n-1} f'(z_{k}^{n-1})$$

* By countinuing the chain rule, the problem end up being a matrix multiplication of weights and function derivatives.
* For the bias

$$\frac{\partial z_{j}^{n}}{\partial b_{j}^{n}} = 1$$

## Implementation