# Gradient Descent

After designing neural network, It is time for the parameters of the neural network to be set.
This process is often called model fitting or parameter estimation.
In statistics when we want to estimate the parameters of the model, we use minimization of the empirical risk. This is proper process only when the model is simple or the train dataset can represent the whole population so that the empirical risk is close to the true risk. If not, simply estimating the parameters of the model that makes the empirical risk to be minimized can lead to overfitting.

Let us denote x a training input(one sample) which is n-dimensional vector and denote y the desired output from the network which is k-dimensional vector. Using the quadratic loss function, the empirical risk function is as follows:

$C(w,b)=\frac{1}{n}\sum_{all x}L_{x}$   where $L_{x}=\frac{(y-\hat{y})^2}{2}$

Since $C$ is the function of $w_{1},...,w_{n},b_{1},...,b_{l}$, the change of $C$ at a given point $\theta_{t}$ is as follows:

$\Delta C\approx \frac{\partial C}{\partial w_{1}}\Delta w_{1}+...+\frac{\partial C}{\partial w_{n}}\Delta w_{n}+\frac{\partial C}{\partial b_{1}}\Delta b_{1}+...\frac{\partial C}{\partial b_{l}}\Delta b_{l}$. Therefore, this can be rewritten as

$\Delta C\approx\nabla C\bullet\Delta\theta$

Choose $\Delta\theta=-\eta\nabla C$ which means $\Delta\theta$ is the vector that is perpendicular to the contour in a decreasing direction scaling by $\eta$. Hence, the above equation can be rewritten as

$\Delta C\approx-\eta\|\nabla C\|^2$

Since $\nabla C$ is always greater than or equal to 0, $\Delta\theta$ always makes $C$ to be changed in a decreasing way.

Then the updating equation is as follows:

$\theta_{t+1}=\theta_{t}-\eta\nabla C$

Recapping, the way the gradient descent algorithm works is to repeatedly compute $\nabla C$ at the point $\theta_{t}$, and then to move in the opposite direction.

# Stochastic Gradient Descent

There are a number of challenges in applying the gradient descent rule to the neural network. One of those is that we have to compute $\nabla C$, a gradient vector for every step in the gradient descent algorithm.

Notice that the empirical risk fuction has the form $C=\frac{1}{n}\sum_{all x}L_{x}$, that is, it's an average over the loss for every single training input. In practice, every step in the gradient descent algorithm, we have to compute $\nabla C$ by computing $\nabla L_{x}$ seperately for each individual training input and then average them, $\nabla C=\frac{1}{n}\sum_{all x}\nabla C_{x}$. When the number of training inputs is huge, this process consumes time a lot.

The key idea of the stochastic gradient descent algorithm is to estimate $\nabla C$ only with a small sample of randomly chosen training inputs and we refer to this sample as a mini-batch. Hence,

$\nabla C=\frac{\sum_{i=1}^n \nabla C_{X_i}}{n}\approx\frac{\sum_{j=1}^m \nabla C_{X_j}}{m}$, where m$<<$n.

Therefore, the updating equation is as follows:

$\theta_{t+1} = \theta_{t} -\eta\frac{\sum_{j=1}^m \nabla C_{X_j}}{m}$

As the algorithm computes $\nabla C$ at every step, we pick out different randomly choosen mini-batch at each step until completing an epoch of training.