# Optimization and Gradient Descent

Let's learn about the fundamental algorithm behind machine learning training: gradient descent.

We will cover the following:

- Slope: the derivative of the loss function

- Computing the gradient of a loss function

- Summing up the training scheme

In our 2D example, the loss function can be thought of as a parabolic-shaped function that reaches its minimum on a certain pair of $w_{1}$ and $w_{2}$. Visually, we have:


![pic](https://raw.githubusercontent.com/CUTe-EmbeddedAI/images/main/images/fig04.PNG)

To find these weights, the core idea is to simply follow the slope of the curve.Although we don’t know the actual shape of the loss, we can calculate the slope in a point and then move towards the downhill direction.

> You can think of the loss function as a mountain. The current loss gives us information about the local slope.

But what is the slope?

# Slope: the derivative of the loss function

In calculus, the slope is the derivative of the function at this point and is denoted as $\frac{\delta w}{\delta x}$. The ultimate goal would be to find the global min. The minimums, local orglobal, have a nearly zero derivative, which indicates that we are located at the minimum of the curve.

For now, suppose that we want to minimize the loss function $C$. By calculating the derivative, we will take small steps along the slope in an iterative fashion. In this way, we can gradually reach the minimum of the curve.

The same principle can be extended into many dimensions $N$. Despite the fact this is very difficult to visualize, maths is here to help us.


![pic](https://raw.githubusercontent.com/CUTe-EmbeddedAI/images/main/images/fig05.PNG)

Keep in mind that the minimum is not always the global minimum.

# Computing the gradient of a loss function

The question is how do we compute the derivative (or gradient) with respect to the weights? In simple cases, such as the two-dimensional one, we can compute the analytical form with calculus.

Since our loss function is $C = (f(x_{i},W) − y_{i})^{2}$, where the classifier $f$ is $f = w_{1}x + w_{2}$, we can easily prove that:

$\frac{\delta C}{\delta w_{1}} = 2(w_{1}x + w_{2} - y)x$

$\frac{\delta C}{\delta w_{2}} = 2(w_{1}x + w_{2} - y)$

This is nothing more than the partial derivatives with respect to our 2 weights. In complex cases, such as neural networks, the chain rule will come to the rescue.

Now that we have our gradients, let’s adjust our weights to go downhill:

>$w_{1}^\ast = w_{1} - \lambda \frac{\delta C}{\delta w_{1}}$

>$w_{2}^\ast = w_{2} - \lambda \frac{\delta C}{\delta w_{2}}$

where $\lambda$ is a small constant called **learning rate** . The learning rate $\lambda$ is usually between $10^{-3}$ and $10^{-6}$ and defines how quickly we move down towards thedirection of the gradient.

The negative sign intuitively means that we are going downhill! We follow thenegative slope of the curve.

That’s all? Yes and no.

Yes, because this principle will come in handy all the time. No, because we will not calculate the derivatives for every single neural network that we will use. 

Don’t worry!

However, we will analyze many more aspects of optimization as it is the heart ofmachine learning.

Ok, we found the gradient! How do we change the parameter?

This is the so-called **update rule**:

> The update rule for iteration $j$ of a scalar weight $w$ is as follows:

> $w_{j}^\ast = w_{j} - \lambda \frac{\delta C}{\delta w_{j}}$

> The index $j$ shows the iteration step.

# Summing up the training scheme

To recap, the training algorithm, known as gradient descent, can be formulated like this for the N-dimensional case:

- Initialize the classifier $f(x_{i} ,W)$ with random weights $W$.


- Feed a training example $x_{i}$ (vector) with corresponding target vector $t_{i}$ in the classifier, and compute the output $y_{i} = f(x_{i} ,W)$.


- Compute the loss between the prediction $y_{i}$ and target vector $t_{i}$. The mean squared error loss is one example that can be used $C = \sum(y_{i} − t_{i})^2$ 


- Compute the gradients for the loss with respect to the weights/parameters.


- Adjust the weights $W$ based on the rule $w_{i}^\ast = w_{i} - \lambda \frac{\delta C}{\delta w_{i}}$. Note that $\frac{\delta C}{\delta w_{i}}$ is the gradient of the parameter and $\lambda$ the learning rate.


- Repeat for all training examples.

In Pytorch, the entire algorithm can again be developed with a few lines of code. In the following snippet, we have a simple linear classifier that is trained using gradient descent and the mean squared error loss. It accepts a four-sized vector and outputs a single value.

Feel free to play around with the following code by trying different inputs and inspect the output and the gradient of the model. But don’t try to dive too deep into the code as we will discuss it in detail in the upcoming lessons.

In [1]:
import torch
import torch.nn as nn


def train():
    model = nn.Linear(4,2)
    criterion = torch.nn.MSELoss()
    optimizer = torch.optim.SGD(model.parameters(), lr=0.1)

    for epoch in range(10):
        # Converting inputs and labels to Variable
        inputs = torch.Tensor([0.8,0.4,0.4,0.2])
        labels = torch.Tensor([1,0])
        # Clear gradient buffers because we don't want any gradient from previous epoch to
        optimizer.zero_grad()
        # get output from the model, given the inputs
        outputs = model(inputs)
        # get loss for the predicted output
        loss = criterion(outputs, labels)
        print(loss)
        # get gradients w.r.t to parameters
        loss.backward()
        # update parameters
        optimizer.step()