# Pytorch beginner course: Optimizer

## Summary

- [Simple training](#simple-training)
- [Optimizer](#optimizer)
    - [Gradient Descent Algorithm](#gradient-descent-algorithm)
- [Optimizer in pytorch](#optimizer-in-pytorch)
- [Glossary of the used tools](#glossary-of-the-used-tools)
    - [Methods](#methods)
- [References](#references)
- [Author](#author)

## Simple training

Before to introduce the backpropagation concept we will implement a simple code that simulate a training scenario using only the knowledge acquired at this point of the course.

To understand the concept the best way is proced step by step, so in this first version of our code we will implement only the main parts of a trainer

In [125]:
import torch

# Tensor that contain our weights
weights = torch.ones(4, dtype=torch.float64, requires_grad=True)

# Define a function to get a model with their weights
model = (weights*3).sum()

# Apply the backward on our model to compute the gradient
model.backward()

# output of the weights gradient
print(weights.grad)

tensor([3., 3., 3., 3.], dtype=torch.float64)


as we can see, in output we have tensor with all values of the gradient are equals to $3$ and this means that the gradients of our weights tensor are equal to $3$.
Now we will see what happens if we iterate the training for the model for $3 \ epochs$ for instance

In [126]:
# Tensor that contain our weights
weights = torch.ones(4, dtype=torch.float64, requires_grad=True)

for epochs in range(3):
    model = (weights*3).sum()
    model.backward()
    print(weights.grad)

tensor([3., 3., 3., 3.], dtype=torch.float64)
tensor([6., 6., 6., 6.], dtype=torch.float64)
tensor([9., 9., 9., 9.], dtype=torch.float64)


as we can see at each iteration the gradients calculation accumulates and we get a clearly wrong gradient coordinates, so to fix this bad accumulation we must set to zero the gradient value at each epoch

In [127]:
# Tensor that contain our weights
weights = torch.ones(4, dtype=torch.float64, requires_grad=True)

for epochs in range(3):
    model = (weights*3).sum()
    model.backward()
    print(weights.grad)

    # set to zero the gradient values
    weights.grad.zero_()

tensor([3., 3., 3., 3.], dtype=torch.float64)
tensor([3., 3., 3., 3.], dtype=torch.float64)
tensor([3., 3., 3., 3.], dtype=torch.float64)


Now the gradients computation is correctly and remember that you must pay attention to this details to avoid to get a bad model.

## Optimizer

One of the most important tool about the machine learning algorithms is the **optimizer**, we know that the our goal consist in the *loss function minimization* and to do this we can use specific algorithms, based on the famous ***descent gradient algorithm***, we can use this kind of algorithms because our loss function is derivable.

Hence, we use an optimizer algorithms to minimize the loss function, and this optimizers usually are iterative algorithm based on a *gradient descent algorithm*, the aim of the optimizers is update the parameters of our model with the loss minimization.

The most famous optimizer used in machine learning are:

* *Stochastic Gradient Descent (SGD)*
* *Adaptive moment Estimation (ADAM)*

### Gradient Descent Algorithm

The gradient descent is the most important iterative algorithm to find the minima of a function, mathematically we explain the essence of the gradient descent as:

$$
x_{k+1} = x_k - \alpha \nabla f(x_k)
$$

⚠️: The quantity $\alpha$ is so called ***learning rate*** and this value is a huge influence on the final result.

As follow we can see also the ***backtracking*** approach to choose a different values for $\alpha$ at each iteration to optimize the algorithm.

At begin we set the $\alpha_{k}$ as:

$$
\alpha_{k} = 1
$$

Now we can compute the value of the $X$:

$$
X = x_k - \nabla f(x_k)
$$

Now we must a check on the $X$, so if:

$$
f(X) \leq f(x_k)-\frac{\alpha_k}{3} ||\nabla f(x_k)||
$$

then:

$$
x_{k+1} = X
$$

otherwise, we apply the backtracking and to do this we return at previous index but with decreased value for $\alpha$ *(learning rate)*, hence:

$$
\alpha_k = 0.5
$$

$$
X = x_k - 0.5 \nabla f(x_k)
$$

and so on...

## Optimizer in pytorch

Torch offer a specific module that contains all optimizer named `torch.optim`, as follow we can see a simple implementation of the SGD optimizer.

In [128]:
import torch

# Set the seed for the pseudo random numbers
torch.manual_seed(10)

# Dataset
x_train = torch.randn(100,1)
y_train = 2 * x_train

# Model definition
model = torch.nn.Linear(1,1)

# SGD optimizer definition with 0.001 learning rate
optimizer = torch.optim.SGD(model.parameters(), lr=0.001)

for epoch in range(100):
    # Start the train
    model.train()

    # Predict the sample from the training set
    y_pred = model(x_train)

    # Compute the loss function on the prediction
    loss = torch.nn.functional.mse_loss(y_pred, y_train)

    # At each iteration set to zero the gradient
    optimizer.zero_grad()

    # Apply the backward to compute the gradients
    loss.backward()

    # Update the parameters on the new gradients
    optimizer.step()

    # print the epoch and the loss each 10 epochs
    if (epoch+1) % 10 == 0:
        print(f'Epoch: {epoch+1}\tLoss: {loss.item()}')

Epoch: 10	Loss: 4.361312389373779
Epoch: 20	Loss: 4.1808762550354
Epoch: 30	Loss: 4.007955551147461
Epoch: 40	Loss: 3.842235803604126
Epoch: 50	Loss: 3.683413028717041
Epoch: 60	Loss: 3.531200647354126
Epoch: 70	Loss: 3.3853204250335693
Epoch: 80	Loss: 3.24550724029541
Epoch: 90	Loss: 3.1115076541900635
Epoch: 100	Loss: 2.983078718185425


> ⚠: In the next lectures we will study better the module `torch.nn`

## Glossary of the used tools

### Methods

- `torch.manual_seed()`
- `torch.optim.SGD()`
- `torch.optim.SGD.step()`
- `torch.optim.SGD.zero_grad()`
- `torch.optim.SGD.step()`
- `torch.nn.Linear()`

## References

[Pytorch documentation](https://pytorch.org/docs/stable/index.html)

## Author

Emilio Garzia, 2024

[Github](https://github.com/EmilioGarzia)

[Linkedin](https://www.linkedin.com/in/emilio-garzia-58a934294/)