# The Best for Last: Understanding Optimization Functions!

You've just learned about **backpropagation**, which helps us calculate how much each weight in the neural network contributed to the error. Now, the big question is: **what do we do with this information ?** This is where **optimization functions** come into play.

---
## What is an Optimization Function?

Optimization functions take the results from backpropagation and use them to adjust the network’s weights in a way that reduces the overall error. By carefully tweaking the weights, optimization functions guide the learning process and help the neural network make better predictions over time. The first step in this optimization process is understanding **Gradient Descent**.

### Gradient Descent

Gradient Descent is a fundamental optimization algorithm used to minimize the loss function. The key idea behind Gradient Descent is to use the gradient of the loss function to update the weights, moving in the direction that reduces the loss. This process is repeated iteratively until the model reaches the lowest possible error.

### Mathematical Explanation of Gradient Descent

In Gradient Descent, weights are updated iteratively using the following rule:

$$
w = w - \eta \cdot \nabla L(w)
$$

Where:

- **\( w \)** represents the current weights of the model.
- **\( \delta \)** is the learning rate, a small positive number that controls the size of the weight update step.
- **\( \nabla L(w) \)** is the gradient of the loss function \( L \) with respect to the weights \( w \). This gradient indicates the direction in which the weights need to be adjusted to minimize the loss.


https://www.youtube.com/watch?v=mdKjMPmcWjY (End at 3:30min)

---
For now we are gonna see where gradients are manipulated and how they took effect :

In [None]:
import numpy as np


input = np.array([1.0, 2.0, 3.0, 4.0])
target = np.array([2.0, 3.0, 4.0, 5.0])

W1 = np.array([[1.5, 1.3, 1.8, 1.1],
              [1.5, 1.3, 1.8, 1.1],
              [1.5, 1.3, 1.8, 1.1],
              [1.5, 1.3, 1.8, 1.1]]) # weights for the input layer that are randomly initialized for 1 example

prediction = input @ W1

loss = np.mean((prediction - target) ** 2) # mean squared error
print("loss", loss)

G1 = 2 * np.mean((prediction - target) * input) # gradient of the loss function
print("gradient", G1)


Now that we have a gradient let's try to minimize the loss **below 1.30**, try your best !

In [None]:
# todo update the weights with the gradient

New_W1 = W1 + 0.0001 * G1

prediction = New_W1 @ input
loss = np.mean((prediction - target) ** 2)
print("new loss", loss) # should be smaller than the previous loss

W1 = New_W1
# Don't hesitate to run the code multiple times to see the loss decreasing

---
As you can see, you didn't directly manipulate the gradient;
instead, you adjusted the weights by adding or subtracting a certain percentage of it, known as the learning rate. But how is this process done automatically? This is where the concept of gradient descent comes into play:

Pytorch implement the optimization in this way :

In [None]:
import torch 

learning_rate = 0.01
gradien_descent = torch.optim.SGD(params=..., lr=learning_rate) # params should be the weights of the model

## Automatic Adjustment in Gradient Descent

With Gradient Descent, the process of adjusting the weights is both simple and efficient. Here's how it works automatically:

1. **Following the Gradient's Lead**: The gradient points in the direction of the steepest increase in loss. By taking the negative gradient, we move in the direction that most rapidly decreases the loss. This adjustment is inherently built into the Gradient Descent update rule:

    $$
    w = w - \eta \cdot \nabla L(w)
    $$

    The negative sign ensures that the weights move in the opposite direction of the gradient, towards minimizing the loss.

2. **Dynamic Adjustment**: As the model learns, the gradient changes. A large positive gradient means the loss will increase if the weight is increased, so the weight is reduced. A large negative gradient means the loss will decrease if the weight is increased, so the weight is increased.

3. **Role of the Learning Rate**: The learning rate (\( \eta \)) controls the step size of each adjustment. It helps fine-tune the balance between moving quickly toward lower loss values and making sure we don't overshoot the optimal point. A small learning rate results in small steps, making fine-tuned adjustments, while a larger learning rate allows for bigger changes.

4. **Automatic Decision-Making**: Gradient Descent doesn't need to explicitly decide to add or subtract from the weights because the gradient and the learning rate inherently dictate the adjustment. The mathematical formulation ensures that each step taken is in the direction that will reduce the loss. This method is called the **step()**

Great job making it this far! An important key point to remember is that when we loop through the training process, we need to reset the gradient values at the start of each loop. This is necessary because gradients must be recalculated with each new loss computation to reflect the most recent state of the model. This is done using the method called **zero_grad()**.