# Optimizer

-----
### Gradient Descent

Gradient Descent is an optimization algorithm commonly used to minimize the cost or loss function in the process of training machine learning models. The basic idea is to iteratively adjust the model parameters in the direction that reduces the cost function.

The algorithm is as follows:
1. Initialize the model parameters with random values.
2. Calculate the gradient of the cost function with respect to each parameter.
3. Update the parameters in the opposite direction of the gradient.
4. Repeat steps 2 and 3 until the cost function converges to a minimum.

-----
### Stochastic Gradient Descent

Stochastic Gradient Descent (SGD) is a variant of the gradient descent algorithm that updates the model parameters using the gradient of the cost function with respect to a small subset of the training data, rather than the entire dataset. This makes the algorithm faster and more scalable, especially for large datasets.

The algorithm is as follows:
1. Initialize the model parameters with random values.
2. Randomly shuffle the training data.
3. For each mini-batch of the training data, calculate the gradient of the cost function with respect to each parameter.
4. Update the parameters in the opposite direction of the gradient.
5. Repeat steps 2, 3, and 4 until the cost function converges to a minimum.

-----

### Regularization

A technique that discourages learning a more complex or flexible model, so as to avoid the risk of overfitting.

Lasso and Ridge regression **add a penalty term to the linear regression objective function** to prevent overfitting. 

- L1 norm: **Lasso (Least Absolute Shrinkage and Selection Operator) Regression** adds a penalty term to the objective function that is proportional to the absolute value of the coefficients. This results in some of the coefficients becoming exactly zero, which **effectively eliminates those features from the model**. This is called **feature selection** and it helps in reducing the number of features in the model. Lasso regression is useful for models with many features and it can help to select a subset of those features that are most useful for predicting the target variable.

$$Objective function = Sum of squared errors (SSE) + λ * Sum of absolute values of the coefficients (|w|)$$

- L2 norm: **Ridge Regression**, on the other hand, adds a penalty term to the objective function that is proportional to the square of the coefficients. This results in all coefficients shrinking towards zero, but none of them becoming exactly zero. This helps to reduce the impact of any one feature on the model, which can **help to prevent overfitting**. Ridge regression is useful for models with correlated features.

$$Objective function = Sum of squared errors (SSE) + λ * Sum of squared values of the coefficients (w^2)$$


What's the difference between Lasso and Ridge regression?

1. The main difference between the two is the type of penalty term they use. Lasso regression uses L1 penalty term, while Ridge regression uses L2 penalty term.
2. Lasso regression is useful for **feature selection**, while Ridge regression is useful for **reducing the impact of correlated features**.

In PyTorch, you can add regularization terms to the loss function. For L2 regularization (weight decay), you can use the weight_decay parameter in the optimizer. For example:

`optimizer = optim.SGD(model.parameters(), lr=0.01, weight_decay=1e-5)`