# Optimizer

Before introducing the recommended optimizer, it is important to clarify that no optimizer is the 'best' optimizer suitable for all types of machine learning problems and model architectures. Even comparing the performance of optimizers is a daunting task.

We suggest sticking to using mature and popular optimizers, and ideally, choosing the most commonly used optimizer for similar problems. Common and well-established optimizers include (but are not limited to):
- **SGD with momentum**
- **Adam and NAdam**, They are more versatile than SGDs with momentum. Please note that Adam has 4 adjustable hyperparameters, all of which are important.

## Stochastic Gradient Descent (SGD)

### Concept
Stochastic Gradient Descent (SGD) is an optimization algorithm that updates the model's parameters by computing the gradient of the loss function with respect to the parameters for each training sample and adjusting the parameters in the opposite direction of the gradient. This helps in finding the minimum of the loss function. SGD with Momentum is an extension of SGD that helps accelerate gradients vectors in the right directions, thus leading to faster converging. Momentum is added to the gradient update to prevent oscillations and improve convergence speed.

$$
\begin{align*}
v &= \gamma v + \eta \nabla J(w) \\
w &= w - v
\end{align*}
$$

Where:
$$
\begin{align*}
v &\text{ is the velocity} \\
\gamma &\text{ is the momentum term} \\
\eta &\text{ is the learning rate} \\
\nabla J(w) &\text{ is the gradient of the loss function with respect to the weights}
\end{align*}
$$

### Parameters
- `params`: Iterable of parameters to optimize or dictionaries defining parameter groups.
- `lr`: Learning rate (default: 0.01).
- `momentum`: Momentum factor (default: 0).
- `dampening`: Dampening for momentum (default: 0).
- `weight_decay`: Weight decay (L2 penalty) (default: 0).
- `nesterov`: Enables Nesterov momentum (default: False).

### PyTorch Code

In [None]:
import torch.optim as optim
sgd = optim.SGD(model.parameters(), lr=0.01)
sgd_with_momentum = optim.SGD(model.parameters(), lr=0.01, momentum=0.9)

## Adam (Adaptive Moment Estimation)

### Concept
Adam is an adaptive learning rate optimization algorithm designed to combine the advantages of both the AdaGrad and RMSProp algorithms. It computes adaptive learning rates for each parameter by considering the first moment (mean) and the second moment (uncentered variance) of the gradients.

$$
\begin{align*}
m &= \beta_1 m + (1 - \beta_1) \nabla J(w) \\
v &= \beta_2 v + (1 - \beta_2) (\nabla J(w))^2 \\
w &= w - \eta \frac{m}{\sqrt{v} + \epsilon}
\end{align*}
$$

Where:
$$
\begin{align*}
m &\text{ is the estimate of the first moment (the mean of the gradients)} \\
v &\text{ is the estimate of the second moment (the mean of the squared gradients)} \\
\beta_1 &\text{ and } \beta_2 \text{ are the exponential decay rates for the first and second moment estimates, respectively} \\
\epsilon &\text{ is a small constant for numerical stability}
\end{align*}
$$

### Parameters
- `params`: Iterable of parameters to optimize or dictionaries defining parameter groups.
- `lr`: Learning rate (default: 0.001).
- `betas`: Coefficients used for computing running averages of gradient and its square (default: (0.9, 0.999)).
- `eps`: Term added to the denominator to improve numerical stability (default: 1e-8).
- `weight_decay`: Weight decay (L2 penalty) (default: 0).
- `amsgrad`: Whether to use the AMSGrad variant of this algorithm (default: False).

### PyTorch Code

In [None]:
adam = optim.Adam(model.parameters(), lr=0.001)

## NAdam (Nesterov-accelerated Adaptive Moment Estimation)

### Concept
NAdam is a combination of Adam and Nesterov momentum. It incorporates the Nesterov momentum into the Adam optimizer to achieve a faster convergence.

$$
\begin{align*}
g_t &= \nabla J(w_t) \\
m &= \beta_1 m + (1 - \beta_1) g_t \\
v &= \beta_2 v + (1 - \beta_2) g_t^2 \\
\hat{m} &= m / (1 - \beta_1^t) \\
\hat{v} &= v / (1 - \beta_2^t) \\
w_{t+1} &= w_t - \eta \frac{\hat{m} + \epsilon}{\sqrt{\hat{v}} + \epsilon}
\end{align*}
$$

Here:
$$
\begin{align*}
g_t &\text{ is the gradient at time } t \\
\hat{m} &\text{ and } \hat{v} \text{ are the unbiased estimates of the first and second moment estimates, respectively}
\end{align*}
$$

### Parameters
- `params`: Iterable of parameters to optimize or dictionaries defining parameter groups.
- `lr`: Learning rate (default: 0.001).
- `betas`: Coefficients used for computing running averages of gradient and its square (default: (0.9, 0.999)).
- `eps`: Term added to the denominator to improve numerical stability (default: 1e-8).
- `weight_decay`: Weight decay (L2 penalty) (default: 0).
- `momentum_decay`: Momentum decay term (default: 4e-3).

### PyTorch Code

In [None]:
nadam = optim.NAdam(model.parameters(), lr=0.001)

# Batch Size

Batch Size is a key factor in determining training time and computational resource consumption.<br>

Increasing Batch Size usually reduces training time. This is very beneficial because it:<br>
- It can make hyperparameter adjustment more thorough within a fixed time interval, and ultimately train a better model.
- Reducing development cycle latency allows for more testing of new ideas.

There is no clear relationship between resource consumption and Batch Size. Increasing Batch Size can increase, decrease, or maintain resource consumption unchanged possible.<br>
Batch Size should not be used as a tunable hyperparameter for validation set performance. As long as all hyperparameters (especially learning rate and regularization hyperparameters) are adjusted well and the training steps are sufficient, theoretically any batch size can achieve the same final performance.

For a given model and optimizer, available hardware can typically support a range of Batch Sizes. The limiting factor is usually the memory of accelerators (GPUs/TPUs, etc.). Unfortunately, it is difficult to calculate a suitable batch size for memory without running or compiling a complete training program. The simplest solution is usually to run a small number of training experiments with different batch sizes (for example, trying to power 2) until one of the experiments exceeds the available memory.