In [2]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
torch.manual_seed(42)

<torch._C.Generator at 0x1103ad1b0>

<span style="font-size: 15px;">

When we build a model that predicts an output quantity from a given input, the **loss function** measures the discrepancy between the model’s prediction and the true target value. Once the loss is defined, we need a method to adjust the model’s parameters so that this loss is **minimized**.

This process is called **training the model**, and the specific method used to **update the parameters** is referred to as the **optimization step**. In this lab, we will demonstrate some techniques used to optimize model parameters.

</span>

**Overview**

| Optimizer | Key Innovation | Advantages | Disadvantages | Best Use Cases |
|-----------|---------------|------------|---------------|----------------|
| **SGD** | Basic gradient descent with optional momentum | • Simple and interpretable<br>• Low memory overhead<br>• Works well with momentum<br>• Good generalization with proper tuning | • Requires careful learning rate tuning<br>• Same learning rate for all parameters<br>• Can be slow to converge<br>• Struggles with sparse gradients | • Well-understood problems<br>• When generalization is critical<br>• Large-scale models (memory efficient)<br>• CNNs with momentum |
| **ASGD** | Averages parameters over time with adaptive learning rate decay | • Improved convergence over SGD<br>• Theoretically sound averaging<br>• Can escape sharp minima<br>• Better generalization than SGD alone | • Averaging only starts after t₀ steps<br>• t₀ is hard to tune (default: 1M steps)<br>• Less commonly used/tested<br>• Limited practical adoption | • Convex optimization problems<br>• When training stability is important<br>• Long training runs where averaging helps<br>• Theoretical research |
| **Adagrad** | Adapts learning rate per parameter based on accumulated squared gradients | • No manual learning rate tuning needed<br>• Works well with sparse features<br>• Larger updates for infrequent features<br>• Good for sparse data (NLP, embeddings) | • Learning rate monotonically decreases<br>• Can stop learning too early<br>• Accumulates all historical gradients<br>• Poor for non-convex optimization | • Sparse data and features<br>• NLP tasks (word embeddings)<br>• Recommendation systems<br>• Problems with varying feature frequencies |
| **Adadelta** | Fixes Adagrad's diminishing learning rate using exponential moving average | • No learning rate parameter needed<br>• More robust than Adagrad<br>• Adapts to gradient history<br>• Good for non-convex problems | • Can be slower than Adam<br>• Less commonly used in practice<br>• Hyperparameter ρ needs tuning<br>• May not converge as fast as Adam | • When you want Adagrad without decay issues<br>• Problems where learning rate tuning is difficult<br>• Mid-sized neural networks<br>• Alternative to RMSprop |
| **RMSprop** | Uses exponential moving average of squared gradients with optional momentum | • Solves Adagrad's learning rate decay<br>• Works well in practice<br>• Good for RNNs and non-stationary problems<br>• Widely used and tested | • No bias correction<br>• Still requires learning rate tuning<br>• Can be unstable without proper hyperparameters<br>• Less popular than Adam | • Recurrent Neural Networks (RNNs)<br>• Non-stationary objectives<br>• Online learning settings<br>• When Adam is too complex |
| **Adam** | Combines momentum and RMSprop with bias correction | • Generally works well out-of-the-box<br>• Fast convergence<br>• Bias correction for moments<br>• Most widely used optimizer<br>• Robust to hyperparameters | • Can overfit more than SGD<br>• May not generalize as well as SGD<br>• Uses more memory (stores moments)<br>• Can get stuck in poor local minima | • Default choice for most deep learning<br>• Transformers and attention models<br>• GANs and generative models<br>• When fast prototyping is needed<br>• Most NLP and vision tasks |
| **AdamW** | Decouples weight decay from gradient update (proper L2 regularization) | • Better generalization than Adam<br>• Correct weight decay implementation<br>• State-of-the-art for many tasks<br>• Widely used in modern architectures | • Slightly more hyperparameters<br>• Still can overfit without proper regularization<br>• Higher memory usage<br>• May need different weight decay values than Adam | • Modern transformers (BERT, GPT, etc.)<br>• Large language models<br>• Vision transformers (ViT)<br>• When proper weight decay is critical<br>• Fine-tuning pretrained models |

## General Recommendations

- **Start with AdamW** for most modern deep learning tasks (transformers, large models)
- **Use SGD with momentum** when generalization is critical and you have time to tune
- **Use Adam** for quick prototyping or when AdamW isn't necessary
- **Use Adagrad** for sparse data problems (embeddings, recommendation systems)
- **Use RMSprop** for RNNs or when you want something simpler than Adam
- **Avoid Adadelta and ASGD** unless you have specific reasons (less commonly used)

## Key Hyperparameters to Tune

| Optimizer | Critical Hyperparameters | Typical Good Values |
|-----------|-------------------------|---------------------|
| SGD | Learning rate, momentum | lr: 0.01-0.1, momentum: 0.9 |
| ASGD | Learning rate, t₀ | lr: 0.01, t₀: problem-dependent |
| Adagrad | Learning rate | lr: 0.01 |
| Adadelta | ρ (rho) | ρ: 0.9-0.95 |
| RMSprop | Learning rate, α (alpha) | lr: 0.001, α: 0.99 |
| Adam | Learning rate, β₁, β₂ | lr: 0.001, β₁: 0.9, β₂: 0.999 |
| AdamW | Learning rate, weight decay | lr: 0.001, weight_decay: 0.01 |

## SGD

<span style="font-size: 15px;">

Stochastic Gradient Descent (SGD) updates the model parameters by computing the gradient of the loss function with respect to the parameters and usually taking a step in the opposite direction of this gradient. Unlike standard gradient descent, which uses the entire training dataset to compute each update, SGD estimates the gradient using **a single data point or a small batch of data**. This makes SGD computationally efficient and well-suited for large datasets, though the updates can be noisy.


In general, the init method in optim-SGD has the following default arguments:

```(params, lr=0.001, momentum=0, dampening=0, weight_decay=0, nesterov=False, *, maximize=False, foreach=None, differentiable=False, fused=None)```

The default arguments ```differentiable``` and ```fused``` will be discussed later in more details. For now, if we denote the model's parameters at training step $t$ by $\theta_t$ (starting with $t=0$), the learning rate by $\gamma$, the momentum coefficient by $\mu$, the dampening parameter by $\tau$, and the weight decay by $\lambda$, then the optimization step in SGD takes the following general form:


   $$
\begin{aligned} 
g_t &= \begin{cases}
-\nabla_\theta f(\theta_{t-1}) & \text{if maximize} \\
\nabla_\theta f(\theta_{t-1}) & \text{otherwise}
\end{cases} \\
g_t &\leftarrow g_t + \lambda \theta_{t-1} \quad \text{if } \lambda \neq 0 \\
b_t &= \begin{cases}
g_t & \text{if } t = 1 \\
\mu b_{t-1} + (1-\tau)g_t & \text{if } t > 1
\end{cases} \quad \text{if } \mu \neq 0 \\
g_t &\leftarrow \begin{cases}
g_t + \mu b_t & \text{if nesterov} \\
b_t & \text{otherwise}
\end{cases} \quad \text{if } \mu \neq 0 \\
\theta_t &= \theta_{t-1} - \gamma g_t
\end{aligned}
   $$
That is,
   $$
\begin{aligned} 
\theta_t =& \theta_{t-1} - \gamma \tilde{g}_t
\,,
\nonumber
\\
\quad \tilde{g}_t =& \begin{cases}
g_t + \mu b_t & \text{if nesterov and } \mu \neq 0 \\
b_t & \text{if } \mu \neq 0 \text{ and not nesterov} \\
g_t & \text{if } \mu = 0
\end{cases}
\,,
\nonumber
\\
b_t =& \begin{cases}
g_t & \text{if } t = 1 \\
\mu b_{t-1} + (1-\tau)g_t & \text{if } t > 1
\end{cases} \quad \text{(only computed if } \mu \neq 0\text{)}
\,,
\nonumber
\\
g_t = \nabla_\theta f(\theta_{t-1}) + \lambda\theta_{t-1}
\end{aligned}
   $$
</span>


In [76]:
# To see how this exactly work in PyTorch we create first some random trainable parameters:

parameter_1 = nn.Parameter(torch.rand((2, 4), dtype=torch.float64))
parameter_2 = nn.Parameter(torch.rand(4, dtype=torch.float64))

params = nn.ParameterList([parameter_1, parameter_2])

# Let's save the initial values of the parameters
initial_param1 = parameter_1.clone().detach()
initial_param2 = parameter_2.clone().detach()

# Create a simple loss function
def dummy_loss():
    # Simple loss: sum of all parameters
    return (parameter_1**2).sum() + (parameter_2**3).sum()

In [77]:
# we may also take the following choices for the other arguments:
lr = 0.1
momentum = 0.5
dampening = 1
weight_decay = 0.02
nesterov = False

In [78]:
optimizer = optim.SGD(params, lr=lr, momentum=momentum, dampening=dampening, weight_decay=weight_decay,nesterov=nesterov)

In [79]:
# Perform one optimization step
optimizer.zero_grad()  # Clear previous gradients
loss = dummy_loss()    # Compute loss
loss.backward()        # Compute gradients

print("\nGradients:")
print("grad parameter_1:\n", parameter_1.grad)
print("grad parameter_2:\n", parameter_2.grad)

# Save gradients before optimizer.step()
grad1 = parameter_1.grad.clone()
grad2 = parameter_2.grad.clone()

optimizer.step() #Update parameters

print("\nAfter one step:")
print("Updated parameter_1:\n", parameter_1)
print("Updated parameter_2:\n", parameter_2)


print("\nChange in parameters:")
print("Δ parameter_1:\n", parameter_1 - initial_param1)
print("Δ parameter_2:\n", parameter_2 - initial_param2)


Gradients:
grad parameter_1:
 tensor([[1.4953, 1.2757, 1.0132, 0.1502],
        [1.2270, 1.9302, 0.6604, 1.8370]], dtype=torch.float64)
grad parameter_2:
 tensor([1.0079e+00, 2.6933e-04, 1.0688e+00, 1.1233e+00], dtype=torch.float64)

After one step:
Updated parameter_1:
 Parameter containing:
tensor([[0.5966, 0.5090, 0.4043, 0.0599],
        [0.4896, 0.7701, 0.2635, 0.7330]], dtype=torch.float64,
       requires_grad=True)
Updated parameter_2:
 Parameter containing:
tensor([0.4777, 0.0094, 0.4888, 0.4984], dtype=torch.float64,
       requires_grad=True)

Change in parameters:
Δ parameter_1:
 tensor([[-0.1510, -0.1288, -0.1023, -0.0152],
        [-0.1239, -0.1949, -0.0667, -0.1855]], dtype=torch.float64,
       grad_fn=<SubBackward0>)
Δ parameter_2:
 tensor([-1.0195e-01, -4.5883e-05, -1.0808e-01, -1.1355e-01],
       dtype=torch.float64, grad_fn=<SubBackward0>)


In [81]:
# Now manually reproduce the same update

# Reset to initial values
manual_param1 = initial_param1.clone()
manual_param2 = initial_param2.clone()

# Step 1: Compute g_t (gradient + weight decay)
g1 = grad1 + weight_decay * initial_param1
g2 = grad2 + weight_decay * initial_param2


# Step 2: Apply momentum (for t=1, b_t = g_t since it's the first step)
b1 = g1  # momentum buffer for parameter_1
b2 = g2  # momentum buffer for parameter_2


# Step 3: Since nesterov=False and momentum!=0, use g_t = b_t
final_g1 = b1
final_g2 = b2

# Step 4: Update parameters
manual_param1 = manual_param1 - lr * final_g1
manual_param2 = manual_param2 - lr * final_g2

print("\nManual parameter_1:\n", manual_param1)
print("Manual parameter_2:\n", manual_param2)

# Verify they match

print("parameter_1 match:", torch.allclose(parameter_1, manual_param1))
print("parameter_2 match:", torch.allclose(parameter_2, manual_param2))
print("Max difference param1:", torch.max(torch.abs(parameter_1 - manual_param1)).item())
print("Max difference param2:", torch.max(torch.abs(parameter_2 - manual_param2)).item())


Manual parameter_1:
 tensor([[0.5966, 0.5090, 0.4043, 0.0599],
        [0.4896, 0.7701, 0.2635, 0.7330]], dtype=torch.float64)
Manual parameter_2:
 tensor([0.4777, 0.0094, 0.4888, 0.4984], dtype=torch.float64)
parameter_1 match: True
parameter_2 match: True
Max difference param1: 0.0
Max difference param2: 0.0


## ASGD

<span style="font-size: 15px;">

Averaged Stochastic Gradient Descent (ASGD) is an optimization algorithm that enhances standard stochastic gradient descent by maintaining a running average of the parameter values over time. The key innovation is that ASGD computes parameter updates similarly to SGD but also keeps track of averaged parameters, which typically converge to better solutions. The averaging process begins after a specified number of steps (controlled by `t0`), and the learning rate is adjusted dynamically based on the number of steps taken. This averaging mechanism helps reduce the variance in the parameter estimates and can lead to improved generalization.

In general, the init method in optim.ASGD has the following default arguments:

```(params, lr=0.01, lambd=0.0001, alpha=0.75, t0=1000000.0, weight_decay=0, foreach=None, maximize=False, differentiable=False, capturable=False)```

The default arguments ```differentiable```, ```foreach```, and ```capturable``` will be discussed later in more details. For now, if we denote the model's parameters at training step $t$ by $\theta_t$ (starting with $t=0$), the learning rate by $\gamma$, the decay term by $\lambda$, the power for eta update by $\alpha$, the averaging start point by $t_0$, and the weight decay by $\omega$, then the optimization step in ASGD takes the following general form:
   $$
\begin{aligned} 
g_t &= \begin{cases}
-\nabla_\theta f(\theta_{t-1}) & \text{if maximize} \\
\nabla_\theta f(\theta_{t-1}) & \text{otherwise}
\end{cases} \\
g_t &\leftarrow g_t + \omega \theta_{t-1} \quad \text{if } \omega \neq 0 \\
\eta_t &= \frac{1}{\gamma(1 + \lambda \gamma t)^\alpha} \\
\mu_t &= \frac{1}{\max(1, t - t_0)} \\
\theta_t &= \theta_{t-1} - \eta_t g_t \\
\bar{\theta}_t &= \begin{cases}
\theta_t & \text{if } t \leq t_0 \\
(1 - \mu_t)\bar{\theta}_{t-1} + \mu_t \theta_t & \text{if } t > t_0
\end{cases}
\end{aligned}
   $$
That is,
   $$
\begin{aligned} 
\theta_t =& \theta_{t-1} - \eta_t g_t
\,,
\nonumber
\\
\eta_t =& \frac{1}{\gamma(1 + \lambda \gamma t)^\alpha}
\,,
\nonumber
\\
\bar{\theta}_t =& \begin{cases}
\theta_t & \text{if } t \leq t_0 \\
(1 - \mu_t)\bar{\theta}_{t-1} + \mu_t \theta_t & \text{if } t > t_0
\end{cases}
\,,
\nonumber
\\
\mu_t =& \frac{1}{\max(1, t - t_0)}
\,,
\nonumber
\\
g_t =& \nabla_\theta f(\theta_{t-1}) + \omega\theta_{t-1}
\end{aligned}
   $$
The ASGD is used similarly to the SGD.
</span>


## Adagrad

<span style="font-size: 15px;">

Adagrad (Adaptive Gradient Algorithm) is an optimization algorithm that adapts the learning rate for each parameter individually based on the historical gradients. The key innovation is that Adagrad accumulates the squared gradients over time and uses this accumulation to scale the learning rate differently for each parameter. Parameters that receive large gradients will have their learning rates reduced more aggressively, while parameters with small gradients will have relatively larger learning rates. This makes Adagrad particularly well-suited for dealing with sparse data and features that occur infrequently.

In general, the init method in optim.Adagrad has the following default arguments:

```(params, lr=0.01, lr_decay=0, weight_decay=0, initial_accumulator_value=0, eps=1e-10, foreach=None, maximize=False, differentiable=False, fused=None)```

The default arguments ```differentiable```, ```foreach```, and ```fused``` will be discussed later in more details. For now, if we denote the model's parameters at training step $t$ by $\theta_t$ (starting with $t=0$), the learning rate by $\gamma$, the learning rate decay by $\eta$, the weight decay by $\lambda$, the initial accumulator value by $\tau$, and the numerical stability term by $\epsilon$, then the optimization step in Adagrad takes the following general form:
   $$
\begin{aligned} 
\text{state\_sum}_0 &= \tau \\
g_t &= \begin{cases}
-\nabla_\theta f(\theta_{t-1}) & \text{if maximize} \\
\nabla_\theta f(\theta_{t-1}) & \text{otherwise}
\end{cases} \\
\tilde{\gamma}_t &= \frac{\gamma}{1 + (t-1)\eta} \\
g_t &\leftarrow g_t + \lambda \theta_{t-1} \quad \text{if } \lambda \neq 0 \\
\text{state\_sum}_t &= \text{state\_sum}_{t-1} + g_t^2 \\
\theta_t &= \theta_{t-1} - \frac{\tilde{\gamma}_t g_t}{\sqrt{\text{state\_sum}_t} + \epsilon}
\end{aligned}
   $$
That is,
   $$
\begin{aligned} 
\theta_t =& \theta_{t-1} - \frac{\tilde{\gamma}_t g_t}{\sqrt{\text{state\_sum}_t} + \epsilon}
\,,
\nonumber
\\
\tilde{\gamma}_t =& \frac{\gamma}{1 + (t-1)\eta}
\,,
\nonumber
\\
\text{state\_sum}_t =& \text{state\_sum}_{t-1} + g_t^2 \quad \text{with } \text{state\_sum}_0 = \tau
\,,
\nonumber
\\
g_t =& \nabla_\theta f(\theta_{t-1}) + \lambda\theta_{t-1}
\end{aligned}
   $$

</span>


## Adadelta

<span style="font-size: 15px;">


Adadelta is an extension of Adagrad that seeks to reduce its aggressive, monotonically decreasing learning rate. Instead of accumulating all past squared gradients, Adadelta restricts the window of accumulated past gradients to some fixed size using a decaying average. The key innovation is that Adadelta also maintains a running average of the squared parameter updates, which it uses to adapt the learning rate. This eliminates the need to manually set an initial learning rate, though PyTorch's implementation still includes a learning rate scaling factor for flexibility.

In general, the init method in optim.Adadelta has the following default arguments:

```(params, lr=1.0, rho=0.9, eps=1e-06, weight_decay=0, foreach=None, capturable=False, maximize=False, differentiable=False)```

The default arguments ```differentiable```, ```foreach```, and ```capturable``` will be discussed later in more details. For now, if we denote the model's parameters at training step $t$ by $\theta_t$ (starting with $t=0$), the learning rate by $\gamma$, the decay rate by $\rho$, the weight decay by $\lambda$, and the numerical stability term by $\epsilon$, then the optimization step in Adadelta takes the following general form:
   $$
\begin{aligned} 
v_0 &= 0 \quad \text{(square avg)} \\
u_0 &= 0 \quad \text{(accumulate variables)} \\
g_t &= \begin{cases}
-\nabla_\theta f(\theta_{t-1}) & \text{if maximize} \\
\nabla_\theta f(\theta_{t-1}) & \text{otherwise}
\end{cases} \\
g_t &\leftarrow g_t + \lambda \theta_{t-1} \quad \text{if } \lambda \neq 0 \\
v_t &= v_{t-1}\rho + g_t^2(1-\rho) \\
\Delta x_t &= \frac{\sqrt{u_{t-1} + \epsilon}}{\sqrt{v_t + \epsilon}} g_t \\
u_t &= u_{t-1}\rho + \Delta x_t^2(1-\rho) \\
\theta_t &= \theta_{t-1} - \gamma \Delta x_t
\end{aligned}
   $$
That is,
   $$
\begin{aligned} 
\theta_t =& \theta_{t-1} - \gamma \Delta x_t
\,,
\nonumber
\\
\Delta x_t =& \frac{\sqrt{u_{t-1} + \epsilon}}{\sqrt{v_t + \epsilon}} g_t
\,,
\nonumber
\\
v_t =& v_{t-1}\rho + g_t^2(1-\rho) \quad \text{with } v_0 = 0
\,,
\nonumber
\\
u_t =& u_{t-1}\rho + \Delta x_t^2(1-\rho) \quad \text{with } u_0 = 0
\,,
\nonumber
\\
g_t =& \nabla_\theta f(\theta_{t-1}) + \lambda\theta_{t-1}
\end{aligned}
   $$

</span>

## RMSprop

<span style="font-size: 15px;">


In general, the init method in optim.RMSprop has the following default arguments:

```(params, lr=0.01, alpha=0.99, eps=1e-08, weight_decay=0, momentum=0, centered=False, capturable=False, foreach=None, maximize=False, differentiable=False)```

The default arguments ```differentiable```, ```foreach```, and ```capturable``` will be discussed later in more details. For now, if we denote the model's parameters at training step $t$ by $\theta_t$ (starting with $t=0$), the learning rate by $\gamma$, the smoothing constant by $\alpha$, the weight decay by $\lambda$, the momentum factor by $\mu$, and the numerical stability term by $\epsilon$, then the optimization step in RMSprop takes the following general form:
   $$
\begin{aligned} 
v_0 &= 0 \quad \text{(square average)} \\
b_0 &= 0 \quad \text{(buffer)} \\
g_0^{ave} &= 0 \\
g_t &= \begin{cases}
-\nabla_\theta f(\theta_{t-1}) & \text{if maximize} \\
\nabla_\theta f(\theta_{t-1}) & \text{otherwise}
\end{cases} \\
g_t &\leftarrow g_t + \lambda \theta_{t-1} \quad \text{if } \lambda \neq 0 \\
v_t &= \alpha v_{t-1} + (1-\alpha)g_t^2 \\
g_t^{ave} &= \alpha g_{t-1}^{ave} + (1-\alpha)g_t \quad \text{(only if centered)} \\
\tilde{v}_t &= \begin{cases}
v_t - (g_t^{ave})^2 & \text{if centered} \\
v_t & \text{otherwise}
\end{cases} \\
b_t &= \mu b_{t-1} + \frac{g_t}{\sqrt{\tilde{v}_t+ \epsilon} } \quad \text{(only if } \mu > 0\text{)} \\
\theta_t &= \begin{cases}
\theta_{t-1} - \gamma b_t & \text{if } \mu > 0 \\
\theta_{t-1} - \frac{\gamma g_t}{\sqrt{\tilde{v}_t+ \epsilon}} & \text{otherwise}
\end{cases}
\end{aligned}
   $$
That is,
   $$
\begin{aligned} 
\theta_t =& \begin{cases}
\theta_{t-1} - \gamma b_t & \text{if } \mu > 0 \\
\theta_{t-1} - \frac{\gamma g_t}{\sqrt{\tilde{v}_t+ \epsilon}} & \text{otherwise}
\end{cases}
\,,
\nonumber
\\
b_t =& \mu b_{t-1} + \frac{g_t}{\sqrt{\tilde{v}_t+ \epsilon} } \quad \text{with } b_0 = 0 \quad \text{(only if } \mu > 0\text{)}
\,,
\nonumber
\\
\tilde{v}_t =& \begin{cases}
v_t - (g_t^{ave})^2 & \text{if centered} \\
v_t & \text{otherwise}
\end{cases}
\,,
\nonumber
\\
g_t^{ave} =& \alpha g_{t-1}^{ave} + (1-\alpha)g_t \quad \text{with } g_0^{ave} = 0 \quad \text{(only if centered)}
\,,
\nonumber
\\
v_t =& \alpha v_{t-1} + (1-\alpha)g_t^2 \quad \text{with } v_0 = 0
\,,
\nonumber
\\
g_t =& \nabla_\theta f(\theta_{t-1}) + \lambda\theta_{t-1}
\end{aligned}
   $$


</span>

## Adam

<span style="font-size: 15px;">


Adam (Adaptive Moment Estimation) is an optimization algorithm that combines ideas from RMSprop and momentum methods. The key innovation is that Adam computes adaptive learning rates for each parameter by maintaining exponentially decaying averages of both past gradients (first moment) and past squared gradients (second moment). Additionally, Adam includes bias correction terms to account for the initialization of these moment estimates at zero, which is particularly important during the initial training steps. An optional AMSGrad variant maintains the maximum of past second moments for improved convergence in some cases.

In general, the init method in optim.Adam has the following default arguments:

```(params, lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0, amsgrad=False, foreach=None, maximize=False, capturable=False, differentiable=False, fused=None, decoupled_weight_decay=False)```

The default arguments ```differentiable```, ```foreach```, ```capturable```, ```fused```, and ```decoupled_weight_decay``` will be discussed later in more details. For now, if we denote the model's parameters at training step $t$ by $\theta_t$ (starting with $t=0$), the learning rate by $\gamma$, the exponential decay rates for moment estimates by $\beta_1$ and $\beta_2$ (the `betas` tuple), the weight decay by $\lambda$, and the numerical stability term by $\epsilon$, then the optimization step in Adam takes the following general form:
   $$
\begin{aligned} 
m_0 &= 0 \quad \text{(first moment)} \\
v_0 &= 0 \quad \text{(second moment)} \\
v_0^{max} &= 0 \\
g_t &= \begin{cases}
-\nabla_\theta f(\theta_{t-1}) & \text{if maximize} \\
\nabla_\theta f(\theta_{t-1}) & \text{otherwise}
\end{cases} \\
g_t &\leftarrow g_t + \lambda \theta_{t-1} \quad \text{if } \lambda \neq 0 \\
m_t &= \beta_1 m_{t-1} + (1-\beta_1)g_t \\
v_t &= \beta_2 v_{t-1} + (1-\beta_2)g_t^2 \\
v_t^{max} &= \max(v_{t-1}^{max}, v_t) \quad \text{(only if amsgrad)} \\
\hat{m}_t &= \frac{m_t}{1-\beta_1^t} \\
\hat{v}_t &= \begin{cases}
\frac{v_t^{max}}{1-\beta_2^t} & \text{if amsgrad} \\
\frac{v_t}{1-\beta_2^t} & \text{otherwise}
\end{cases} \\
\theta_t &= \theta_{t-1} - \frac{\gamma \hat{m}_t}{\sqrt{\hat{v}_t + \epsilon}}
\end{aligned}
   $$
That is,
   $$
\begin{aligned} 
\theta_t =& \theta_{t-1} - \frac{\gamma \hat{m}_t}{\sqrt{\hat{v}_t + \epsilon}}
\,,
\nonumber
\\
\hat{m}_t =& \frac{m_t}{1-\beta_1^t}
\,,
\nonumber
\\
\hat{v}_t =& \begin{cases}
\frac{v_t^{max}}{1-\beta_2^t} & \text{if amsgrad} \\
\frac{v_t}{1-\beta_2^t} & \text{otherwise}
\end{cases}
\,,
\nonumber
\\
v_t^{max} =& \max(v_{t-1}^{max}, v_t) \quad \text{with } v_0^{max} = 0 \quad \text{(only if amsgrad)}
\,,
\nonumber
\\
m_t =& \beta_1 m_{t-1} + (1-\beta_1)g_t \quad \text{with } m_0 = 0
\,,
\nonumber
\\
v_t =& \beta_2 v_{t-1} + (1-\beta_2)g_t^2 \quad \text{with } v_0 = 0
\,,
\nonumber
\\
g_t =& \nabla_\theta f(\theta_{t-1}) + \lambda\theta_{t-1}
\end{aligned}
   $$



</span>

## AdamW

<span style="font-size: 15px;">


AdamW (Adam with Decoupled Weight Decay) is a variant of the Adam optimizer that modifies how weight decay is applied. The key innovation is that weight decay is decoupled from the gradient-based optimization step and applied directly to the parameters. Unlike standard Adam where weight decay is added to the gradients (L2 regularization), AdamW applies weight decay as a separate multiplicative factor to the parameters themselves. This decoupling has been shown to improve generalization performance, particularly in deep learning applications.

In general, the init method in optim.AdamW has the following default arguments:

```(params, lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0.01, amsgrad=False, maximize=False, foreach=None, capturable=False, differentiable=False, fused=None)```

The default arguments ```differentiable```, ```foreach```, ```capturable```, and ```fused``` will be discussed later in more details. For now, if we denote the model's parameters at training step $t$ by $\theta_t$ (starting with $t=0$), the learning rate by $\gamma$, the exponential decay rates for moment estimates by $\beta_1$ and $\beta_2$ (the `betas` tuple), the weight decay by $\lambda$, and the numerical stability term by $\epsilon$, then the optimization step in AdamW takes the following general form:
   $$
\begin{aligned} 
m_0 &= 0 \quad \text{(first moment)} \\
v_0 &= 0 \quad \text{(second moment)} \\
v_0^{max} &= 0 \\
g_t &= \begin{cases}
-\nabla_\theta f(\theta_{t-1}) & \text{if maximize} \\
\nabla_\theta f(\theta_{t-1}) & \text{otherwise}
\end{cases} \\
\theta_t &= \theta_{t-1} - \gamma \lambda \theta_{t-1} \\
m_t &= \beta_1 m_{t-1} + (1-\beta_1)g_t \\
v_t &= \beta_2 v_{t-1} + (1-\beta_2)g_t^2 \\
v_t^{max} &= \max(v_{t-1}^{max}, v_t) \quad \text{(only if amsgrad)} \\
\hat{m}_t &= \frac{m_t}{1-\beta_1^t} \\
\hat{v}_t &= \begin{cases}
\frac{v_t^{max}}{1-\beta_2^t} & \text{if amsgrad} \\
\frac{v_t}{1-\beta_2^t} & \text{otherwise}
\end{cases} \\
\theta_t &\leftarrow \theta_t - \frac{\gamma \hat{m}_t}{\sqrt{\hat{v}_t + \epsilon}}
\end{aligned}
   $$
That is,
   $$
\begin{aligned} 
\theta_t =& \theta_{t-1} - \gamma \lambda \theta_{t-1} - \frac{\gamma \hat{m}_t}{\sqrt{\hat{v}_t + \epsilon}}
\,,
\nonumber
\\
\hat{m}_t =& \frac{m_t}{1-\beta_1^t}
\,,
\nonumber
\\
\hat{v}_t =& \begin{cases}
\frac{v_t^{max}}{1-\beta_2^t} & \text{if amsgrad} \\
\frac{v_t}{1-\beta_2^t} & \text{otherwise}
\end{cases}
\,,
\nonumber
\\
v_t^{max} =& \max(v_{t-1}^{max}, v_t) \quad \text{with } v_0^{max} = 0 \quad \text{(only if amsgrad)}
\,,
\nonumber
\\
m_t =& \beta_1 m_{t-1} + (1-\beta_1)g_t \quad \text{with } m_0 = 0
\,,
\nonumber
\\
v_t =& \beta_2 v_{t-1} + (1-\beta_2)g_t^2 \quad \text{with } v_0 = 0
\,,
\nonumber
\\
g_t =& \nabla_\theta f(\theta_{t-1})
\end{aligned}
   $$


</span>