#<font color='blue' size='5px'/> Learning Rates<font/>





# 1 Introduction

**Learning Rate** is a crucial hyperparameter in the training of machine learning models, including artificial neural networks. It determines the step size at which the model's parameters are updated during the optimization process. The learning rate influences how quickly or slowly a model converges to an optimal solution, and choosing the right learning rate is essential for successful model training.


<details>
  <summary>Importance of Learning Rate:</summary>

- **Convergence:** A suitable learning rate helps the model converge to an optimal solution. If the learning rate is too high, the optimization process may overshoot the optimal point or oscillate. If it's too low, the process may converge very slowly or get stuck in local minima.

- **Stability:** The learning rate affects the stability of the optimization process. A well-chosen learning rate leads to stable convergence, while a poorly chosen rate can lead to divergence or chaotic behavior.

- **Generalization:** The learning rate can influence how well a model generalizes to unseen data. Overly aggressive learning rates may result in overfitting, while extremely small rates may lead to underfitting.
</details>




<details>
  <summary>Types of Learning Rate:</summary>

1. **Fixed Learning Rate:** In this approach, the learning rate remains constant throughout the training process. It's simple and often works well when the dataset and model are well-behaved. However, it may lead to slow convergence and difficulty finding an optimal solution in more complex scenarios.

2. **Learning Rate Annealing:** This technique involves gradually reducing the learning rate during training. Common annealing strategies include:
   - **Step Decay:** Reducing the learning rate at predefined intervals.
   - **Exponential Decay:** Reducing the learning rate exponentially over time.
   - **1/t Decay:** Reducing the learning rate as a function of the training iteration.

   Learning rate annealing can help improve convergence and enable finer adjustments as the optimization process progresses.

3. **Adaptive Learning Rate:** These methods dynamically adjust the learning rate based on the progress of optimization. Common adaptive learning rate algorithms include:
   - **Adagrad:** Adapts the learning rate for each parameter based on historical gradient information.
   - **RMSprop:** Adapts the learning rate with a moving average of squared gradients.
   - **Adam (Adaptive Moment Estimation):** Combines adaptive learning rate techniques with momentum.

   Adaptive learning rates are highly effective in training deep neural networks and can speed up convergence significantly.

4. **Cyclic Learning Rates:** These approaches involve cyclically increasing and decreasing the learning rate during training. The idea is to explore different learning rates to escape local minima and converge faster. Common methods include cyclic learning rate policies and the "learning rate finder."

5. **Warm-up Schedules:** In some cases, models are trained with a lower learning rate initially and gradually increase it. This "warm-up" phase helps stabilize training, particularly when using large learning rates.

6. **Learning Rate Schedulers:** Learning rate schedulers adjust the learning rate based on specific criteria, such as the number of epochs or the validation loss. Popular schedulers include the StepLR, ReduceLROnPlateau, and CosineAnnealing schedulers.

7. **One-cycle Learning Rates:** The one-cycle policy involves training with a learning rate that cyclically increases and decreases. It begins with a lower rate, increases to a maximum value, and then decreases again.
</details>



<details>
  <summary>How to Choose the Right Learning Rate:</summary>

Choosing the appropriate learning rate for your model and dataset often involves experimentation. Here's a general approach:

1. Start with a moderate learning rate and observe the convergence behavior. If the loss is decreasing steadily, it may be a good choice.

2. If the loss doesn't decrease or oscillates, try reducing the learning rate.

3. If the model converges too slowly, try increasing the learning rate.

4. Experiment with different learning rate schedules and adaptive techniques to fine-tune the training process.

5. Cross-validation can help in selecting the best learning rate for your specific task.

6. Learning rate ranges should be chosen wisely to perform a learning rate search. Techniques like the "learning rate finder" or random search can be useful for this purpose.

The choice of learning rate is problem-specific, and the ideal rate can vary depending on factors such as the model architecture, dataset size, and complexity of the task. Therefore, a systematic approach to hyperparameter tuning, including learning rate, is essential for achieving the best model performance.
</details>

# 2 Fixed Learning Rate

<details>
  <summary>Theory</summary>

**Fixed Learning Rate** is a straightforward learning rate strategy in the context of machine learning and deep learning. With a fixed learning rate, the learning rate remains constant throughout the training process. This means that the step size used to update the model parameters remains the same from the beginning to the end of training. Here's a detailed explanation of fixed learning rate and a numerical example:

**Fixed Learning Rate Process:**

1. **Initialization:** Choose a constant learning rate, typically denoted as $(\alpha$), which is a small positive number. The learning rate is a hyperparameter that you must set before training your model. The choice of the learning rate can significantly impact the training process.

2. **Training Loop:** During each training iteration (epoch or step), the model's parameters are updated using the fixed learning rate. The process typically involves the following steps:

    a. **Forward Pass:** Use the current model parameters to make predictions on the training data.

    b. **Loss Calculation:** Compute the loss, which quantifies the error between the model's predictions and the actual target values.

    c. **Gradient Calculation:** Calculate the gradient of the loss with respect to the model parameters. The gradient points in the direction of the steepest ascent in the loss function.

    d. **Parameter Update:** Adjust the model parameters (weights and biases) using the gradient and the fixed learning rate:
       
      $[ \theta \leftarrow \theta - \alpha \nabla J(\theta) $]

      Here, $(\theta$) represents the model's parameters, $(\nabla J(\theta)$) is the gradient, and $(\alpha$) is the fixed learning rate.

    e. **Iteration:** Repeat these steps for a predetermined number of iterations or until a convergence criterion is met. This process aims to minimize the loss and optimize the model's parameters.



</details>

<details>
  <summary>Numerical Example</summary>

Let's say we have a simple linear regression problem, where we want to predict the house prices based on their sizes. We have a dataset with the following data:

| Size (in square feet) | Price (in thousands of dollars) |
|-----------------------|---------------------------------|
| 1000                  | 250                             |
| 1500                  | 350                             |
| 2000                  | 450                             |
| 2500                  | 550                             |

- We want to use gradient descent to find the optimal parameters (slope and intercept) for our linear regression model. We initialize the parameters as follows:

- Slope ($m$) = 0
- Intercept ($b$) = 0

- We also choose a fixed learning rate of $\alpha$ = 0.01.

**We iterate through the following steps until convergence:**

1. Calculate the predicted prices using the current parameters:
   - For the first data point (Size = 1000), the predicted price is:
     $$ \text{predicted price} = m \times \text{Size} + b = 0 \times 1000 + 0 = 0 $$
   - Similarly, we calculate the predicted prices for the other data points.

2. Calculate the mean squared error (MSE) between the predicted prices and the actual prices:
   - For the first data point, the MSE is:
     $$ \text{MSE} = \frac{(\text{predicted price} - \text{actual price})^2}{2} = \frac{(0 - 250)^2}{2} = 31250 $$
   - Similarly, we calculate the MSE for the other data points.

3. Calculate the gradients of the parameters with respect to the MSE:
   - The gradient of the slope ($m$) is:

     $$ \frac{\partial \text{MSE}}{\partial m} = \frac{1}{4} \sum_{i=1}^{4} (\text{predicted price}_i - \text{actual price}_i) \times \text{Size}_i = -3750 $$

   - The gradient of the intercept ($b$) is:
   
     $$ \frac{\partial \text{MSE}}{\partial b} = \frac{1}{4} \sum_{i=1}^{4} (\text{predicted price}_i - \text{actual price}_i) = -250 $$

4. Update the parameters using the gradients and the learning rate:
   - The new slope is:
     $$ m_{new} = m - \alpha \times \frac{\partial \text{MSE}}{\partial m} = 0 - 0.01 \times (-3750) = 37.5 $$
   - The new intercept is:
     $$ b_{new} = b - \alpha \times \frac{\partial \text{MSE}}{\partial b} = 0 - 0.01 \times (-250) = 2.5 $$

5. Repeat steps 1-4 until convergence (i.e., until the MSE stops decreasing or reaches a minimum threshold).

By using a fixed learning rate, we update our parameters by a fixed amount in each iteration, which can help us converge to an optimal solution faster. However, if we choose a learning rate that is too large, we risk overshooting the optimal solution and diverging from it instead. On the other hand, if we choose a learning rate that is too small, it may take longer for us to converge to an optimal solution.

</details>

<details>
  <summary>Code</summary>


Let's illustrate fixed learning rate with a simple numerical example using a basic linear regression problem. In this example, we aim to fit a straight line to a set of data points. We'll use a fixed learning rate to update the model's parameters during training.

Suppose we have a dataset of points (x, y) and want to fit a linear model $(y = ax + b$) to the data. We'll use gradient descent with a fixed learning rate to find the values of $(a$) and $(b$) that minimize the mean squared error (MSE) loss.

Here's a Python code example using a fixed learning rate:

```python
import numpy as np

# Generate synthetic data
np.random.seed(42)
X = np.random.rand(100, 1)
y = 3 * X + 2 + 0.1 * np.random.randn(100, 1)

# Initialize model parameters
a = 1.0  # Initial guess for 'a'
b = 1.0  # Initial guess for 'b'

# Fixed learning rate
learning_rate = 0.01

# Number of training iterations
num_iterations = 1000

# Training loop
for iteration in range(num_iterations):
    # Predicted values using the current model parameters
    y_pred = a * X + b
    
    # Calculate the mean squared error (MSE) loss
    mse_loss = ((y_pred - y)**2).mean()
    
    # Compute the gradients with respect to 'a' and 'b'
    grad_a = 2 * np.dot(X.T, (y_pred - y)) / X.shape[0]
    grad_b = 2 * (y_pred - y).mean()
    
    # Update model parameters using the fixed learning rate
    a -= learning_rate * grad_a
    b -= learning_rate * grad_b
    
    if (iteration + 1) % 100 == 0:
        print(f'Iteration {iteration + 1}, Loss: {mse_loss}, a: {a}, b: {b}')
```

In this example:

- We generate synthetic data points (X, y) for a linear regression task.

- We initialize the model parameters 'a' and 'b' with arbitrary values.

- We use a fixed learning rate of 0.01.

- In each iteration, we calculate the loss, compute the gradients, and update the parameters 'a' and 'b' using the fixed learning rate.

- The training loop repeats for a specified number of iterations, and we monitor the loss and parameter values.

Fixed learning rates are simple to implement and can work well when you have prior knowledge of a suitable learning rate for your specific problem. However, in more complex scenarios, adaptive learning rate techniques are often preferred to achieve faster convergence and better results.

</details>

# 3 Learning Rate Annealing

## Step Decay

<details>
  <summary>Theory</summary>

**Learning Rate Annealing with Step Decay** is a technique used in training machine learning models, particularly neural networks, to adapt the learning rate during the training process.
- It involves **reducing the learning rate at specific predefined steps or epochs**.
- The goal is to **improve the convergence of the model by allowing for larger learning rates in the initial stages of training and then gradually reducing the learning rate as training progresses**. This can help the model converge more effectively and reach a better solution. Let's delve into the mathematics and usage of step decay learning rate annealing:

**Mathematics of Step Decay:**

1. **Initialization**: You start with an initial learning rate, denoted as \(LR_0\), which is typically set based on prior knowledge, experimentation, or hyperparameter tuning.

2. **Annealing Schedule**: Define a schedule that specifies how the learning rate should change over time. In step decay, you reduce the learning rate by a fixed factor \(\gamma\) at predefined intervals (steps or epochs). The new learning rate at each step is calculated as follows:
   
   $[LR_{\text{new}} = \gamma \cdot LR_{\text{old}}$]

   Where:
   - $(LR_{\text{new}}$) is the updated learning rate.
   - $(LR_{\text{old}}$) is the current learning rate.
   - $(\gamma$) is the decay factor, typically a value between 0 and 1.

3. **Usage**:
   - Initially, the model uses the initial learning rate $(LR_0$) for training. This higher learning rate allows the model to make large parameter updates, which can speed up the initial stages of training.
   - At predefined intervals (e.g., every $(n$) epochs), you apply the annealing schedule and reduce the learning rate by the factor $(\gamma$).
   - The process continues, with the learning rate decreasing at each step, until the training process is completed.

**Benefits and Usage:**

- **Faster Convergence**: Step decay helps the model converge faster in the initial stages, as the larger learning rate allows for more significant parameter updates.

- **Stability**: As training progresses, reducing the learning rate can help stabilize the optimization process, preventing overshooting or oscillations in the loss landscape.

- **Improved Generalization**: Gradually reducing the learning rate can help the model generalize better by fine-tuning its parameters as training proceeds.

- **Robustness**: Step decay is a simple and effective technique that is widely used in practice for training deep neural networks, as it provides a balance between rapid convergence and stable optimization.

- **Hyperparameter Tuning**: You can experiment with different decay factors and step intervals to find the values that work best for your specific problem.

It's important to note that the choice of the decay factor (\(\gamma\)) and the step size are hyperparameters that may require some experimentation to optimize for your specific problem. Different learning rate schedules, such as step decay, can be useful tools to enhance the training process and achieve better results in machine learning and deep learning tasks.
</details>

<details>
  <summary>Numerical Example</summary>

</details>

<details>
  <summary>code</summary>

Here's a code example demonstrating step decay learning rate annealing during training with a PyTorch model for a linear regression problem:

```python
import torch
import torch.nn as nn
import torch.optim as optim
from torch.optim.lr_scheduler import StepLR
import numpy as np

# Generate synthetic data
np.random.seed(42)
X = torch.rand(100, 1)
y = 3 * X + 2 + 0.1 * torch.randn(100, 1)

# Define a simple linear regression model
class LinearRegression(nn.Module):
    def __init__(self):
        super(LinearRegression, self).__init__()
        self.linear = nn.Linear(1, 1)

    def forward(self, x):
        return self.linear(x)

# Create the model
model = LinearRegression()

# Define the loss function (mean squared error) and the optimizer
criterion = nn.MSELoss()
optimizer = optim.SGD(model.parameters(), lr=0.1)

# Learning rate schedule with step decay
def step_decay_schedule(initial_lr, decay_factor, step_size):
    def schedule(epoch):
        if (epoch + 1) % step_size == 0 and epoch > 0:
            return initial_lr * decay_factor
        return initial_lr
    return schedule

# Set initial learning rate, decay factor, and step size
initial_lr = 0.1
decay_factor = 0.5
step_size = 10

# Create a learning rate schedule
lr_schedule = StepLR(optimizer, step_size=step_size, gamma=decay_factor)

# Training loop
num_epochs = 50
for epoch in range(num_epochs):
    # Adjust the learning rate according to the schedule
    lr_schedule.step()
    
    # Forward pass
    outputs = model(X)
    loss = criterion(outputs, y)
    
    # Backpropagation
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    
    if (epoch + 1) % 10 == 0:
        print(f'Epoch [{epoch + 1}/{num_epochs}], Loss: {loss.item()}, Learning Rate: {optimizer.param_groups[0]["lr"]}')

# Check the final learned parameters (should be close to 3 and 2)
print('Learned parameters:', model.linear.weight.item(), model.linear.bias.item())
```

In this PyTorch code example:

- We generate synthetic data points (X, y) for a linear regression task.

- We define a simple linear regression model using PyTorch's `nn.Module`.

- We use the mean squared error (MSE) as the loss function and stochastic gradient descent (SGD) as the optimizer.

- We define a step decay learning rate schedule using `StepLR` with a step size of 10 and a decay factor of 0.5. The learning rate will decrease by a factor of 0.5 every 10 epochs.

- During the training loop, we adjust the learning rate according to the schedule using `lr_schedule.step()`.

- The training process continues for a specified number of epochs, and we monitor the final learned parameters.

Step decay learning rate annealing can help improve model convergence by allowing for larger learning rates initially and smaller learning rates as training progresses, which can be particularly useful in deep learning scenarios.

</details>

## Exponential Decay

<details>
  <summary>Theory</summary>
  
**Learning Rate Annealing with Exponential Decay** is a technique used in training machine learning models, particularly neural networks, to adapt the learning rate during the training process. It involves reducing the learning rate exponentially over time. This approach allows for a rapid reduction in the learning rate in the early stages of training and then slows down the rate of reduction as training progresses. Let's explore the mathematics and usage of exponential decay learning rate annealing:

### Mathematics of Exponential Decay:

1. **Initialization**: Start with an initial learning rate, denoted as \(LR_0\), which is typically set based on prior knowledge, experimentation, or hyperparameter tuning.

2. **Annealing Schedule**: Define an exponential decay schedule that specifies how the learning rate should change over time. In exponential decay, the learning rate at each step or epoch is calculated as follows:

   \[LR_{\text{new}} = LR_{\text{old}} \times \gamma\]

   Where:
   - \(LR_{\text{new}}\) is the updated learning rate.
   - \(LR_{\text{old}}\) is the current learning rate.
   - \(\gamma\) is the decay factor, which is typically a value between 0 and 1.

3. **Usage**:
   - Initially, the model uses the initial learning rate \(LR_0\) for training. This higher learning rate allows the model to make large parameter updates, which can speed up the initial stages of training.
   - At each step or epoch, you apply the annealing schedule, reducing the learning rate by the factor \(\gamma\).
   - The process continues, with the learning rate decreasing exponentially at each step, until the training process is completed.

### Benefits and Usage:

- **Rapid Early Convergence**: Exponential decay allows for a swift reduction in the learning rate in the early stages of training, enabling the model to converge quickly to a reasonable solution.

- **Fine-Tuning**: As training progresses, the learning rate reduction rate slows down, which is useful for fine-tuning the model's parameters.

- **Stability**: The gradual reduction of the learning rate can help stabilize the optimization process and prevent overshooting or oscillations in the loss landscape.

- **Generalization**: Exponential decay can contribute to better generalization, as it allows the model to adapt its parameters more carefully as training proceeds.

- **Hyperparameter Tuning**: You can experiment with different decay factors (\(\gamma\)) to find the values that work best for your specific problem.

Exponential decay learning rate annealing is a powerful tool for training deep neural networks and is widely used in practice. It provides a balance between rapid early convergence and fine-tuning the model for better generalization. By adjusting the decay factor and the initial learning rate, you can tailor this technique to your specific problem and achieve improved training outcomes.

</details>

<details>
  <summary>Numerical Example</summary>

</details>

<details>
  <summary>code</summary>
Certainly! Here's a PyTorch code example demonstrating exponential decay learning rate annealing during the training of a simple linear regression model:

```python
import torch
import torch.nn as nn
import torch.optim as optim
from torch.optim.lr_scheduler import ExponentialLR
import numpy as np

# Generate synthetic data
np.random.seed(42)
X = torch.rand(100, 1)
y = 3 * X + 2 + 0.1 * torch.randn(100, 1)

# Define a simple linear regression model
class LinearRegression(nn.Module):
    def __init__(self):
        super(LinearRegression, self).__init__()
        self.linear = nn.Linear(1, 1)

    def forward(self, x):
        return self.linear(x)

# Create the model
model = LinearRegression()

# Define the loss function (mean squared error) and the optimizer
criterion = nn.MSELoss()
optimizer = optim.SGD(model.parameters(), lr=0.1)

# Exponential decay learning rate schedule
def exponential_decay_schedule(initial_lr, gamma):
    scheduler = ExponentialLR(optimizer, gamma=gamma)
    return scheduler

# Set initial learning rate and decay factor
initial_lr = 0.1
decay_factor = 0.9

# Create the learning rate scheduler
lr_scheduler = exponential_decay_schedule(initial_lr, gamma=decay_factor)

# Training loop
num_epochs = 50
for epoch in range(num_epochs):
    # Adjust the learning rate according to the schedule
    lr_scheduler.step()
    
    # Forward pass
    outputs = model(X)
    loss = criterion(outputs, y)
    
    # Backpropagation
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    
    if (epoch + 1) % 10 == 0:
        print(f'Epoch [{epoch + 1}/{num_epochs}], Loss: {loss.item()}, Learning Rate: {optimizer.param_groups[0]["lr"]}')

# Check the final learned parameters (should be close to 3 and 2)
print('Learned parameters:', model.linear.weight.item(), model.linear.bias.item())
```

In this PyTorch code example:

- We generate synthetic data points (X, y) for a linear regression task.

- We define a simple linear regression model using PyTorch's `nn.Module`.

- We use the mean squared error (MSE) as the loss function and stochastic gradient descent (SGD) as the optimizer.

- We implement exponential decay learning rate annealing using the `ExponentialLR` scheduler.

- The learning rate is reduced by a factor of 0.9 at each epoch, following the exponential decay schedule.

- The training process continues for a specified number of epochs, and we monitor the final learned parameters.

Exponential decay learning rate annealing helps the model to converge rapidly in the early stages and fine-tune its parameters as training progresses. It's a useful technique in various machine learning and deep learning tasks.

</details>

## 1/t Decay

<details>
  <summary>Theory</summary>
**Learning Rate Annealing with 1/t Decay** is a technique used in training machine learning models, particularly neural networks, to adjust the learning rate during training. This approach decreases the learning rate over time, following an inverse time (1/t) decay schedule. It allows for a slower reduction in the learning rate, which can be beneficial for fine-tuning the model as training progresses. Let's explore the mathematics and usage of 1/t decay learning rate annealing:

### Mathematics of 1/t Decay:

1. **Initialization**: Start with an initial learning rate, denoted as \(LR_0\), typically chosen based on prior knowledge, experimentation, or hyperparameter tuning.

2. **Annealing Schedule**: Define a 1/t decay schedule that specifies how the learning rate should change over time. In 1/t decay, the learning rate at each step or epoch is calculated as follows:

   \[LR_{\text{new}} = \frac{LR_0}{1 + \alpha \cdot t}\]

   Where:
   - \(LR_{\text{new}}\) is the updated learning rate.
   - \(LR_0\) is the initial learning rate.
   - \(\alpha\) is a hyperparameter that controls the rate of decay.
   - \(t\) is the current training step or epoch.

3. **Usage**:
   - Initially, the model uses the initial learning rate \(LR_0\) for training. This higher learning rate allows the model to make large parameter updates, which can speed up the initial stages of training.
   - As training progresses, the learning rate decreases slowly following the 1/t decay schedule. The learning rate continues to decrease over time, allowing the model to fine-tune its parameters.

### Benefits and Usage:

- **Fine-Tuning**: 1/t decay provides a slow and gradual reduction in the learning rate, which can be beneficial for fine-tuning the model's parameters as training progresses.

- **Stability**: The gradual reduction in the learning rate can help stabilize the optimization process and prevent overshooting or oscillations in the loss landscape.

- **Generalization**: Slower learning rate decay can contribute to better generalization, as it allows the model to adapt its parameters more carefully as training proceeds.

- **Hyperparameter Tuning**: You can experiment with different values of the hyperparameter \(\alpha\) to control the decay rate and find the values that work best for your specific problem.

1/t decay learning rate annealing is a valuable tool for training deep neural networks. It balances rapid initial convergence and fine-tuning for better generalization. By adjusting the hyperparameter \(\alpha\) and the initial learning rate, you can customize this technique to your specific problem and achieve improved training results.

</details>

<details>
  <summary>Numerical Example</summary>

</details>

<details>
  <summary>code</summary>
Certainly! Here's a PyTorch code example demonstrating 1/t decay learning rate annealing during the training of a simple linear regression model:

```python
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np

# Generate synthetic data
np.random.seed(42)
X = torch.rand(100, 1)
y = 3 * X + 2 + 0.1 * torch.randn(100, 1)

# Define a simple linear regression model
class LinearRegression(nn.Module):
    def __init__(self):
        super(LinearRegression, self).__init__()
        self.linear = nn.Linear(1, 1)

    def forward(self, x):
        return self.linear(x)

# Create the model
model = LinearRegression()

# Define the loss function (mean squared error) and the optimizer
criterion = nn.MSELoss()
optimizer = optim.SGD(model.parameters(), lr=0.1)

# 1/t decay learning rate schedule
def one_over_t_decay_schedule(initial_lr, alpha, t):
    return initial_lr / (1 + alpha * t)

# Set initial learning rate and hyperparameter alpha
initial_lr = 0.1
alpha = 0.01

# Training loop
num_epochs = 50
for epoch in range(num_epochs):
    # Adjust the learning rate according to the schedule
    current_lr = one_over_t_decay_schedule(initial_lr, alpha, epoch)
    
    for param_group in optimizer.param_groups:
        param_group['lr'] = current_lr
    
    # Forward pass
    outputs = model(X)
    loss = criterion(outputs, y)
    
    # Backpropagation
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    
    if (epoch + 1) % 10 == 0:
        print(f'Epoch [{epoch + 1}/{num_epochs}], Loss: {loss.item()}, Learning Rate: {current_lr}')

# Check the final learned parameters (should be close to 3 and 2)
print('Learned parameters:', model.linear.weight.item(), model.linear.bias.item())
```

In this PyTorch code example:

- We generate synthetic data points (X, y) for a linear regression task.

- We define a simple linear regression model using PyTorch's `nn.Module`.

- We use the mean squared error (MSE) as the loss function and stochastic gradient descent (SGD) as the optimizer.

- We implement 1/t decay learning rate annealing using a custom function `one_over_t_decay_schedule`. The learning rate decreases over time according to the 1/t decay schedule.

- During the training loop, we adjust the learning rate for each epoch using the calculated value from the 1/t decay schedule.

- The training process continues for a specified number of epochs, and we monitor the final learned parameters.

1/t decay learning rate annealing provides a slow and gradual reduction in the learning rate, which can be beneficial for fine-tuning the model's parameters as training progresses. This technique is useful in various machine learning and deep learning tasks.
</details>

# 4 Adaptive Learning Rate

## Adagrad

<details>
  <summary>Theory</summary>

</details>

<details>
  <summary>Numerical Example</summary>

</details>

<details>
  <summary>code</summary>

</details>

## RMSprop

<details>
  <summary>Theory</summary>

</details>

<details>
  <summary>Numerical Example</summary>

</details>

<details>
  <summary>code</summary>

</details>

## Adam (Adaptive Moment Estimation)

<details>
  <summary>Theory</summary>

</details>

<details>
  <summary>Numerical Example</summary>

</details>

<details>
  <summary>code</summary>

</details>

# 5 Cyclic Learning Rates

## Cyclic learning rate policies



<details>
  <summary>Theory</summary>

</details>

<details>
  <summary>Numerical Example</summary>

</details>

<details>
  <summary>code</summary>

</details>

## learning rate finder

<details>
  <summary>Theory</summary>

</details>

<details>
  <summary>Numerical Example</summary>

</details>

<details>
  <summary>code</summary>

</details>

# 6 Warm-up Schedules

<details>
  <summary>Theory</summary>

</details>

<details>
  <summary>Numerical Example</summary>

</details>

<details>
  <summary>code</summary>

</details>

# 7 Learning Rate Schedulers

## StepLR

<details>
  <summary>Theory</summary>

</details>

<details>
  <summary>Numerical Example</summary>

</details>

<details>
  <summary>code</summary>

</details>

## ReduceLROnPlateau

<details>
  <summary>Theory</summary>

</details>

<details>
  <summary>Numerical Example</summary>

</details>

<details>
  <summary>code</summary>

</details>

## CosineAnnealing schedulers

<details>
  <summary>Theory</summary>

</details>

<details>
  <summary>Numerical Example</summary>

</details>

<details>
  <summary>code</summary>

</details>

# 8 One-cycle Learning Rates

<details>
  <summary>Theory</summary>

</details>

<details>
  <summary>Numerical Example</summary>

</details>

<details>
  <summary>code</summary>

</details>