# Learning Rate Adjustment Strategies

In deep learning, the learning rate is a crucial hyperparameter that controls how much to change the model in response to the estimated error each time the model weights are updated. Proper adjustment of the learning rate can significantly improve model performance and convergence speed. This blog introduces various learning rate adjustment strategies, including fixed learning rate, step decay, exponential decay, cosine annealing, cyclical learning rate, adaptive learning rate, learning rate warm-up, and learning rate finder, along with their application scenarios.

## 1. Fixed Learning Rate

### Description
A fixed learning rate means keeping the learning rate constant throughout the training process. This is the simplest strategy and can be effective for simpler problems or when computational resources are limited.

### Application
- Suitable for small datasets or shallow networks.
- Used when training time is limited, and a quick solution is needed.

## 2. Step Decay

### Description
Step decay reduces the learning rate by a factor at specific intervals. This approach helps in maintaining a high learning rate at the beginning and reducing it as the training progresses.

### Application
- Effective in scenarios where a significant drop in the learning rate is required after certain epochs.
- Commonly used in computer vision tasks.

In [None]:
import torch
import torch.optim as optim
from torch.optim.lr_scheduler import StepLR

optimizer = optim.Adam(model.parameters(), lr=0.1)

# Define step decay scheduler
scheduler = StepLR(optimizer, step_size=10, gamma=0.5)

# Training loop
num_epochs = 100
for epoch in range(num_epochs):
    model.train()
    for inputs, labels in train_loader:
        optimizer.zero_grad()
        outputs = model(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
    
    scheduler.step()

## 3. Exponential Decay

### Description
Exponential decay decreases the learning rate exponentially over time. This allows for a smoother reduction compared to step decay.

### Application
- Useful for large-scale models where a gradual reduction in learning rate can stabilize training.

In [None]:
import torch
import torch.optim as optim
from torch.optim.lr_scheduler import ExponentialLR

optimizer = optim.Adam(model.parameters(), lr=0.1)

# Define exponential decay scheduler
scheduler = ExponentialLR(optimizer, gamma=0.9)

# Training loop
num_epochs = 100
for epoch in range(num_epochs):
    model.train()
    for inputs, labels in train_loader:
        optimizer.zero_grad()
        outputs = model(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
    
    scheduler.step()

## 4. Cosine Annealing

### Description
Cosine annealing reduces the learning rate following a cosine curve. This method can potentially escape local minima by allowing the learning rate to increase again.

### Application
- Effective in training neural networks for image and natural language processing tasks.
- Used in conjunction with warm restarts.


In [None]:
import torch
import torch.optim as optim
from torch.optim.lr_scheduler import CosineAnnealingLR

optimizer = optim.Adam(model.parameters(), lr=0.1)

# Define cosine annealing scheduler
scheduler = CosineAnnealingLR(optimizer, T_max=10, eta_min=0)

# Training loop
num_epochs = 100
for epoch in range(num_epochs):
    model.train()
    for inputs, labels in train_loader:
        optimizer.zero_grad()
        outputs = model(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
    
    scheduler.step()

## 5. Cyclical Learning Rate

### Description
Cyclical learning rate (CLR) oscillates between a lower and upper bound, rather than decreasing monotonically. This can help the model escape local minima.

### Application
- Beneficial in training neural networks where the landscape of the loss function is complex.


In [None]:
import torch
import torch.optim as optim
from torch.optim.lr_scheduler import CyclicLR

optimizer = optim.Adam(model.parameters(), lr=0.001)

# Define cyclical learning rate scheduler
scheduler = CyclicLR(optimizer, base_lr=0.001, max_lr=0.006, step_size_up=2000, mode='triangular')

num_epochs = 100
for epoch in range(num_epochs):
    model.train()
    for inputs, labels in train_loader:
        optimizer.zero_grad()
        outputs = model(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
        scheduler.step()

## 6. Adaptive Learning Rate

### Description
Adaptive learning rates, used in optimizers like AdaGrad, RMSprop, and Adam, adjust the learning rate based on past gradient information.

### Application
- Suitable for complex problems with sparse gradients.
- Common in natural language processing and recommendation systems.

In [None]:
import torch
import torch.optim as optim

optimizer = optim.Adam(model.parameters(), lr=0.001)  # Adam, AdaGrad, and RMSprop are adaptive learning rate optimizers

# Training loop
num_epochs = 100
for epoch in range(num_epochs):
    model.train()
    for inputs, labels in train_loader:
        optimizer.zero_grad()
        outputs = model(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()

## 7. Learning Rate Warm-Up

### Description
Learning rate warm-up gradually increases the learning rate from a small value to the desired value. This can help stabilize the training process, especially in the initial phase.

### Application
- Often used in conjunction with other learning rate schedules like step decay or cosine annealing.
- Effective in training deep networks such as transformers.

In [None]:
import torch
import torch.optim as optim
from torch.optim.lr_scheduler import LambdaLR

optimizer = optim.Adam(model.parameters(), lr=0.001)

# Define learning rate warm-up scheduler
def lr_lambda(epoch):
    warmup_epochs = 5
    if epoch < warmup_epochs:
        return (epoch + 1) / warmup_epochs
    else:
        return 0.1 ** ((epoch - warmup_epochs) / (num_epochs - warmup_epochs))

scheduler = LambdaLR(optimizer, lr_lambda)

# Training loop
num_epochs = 100
for epoch in range(num_epochs):
    model.train()
    for inputs, labels in train_loader:
        optimizer.zero_grad()
        outputs = model(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
    
    scheduler.step()

## 8. Learning Rate Finder

### Description
Learning rate finder helps to find the optimal learning rate by running a short training cycle with exponentially increasing learning rates and plotting the loss.

### Application
- Useful for setting an appropriate initial learning rate before full training.
- Can be applied across various domains and models to tune learning rates.

In [None]:
import torch
import torch.optim as optim
import matplotlib.pyplot as plt

optimizer = optim.Adam(model.parameters(), lr=1e-7)

# Define learning rate finder
class LRFinder:
    def __init__(self, model, optimizer, criterion, train_loader):
        self.model = model
        self.optimizer = optimizer
        self.criterion = criterion
        self.train_loader = train_loader
        self.history = {"lr": [], "loss": []}

    def find(self, start_lr=1e-7, end_lr=10, num_it=100):
        self.optimizer.param_groups[0]['lr'] = start_lr
        lr_mult = (end_lr / start_lr) ** (1 / num_it)
        best_loss = float('inf')

        for batch_idx, (inputs, labels) in enumerate(self.train_loader):
            if batch_idx > num_it:
                break

            self.optimizer.zero_grad()
            outputs = self.model(inputs)
            loss = self.criterion(outputs, labels)
            loss.backward()
            self.optimizer.step()

            lr = self.optimizer.param_groups[0]['lr']
            self.history["lr"].append(lr)
            self.history["loss"].append(loss.item())

            if loss.item() < best_loss:
                best_loss = loss.item()
            if loss.item() > 4 * best_loss:
                break

            lr *= lr_mult
            self.optimizer.param_groups[0]['lr'] = lr

    def plot(self):
        plt.plot(self.history["lr"], self.history["loss"])
        plt.xscale('log')
        plt.xlabel("Learning Rate")
        plt.ylabel("Loss")
        plt.show()

# Instantiate learning rate finder
lr_finder = LRFinder(model, optimizer, criterion, train_loader)

# Find and plot learning rates
lr_finder.find()
lr_finder.plot()