In [None]:
#r "nuget: TorchSharp-cpu"

using TorchSharp;
using static TorchSharp.torch;
using static TorchSharp.TensorExtensionMethods;
using static TorchSharp.torch.distributions;

using Microsoft.DotNet.Interactive.Formatting;
Formatter.SetPreferredMimeTypesFor(typeof(torch.Tensor), "text/plain");
Formatter.Register<torch.Tensor>((torch.Tensor x) => x.ToString(true));

# Training with a Learning Rate Scheduler

In Tutorial 6, we saw how the optimizers took an argument called the 'learning rate,' but didn't spend much time on it except to say that it could have a great impact on how quickly training would converge toward a solution. In fact, you can choose the learning rate (LR) so poorly, that the training doesn't converge at all.

If the LR is too small, training will go very slowly, wasting compute resources. If it is too large, training could result in numeric overflow, or NaNs. Either way, you're in trouble.

To further complicate matters, it turns out that the learning rate shouldn't necessarily be constant. Training can go much better if the learning rate starts out relatively large and gets smaller as you get closer to the end.

There's a solution for this, called a Learning Rate Scheduler. An LRS instance has access to the internal state of the optimizer, and can modify the LR as it goes along. 

There are several algorithms for scheduling, but TorchSharp only implements the two most conceptually simple: StepLR and ExponentialLR. In this tutorial, we will only cover StepLR.

Before demonstrating, let's have a model and a baseline training loop.

In [None]:
private class Trivial : nn.Module
{
    public Trivial()
        : base(nameof(Trivial))
    {
        RegisterComponents();
    }

    public override Tensor forward(Tensor input)
    {
        using var x = lin1.forward(input);
        using var y = nn.functional.relu(x);
        return lin2.forward(y);
    }

    private nn.Module lin1 = nn.Linear(1000, 100);
    private nn.Module lin2 = nn.Linear(100, 10);
}

To demonstrate how to correctly use an LR scheduler, our training data needs to look more like real training data, that is, it needs to be divided into batches.

In [None]:
var learning_rate = 0.01f;
var model = new Trivial();
var loss = nn.functional.mse_loss();

var data = Enumerable.Range(0,16).Select(_ => rand(32,1000)).ToList<torch.Tensor>();  // Our pretend input data
var results = Enumerable.Range(0,16).Select(_ => rand(32,10)).ToList<torch.Tensor>();  // Our pretend ground truth.

var optimizer = torch.optim.SGD(model.parameters(), learning_rate);

for (int i = 0; i < 300; i++) {

    for (int idx = 0; i < data.Count; i++) {
        // Compute the loss
        using var output = loss(model.forward(data[idx]), results[idx]);

        // Clear the gradients before doing the back-propagation
        model.zero_grad();

        // Do back-progatation, which computes all the gradients.
        output.backward();

        optimizer.step();
    }
}

loss(model.forward(data[0]), results[0]).item<float>()

When I ran this, the loss was down to 0.095 after 1 second. (It took longer the first time around.)

## StepLR

StepLR uses subtraction to adjust the learning rate every so often. The difference it makes to the training loop is that you wrap the optimizer, and then call `step` on the scheduler (once per epoch) as well as the optimizer (once per batch).

In [None]:
var learning_rate = 0.01f;
var model = new Trivial();
var loss = nn.functional.mse_loss();

var data = Enumerable.Range(0,16).Select(_ => rand(32,1000)).ToList<torch.Tensor>();  // Our pretend input data
var results = Enumerable.Range(0,16).Select(_ => rand(32,10)).ToList<torch.Tensor>();  // Our pretend ground truth.

var optimizer = torch.optim.SGD(model.parameters(), learning_rate);
var scheduler = torch.optim.lr_scheduler.StepLR(optimizer, 25, 0.95);

for (int i = 0; i < 300; i++) {

    for (int idx = 0; i < data.Count; i++) {
        // Compute the loss
        using var output = loss(model.forward(data[idx]), results[idx]);

        // Clear the gradients before doing the back-propagation
        model.zero_grad();

        // Do back-progatation, which computes all the gradients.
        output.backward();

        optimizer.step();
    }

    scheduler.step();
}

loss(model.forward(data[0]), results[0]).item<float>()

Well, that was underwhelming. The loss (in my case) went up a bit, so that's nothing to get excited about. For this trivial model, using a scheduler isn't going to make a huge difference, and it may not make much of a difference even for complex models. It's very hard to know until you try it, but now you know how to try it out. If you try this trivial example over and over, you will see that the results vary quite a bit. It's simply too simple.

Regardless, you can see from the verbose output that the learning rate is adjusted as the epochs proceed. 

Note: If you're using 0.93.9 and you see odd dips in the learning rate, that's a bug in the verbose printout logic, not the learning rate scheduler itself.