This notebook benchmarks the adaptive gradient solver and the exponential Euler solver to compare their training times.

In [1]:
from statistics import mean
import time

from tabulate import tabulate
import torch
import torchvision as tv

import exponential_euler_solver as euler

N_TRAIN = [64, 256]
N_CLASSES = [2, 4, 10]
N_SEEDS = 1

We use CIFAR-10 as our training data, flattened into 3072-dimensional vectors. We vary the number of classes and the size of the subset.

In [2]:
all_ds = tv.datasets.CIFAR10('.', download=True)
all_images = torch.from_numpy(all_ds.data).float().flatten(1).cuda() / 255
all_labels = torch.tensor(all_ds.targets).cuda()

def build_dataset(n_train: int, n_classes: int, seed: int = 0) -> torch.utils.data.TensorDataset:
    torch.manual_seed(seed)
    mask = (all_labels < n_classes)
    images, labels = all_images[mask], all_labels[mask]
    idxs = torch.randperm(len(labels), device='cuda')[:n_train]
    return torch.utils.data.TensorDataset(images[idxs], labels[idxs])

Files already downloaded and verified


We use a 6-layer multilayer perceptron as our network. We use the exponential linear unit (ELU) as our activation function to prevent problems due to non-smoothness.

In [3]:
def build_network(n_classes: int) -> torch.nn.Sequential:
    net = [torch.nn.Linear(3072, 512), torch.nn.ELU()]
    for _ in range(4):
        net += [torch.nn.Linear(512, 512), torch.nn.ELU()]
    net += [torch.nn.Linear(512, n_classes)]
    return torch.nn.Sequential(*net)

Now let's train the networks. Note that we use some learning rate warmup with the adaptive gradient solver: we found this was necessary to ensure that the loss declined monotonically.

In [4]:
table = []
criterion = torch.nn.CrossEntropyLoss()

for n_train in N_TRAIN:
    for n_classes in N_CLASSES:
        adapt_times, euler_times = [], []
        for seed in range(N_SEEDS):
            ds = build_dataset(n_train, n_classes, seed)
            net = build_network(n_classes)
            loss_func = euler.LossFunction(ds, criterion, net, batch_size=len(ds))
            torch.manual_seed(seed)
            params = loss_func.initialize_parameters()
            solver = euler.ExponentialEulerSolver(
                params, loss_func, 0.01, n_classes - 1
            )
            loss = float('inf')

            start_time = time.perf_counter()
            while loss > 0.01:
                new_loss = solver.step().loss
                assert new_loss < loss
                loss = new_loss
            euler_times.append(time.perf_counter() - start_time)

            torch.manual_seed(seed)
            params = loss_func.initialize_parameters()
            solver = euler.AdaptiveGradientDescent(
                params, loss_func, 0.01, warmup_iters=10, warmup_factor=0.01,
            )
            loss = float('inf')

            start_time = time.perf_counter()
            while loss > 0.01:
                new_loss = solver.step().loss
                assert new_loss < loss
                loss = new_loss
            adapt_times.append(time.perf_counter() - start_time)

        table.append([n_train, n_classes, mean(euler_times), mean(adapt_times)])

headers=['# Train', '# Classes', 'Exponential Euler Time', 'Adaptive Gradient Time']
print(tabulate(table, headers=headers))

  # Train    # Classes    Exponential Euler Time    Adaptive Gradient Time
---------  -----------  ------------------------  ------------------------
       64            2                   4.38303                   11.4608
       64            4                  14.0392                    13.9129
       64           10                  70.4531                    12.2033
      256            2                  10.3577                    68.1163
      256            4                  33.0347                    88.1126
      256           10                 182.06                      81.9457


This table was generated on a LambdaLabs A100 instance. We see that the exponential Euler solver is better when the number of classes is small. It also scales better as the size of the dataset increases, so if we kept increasing the dataset size, we would expect the exponential Euler solver to eventually beat the adaptive gradient solver. This is caused by the fact that the exponential Euler solver needs to calculate as many eigenvalue-eigenvector pairs as the network has outputs, while the adaptive gradient solver only needs to calculate one. As a result, the adaptive gradient solver's steps are faster if the network has more than a handful of classes, but each exponential Euler solver step can go further so it needs fewer of them.