# Experimental validation of SM3

Below we use three optimization algorithms to train a simple neural network classifer in order to experimentally compare the different algorithms. The learning rate for each algorithm was chosen by experimenting with fractions until the top-1 accuracy was at least 90%. (A more proper comparison would have been to leave the validation set out entirely until the end. However, the method is defended in the source; my concern was the correctness of my implementation.) After this, I chose a different seed.

The highlight is that each achieves comparable success on this test problem despite drastic differences in the number of parameters that are tracked. The network's top-choice classification is correct at least 95% regardless of which optimizer is used.

## Optimizer Comparisons

Below is the code used to compare the optimizers. A standardized weighted network is created. The weights are then updated using the chosen optimizer. When the updates are complete, the network is tested on the validation set. From the test, the negative log-likelihood and the top-1 accuracy is tracked. The number of parameters in the state dictionary of each optimizer is also counted.

The negative log-likelihood is computed using `torch.nn.CrossEntropyLoss`.

In [1]:
import torch, torch.nn as nn
from torchvision.datasets import MNIST
from torchvision.transforms import ToTensor
from SM3 import SM3

batch_size = 100
training_set = MNIST(root='D:\Code\data', train=True, download=False, transform=ToTensor())
training_loader = torch.utils.data.DataLoader(dataset=training_set, batch_size=batch_size, shuffle=False)
testing_set = MNIST(root='D:\Code\data', train=False, download=False, transform=ToTensor())
testing_loader = torch.utils.data.DataLoader(dataset=testing_set, batch_size=batch_size, shuffle=True)

repeats = 10
epochs = len(training_loader) * repeats

device = 'cuda' if torch.cuda.is_available() else 'cpu'

loss_fn = nn.CrossEntropyLoss()

In [2]:
def make_network():
    # Creates an example network using the same seed.
    torch.manual_seed(64)

    net = nn.Sequential(
        nn.Flatten(),
        nn.Linear(28**2, 512),
        nn.ReLU(),
        nn.Linear(512, 256),
        nn.ReLU(),
        nn.Linear(256, 10)
    )
    return net

def test_optimizer(optim_fn, params, lr_lambda=None):
    net = make_network()
    net.to(device)
    opt = optim_fn(net.parameters(), **params)
    if callable(lr_lambda):
        scheduler = torch.optim.lr_scheduler.LambdaLR(opt, lr_lambda)
    else:
        scheduler = None
    train_net(epochs, net, opt, scheduler)
    loss, correct = test_net(net)
    count = count_param_size(opt)
    return {'loss': loss, 'correct': correct, 'count': count}

In [3]:
def train_net(epochs, net, opt, scheduler=None):
    i = 0
    for _ in range(repeats):
        for batch_nb, batch in enumerate(training_loader):
            opt.zero_grad()

            images = batch[0].to(device)
            targets = batch[1].to(device)

            label = net(images)

            # We are minimizing cross entropy loss
            loss = loss_fn(label, targets)

            loss.backward()
            opt.step()
            if scheduler is not None:
                scheduler.step()

            i += 1
            if i % (epochs // 1000) == 0:
                print('\r{0:.1%}'.format(i / epochs), end='')
    print('') # clear line
            
def test_net(net):
    testing_loss = 0.
    correct = 0.
    with torch.no_grad():
        for images, targets in testing_loader:
            images = images.to(device)
            targets = targets.to(device)

            label = net(images)

            # Correct is tracked by seeing if the index of the max
            # matches the label, and then adding these truth values.
            correct += torch.sum(label.argmax(1) == targets)
            testing_loss += loss_fn(label, targets)

    return testing_loss, correct

In [4]:
def count_param_size(opt):
    count = 0
    for state_value in opt.state.values():
        for value in state_value.values():
            if torch.is_tensor(value):
                count += value.numel()
    return count

In [5]:
print('Testing Adam')
summary_Adam = test_optimizer(torch.optim.Adam, {'lr': 0.01})
summary_Adam['name'] = 'Adam'

Testing Adam
100.0%


In [6]:
print('Testing Adagrad')
summary_Adagrad = test_optimizer(torch.optim.Adagrad, {'lr': 0.1})
summary_Adagrad['name'] = 'Adagrad'

Testing Adagrad
100.0%


In [7]:
print('Testing SM3-II')
summary_SM3 = test_optimizer(SM3, {'lr': 0.1})
summary_SM3['name'] = 'SM3'

Testing SM3-II
100.0%


In [8]:
print('Testing SM3-II with warm-up for first 5%')
lr_lambda = lambda epoch: min(1., (epoch / (0.05 * epochs)) ** 2)
summary_SM3v2 = test_optimizer(SM3, {'lr': .5}, lr_lambda)
summary_SM3v2['name'] = 'SM3 v2'

Testing SM3-II with warm-up for first 5%
100.0%


In [9]:
print('Testing SM3-II with warm-up for first 5% and decay during last 10%')
lr_lambda = lambda epoch: min(1., (epoch / (0.05 * epochs)) ** 2, (epochs - epoch) / (0.1 * epochs))
summary_SM3v3 = test_optimizer(SM3, {'lr': .5}, lr_lambda)
summary_SM3v3['name'] = 'SM3 v3'

Testing SM3-II with warm-up for first 5% and decay during last 10%
100.0%


## Summary
Below we list the optimizers along with the cumulative negative log-likelihood and top-1 of the updated network, as well as the number of parameters used and the paramater count relative to the network.

In [10]:
network_count = summary_Adagrad['count'] # Has the same number of parameters as the network
print('Names  \tLoss\tCorrect\tParams\tRelative Params')
for summary in [summary_Adam, summary_Adagrad, summary_SM3, summary_SM3v2, summary_SM3v3]:
    print('{0:7}\t{1:.4}\t{2}\t{3}\t{4:.3%}'.format(
        summary['name'],
        summary['loss'],
        summary['correct'],
        summary['count'],
        summary['count']/network_count)
    )

Names  	Loss	Correct	Params	Relative Params
Adam   	20.03	9684.0	1071636	2.0
Adagrad	12.33	9712.0	535818	1.0
SM3    	9.729	9702.0	3108	0.0058
SM3 v2 	14.78	9628.0	3108	0.0058
SM3 v3 	9.908	9725.0	3108	0.0058
