# CIFAR 10

We try to implement a Resnet 34 model for with the best parameter settings as mentioned in their respective paper and try to implement our own version of algorithm

In [1]:
import os
%matplotlib notebook
import matplotlib.pyplot as plt
import torch
import numpy as np

LABELS = ['SGD','ADAGRAD','ADAM','AMSBOUND','ADABOUND','AMSGRAD','GGDO4','GGDO2']

In [2]:
def get_folder_path(use_pretrained=True):
    path = 'curve/trained2'
    if use_pretrained:
        path = os.path.join(path, 'pretrained')
    return path

In [3]:
def get_curve_data(use_pretrained=True, model='ResNet'):
    folder_path = get_folder_path(use_pretrained)
    filenames = [name for name in os.listdir(folder_path) if name.startswith(model.lower())]
    paths = [os.path.join(folder_path, name) for name in filenames]
    keys = [name.split('-')[1] for name in filenames]
    return {key: torch.load(fp) for key, fp in zip(keys, paths)}

In [4]:
def plot(use_pretrained=True, model='ResNet', optimizers=None, curve_type='train', plot_acc=True):
    assert model in ['ResNet', 'DenseNet'], 'Invalid model name: {}'.format(model)
    assert curve_type in ['train', 'test'], 'Invalid curve type: {}'.format(curve_type)
    assert all(_ in LABELS for _ in optimizers), 'Invalid optimizer'
    
    curve_data = get_curve_data(use_pretrained, model=model)
    
    if plot_acc==True:
        plt.figure()#figsize=(10,6))
        plt.title('{} Accuracy for {} on CIFAR-10'.format(curve_type.capitalize(), model))
        plt.xlabel('Epoch')
        plt.ylabel('{} Accuracy %'.format(curve_type.capitalize()))
    
        plt.ylim(80, 101 if curve_type == 'train' else 96)
        plt_acc = 'acc'
        for optim in optimizers:
            linestyle = '--' if 'GGDO' in optim else '-'
            accuracies = np.array(curve_data[optim.lower()]['{}_{}'.format(curve_type,plt_acc)])
            plt.plot(accuracies, label=optim, ls=linestyle)
    
    else:
        plt.figure()
        plt.title('{} Categorical Cross Entropy Loss for {} on CIFAR-10'.format(curve_type.capitalize(), model))
        plt.xlabel('Epoch')
        plt.ylabel('{} Loss'.format(curve_type.capitalize()))
    
        plt_acc = 'loss'
        for optim in optimizers:
            linestyle = '--' if 'GGDO' in optim else '-'
            accuracies = np.array(curve_data[optim.lower()]['{}_{}'.format(curve_type,plt_acc)])
            plt.plot(accuracies, label=optim, ls=linestyle)
    
        
    plt.grid(ls='--')
    plt.legend()
    plt.show()

The above function is for plotting the learning curves. To use your own data points, set `use_pretrained` as `False`.

## ResNet

First, let's see the results with ResNet-34.

### For all the optimizers:

GGDO2 and GGDO4 are our proposed optimizers. GGDO4 is simply a purely sampled gradient from the gaussian noise with estimated parameters from the previous gradient updates.

We observe that the performance on the training and test set is almost simmilar to Adaptive methods early on in training and it converges to a more genrealizable solution akin to SGD after the learning rate is decayed at 150th Epoch

In [5]:
LABELS = ['SGD','ADAGRAD','ADAM','AMSBOUND','ADABOUND','AMSGRAD','GGDO4','GGDO2']

plot(use_pretrained=False, model='ResNet', optimizers=LABELS, curve_type='train',plot_acc=True)
plot(use_pretrained=False, model='ResNet', optimizers=LABELS, curve_type='test',plot_acc=True)
#plot(use_pretrained=False, model='ResNet', optimizers=LABELS, curve_type='train',plot_acc=False)
#plot(use_pretrained=False, model='ResNet', optimizers=LABELS, curve_type='test',plot_acc=False)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

### GGDO vs SGD

We see that performance on the training and test set is better than sgd at early stages, and the performance on test set is comparable to it after the learning rate has been decayed. 

The performance dip at the end of the training is due to SGD not being stable (as mentioned in the orignal ResNet paper), rather than overtraining/overfitting. 

We see that GGDO is more stable than SGD. This is consistent with Stochastic Gradient Langvin Dynamics which states that the method would achieve convergence if the noise shrinks asymptotically (as we observe when the model converges the variance of the noise estimated from the past gradients goes small and hence the model converges)

In [6]:
LABELS = ['SGD','GGDO4','GGDO2']

plot(use_pretrained=False, model='ResNet', optimizers=LABELS, curve_type='train',plot_acc=True)
plot(use_pretrained=False, model='ResNet', optimizers=LABELS, curve_type='test',plot_acc=True)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

### GGDO vs Adaptive methods

We see that during early training, GGDO converges as good as the conventional adaptive methods such as ADAM, ADAGrad, AMSGrad etc. and GGDO converges to a better solution as learning rate is decayed. 

In [7]:
LABELS = ['ADAGRAD','ADAM','AMSGRAD','GGDO4','GGDO2']

plot(use_pretrained=False, model='ResNet', optimizers=LABELS, curve_type='train',plot_acc=True)
plot(use_pretrained=False, model='ResNet', optimizers=LABELS, curve_type='test',plot_acc=True)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

### GGDO vs adaptive methods proposed in ICLR 2019

We see that the training and test set performance is simmilar to the adaptive methods such as ADABound and AMSBound(regularization on the per parameter scale). We show that simmilar performance can be reached with the help of less hyper-parameters and a SGD like structure.

In [8]:
LABELS = ['ADABOUND','AMSBOUND','GGDO4','GGDO2']

plot(use_pretrained=False, model='ResNet', optimizers=LABELS, curve_type='train',plot_acc=True)
plot(use_pretrained=False, model='ResNet', optimizers=LABELS, curve_type='test',plot_acc=True)


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [9]:
# For DenseNet
#plot(use_pretrained=True, model='DenseNet', optimizers=LABELS, curve_type='train')
#plot(use_pretrained=True, model='DenseNet', optimizers=LABELS, curve_type='test')