Messup

Introduction

Introducing linear behaviour in-between training samples to reduce generalization error was the main objective of Mixup augmentation/regularization. Mixup inturn reduces "undesirable oscillations when predicting outside the training examples". The coefficient of the linear combination is sampled from beta distribution(Convex combination to be precise).

In this repository inspired by mixup, I introduce exponential moving average while sampling training samples, in other words samples are mixed and are exponentially weighed down as training progresses.I call this method messup.

Algorithm

Concretely the algorithm is as follows,

To explain the algorithm, EMA in general weighs down all the training points in any problem exponentially as time progresses. In other words past examples are weighed down exponentially and the most recent training example is assigned with a higher weight. This weight is the smoothing constant.

Like EMA, Messup has a smoothing constant alpha and an extra hyperparameter called Reset Cycle (C). All C does is reset the weights or reset exponential dependency between training samples at the value time step C . To understand better Messup acts like an EMA when the value C = int(Number of training samples/Batch size) for an epoch. So if the value of C = 30 , this means for every 30 steps in an epoch, all the encountered samples' weights are reset.(weight here refers to the smoothing constant!)

Algorithm REMA (Reset Exponential Moving Average) is called when the current step is divisible by C .

Algorithm CEMA (Compute Exponential Moving Average) is called till the current step is not divisble by C . If CEMA doesn't make sense as an algorithm, all its doing is computing the equation iteratively than recursively.

Loss in the algorithm is Cross Entropy and labels are all one-hot encoded. Also batch size has to be a multiple of total number of samples in the dataset because exponential smoothing takes place across steps in an epoch (or drop samples for the last step)

What is Messup doing?

Messup introduces linear behaviour between training samples like Mixup. Mixup converges to ERM strategy when the parameters of beta distribution tends to zero, i.e. the coefficient λ ~ beta(0,0) = 1. Like Mixup, Messup's smoothing constant can also be made to 1. Thereby making network to use ERM strategy.

Can it compete with Mixup?

Mixup constructs virtual samples in a step itself. Suppose X is a training sample and λ is the coefficent. The Virtual sample is λ*X + (1-λ)*shuffle(X). But Messup does it across steps since EMA comes into play. So it encounters the same sample from previous step but with reduced weight. This in an intuitive sense lags behind mixup because of this reason. So I never ran the comparision because Mixup would naturally outperform Messup. Although since this is not a time series problem, I think we can shuffle the previously encountered data between steps and increase the robustness.

Prerequisites

PyTorch >= 1.4

Experiment

To run Messup I use Cifar-10 on ResNet with only Identity skips from this repository(PreActResnet18)

Setup

batch-size - 128
optimizer - Adam
step-size - 0.001
epochs - 300

Results

Strategy	Model	smoothing constant (α)	Reset Cycle (C)	Classification Error
ERM	PreAct ResNet-18	NA	NA	7.69
Messup	PreAct ResNet-18	0.7	5	6.70
Messup	PreAct ResNet-18	0.7	30	6.26
Messup	PreAct ResNet-18	0.7	80	6.09

Hence Messup is indeed serving its purpose of regularizing and reducing variance.

Future Work

Messup algorithm has three hyperparameters which are important -- Reset Value, Smoothing Constant and Batch size . Batch size is important because it actually determines how many steps are in an epoch, which in turn affects Reset Value (C). Lower batch sizes will increase the number of steps in an epoch, this inturn will not serve the purpose of messup i.e. samples encountered in the initial steps will be weighed down drastically when encountering samples in the closing stage of an epoch. Also setting higher batch sizes might be too corrupted to make sense for the model, since it reduces th number of steps. Reset Value suffers the same fate i.e. setting a higher C value has the sam effectof setting a lower batch size, and a lower C value is same as setting a higher batch size. Hence ratio of batch size to C is really important in that regard. Introducing Messup onto word embeddings for NLP tasks has to be investigated.

References

mixup: Beyond ERM

Note

If you find any bug in my implementation of Messup, kindly feel free to drop an issue or email me at maiyaanirudh@gmail.com.

To-Do

Pick any NLP problem and use Messup
Inculcate Mixup into Messup if possible

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
images		images
LICENSE		LICENSE
PreActResNet.py		PreActResNet.py
README.md		README.md
run_messup.py		run_messup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Messup

Introduction

Algorithm

What is Messup doing?

Can it compete with Mixup?

Prerequisites

Experiment

Setup

Results

Future Work

References

Note

To-Do

About

Releases

Packages

Languages

License

AnirudhMaiya/Messup

Folders and files

Latest commit

History

Repository files navigation

Messup

Introduction

Algorithm

What is Messup doing?

Can it compete with Mixup?

Prerequisites

Experiment

Setup

Results

Future Work

References

Note

To-Do

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages