In this paper, we present a novel characterization of the smoothness of a model based on basic principles of Large Deviation Theory. In contrast to prior work, where the smoothness of a model is normally characterized by a real value (e.g., the weights' norm), we show that smoothness can be described by a simple real-valued function. Based on this concept of smoothness, we propose an unifying theoretical explanation of why some interpolators generalize remarkably well and why a wide range of modern learning techniques (i.e., stochastic gradient descent,
This repository contains Python Notebooks that reproduce each of the figures displayed in the article. In fact, this figures can be found in a separate folder in this Repository. The specific experiments are the following:
-
Measuring the rate function of different Inception models on Cifar10. In this experiment, a small InceptionV3 network is trained using different techniques,
$\ell$ -2 regularization, random cropping and random labels memorization. The rate and cummulant function for each of this models are shown in Figures 1 and 6 respectively and can be reproduced in the corresponding Notebook. - Measuring the rate function on different architectures. In this experiemnt, both small InceptionV3 and MLPs are used to show the effect that invariant architectures on the rate function and generalization error of the models. This is shown in Figure 3 (left) and can be reproduced with the Notebook.
- Measuring the impact of stochastic optimization vs non-stochastic. In this experiment, the rate function and abnormality rates of different small Inception models is computed using non-stochastic optimization and stochastic optimization with diferent batch sizes. The rate functions can be seen here, train/test errors here, inverse rate functions here and abnormality rates here. Experiments can be reproduced using the following Notebook.
- Measuring the impact of over-parameterization. In this experiment, we measure the impact of over-parameterization in the generalization performance of the models. To this end, we are using an increasing (in parameters) set of convolutional models that contain the previous ones. Figure 7 shows the train and test error of each of this models, and Figure 8 left shows the rate function of the first 7 models, where the rate function decreases when increasing the size of the model. Figure 8 right shows the remaining models, where the rate function starts to increase. All this experiment can be reproduced using the Notebook.
- Measuring the empirical distribution of the abnormality rate. In this experiment, we measure the empirical distribution of the abnormality rates on two different LeNet 5 networks on MNIST. Figure 9 (top left and right) shows the est error and samples from the training error for each of these models. Figure 9 (middle left and right) show the estimated density of the abnormality rates of these models. Figure 9 (bottom left and right) show the estimated commulative distribution of the abnormality rates. This experiment can be reproduced using the Notebook.