Skip to content
This repository has been archived by the owner on May 1, 2023. It is now read-only.

Use of image classification datasets

Neta Zmora edited this page Oct 30, 2018 · 1 revision

There is some confusion regarding how we use the image classification datasets in our examples, that we want to clear up.
Each dataset we use for image classification training is divided to three parts (or datasets): Training, Validation, Testing. The Training dataset (DS) is used to drive the learning process, the Validation DS is used for estimating the generalization error, and the Test DS is used for the final test score.
One source of confusion stems around the names "validation dataset" and "test dataset", which we've see different people, blogs, and people use to refer to different things. The second source of confusion is specific to ImageNet, which is distributed with only Training and Validation datasets. Because the Test DS is withheld (this dataset originated in the classification competition so it wasn't made publicly available), many people and papers report results using the Validation dataset.

Let's address these two sources of confusion. We follow the naming convention and methodology which are explained on page 119 of Goodfellow et. al's excellent Deep Learning book (which is available for free online). The relevant text is pasted below.

Earlier we discussed how a held-out test set, composed of examples coming from the same distribution as the training set, can be used to estimate the generalization error of a learner, after the learning process has completed. It is important that the test examples are not used in any way to make choices about the model, including its hyperparameters. For this reason, no example from the test set can be used in the validation set. Therefore, we always construct the validation set from the training data. Specifically, we split the training data into two disjoint subsets. One of these subsets is used to learn the parameters. The other subset is our validation set, used to estimate the generalization error during or after training, allowing for the hyperparameters to be updated accordingly. The subset of data used to learn the parameters is still typically called the training set, even though this may be confused with the larger pool of data used for the entire training process. The subset of data used to guide the selection of hyperparameters is called the validation set. Typically, one uses about 80 percent of the training data for training and 20 percent for validation. Since the validation set is used to “train” the hyperparameters, the validation set error will underestimate the generalization error, though typically by a smaller amount than the training error does. After all hyperparameter optimization is complete, the generalization error may be estimated using the test set.

In practice, when the same test set has been used repeatedly to evaluate performance of different algorithms over many years, and especially if we consider all the attempts from the scientific community at beating the reported state-of-the-art performance on that test set, we end up having optimistic evaluations with the test set as well. Benchmarks can thus become stale and then do not reflect the true field performance of a trained system. Thankfully, the community tends to move on to new (and usually more ambitious and larger) benchmark datasets.

Now let's address the second point: the confusion around our use of ImageNet's datasets in the sample compress_classifier.py application.
We've provided an application utility function apputils.load_data to create three PyTorch dataloaders for ImageNet: training, validation, test. But as we said above, there's no ImageNet Test dataset!
To resolve this, the signature of apputils.load_data allows you to pass the size of the Validation dataset ('proportion' would have been a better name than 'size', since this parameter indicates the proportion of the training set that is used for validation). If this 'size' parameter is greater than zero, we split the Training dataset to Training and Validation dataloaders, and we use the original ImageNet Validation dataset as the Test dataset. If the 'size' parameter is exactly zero, then we use the ImageNet Validation dataset for both validation and testing.

By default we use a 90-10 Training-Validation split. Was this a good decision?
The split renders the training dataset smaller, which makes for faster training, but using less data can also negatively affect results. We wanted to adhere to a correct methodology and didn't understand at the time, that ImageNet has been beaten to death, and then stabbed and thrown off the top of the Burj Khalifa. For now, we are sticking to this default convention in our examples - mainly because it might be even more confusing to change things now.

You are free to make your own decision, using a larger training-validation split, or choosing split=0. Feel free to chime in with your opinion on this - we're interested to hear highly opinionated people :-)