Reproducing Super Convergence
This is an attempt to reproduce a subset of the results found in Super-Convergence: Very Fast Training of Residual Networks Using Large Learning Rates.
Super-Convergence is described as "a phenomenon... where residual networks can be trained using an order of magnitude fewer iterations than is used with standard training methods".
Figure 1A demonstrates the phenomenon below:
Cyclical Learning Rate (CLR) allows for competitive training in just 10,000 training steps.
Weaker evidence of super-convergence is demonstrated below:
Left: Test accuracy after 10,000 steps with CLR Right: Test accuracy after 80,000 steps with multistep.
In the above images:
- A Cyclical Learning Rate allows for a test accuracy of ~85% after 10,000 training steps.
- A multistep learning rate allows for a test accuracy ofr ~80% after 20,000 training steps. Progress is not made in steps 60,000 to 80,000.
- Accuracies above 90% were unable to be achieved. This may be related to the small mini-batch sizes used (125) compared to the author's (1,000).
The Tensorflow implementation in based on the ResNet-56 architecture described in Appendix A of Super-Convergence: Very Fast Training of Residual Networks Using Large Learning Rates with the following changes:
- The 3x3 Conv Layer at the start of the network has
stride=2as mentioned in the paper.
- While training, images are flipped left-to-right with 50% probability
- All weights before ReLUs are initialized according to Delving Deep into Rectifiers. See:
- All weights before softmax are initialized according to Understanding the difficulty of training deep feedforward neural networks. See:
- Bias variables are initialized to zero.
The learning rate, train accuracy and train loss for 10,000 training steps with a cyclical learning rate are shown below:
The learning rate, train accuracy and train loss for 80,000 training steps with a multistep learning rate are shown below: