The Quest for the Ultimate Optimizer

The field of neural networks, both in terms of research and practical applications, has grown dramatically in recent years. This has been attributed, among other things, to the concurrent growth of computing power and training data. These have lead, in turn, to the development of more and more complex NN (Neural Network) architecture, with more and more layers, various type of cells, various ways to train them, etc. Yet, one central piece of the puzzle has remained relatively simple : the optimization algorithm. Don't get me wrong, there are a lot of different variations, they actually constitute a field of research all by themselves, have spurred thousands of research articles, most of them with detailed mathematical proof of convergence, extensive testing, etc. Yet most of these algorithms (at least those widely used for NN training) can be described by a couple of lines of codes ! 5 lines for the most complicated (and I'm being generous).

I remember, when I first tried to learn about neural networks, around a year ago, this relative simplicity of the SGD (Stochastic Gradients Descent) algorithms used for NN optimization struck me as one of the most intriguing aspect of the field. In one of the lecture of his online course, Geoffrey Hinton tries to give some explanation as to why we haven't yet found the perfect recipe to train NN, and more or less conclude that it's the diversity of the NNs both in their architectures and in their tasks that makes the NN optimization problem such a tough problem to crack, especially if you are looking for a "silver bullet", "one size fit all" optimization algorithm.

One of the conclusion of Hinton's lecture on how to train NN is to look at whatever Yann LeCun's, and his "No More Pesky Learning Rates" group's latest recipe is. The name of the group, and of the algorithm they came up with, highlights one of the biggest frustrations you are faced with when training NN : each of these optimizers have at least one knob (when it's not 3 or 4) that needs tuning for your neural net (or any other system you're trying to optimize) to converge both in a reasonable time and to its lowest possible error value. Shouldn't we be able to design a system that is good at interpreting a series of data like successive gradients and predicting what is the best next update to use based on those past data (i.e. an SGD-like algorithm) ? Wait ... that sounds very much like what a recurrent neural network is good at, doesn't it ?

Like any good idea, if you look hard enough on the internet, you'll find someone, way ahead of you, who has already investigated and perfected the concept. In our case, a team from DeepMind proposed, in 2016, an implementation of this idea in the paper “Learning to learn by gradient descent by gradient descent”. This paper was pointed out to me by someone who works at ... you guessed it ... DeepMind. @Cyprien : thanks for that ! It describes how you can train a (relatively) simple RNN (recurrent neural network) to act as the optimizer of another problem, be it very simple like minimizing the quadratic function, or more complex, like training a neural net on the MNIST or CIFAR-10 datasets.

The target of this series of notebooks / articles is to reproduce some of the results from DeepMind’s paper and explore some of the doors it opens."

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
Episode1		Episode1
Episode2		Episode2
.gitattributes		.gitattributes
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Episode1

Episode1

Episode2

Episode2

.gitattributes

.gitattributes

LICENSE

LICENSE

README.md

README.md

Repository files navigation

The Quest for the Ultimate Optimizer

About

Releases

Packages

Languages

License

Ericvulpi/The-Quest-for-the-Ultimate-Optimizer

Folders and files

Latest commit

History

Repository files navigation

The Quest for the Ultimate Optimizer

About

Resources

License

Stars

Watchers

Forks

Languages