Skip to content

SKorablyov/Pranam

Repository files navigation

Related work

http://www.cs.cornell.edu/courses/cs6787/2017fa/Lecture7.pdf https://ai.stanford.edu/~ang/papers/icml11-OptimizationForDeepLearning.pdf (benchmark on ConvNet) https://ganguli-gang.stanford.edu/pdf/14.SaddlePoint.NIPS.pdf http://azadproject.ir/wp-content/uploads/2014/07/2014-iPiano-Inertial-Proximal-Algorithm-for-Nonconvex-Optimization.pdf https://arxiv.org/abs/1510.06096

Thoughts

When my function fits MNIST with a loss of 0 -- that is a completely different situation. I want some likely non-convex function that is heavily underfitting.

Monitor saturations (as in the figure 6 - I should print activations) batch size dependence I am not exactly happy with two nonlinearities -- one of the weight itself, one on the product

Add best network instead on average (average was only reasonably good for Schwefel) Try softsign instead of tanh (and other nonlinerities) Start with mean gradient --> progress everything is for the best gradient Hierarchical training Biases could be needed for easier fast shift to 0 Do for the convolution Embed 2 layers instead of 1 (and try convolutions in that case?)

I could add an online batch norm?

Deep Thoughts

I can use functions of completely different structure for embedding I could use pairs of gradients and deeper model (like attention) instead of all gradients Covariance could be useful

Work

step0

cfg12-19 just trying to capture the signal if nonconvex opt helps at all

For some reason with Xavier initializer does not work (while it's is the best initialization strategy for a normal net)

step1

  1. report (variable stats + losses in tensorboard)
  2. I did rescaling initialization - in order to

Embedding [1,64,None] is different from [64,64,None] and should not be 12-27

Abstract

PranamOptimizer: re-parametrization as a useful trick for stochastic minimization.

For many important problems, such as inference on mobile devices, or quantum chemistry, small amount of computaion at the inference time is highly desirable. Even when the shape of the data-generating polynomial is approximately known, fitting small functions is hard because gradient descent can easily be trapped into a bad local optima.

Pranam optimizer creates the batch size number of replicas of the model.

Vanilla gradient descent may not arrive to a good or globally optimal solution when a small polynomial is used to fit the data. For these cases, finding an optimal solution can be incredibly hard and multiple independent walks of gradient descend or other sampling techniques such as simulated annealing could be used.

  1. Learning smaller functions - similar to data generating functions is interesting (not overfitting + fast on inference) (but unfortunately, not convex)
  2. What is optimization - really a combinatorial problem of finding one of a few possible combinations where the gradient is small everywhere. We always can get all gradients to 0 - that is not a problem.

We got the space of good solutions to grow, and the space of bad solutions to shrink - how? If we initialize everything with 0 gradient, nothing moves, we don't have reparametrization effect. We need things to move.

Good solutions fall deeper. Good solutions have higher average gradient on weights measured in the first layer. Good solutions have higher average gradient on embedding, and move embedding more. When embedding is projected back onto the weights, good second-layer embedding generates more gradient

This will only work if the fall is A) reasonably probable B) Deep

This is a better problem to solve

Gradient on embedding is generated by good solutions, and bad solutions have no gradient | this is still similar to taking an average. But if I take infinitely many weighted averages - on average, the are weighted average. There should be some softmax effect where the thing with most gradient can increase it's share of the gradient Square of the gradient gives that

  1. We could initialize many gradeint descends in parrallel.

For many important problems, such as inference on mobile devices, or quantum chemistry, a small amount of computation at the inference time is highly desirable. Even when the shape of the data-generating polynomial is approximately known, fitting small functions is hard because gradient descent can easily be trapped into bad local optima. Here we propose and experimentally confirm that a projection of the optimization problem into a higher-dimensional space helps to avoid shallow local optima and results in a better resting point found by gradient descend on average. We implement re-parametrization trick as following: instead of processing every example in a batch by the same set of weights, we initialize a number of batches replicas of the initial model, each with its own set of weights, and compute the gradient for each weight separately. We don't apply the gradients to the weights directly, instead, we initialize a multilayer perceptron which has no input, and a batch size times number of weights outputs, and update it with L2 loss of the original gradient from each of the weights. We show that reparametrization helps each of the SGD, Adagrad, and RMSProp, and Adam to find lower minima in optimizing many known non-convex functions such as Akley and Schwefel, a dataset of random synthetic polynomials, and MNIST in a wide variety of settings. Finally, we provide a Pranam optimizer which combines projection, adaptive learning rate, and momentum which can be called like any other optimizer in one line of TensorFlow code.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages