Should be able to take hyper-gradients of Adam and other optimizers where reasonable #287

dsyme · 2021-03-09T18:32:16Z

The Adam code is non-differentiable, that is some hyper-gradients are easily becoming NaN in the Adam optimizer as described in this comment.

Specifically, if we look at this code:

        if stateStep = 0 then
            stateExpAvg <- model.parameters.map(fun (t:Tensor) -> t.zerosLike()
            stateExpAvgSq <- model.parameters.map(fun (t:Tensor) -> t.zerosLike()
        stateStep <- stateStep + 1
        let expAvg = stateExpAvg.[name].mul(beta1).add(d*(1.-beta1))
        let expAvgSq = stateExpAvgSq.[name].mul(beta2).add(d*d*(1.-beta2))
        stateExpAvg.[name] <- expAvg
        stateExpAvgSq.[name] <- expAvgSq
        let biasCorrection1 = 1. - beta1 ** stateStep
        let biasCorrection2 = 1. - beta2 ** stateStep
        let denom = (expAvgSq.sqrt() / biasCorrection2.sqrt()).add(eps)
        let stepSize = lr / biasCorrection1
        t - stepSize * (expAvg/denom)

then the sqrt operations have NaN derivative if the primal is zero (see sqrt derivative). So if expAvgSq is zero, then there is no gradient, which in turn happens if d and stateExpEvgSq.[name] was zero, i.e. can happen at the beginning of optimization AFAICS

I believe this is sort of similar to what the "epsilon" parameter is for = to make sure the denom is always non-zero. So I tried sprinkling a couple of additions of eps in a little earlier and the hyper-gradients returned:

        if stateStep = 0 then
            stateExpAvg <- model.parameters.map(fun (t:Tensor) -> t.zerosLike().add(eps))  // add eps so always non-zero
            stateExpAvgSq <- model.parameters.map(fun (t:Tensor) -> t.zerosLike().add(eps))  // add eps so always non-zero
        stateStep <- stateStep + 1
        let expAvg = stateExpAvg.[name].mul(beta1).add((d*(1.-beta1)).add(eps))  // add eps so always non-zero
        let expAvgSq = stateExpAvgSq.[name].mul(beta2).add((d*d*(1.-beta2)).add(eps))  // add eps so always non-zero

Likely we only need one or two of these.

The text was updated successfully, but these errors were encountered:

dsyme · 2021-07-05T12:39:09Z

More generally, we need test cases for taking hyper-gradients of each of the optimizers https://github.com/DiffSharp/DiffSharp/blob/dev/tests/DiffSharp.Tests/TestDerivatives.Nested.fs

dsyme mentioned this issue Mar 10, 2021

Trialling hyper-parameter optimization #286

Closed

dsyme added the 1.0 label Jul 5, 2021

dsyme changed the title ~~Hyper-gradients going NaN in Adam optimizer~~ Should be able to take hyper-gradients of Adam and other optimizers where reasonable Jul 5, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Should be able to take hyper-gradients of Adam and other optimizers where reasonable #287

Should be able to take hyper-gradients of Adam and other optimizers where reasonable #287

dsyme commented Mar 9, 2021

dsyme commented Jul 5, 2021

Should be able to take hyper-gradients of Adam and other optimizers where reasonable #287

Should be able to take hyper-gradients of Adam and other optimizers where reasonable #287

Comments

dsyme commented Mar 9, 2021

dsyme commented Jul 5, 2021