Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should be able to take hyper-gradients of Adam and other optimizers where reasonable #287

Open
dsyme opened this issue Mar 9, 2021 · 1 comment
Labels

Comments

@dsyme
Copy link
Collaborator

dsyme commented Mar 9, 2021

The Adam code is non-differentiable, that is some hyper-gradients are easily becoming NaN in the Adam optimizer as described in this comment.

Specifically, if we look at this code:

        if stateStep = 0 then
            stateExpAvg <- model.parameters.map(fun (t:Tensor) -> t.zerosLike()
            stateExpAvgSq <- model.parameters.map(fun (t:Tensor) -> t.zerosLike()
        stateStep <- stateStep + 1
        let expAvg = stateExpAvg.[name].mul(beta1).add(d*(1.-beta1))
        let expAvgSq = stateExpAvgSq.[name].mul(beta2).add(d*d*(1.-beta2))
        stateExpAvg.[name] <- expAvg
        stateExpAvgSq.[name] <- expAvgSq
        let biasCorrection1 = 1. - beta1 ** stateStep
        let biasCorrection2 = 1. - beta2 ** stateStep
        let denom = (expAvgSq.sqrt() / biasCorrection2.sqrt()).add(eps)
        let stepSize = lr / biasCorrection1
        t - stepSize * (expAvg/denom)

then the sqrt operations have NaN derivative if the primal is zero (see sqrt derivative). So if expAvgSq is zero, then there is no gradient, which in turn happens if d and stateExpEvgSq.[name] was zero, i.e. can happen at the beginning of optimization AFAICS

I believe this is sort of similar to what the "epsilon" parameter is for = to make sure the denom is always non-zero. So I tried sprinkling a couple of additions of eps in a little earlier and the hyper-gradients returned:

        if stateStep = 0 then
            stateExpAvg <- model.parameters.map(fun (t:Tensor) -> t.zerosLike().add(eps))  // add eps so always non-zero
            stateExpAvgSq <- model.parameters.map(fun (t:Tensor) -> t.zerosLike().add(eps))  // add eps so always non-zero
        stateStep <- stateStep + 1
        let expAvg = stateExpAvg.[name].mul(beta1).add((d*(1.-beta1)).add(eps))  // add eps so always non-zero
        let expAvgSq = stateExpAvgSq.[name].mul(beta2).add((d*d*(1.-beta2)).add(eps))  // add eps so always non-zero

Likely we only need one or two of these.

@dsyme
Copy link
Collaborator Author

dsyme commented Jul 5, 2021

More generally, we need test cases for taking hyper-gradients of each of the optimizers https://github.com/DiffSharp/DiffSharp/blob/dev/tests/DiffSharp.Tests/TestDerivatives.Nested.fs

@dsyme dsyme changed the title Hyper-gradients going NaN in Adam optimizer Should be able to take hyper-gradients of Adam and other optimizers where reasonable Jul 5, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant