Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NaNs #61

Closed
thegodone opened this issue May 20, 2021 · 4 comments
Closed

NaNs #61

thegodone opened this issue May 20, 2021 · 4 comments

Comments

@thegodone
Copy link

I observed that the RAdam method can start at first epochs to be produce NaN Loss while Adams not. It's not only for one or two experiments but a general observation. I wonder if we can merge Adabound clamp to RAdam to avoid this type of issue in the very beginning of the training ?

@LiyuanLucasLiu
Copy link
Owner

Thanks for reaching out. I haven't observed this and I'm wondering whether you can provide a simple setup to reproduce this phenomenon.

BTW, there is a known issue that can be fixed by setting degenerated_to_sgd=False (more discussions can be found at: #54)

@brandondube
Copy link

I have run into the same issue, trying to implement RAdam. Here's a pure (num)python implementation:

class RADAM:
    def __init__(self, fg, x0, alpha, beta1=0.9, beta2=0.999):
        """Create a new RADAM optimizer.

        Parameters
        ----------
        fg : callable
            a function which returns (f, g) where f is the scalar cost, and
            g is the vector gradient.
        x0 : callable
            the parameter vector immediately prior to optimization
        alpha : float
            the step size
        beta1 : float
            the decay rate of the first moment (mean of gradient)
        beta2 : float
            the decay rate of the second moment (uncentered variance)

        """
        self.fg = fg
        self.x0 = x0
        self.alpha = alpha
        self.beta1 = beta1
        self.beta2 = beta2
        self.x = x0.copy()
        self.m = np.zeros_like(x0)
        self.v = np.zeros_like(x0)
        self.eps = np.finfo(x0.dtype).eps
        self.rhoinf = 2 / (1-beta2) - 1
        self.iter = 0

    def step(self):
        """Perform one iteration of optimization."""
        self.iter += 1
        k = self.iter
        beta1 = self.beta1
        beta2 = self.beta2
        beta2k = beta2**k

        f, g = self.fg(self.x)
        # update momentum estimates
        self.m = beta1*self.m + (1-beta1) * g
        self.v = beta2*self.v + (1-beta2) * (g*g)
        # torch exp_avg_sq.mul_(beta2).addcmul_(grad,grad,value=1-beta2)
        # == v

        mhat = self.m / (1 - beta1**k)

        # going to use this many times, local lookup is cheaper
        rhoinf = self.rhoinf
        rho = rhoinf - (2*k*beta2k)/(1-beta2k)
        x = self.x
        if rho >= 5:  # 5 was 4 in the paper, but PyTorch uses 5, most others too
            # l = np.sqrt((1-beta2k)/self.v)  # NOQA
            # commented out l exactly as in paper
            # seems to blow up all the time, must be a typo; missing sqrt(v)
            # torch computes vhat same as ADAM, assume that's the typo
            l = np.sqrt(1 - beta2k) / (np.sqrt(self.v)+self.eps)  # NOQA
            num = (rho - 4) * (rho - 2) * rhoinf
            den = (rhoinf - 4) * (rhoinf - 2) * rho
            r = np.sqrt(num/den)
            self.x = x - self.alpha * r * mhat * l
        else:
            self.x = x - self.alpha * mhat
        return x, f, g

def runN(optimizer, N):
    for _ in range(N):
        yield optimizer.step()

A minimum working example that blows up,

import numpy as np
from scipy.optimize import rosen, rosen_der
def fg(x):
    f = rosen(x)
    g = rosen_der(x)
    return f,g

x0 = np.zeros(2)
x0[0]=-2
x0[1]=2

opt = RADAM(fg, x0, 1e-2)
hist = []
xh = []
for xk, fk, gk in runN(opt,1000):
    hist.append(float(fk))
    xh.append(xk.copy())

I do not observe this behavior with vanilla Adam, Yogi, Adagrad, RMSprop, or other optimizers. Any thoughts? @LiyuanLucasLiu

@LiyuanLucasLiu
Copy link
Owner

@brandondube thanks for providing the example.

I believe this is a known issue and can be fixed by setting degenerated_to_sgd=False (in your case, you can simply delete the else: self.x = x - self.alpha * mhat part).

More discussions can be found at: #54 (comment).

@brandondube
Copy link

Thanks, that was it. II made a different choice, detuning g by its norm. This increases the range of stable learning rates, although not all that much.

            invgnorm = 1 / np.sqrt(gsq.sum())
            self.x = x - self.alpha * invgnorm * g

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants