Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dual Averaging (with default setting suggested by NUTS paper) doesn't work well in transformed space #182

Closed
xukai92 opened this issue Apr 19, 2017 · 10 comments

Comments

@xukai92
Copy link
Member

xukai92 commented Apr 19, 2017

The default setting for HMC with Dual Averaging method seems not work well with constraints. 49b90e9 adds a file to reproduce the situation in test/hmcda_nb.jl.

To see the situation, it's simple way is to changed the Line 110 of samplers/hmcda.jl to a normal print function println("leapfrog for $τ steps with step size $ϵ") and run the test with julia test/hmcda_nb.jl. It will then show the step size before each leap frog loop, i.e. the one before the error. It can be seen that the first step size adapted is always around 2.7, which is controlled by the initial Dual Averaging parameters and the target accept rate 0.95 (which is already high to reduce the step size).

@yebai
Copy link
Member

yebai commented Apr 20, 2017

@xukai92 Is it possible to isolate the numerical issue for transformed variables first? This would allow us to first fix the transform (ie. simplex transform here). I imagine this can be done by finding a value r2 such that

r = link(x, Dirichlet(...))  # apply transform

# perform some changes to `r`, e.g. simulating Hamiltonian dynamics on `r`
r2 = ...

x2 = invlink(r, Dirichlet(...)) # some elements of `x2` are `NaN`

@xukai92
Copy link
Member Author

xukai92 commented Apr 20, 2017

Yes. I guess if we solve the numerical problem here then we are fine at the moment.

I further looked into the issue and found that the numerical issue only happens to the dual part of dual numbers.

The below is the invlink for the simplex variables.

function invlink(d::SimplexDistribution, y::Vector)
  K = length(y)
  T = typeof(y[1])
  z = [invlogit(y[k] + log(1 / (K - k))) for k in 1:K-1]
  x = Vector{T}(K)
  for k in 1:K-1
    x[k] = (1 - sum(x[1:k-1])) * z[k]
  end
  x[K] = 1 - sum(x[1:K-1])
  x
end

My guess is that invlogit(y[k] + log(1 / (K - k))) is not very safe and lose precision, when calculating the dual parts, the chain rule of derivatives may make the situation even worse. I'm looking into StatsFuns's implementation of some functions to see if I can use some to improve the code here.

@xukai92
Copy link
Member Author

xukai92 commented Apr 20, 2017

I further locate the problem should origin from gradient(). It seems that gradient() returns NaN when numbers go extreme, which later used to update variables.

@xukai92
Copy link
Member Author

xukai92 commented Apr 20, 2017

LATEST

The original cause of this issue is due to the fact that between link and invlink the constrained variable lose some precision, which then make the probability vector to be something like [1, 0, 0, 0], which is actually invalid to Dirichlet because there shouldn't be 0s in the probability vector of Dirichlet. When we call logpdf using this vector, we got NaN then (both in the real part and the dual part). Our AD then uses this NaNs to do leapfrog steps, which then cause each dimension to be NaN.

@yebai
Copy link
Member

yebai commented Apr 20, 2017

using Turing, Distributions

# Our version of invlogit
invlogit(x) = 1.0 ./ (exp(-x) + 1.0)

# log invlogit
loginvlogit_unstable(x) = log(invlogit(x))

# Numerically stable version of log invlogit
#  See e.g. https://lingpipe-blog.com/2012/02/16/howprevent-overflow-underflow-logistic-regression/
loginvlogit(x) = -logsumexp([0, -x])

# julia> loginvlogit_unstable(-750)
# -Inf

# julia> loginvlogit(-750)
# -750.0

function loginvlink(d::Dirichlet, y::Vector)
  K = length(y)
  T = typeof(y[1])
  z = [loginvlogit(y[k] - log(K - k)) for k in 1:K-1]
  x = Vector{T}(K)
  for k in 1:K-1
    x[k] =  log(-expm1(logsumexp(x[1:k-1]))) + z[k]
  end
  x[K] = log(-expm1(logsumexp(x[1:K-1])))
 x
end

# loginvlink(Dirichlet([1/3, 1/3, 1/3]), [-1000.0, -1000, -1000])
# 3-element Array{Float64,1}:
# -1000.69
# -1000.0 
#    0.0 

# julia> log(Turing.invlink(Dirichlet([1/3, 1/3, 1/3]), [-1000.0, -1000, -1000]))
# 3-element Array{Float64,1}:
# -Inf  
# -Inf  
#   0.0

@xukai92
Copy link
Member Author

xukai92 commented Apr 20, 2017

Looks great! But we still need to exp the result and throw the result (which sitll Inf) to logpdf of Direchlet, right?

@xukai92 xukai92 mentioned this issue Apr 20, 2017
3 tasks
@xukai92
Copy link
Member Author

xukai92 commented Apr 20, 2017

NOTE

The logpdf of the Dirichlet distribution from the Distributions package

function _logpdf{T<:Real}(d::Dirichlet, x::AbstractVector{T})
    a = d.alpha
    s = 0.
    for i in 1:length(a)
        @inbounds s += (a[i] - 1.0) * log(x[i])
    end
    return s - d.lmnB
end

@yebai
Copy link
Member

yebai commented Apr 21, 2017

It's OK to take the exp after loginvlink, the returned value will be 0 since -Inf< x <=0 here. Similarly the gradient through loginvlink will be 0 too. This prevents the underflow issue when taking gradients.

We might also need a special version of the logpdf function, in order to prevent the underflow or overflow issue. We can discuss when we meet.

@xukai92
Copy link
Member Author

xukai92 commented Apr 21, 2017

I see. I guess another reason here we have the NaN is because that we use Dirichlet([1, 1, 1, 1]) in the Naive Bayes model.

If there is a 0 in the probability vector, during the 5th line of the logpdf of Dirichlet, i.e.

s += (a[i] - 1.0) * log(x[i])

We will have (1 - 1.0) * log(exp(-1000)) => 0 * -Inf => NaN.

@yebai
Copy link
Member

yebai commented Apr 28, 2017

Closed by PR #199

@yebai yebai closed this as completed Apr 28, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants