Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix (log)(c)cdf with Inf, -Inf and NaN #1348

Merged
merged 5 commits into from
Jul 2, 2021
Merged

Fix (log)(c)cdf with Inf, -Inf and NaN #1348

merged 5 commits into from
Jul 2, 2021

Conversation

devmotion
Copy link
Member

@devmotion devmotion commented Jun 16, 2021

This PR simplifies the (log)(c)cdf implementations and fixes the evaluation with -Inf, Inf, and NaN.

There are some problems with the current implementation:

  • It does not handle NaN consistently: e.g.
    julia> cdf(Normal(), NaN)
    NaN
    
    julia> cdf(Exponential(), NaN)
    0.0
    
    julia> cdf(TriangularDist(0, 1), NaN)
    1.0
  • It defines cdf(d::DiscreteUnivariateDistribution, x::Real) = cdf(d, floor(Int, x)) etc and a default implementation for cdf(::DiscreteUnivariateDistribution, x::Integer). This is problematic since
    • one has to define both cdf(d, x::Real) and cdf(d, x::Integer) for e.g. discrete LocationScale, Truncated and MixtureModel and DiscreteNonParametric to avoid method ambiguity errors
    • not all discrete distributions have integer-valued support but still a cdf for integers is defined that might not make sense or in the worst case silently produce incorrect results, e.g.
      julia> cdf(Dirac(3.4), 3.5)
      1.0
      
      julia> logcdf(Dirac(3.4), 3.5)
      -Inf
      (since logcdf(Dirac(3.4), 3.5) is forwarded to logcdf(Dirac(3.4), 3) which then calls log(cdf(Dirac(3.4, 3))) = log(0) = -Inf). These bugs are difficult to discover and to avoid when implementing a discrete distribution (I fixed some of these for DiscreteNonParametric in Generalize LocationScale to discrete distributions #1286).
    • it is impossible to evaluate cdf(d, -Inf), cdf(d, Inf), or `cdf(d, NaN) with these definitions: e.g.
      julia> cdf(Poisson(), -Inf)
      ERROR: InexactError: trunc(Int64, -Inf)
      ...
      
      julia> cdf(Poisson(), Inf)
      ERROR: InexactError: trunc(Int64, Inf)
      ...
      
      julia> cdf(Poisson(), NaN)
      ERROR: InexactError: trunc(Int64, NaN)
      ...
      For Poisson this can be fixed by generalizing the StatsFuns macro to inputs of type Real but this is still an issue for other native implementations such as Categorical for which the same errors are thrown. The evaluation of cdf(d, -Inf) or cdf(d, Inf) is useful e.g. in the construction of truncated distributions - it allows to remove the current type heuristics (not included in this PR).

The PR fixes these problems by

  • only defining cdf(d::DiscreteUnivariateDistribution, x::Real) but not cdf(::DiscreteUnivariateDistribution, x::Integer), similar to [Breaking] Fix inconsistent fallback behaviour of logpdf and pdf #1171
    • the default calls cdf_int(d, x) which assumes integer-valued support but not inputs of type Int
    • cdf_int(d, x) handles Inf, -Inf, and NaN and calls cdf(d, floor(Int, x)) for other values (not implemented!)
    • for distributions with integer-valued support one has to implement cdf(d, ::Int) and gets support for real values including Inf, -Inf, and NaN for free
    • for distributions with non-integer-valued support one has to implement cdf(d, ::Real) but not cdf(d, ::Int) (and there exists no incorrect default implementation)
    • there are no silent incorrect results since an error is thrown if one does not implement cdf(d, ::Int) explicitly
    • one can call integerunitrange_cdf etc. instead of implementing cdf(d, ::Int) from scratch if the distribution has a unitrange of integers as support (currently the default implementation of cdf(d, ::Int))
    • this also removes method ambiguity issues and the implementation of Truncated, LocationScale and MixtureModel can be simplified.
  • generalizing the StatsFuns macro and also defining cdf(d, x::Real) etc. instead of cdf(d, x::Int) for discrete distributions
  • updating implementations of (log)(c)cdf of many univariate distributions
    • this includes many continuous distributions
    • removes multiple type heuristics

@devmotion
Copy link
Member Author

Note: The simplified implementation for Truncated requires JuliaStats/LogExpFunctions.jl#21, thus tests will fail until the LogExpFunctions PR is released.

@mschauer
Copy link
Member

It's there something to look out for in the review?

@devmotion
Copy link
Member Author

devmotion commented Jun 17, 2021

Hmm I'm not sure. Hopefully I could convey the main ideas in the comment above and then I guess it depends on which points you want to examine or discuss more carefully. I assume that src/univariates.jl should show the main structure. The changes in src/truncate.jl, src/univariate/locationscale.jl and src/mixtures/... show how it simplifies (and fixes) the existing implementation. And finally the changes in src/univariate/... fix and simplify different (log)(c)cdf implementations. I extended the tests and (log)(c)cdf are tested for all modified distributions (I only changed these distributions since tests failed), so I am quite confident that the changes are correct. But it's always better if someone else checks it as well 🙂

@codecov-commenter
Copy link

codecov-commenter commented Jul 1, 2021

Codecov Report

Merging #1348 (fa4544a) into master (a0cb096) will increase coverage by 0.50%.
The diff coverage is 97.66%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1348      +/-   ##
==========================================
+ Coverage   82.23%   82.73%   +0.50%     
==========================================
  Files         116      116              
  Lines        6635     6672      +37     
==========================================
+ Hits         5456     5520      +64     
+ Misses       1179     1152      -27     
Impacted Files Coverage Δ
src/univariate/discrete/binomial.jl 92.77% <ø> (-0.64%) ⬇️
src/univariate/discrete/hypergeometric.jl 81.48% <ø> (+7.19%) ⬆️
src/univariate/discrete/poisson.jl 74.07% <ø> (-3.35%) ⬇️
...rc/univariate/discrete/noncentralhypergeometric.jl 91.15% <40.00%> (-1.21%) ⬇️
src/mixtures/mixturemodel.jl 68.96% <66.66%> (+0.84%) ⬆️
src/univariate/discrete/betabinomial.jl 86.53% <66.66%> (-1.22%) ⬇️
src/univariate/continuous/generalizedpareto.jl 78.20% <77.77%> (-1.55%) ⬇️
src/univariates.jl 73.93% <98.90%> (+12.21%) ⬆️
src/truncate.jl 87.01% <100.00%> (+13.14%) ⬆️
src/univariate/continuous/betaprime.jl 92.50% <100.00%> (-0.36%) ⬇️
... and 25 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a0cb096...fa4544a. Read the comment docs.

@devmotion
Copy link
Member Author

The LogExpFunctions PR was merged and released, so tests pass now.

src/univariates.jl Outdated Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants