Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

better NaN handling with ecdf #537

Merged
merged 2 commits into from
Dec 3, 2019
Merged

Conversation

joshday
Copy link
Contributor

@joshday joshday commented Nov 25, 2019

Ref: #413

New behaviors:

julia> ecdf([1,2,3])(NaN)
NaN

julia> ecdf([1,2,NaN])
ERROR: ArgumentError: ecdf can not include NaN values
Stacktrace:
 [1] #ecdf#113(::Weights{Float64,Float64,Array{Float64,1}}, ::typeof(ecdf), ::Array{Float64,1}) at /Users/joshday/.julia/dev/StatsBase/src/empirical.jl:57
 [2] ecdf(::Array{Float64,1}) at /Users/joshday/.julia/dev/StatsBase/src/empirical.jl:57
 [3] top-level scope at REPL[11]:1

Side note: I didn't actually touch the Project.toml file...did Pkg make that change for me?

@@ -9,6 +8,7 @@ struct ECDF{T <: AbstractVector{<:Real}, W <: AbstractWeights{<:Real}}
end

function (ecdf::ECDF)(x::Real)
isnan(x) && return NaN
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should use the element type of ECDF to to ensure type stability of e.g. Float32s and Dual so probably something like T(NaN) for ECDF{T}.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, there are already other type instabilities present:

weightsum = evenweights ? length(ecdf.sorted_values) : sum(ecdf.weights)

I'm fine fixing these here or in a different PR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Never mind, fixes were easy. I'll push in a sec

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also add tests, including @inferred?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Never mind again, way harder to handle NaN in a type stable way than I thought! I'm gonna punt on this and ask this to be merged as-is.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The weightsum type instability is not visible outside of the function for Float32 since partialsum/weightsum will return a Float64. So NaN is more correct than oftype(T, NaN). Not sure about Dual, would need to try that.

But please add new tests.

@codecov
Copy link

codecov bot commented Nov 25, 2019

Codecov Report

Merging #537 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #537      +/-   ##
==========================================
+ Coverage   90.43%   90.44%   +<.01%     
==========================================
  Files          21       21              
  Lines        2101     2103       +2     
==========================================
+ Hits         1900     1902       +2     
  Misses        201      201
Impacted Files Coverage Δ
src/empirical.jl 100% <100%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 752bdf8...42e76e9. Read the comment docs.

@nalimilan
Copy link
Member

Sorry for saying this only now, but it would probably be better to only print a deprecation for now instead of throwing an error. That will possibly leave some time for users to adapt, and it won't require us to tag a new breaking release just for this. Breaking releases are painful as they require all dependent packages to update their upper bound, or users will keep using old StatsBase versions.

@joshday
Copy link
Contributor Author

joshday commented Nov 26, 2019

Sure, every bug fix is technically breaking, but that's being awfully strict? This is simply fixing incorrect results:

julia> ecdf([1,NaN])(1)
0.5

@nalimilan
Copy link
Member

Ah, right, I thought the current behavior just skipped NaN, but if it returns invalid results then turning that into an error is a strict improvement.

@andreasnoack
Copy link
Member

The test failure on 32 bit seems to be a time out because of deprecations

@andreasnoack andreasnoack merged commit 93831cc into JuliaStats:master Dec 3, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants