better NaN handling with ecdf #537

joshday · 2019-11-25T14:08:28Z

New behaviors:

julia> ecdf([1,2,3])(NaN)
NaN

julia> ecdf([1,2,NaN])
ERROR: ArgumentError: ecdf can not include NaN values
Stacktrace:
 [1] #ecdf#113(::Weights{Float64,Float64,Array{Float64,1}}, ::typeof(ecdf), ::Array{Float64,1}) at /Users/joshday/.julia/dev/StatsBase/src/empirical.jl:57
 [2] ecdf(::Array{Float64,1}) at /Users/joshday/.julia/dev/StatsBase/src/empirical.jl:57
 [3] top-level scope at REPL[11]:1

Side note: I didn't actually touch the Project.toml file...did Pkg make that change for me?

andreasnoack · 2019-11-25T14:30:04Z

src/empirical.jl

@@ -9,6 +8,7 @@ struct ECDF{T <: AbstractVector{<:Real}, W <: AbstractWeights{<:Real}}
 end

 function (ecdf::ECDF)(x::Real)
+    isnan(x) && return NaN


This should use the element type of ECDF to to ensure type stability of e.g. Float32s and Dual so probably something like T(NaN) for ECDF{T}.

Hmm, there are already other type instabilities present:

weightsum = evenweights ? length(ecdf.sorted_values) : sum(ecdf.weights)

I'm fine fixing these here or in a different PR

Never mind, fixes were easy. I'll push in a sec

Can you also add tests, including @inferred?

Never mind again, way harder to handle NaN in a type stable way than I thought! I'm gonna punt on this and ask this to be merged as-is.

The weightsum type instability is not visible outside of the function for Float32 since partialsum/weightsum will return a Float64. So NaN is more correct than oftype(T, NaN). Not sure about Dual, would need to try that.

But please add new tests.

codecov · 2019-11-25T16:16:44Z

Codecov Report

Merging #537 into master will increase coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master     #537      +/-   ##
==========================================
+ Coverage   90.43%   90.44%   +<.01%     
==========================================
  Files          21       21              
  Lines        2101     2103       +2     
==========================================
+ Hits         1900     1902       +2     
  Misses        201      201

Impacted Files	Coverage Δ
src/empirical.jl	`100% <100%> (ø)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 752bdf8...42e76e9. Read the comment docs.

nalimilan · 2019-11-26T12:52:22Z

Sorry for saying this only now, but it would probably be better to only print a deprecation for now instead of throwing an error. That will possibly leave some time for users to adapt, and it won't require us to tag a new breaking release just for this. Breaking releases are painful as they require all dependent packages to update their upper bound, or users will keep using old StatsBase versions.

joshday · 2019-11-26T13:24:14Z

Sure, every bug fix is technically breaking, but that's being awfully strict? This is simply fixing incorrect results:

julia> ecdf([1,NaN])(1)
0.5

nalimilan · 2019-11-26T13:38:18Z

Ah, right, I thought the current behavior just skipped NaN, but if it returns invalid results then turning that into an error is a strict improvement.

andreasnoack · 2019-12-03T10:30:04Z

The test failure on 32 bit seems to be a time out because of deprecations

better NaN handling with ecdf

2d9fc42

andreasnoack reviewed Nov 25, 2019

View reviewed changes

ecdf tests

42e76e9

nalimilan approved these changes Nov 26, 2019

View reviewed changes

andreasnoack merged commit 93831cc into JuliaStats:master Dec 3, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

better NaN handling with ecdf #537

better NaN handling with ecdf #537

joshday commented Nov 25, 2019

andreasnoack Nov 25, 2019

joshday Nov 25, 2019

joshday Nov 25, 2019

nalimilan Nov 25, 2019

joshday Nov 25, 2019

nalimilan Nov 25, 2019

codecov bot commented Nov 25, 2019 •

edited

nalimilan commented Nov 26, 2019

joshday commented Nov 26, 2019

nalimilan commented Nov 26, 2019

andreasnoack commented Dec 3, 2019

better NaN handling with ecdf #537

better NaN handling with ecdf #537

Conversation

joshday commented Nov 25, 2019

andreasnoack Nov 25, 2019

Choose a reason for hiding this comment

joshday Nov 25, 2019

Choose a reason for hiding this comment

joshday Nov 25, 2019

Choose a reason for hiding this comment

nalimilan Nov 25, 2019

Choose a reason for hiding this comment

joshday Nov 25, 2019

Choose a reason for hiding this comment

nalimilan Nov 25, 2019

Choose a reason for hiding this comment

codecov bot commented Nov 25, 2019 • edited

Codecov Report

nalimilan commented Nov 26, 2019

joshday commented Nov 26, 2019

nalimilan commented Nov 26, 2019

andreasnoack commented Dec 3, 2019

codecov bot commented Nov 25, 2019 •

edited