-
Notifications
You must be signed in to change notification settings - Fork 88
Fix p-value estimation for AD tests #118
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@diegozea @andreasnoack @ararslan Ready for review. |
nalimilan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really can't comment on the math, but here are some remarks.
src/anderson_darling.jl
Outdated
| μ = mean(y) | ||
| σ = std(y) | ||
| A² = convert(typeof(μ), -n) | ||
| A² = convert(eltype(x), -n) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't going to be type stable for e.g. integer inputs AFAICT.
src/anderson_darling.jl
Outdated
| r = y[end]-m | ||
| broadcast!(x->(x-m)/r, y, y) | ||
| y[1] += eps() # to avoid log(0.0) | ||
| y[end] -= eps() # to avoid log(0.0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That comment could be made more explicit. Written that way, it sounds like y[end] could be 0.0, but that doesn't make sense since log(-eps()) wouldn't be correct. So where would log(0.0) happen?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is required to avoid -Inf when calculated logccdf(d, 1.0) or logcdf(d, 0.0) in the test.
src/anderson_darling.jl
Outdated
| function OneSampleADTest(x::AbstractVector{T}, d::UnivariateDistribution) where T<:Real | ||
| OneSampleADTest(adstats(x, d)...) | ||
| n = length(x) | ||
| μ, σ = StatsBase.mean_and_std(x) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need for StatsBase. AFAICT. Same for zscore! below.
src/anderson_darling.jl
Outdated
| n::Int # number of observations | ||
| σ::Float64 # variance A²k | ||
| A²k::Float64 # Anderson-Darling test statistic | ||
| modified::Bool # Modified test statistic |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or rather "whether the modified statistic is used"? BTW, no initial uppercase here and below for consistency.
src/anderson_darling.jl
Outdated
|
|
||
| """ | ||
| KSampleADTest(xs::AbstractVector{<:Real}...; modified = true) | ||
| KSampleADTest(xs::AbstractVector{<:Real}...; modified = true, nsim=0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spaces around =.
src/anderson_darling.jl
Outdated
| # the Anderson-Darling test statistics (2*10^6) with sample from standard | ||
| # Normal distribution with size n=500 for each sample. | ||
| # | ||
| # After standardization of AD statistics, p-quantiles were estimate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"estimated"
src/anderson_darling.jl
Outdated
| logP = log.((1 .- PV) ./ PV ) | ||
|
|
||
| # locate curve area for extrapolation | ||
| _, j = findmin(abs.(Tm .- Tk)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could avoid an allocation by using a generator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you be more specific?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will make two passes over the array: one for broadcasting and one for finding the minimum. Generators are lazy, so iteration only happens on-demand, in this case when finding the minimum. This would then be
findmin(abs(tm - tk) for (tm, tk) in zip(Tm, Tk))or something along those lines.
test/anderson_darling.jl
Outdated
| x = rand(Cauchy(), n) | ||
| t = OneSampleADTest(x, Normal()) | ||
| @test pvalue(t) < 1e-100 | ||
| @test isapprox(pvalue(t), 0.0, atol=0.1^4) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be more natural as 1e-4? Also, recommended syntax is @test x ≈ y atol=z.
|
|
||
| n = map(length, samples) | ||
| pooled = vcat(samples...) | ||
| Z = sort(pooled) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could use sort!?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For simulated p-value calculation, an original sample is required for constructing random permutations.
|
|
||
| """Asymptotic evaluation of the p-value for Anderson-Darling test""" | ||
| function pvalueasym(x::KSampleADTest) | ||
| # Computational Details: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this based on the paper cited above? How sensitive is it to the sample size (500 doesn't sound that large)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar MC simulation parameters were used in kSample R package written by original paper author, Scholz, - 500 sample size, 2M samples.
nalimilan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be good to go after this round.
src/anderson_darling.jl
Outdated
| lcdfz = logcdf(d, zi) | ||
| lccdfz = logccdf(d, zni1) | ||
| A² -= (2*i - 1)/n * (lcdfz + lccdfz) | ||
| lcdfz = logcdf(d, x[i])::Float64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need for type assertions. The formula used to compute A² should give a Float64 anyway due to (i+i-1)/n. Also assertions will fail if the type doesn't match, they don't perform the conversion automatically, so that's a lose-lose solution.
src/anderson_darling.jl
Outdated
| `modified` parameter enables a modified test calculation for samples whose observations | ||
| do not all coincide. | ||
| `nsim` parameter if it equals to 0 then the asymptotic calculation of p-value is used. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still not correct grammar. Something like "If nsim is equal to 0 (the default), the asymptotic... If it is greater than 0, an estimation... is used. This proportion is...".
| -1.5849548989018964 -1.5407036306002324 -1.5217650056531025 -1.468313040457585 -1.4383460114029611 -1.3510451980424227 -1.2994715185279744 -1.2107011085917234 -1.1182349910369536 -1.0480894352574122 -0.9888467602015968 -0.792282309179151 -0.6158884225411821 -0.4357973228266815 -0.24055211468641896 -0.013263022467496355 0.2679318847085987 0.6545019598726984 1.3035350042516591 1.5717238802812676 1.9472543460769594 2.5917105532062927 3.4417697301287804 3.708500478251771 4.08514071675675 4.729002778898586 5.553060449998679 5.8246988123153525 6.2044102945806 6.8092648417824355 7.564819358105057 7.8751443348359444 8.265612297736865 9.09572889834988 9.634634570153578; | ||
| -1.8179842473942176 -1.7701061384541197 -1.7430115382228353 -1.6664153177996697 -1.6277529354269717 -1.509882048484231 -1.4423296156847323 -1.327021532216986 -1.2105162624023968 -1.1240541192874598 -1.0516241510661641 -0.8188455856721732 -0.6168586530804309 -0.4186963520248099 -0.20789666676568955 0.0295008713522725 0.31671631783726234 0.6992815580015851 1.3223749183602587 1.5732385052700641 1.9236845463814185 2.5085099219292766 3.2689645805547753 3.5044925007442225 3.841724525500276 4.399781356888528 5.114888256216994 5.346077131581993 5.659256532458322 6.257314173278837 6.986207519686372 7.219730636115486 7.592515097944003 8.120907844495369 8.723051996739617; | ||
| -1.9989354254437397 -1.9372512947973601 -1.9085397775154465 -1.8141114782122008 -1.7646890249660294 -1.6175716157784974 -1.5372145799225008 -1.4011822939743703 -1.2672404489587346 -1.169926310440552 -1.0887002685066605 -0.8322081015222909 -0.6140517112328044 -0.404166231020204 -0.18335516739183028 0.06015392460374823 0.34855873939459553 0.7267832591857595 1.3282605791296271 1.5672982936172346 1.8987053911907181 2.4483775155779135 3.157299591873491 3.3747262669186058 3.6797595381737613 4.192286238183518 4.869623799822088 5.080242748754673 5.379051520938655 5.8747370994257535 6.5004003034881785 6.697073027583033 7.03600074408326 7.445733876827231 8.020320100696683; | ||
| -2.260406253565197 -2.1753074859488235 -2.128991194356946 -2.0065848989178288 -1.945510672007325 -1.7624757482456306 -1.6635124607248777 -1.4990052836398609 -1.3392374211352471 -1.2252207360751253 -1.1319971726969635 -0.8445584556230463 -0.6076273028769662 -0.3834824380209529 -0.15483904255918668 0.09368829578808609 0.3831767280474241 0.7547390312514602 1.3312274974190232 1.5571379748618364 1.865766664385659 2.3730178158029442 3.0080178359348353 3.1978501989364085 3.472363926096708 3.931316296612841 4.536760955793221 4.724863565260161 4.962319348947046 5.408613188229584 6.007714186756578 6.198538388520974 6.457796867859084 6.849495066215343 7.4640951217005576; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing a space for perfect aligment here and below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Waste of time, there are many numbers which size differ by 1-2 digits.
|
Ready to merge. |
I'm trying to fix p-value estimation for one-sample and k-sample Anderson-Darling tests.
This PR should fix #74 and #56.