Skip to content

Conversation

@wildart
Copy link
Contributor

@wildart wildart commented Oct 22, 2017

I'm trying to fix p-value estimation for one-sample and k-sample Anderson-Darling tests.

  • Add p-value estimation through MC simulation for k-sample AD test
  • Provide more interpolation data for better asymptotic evaluation of p-value for k-sample AD test
  • Implement p-value calculations for one sample AD test from G. and J. Marsaglia, "Evaluating the Anderson-Darling Distribution", Journal of Statistical Software, 2004

This PR should fix #74 and #56.

@wildart wildart changed the title [WIP] Fix p-value estimation for k-sample AD test Fix p-value estimation for AD tests Oct 27, 2017
@wildart
Copy link
Contributor Author

wildart commented Oct 27, 2017

@diegozea @andreasnoack @ararslan Ready for review.

Copy link
Member

@nalimilan nalimilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really can't comment on the math, but here are some remarks.

μ = mean(y)
σ = std(y)
= convert(typeof(μ), -n)
= convert(eltype(x), -n)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't going to be type stable for e.g. integer inputs AFAICT.

r = y[end]-m
broadcast!(x->(x-m)/r, y, y)
y[1] += eps() # to avoid log(0.0)
y[end] -= eps() # to avoid log(0.0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That comment could be made more explicit. Written that way, it sounds like y[end] could be 0.0, but that doesn't make sense since log(-eps()) wouldn't be correct. So where would log(0.0) happen?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is required to avoid -Inf when calculated logccdf(d, 1.0) or logcdf(d, 0.0) in the test.

function OneSampleADTest(x::AbstractVector{T}, d::UnivariateDistribution) where T<:Real
OneSampleADTest(adstats(x, d)...)
n = length(x)
μ, σ = StatsBase.mean_and_std(x)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for StatsBase. AFAICT. Same for zscore! below.

n::Int # number of observations
σ::Float64 # variance A²k
A²k::Float64 # Anderson-Darling test statistic
modified::Bool # Modified test statistic
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or rather "whether the modified statistic is used"? BTW, no initial uppercase here and below for consistency.


"""
KSampleADTest(xs::AbstractVector{<:Real}...; modified = true)
KSampleADTest(xs::AbstractVector{<:Real}...; modified = true, nsim=0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spaces around =.

# the Anderson-Darling test statistics (2*10^6) with sample from standard
# Normal distribution with size n=500 for each sample.
#
# After standardization of AD statistics, p-quantiles were estimate
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"estimated"

logP = log.((1 .- PV) ./ PV )

# locate curve area for extrapolation
_, j = findmin(abs.(Tm .- Tk))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could avoid an allocation by using a generator.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you be more specific?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will make two passes over the array: one for broadcasting and one for finding the minimum. Generators are lazy, so iteration only happens on-demand, in this case when finding the minimum. This would then be

findmin(abs(tm - tk) for (tm, tk) in zip(Tm, Tk))

or something along those lines.

x = rand(Cauchy(), n)
t = OneSampleADTest(x, Normal())
@test pvalue(t) < 1e-100
@test isapprox(pvalue(t), 0.0, atol=0.1^4)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be more natural as 1e-4? Also, recommended syntax is @test x ≈ y atol=z.


n = map(length, samples)
pooled = vcat(samples...)
Z = sort(pooled)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could use sort!?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For simulated p-value calculation, an original sample is required for constructing random permutations.


"""Asymptotic evaluation of the p-value for Anderson-Darling test"""
function pvalueasym(x::KSampleADTest)
# Computational Details:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this based on the paper cited above? How sensitive is it to the sample size (500 doesn't sound that large)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar MC simulation parameters were used in kSample R package written by original paper author, Scholz, - 500 sample size, 2M samples.

Copy link
Member

@nalimilan nalimilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be good to go after this round.

lcdfz = logcdf(d, zi)
lccdfz = logccdf(d, zni1)
-= (2*i - 1)/n * (lcdfz + lccdfz)
lcdfz = logcdf(d, x[i])::Float64
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for type assertions. The formula used to compute should give a Float64 anyway due to (i+i-1)/n. Also assertions will fail if the type doesn't match, they don't perform the conversion automatically, so that's a lose-lose solution.

`modified` parameter enables a modified test calculation for samples whose observations
do not all coincide.
`nsim` parameter if it equals to 0 then the asymptotic calculation of p-value is used.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still not correct grammar. Something like "If nsim is equal to 0 (the default), the asymptotic... If it is greater than 0, an estimation... is used. This proportion is...".

-1.5849548989018964 -1.5407036306002324 -1.5217650056531025 -1.468313040457585 -1.4383460114029611 -1.3510451980424227 -1.2994715185279744 -1.2107011085917234 -1.1182349910369536 -1.0480894352574122 -0.9888467602015968 -0.792282309179151 -0.6158884225411821 -0.4357973228266815 -0.24055211468641896 -0.013263022467496355 0.2679318847085987 0.6545019598726984 1.3035350042516591 1.5717238802812676 1.9472543460769594 2.5917105532062927 3.4417697301287804 3.708500478251771 4.08514071675675 4.729002778898586 5.553060449998679 5.8246988123153525 6.2044102945806 6.8092648417824355 7.564819358105057 7.8751443348359444 8.265612297736865 9.09572889834988 9.634634570153578;
-1.8179842473942176 -1.7701061384541197 -1.7430115382228353 -1.6664153177996697 -1.6277529354269717 -1.509882048484231 -1.4423296156847323 -1.327021532216986 -1.2105162624023968 -1.1240541192874598 -1.0516241510661641 -0.8188455856721732 -0.6168586530804309 -0.4186963520248099 -0.20789666676568955 0.0295008713522725 0.31671631783726234 0.6992815580015851 1.3223749183602587 1.5732385052700641 1.9236845463814185 2.5085099219292766 3.2689645805547753 3.5044925007442225 3.841724525500276 4.399781356888528 5.114888256216994 5.346077131581993 5.659256532458322 6.257314173278837 6.986207519686372 7.219730636115486 7.592515097944003 8.120907844495369 8.723051996739617;
-1.9989354254437397 -1.9372512947973601 -1.9085397775154465 -1.8141114782122008 -1.7646890249660294 -1.6175716157784974 -1.5372145799225008 -1.4011822939743703 -1.2672404489587346 -1.169926310440552 -1.0887002685066605 -0.8322081015222909 -0.6140517112328044 -0.404166231020204 -0.18335516739183028 0.06015392460374823 0.34855873939459553 0.7267832591857595 1.3282605791296271 1.5672982936172346 1.8987053911907181 2.4483775155779135 3.157299591873491 3.3747262669186058 3.6797595381737613 4.192286238183518 4.869623799822088 5.080242748754673 5.379051520938655 5.8747370994257535 6.5004003034881785 6.697073027583033 7.03600074408326 7.445733876827231 8.020320100696683;
-2.260406253565197 -2.1753074859488235 -2.128991194356946 -2.0065848989178288 -1.945510672007325 -1.7624757482456306 -1.6635124607248777 -1.4990052836398609 -1.3392374211352471 -1.2252207360751253 -1.1319971726969635 -0.8445584556230463 -0.6076273028769662 -0.3834824380209529 -0.15483904255918668 0.09368829578808609 0.3831767280474241 0.7547390312514602 1.3312274974190232 1.5571379748618364 1.865766664385659 2.3730178158029442 3.0080178359348353 3.1978501989364085 3.472363926096708 3.931316296612841 4.536760955793221 4.724863565260161 4.962319348947046 5.408613188229584 6.007714186756578 6.198538388520974 6.457796867859084 6.849495066215343 7.4640951217005576;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing a space for perfect aligment here and below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Waste of time, there are many numbers which size differ by 1-2 digits.

@wildart
Copy link
Contributor Author

wildart commented Nov 23, 2017

Ready to merge.

@andreasnoack andreasnoack merged commit b28a458 into JuliaStats:master Nov 23, 2017
@wildart wildart deleted the ks-adt-pval branch November 23, 2017 21:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

KSampleADTest extrapolation issues (can get p-values greater than 1)

4 participants