Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

K-S for 2 samples - possible issue? #50

Open
themoabird opened this issue Oct 22, 2018 · 2 comments
Open

K-S for 2 samples - possible issue? #50

themoabird opened this issue Oct 22, 2018 · 2 comments

Comments

@themoabird
Copy link

Hi

I'm not sure if this is an issue, but I'm using the K-S test for 2 samples to examine the compatibility of samples.

I've found that if I use identical samples for sample a and sample b it sometimes tells me the samples are not compatible (i.e., it's a low probability that both samples are drawn from the same underlying distribution).

I don't know enough about how the K-S test works to have an idea about whether that makes any sense, but it's certainly counterintuitive...

Sorry if I'm just wasting your time by flagging up a non-issue!

@dcwuser
Copy link
Owner

dcwuser commented Nov 20, 2018

Thanks so much for reporting this issue! This may be a real bug, but I'm actually not entirely sure; I'll need to go back and study some theory to make sure, and there is a little bit of follow-up from you that could be helpful.

Computing the 2-sample KS D-statistic involves measuring the maximum distance between two EDFs (https://en.wikipedia.org/wiki/Empirical_distribution_function). Because there is a step discontinuity in the EDF at each data point, you have a little bit of ambiguity in how to measure the distance between the two EDFs is at those points: should you measure from the bottom or top of the step? Since the D-statistic is defined as the maximum distance, I wrote the code so as to always resolve that ambiguity by returning the largest possible distance. Given two identical EDFs, that means we don't get D=0, but instead D=1/n, where n is the number of points. I need to go back and study the theory to see if this is the right choice.

Considering this, it's to be expected that you get D > 0 for two identical samples, but I am still surprised that you ever get a small P. I haven't been able to construct any example which yields a small P. Could you send me a repro?

Even if this does turn out to be a bug for the identical sample case, I wouldn't worry about the reliability of the method for real data. This behavior appears to be a perhaps undesirable effect of the distance definition for the corner case of identical data, but should have no impact with real, continuous data from separate samples.

@themoabird
Copy link
Author

Hi - Thanks for responding.

Try this dataset as both Sample A & Sample B.

18,15,18,16,17,15,14,14,14,15,15,14,15,14,22,18,21,21,10,10

It gives me D = 0.25, calling it like this:

var ksTest = Univariate.KolmogorovSmirnovTest(List1, List2),

where both lists are the same (obviously).

If you then duplicate it (i.e., same numbers duplicated in both samples), p gets smaller (as expected, I'd guess - because bigger sample size means less randomness, and D doesn't change). Duplicate it again, you eventually end up with p being non-significant

Isn't that inevitable if D > 0 (i.e., there would be some sample size at which even a very small D becomes significant, given that in this case D isn't changing)?

I need to caveat all this by saying, I really don't know what I'm doing, I have a very basic understanding of statistics, and it's also possible I messed up my programming, and D isn't 0.25 at all! :)

Thanks again!

Meta.Numerics is a very, very cool thing. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants