New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spearman Test (this needs review) #53
base: master
Are you sure you want to change the base?
Changes from 1 commit
5a9c7e5
74ab395
4dce494
8d2e57d
5f366e3
4d826aa
3ec64c7
c73345b
415e2d8
14e54da
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -27,7 +27,7 @@ export CorrelationTest, SpearmanCorrelationTest | |
abstract CorrelationTest <: HypothesisTest | ||
|
||
"Sum squared difference of ranks (midranks for ties)" | ||
spearman_S(xrank,yrank) = sumabs2(xrank .- yrank) | ||
spearman_S(xrank, yrank) = sumabs2(xrank .- yrank) | ||
|
||
immutable SpearmanCorrelationTest <: CorrelationTest | ||
# Tied ranking for x and y | ||
|
@@ -43,37 +43,37 @@ immutable SpearmanCorrelationTest <: CorrelationTest | |
# Spearman's ρ | ||
ρ::Float64 | ||
|
||
function SpearmanCorrelationTest(x,y) | ||
function SpearmanCorrelationTest(x, y) | ||
|
||
n = length(x) | ||
(n != length(y)) && throw(ErrorException("x and y must have the same length")) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is there a reason this doesn't use the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I didn't know I can set the message of @Assert |
||
|
||
xrank,xtiesadj = HypothesisTests.tiedrank_adj(x) | ||
yrank,ytiesadj = HypothesisTests.tiedrank_adj(y) | ||
xrank, xtiesadj = HypothesisTests.tiedrank_adj(x) | ||
yrank, ytiesadj = HypothesisTests.tiedrank_adj(y) | ||
|
||
S = spearman_S(xrank,yrank) | ||
S = spearman_S(xrank, yrank) | ||
|
||
ρ = corspearman(x,y) | ||
ρ = corspearman(x, y) | ||
|
||
new(xrank,yrank,xtiesadj,ytiesadj,S,n,ρ) | ||
new(xrank, yrank, xtiesadj, ytiesadj, S, n, ρ) | ||
end | ||
|
||
end | ||
|
||
testname(::SpearmanCorrelationTest) = "Spearman's rank correlation test" | ||
|
||
# parameter of interest: name, value under h0, point estimate | ||
population_param_of_interest(x::SpearmanCorrelationTest) = ("Spearman's ρ",0.0,x.ρ) | ||
# parameter of interest: name, value under h0, point estimate | ||
population_param_of_interest(x::SpearmanCorrelationTest) = ("Spearman's ρ", 0.0, x.ρ) | ||
|
||
function show_params(io::IO,x::SpearmanCorrelationTest,ident) | ||
println(io,ident,"Number of points: ",x.n) | ||
println(io,ident,"Spearman's ρ: ",x.ρ) | ||
println(io,ident,"S (Sum squared difference of ranks): ",x.S) | ||
println(io,ident,"adjustment for ties in x: ",x.xtiesadj) | ||
println(io,ident,"adjustment for ties in y: ",x.ytiesadj) | ||
function show_params(io::IO, x::SpearmanCorrelationTest, ident) | ||
println(io, ident, "Number of points: ", x.n) | ||
println(io, ident, "Spearman's ρ: ", x.ρ) | ||
println(io, ident, "S (Sum squared difference of ranks): ", x.S) | ||
println(io, ident, "adjustment for ties in x: ", x.xtiesadj) | ||
println(io, ident, "adjustment for ties in y: ", x.ytiesadj) | ||
end | ||
|
||
function P_from_null_S_values(S_null,x::SpearmanCorrelationTest,tail) | ||
function P_from_null_S_values(S_null, x::SpearmanCorrelationTest, tail) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is there a reason not to explicitly identify the tail parameter as having type Symbol? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not really... There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Then I'd suggest marking it explicitly, both as a (admittedly trivial) compiler hint and (less trivial) documentation/security for users of the library. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It's done :D |
||
S_null_mean = mean(S_null) | ||
# S is approximately normally distributed | ||
# S and ρ are inversely proportional | ||
|
@@ -91,24 +91,24 @@ function P_from_null_S_values(S_null,x::SpearmanCorrelationTest,tail) | |
end | ||
end | ||
|
||
function spearman_P_exact(x::SpearmanCorrelationTest,tail) | ||
S_null = Float64[ spearman_S(perm,x.yrank) for perm in permutations(x.xrank) ] | ||
P_from_null_S_values(S_null,x,tail) | ||
function spearman_P_exact(x::SpearmanCorrelationTest, tail) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You very well may have reason to follow this naming convention, but I think I should note that other tests rely on multiple-dispatch to avoid needing to use a function name that includes the name of the test. For example, in fisher.jl, there's There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. My bad |
||
S_null = Float64[ spearman_S(perm, x.yrank) for perm in permutations(x.xrank) ] | ||
P_from_null_S_values(S_null, x, tail) | ||
end | ||
|
||
function spearman_P_sampling(x::SpearmanCorrelationTest,tail) | ||
function spearman_P_sampling(x::SpearmanCorrelationTest, tail) | ||
# 360000 samples gives an se(P) < 0.0005 for P < 0.1 | ||
X = copy(x.xrank) | ||
S_null = Float64[ spearman_S(shuffle!(X),x.yrank) for sample in 1:360000 ] | ||
P_from_null_S_values(S_null,x,tail) | ||
S_null = Float64[ spearman_S(shuffle!(X), x.yrank) for sample in 1:360000 ] | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is a nondeterministic test. Even though failure may be rare, it's still not very fun to have nondeterministic tests complain. You should be able to make this test deterministic by just calling srand(12345) before using anything random. It would be super slick to compare this implementation to the R implementation by using an identical source of random numbers (and shuffles/samples based on them, etc), but I don't know offhand how to do that... and it is probably not worth the effort. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @fiatflux How can I set the seed only for this function, but not for all the Julia session? I don't like the idea of having a seed fixed as side effect. > library("coin")
> x <- rnorm(100)
> y <- sqrt(abs(sin((x/3.14)+(0.6*x))))
> spearman_test(x ~ y, distribution=approximate(B=9999))
Approximative Spearman Correlation Test
data: x by y
Z = -0.37602, p-value = 0.7037
alternative hypothesis: true rho is not equal to 0
> spearman_test(x ~ y, distribution=approximate(B=9999))
Approximative Spearman Correlation Test
data: x by y
Z = -0.37602, p-value = 0.7144
alternative hypothesis: true rho is not equal to 0
> spearman_test(x ~ y, distribution=approximate(B=9999))
Approximative Spearman Correlation Test
data: x by y
Z = -0.37602, p-value = 0.7108
alternative hypothesis: true rho is not equal to 0
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Fair point. And sorry, I meant to put this comment in test/... My suggestion would be to call srand() at the end of the unit test to make sure the fixed seed doesn't propagate beyond the test. Probably best to put srand() inside a finally clause, too. Anyone else have thoughts on that? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't think it's a problem to set the seed for a whole test session. Other uses of There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Just an idea ... at risk of being one of those obnoxious people who stomps around town beating their chest and shouting "TESTABILITY!!!1" ... it might be good practice to take a PRNG object as an optional parameter to nondeterministic functions. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ok... I will add the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Looks like this line still has extra padding. |
||
P_from_null_S_values(S_null, x, tail) | ||
end | ||
|
||
# Use estimated mean and std for the S null distribution as in: | ||
# | ||
# Press WH, Teukolsky SA, Vetterling WT, Flannery BP. | ||
# Numerical recipes in C. | ||
# Cambridge: Cambridge university press; 1996. | ||
function spearman_P_estimated(x::SpearmanCorrelationTest,tail) | ||
function spearman_P_estimated(x::SpearmanCorrelationTest, tail) | ||
N = float(x.n) | ||
a = (N^3 - N) | ||
# Numerical Recipes (14.6.6) | ||
|
@@ -119,11 +119,11 @@ function spearman_P_estimated(x::SpearmanCorrelationTest,tail) | |
# S is approximately normally distributed | ||
# S and ρ are inversely proportional | ||
if tail == :both | ||
cdf(Normal(),-abs(zscore)) + ccdf(Normal(),abs(zscore)) | ||
cdf(Normal(), -abs(zscore)) + ccdf(Normal(), abs(zscore)) | ||
elseif tail == :right | ||
cdf(Normal(),zscore) | ||
cdf(Normal(), zscore) | ||
elseif tail == :left | ||
ccdf(Normal(),zscore) | ||
ccdf(Normal(), zscore) | ||
else | ||
throw(ArgumentError("tail=$(tail) is invalid")) | ||
end | ||
|
@@ -134,16 +134,16 @@ end | |
# McDonald JH. | ||
# Handbook of biological statistics. | ||
# Baltimore, MD: Sparky House Publishing; 2009 Aug. | ||
function spearman_P_ttest(x::SpearmanCorrelationTest,tail) | ||
function spearman_P_ttest(x::SpearmanCorrelationTest, tail) | ||
ρ2 = x.ρ^2 | ||
df = x.n-2 | ||
t = sqrt((df*ρ2)/(1-ρ2)) | ||
if tail == :both | ||
cdf(TDist(df),-t) + ccdf(Normal(),t) | ||
cdf(TDist(df), -t) + ccdf(Normal(), t) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is this supposed to be Normal or TDist(df) in the second term? |
||
elseif tail == :right | ||
ccdf(TDist(df),t) | ||
ccdf(TDist(df), t) | ||
elseif tail == :left | ||
cdf(TDist(df),t) | ||
cdf(TDist(df), t) | ||
else | ||
throw(ArgumentError("tail=$(tail) is invalid")) | ||
end | ||
|
@@ -152,16 +152,16 @@ end | |
function pvalue(x::SpearmanCorrelationTest; tail=:both, method=:estimated) | ||
if x.n <= 10 | ||
# Exact P value using permutations | ||
return(spearman_P_exact(x,tail)) | ||
return(spearman_P_exact(x, tail)) | ||
end | ||
if method == :sampling | ||
return(spearman_P_sampling(x,tail)) | ||
return(spearman_P_sampling(x, tail)) | ||
elseif method == :exact | ||
return(spearman_P_exact(x,tail)) | ||
return(spearman_P_exact(x, tail)) | ||
elseif method == :estimated | ||
return(spearman_P_estimated(x,tail)) | ||
return(spearman_P_estimated(x, tail)) | ||
elseif method == :ttest | ||
return(spearman_P_ttest(x,tail)) | ||
return(spearman_P_ttest(x, tail)) | ||
else | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Tiny style thing: can be simplified to (Sorry, I don't know why github is smashing that onto one line.) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I wrote it in that way to make explicit that with 10 or less elements, always the exact P value is calculated. Different methods are selected only if N > 10. The error with N<=10 is only for a consistent API. |
||
throw(ArgumentError("method=$(method) is invalid")) | ||
end | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,51 +1,52 @@ | ||
using HypothesisTests, Base.Test | ||
|
||
# Test Exact P value: n <= 10 | ||
let x = [44.4,45.9,41.9,53.3,44.7,44.1], | ||
y = [2.6,3.1,2.5,5.0,3.6,4.0] | ||
let x = [44.4, 45.9, 41.9, 53.3, 44.7, 44.1], | ||
y = [2.6, 3.1, 2.5, 5.0, 3.6, 4.0] | ||
|
||
corr = HypothesisTests.SpearmanCorrelationTest(x,y) | ||
corr = HypothesisTests.SpearmanCorrelationTest(x, y) | ||
|
||
# R values | ||
@test_approx_eq corr.ρ 0.6 | ||
@test_approx_eq_eps HypothesisTests.pvalue(corr) 0.2417 0.0001 | ||
@test_approx_eq_eps HypothesisTests.pvalue(corr,tail=:right) 0.1208 0.0001 | ||
@test_approx_eq_eps HypothesisTests.pvalue(corr,tail=:left) 0.9125 0.0001 | ||
@test_approx_eq_eps HypothesisTests.pvalue(corr, tail=:right) 0.1208 0.0001 | ||
@test_approx_eq_eps HypothesisTests.pvalue(corr, tail=:left) 0.9125 0.0001 | ||
end | ||
|
||
show(IOBuffer(), | ||
HypothesisTests.SpearmanCorrelationTest([44.4,45.9,41.9,53.3,44.,44.1], | ||
[2.6,3.1,2.5,5.0,3.6,4.0]) | ||
HypothesisTests.SpearmanCorrelationTest([44.4, 45.9, 41.9, 53.3, 44., 44.1], | ||
[2.6, 3.1, 2.5, 5.0, 3.6, 4.0]) | ||
) | ||
|
||
let x = collect(1:11), | ||
y = [6,5,4,3,2,1,7,11,10,9,8] | ||
y = [6, 5, 4, 3, 2, 1, 7, 11, 10, 9, 8] | ||
# https://stat.ethz.ch/pipermail/r-devel/2009-February/052112.html | ||
# correct P value 0.03044548 | ||
|
||
corr = HypothesisTests.SpearmanCorrelationTest(x,y) | ||
corr = HypothesisTests.SpearmanCorrelationTest(x, y) | ||
|
||
srand(12345) # Seed for method=:sampling | ||
|
||
@test_approx_eq_eps HypothesisTests.pvalue(corr,tail=:right,method=:exact) 0.03044548 1e-8 | ||
@test_approx_eq_eps HypothesisTests.pvalue(corr,tail=:right,method=:sampling) 0.030 1e-3 | ||
@test_approx_eq_eps HypothesisTests.pvalue(corr,tail=:right,method=:estimated) 0.030 1e-3 | ||
@test_approx_eq_eps HypothesisTests.pvalue(corr,tail=:right,method=:ttest) 0.03 1e-2 | ||
@test_approx_eq_eps HypothesisTests.pvalue(corr, tail=:right, method=:exact) 0.03044548 1e-8 | ||
@test_approx_eq_eps HypothesisTests.pvalue(corr, tail=:right, method=:sampling) 0.030 1e-3 | ||
@test_approx_eq_eps HypothesisTests.pvalue(corr, tail=:right, method=:estimated) 0.030 1e-3 | ||
@test_approx_eq_eps HypothesisTests.pvalue(corr, tail=:right, method=:ttest) 0.03 1e-2 | ||
end | ||
|
||
let x = collect(1:10), | ||
y = [5,4,3,2,1,6,10,9,8,7] | ||
y = [5, 4, 3, 2, 1, 6, 10, 9, 8, 7] | ||
|
||
# R's pspearman: 0.05443067 is the exact value | ||
corr = HypothesisTests.SpearmanCorrelationTest(x,y) | ||
corr = HypothesisTests.SpearmanCorrelationTest(x, y) | ||
@test_approx_eq_eps HypothesisTests.pvalue(corr) 0.05443067 1e-8 | ||
end | ||
|
||
# Using (N-1)N²(N+1)² overflows with N = 10153 | ||
# Using (N-1)N²(N+1)² overflows with N = 10153 and sqrt((N-1)N²(N+1)²) throws an error | ||
# pvalue avoids the Int overflow using float(N) and sqrt(N-1)N(N+1) since N > 0 | ||
srand(12345) # Seed for rand | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It should be fine to set the seed just once at the top of the test, but there's no harm doing it here too. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I put one There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That's fair. I'm generally much less concerned with a test breaking as a result of a change, than the possibility that the outcome of the same test on the same codebase randomly fails sometimes when repeatedly run. No big deal either way. |
||
let x = rand(10153) | ||
|
||
corr = SpearmanCorrelationTest(x,x) | ||
corr = SpearmanCorrelationTest(x, x) | ||
|
||
@test_approx_eq corr.ρ 1.0 | ||
@test_approx_eq pvalue(corr) 0.0 | ||
|
@@ -54,18 +55,18 @@ end | |
|
||
# Test S value with ties | ||
|
||
function rho_with_ties(S,N,tx,ty) # S == D | ||
function rho_with_ties(S, N, tx, ty) # S == D | ||
# Equation (14.6.5) from Numerical Recipes for rho with ties | ||
a=(N^3)-N | ||
(1-((6/a)*(S+(tx/12)+(ty/12)))) / (sqrt(1-(tx/a))*sqrt(1-(ty/a))) | ||
end | ||
|
||
function diff_rho(x,y) | ||
corr = SpearmanCorrelationTest(x,y) | ||
corr.ρ - rho_with_ties(corr.S,corr.n,corr.xtiesadj,corr.ytiesadj) | ||
function diff_rho(x, y) | ||
corr = SpearmanCorrelationTest(x, y) | ||
corr.ρ - rho_with_ties(corr.S, corr.n, corr.xtiesadj, corr.ytiesadj) | ||
end | ||
|
||
srand(12345) # Seed for rand | ||
for i in 20:100 | ||
@test_approx_eq_eps diff_rho(rand(1:10,i),rand(1:10,i)) 0.0 1e-10 | ||
@test_approx_eq_eps diff_rho(rand(1:10, i), rand(1:10, i)) 0.0 1e-10 | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, this is getting reeeeaallly nitpicky, but this function block has extra newlines compared with the general style of the rest of the codebase (I think).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's true, I like to se white spaces in code so I'm biased to the space paddings and extra newlines XD