New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spearman Test (this needs review) #53
base: master
Are you sure you want to change the base?
Conversation
I ran all the methods in this pull request, and I measured the differences against R's P values with the following function: function P_test(N)
data = DataFrame(
ties = DataArray(Bool, N),
P_R = DataArray(Float64, N),
P_R_exact = DataArray(Float64, N),
P_jl_exact = DataArray(Float64, N),
P_jl_sampling = DataArray(Float64, N),
P_jl_estimated = DataArray(Float64, N),
P_jl_ttest = DataArray(Float64, N)
)
y = 1:11
for row in eachrow(data)
x = vcat([1, 2], rand(1:40, 7), [39, 40])
row[:ties] = length(unique(x)) != 11
row[:P_R] = rcopy(R"""cor.test($x, $y, method="spearman", exact=FALSE)""")[symbol("p.value")]
row[:P_R_exact] = rcopy(R"""cor.test($x, $y, method="spearman", exact=TRUE)""")[symbol("p.value")]
corr = SpearmanCorrelationTest(x, y)
row[:P_jl_exact] = pvalue(corr, method=:exact)
row[:P_jl_sampling] = pvalue(corr, method=:sampling)
row[:P_jl_estimated] = pvalue(corr, method=:estimated)
row[:P_jl_ttest] = pvalue(corr, method=:ttest)
end
data
end From 110 samples without ties, this is the distribution of P - R's P with
From 390 samples with ties, this is the distribution of P - R's P with
Comparing all (500 samples, with and without ties) the values against the exact P value of this pull request, P - Julia exact P value:
|
I guest I'm getting Int overflow here... How can I solve it? The problem is in this line: https://github.com/diegozea/HypothesisTests.jl/blob/master/src/spearman.jl#L121 julia> x
Spearman's rank correlation test
--------------------------------
Population details:
parameter of interest: Spearman's ρ
value under h_0: 0.0
point estimate: -0.7284221607028682
Error showing value of type HypothesisTests.SpearmanCorrelationTest:
ERROR: DomainError:
in spearman_P_estimated at /home/diego/.julia/v0.4/HypothesisTests/src/spearman.jl:121
in pvalue at /home/diego/.julia/v0.4/HypothesisTests/src/spearman.jl:166
in show at /home/diego/.julia/v0.4/HypothesisTests/src/HypothesisTests.jl:98
in anonymous at show.jl:1299
in with_output_limit at ./show.jl:1276
in showlimited at show.jl:1298
in writemime at replutil.jl:4
in display at REPL.jl:114
in display at REPL.jl:117
[inlined code] from multimedia.jl:151
in display at multimedia.jl:163
in print_response at REPL.jl:134
in print_response at REPL.jl:121
in anonymous at REPL.jl:624
in run_interface at ./LineEdit.jl:1610
in run_frontend at ./REPL.jl:863
in run_repl at ./REPL.jl:167
in _start at ./client.jl:420
julia> x.n
10153
julia> (x.n-1)
10152
julia> (x.n^2)
103083409
julia> ((x.n+1)^2)
103103716
julia> ( (x.n-1) * (x.n^2)* ((x.n+1)^2) )
-2782140239849997408
julia> ( (x.n-1) * (x.n^2)* ((x.n+1)^2) ) / 36
-7.72816733291666e16
julia> sqrt( ( (x.n-1) * (x.n^2)* ((x.n+1)^2) ) / 36 )
ERROR: DomainError:
sqrt will only return a complex result if called with a complex argument. Try sqrt(complex(x)).
in sqrt at ./math.jl:144
|
This gives a better result, but maybe is not the best solution: (((x.n+1)^2)/36) * (x.n^2) * (x.n-1) |
Maybe just call |
Maybe use |
It could help a bit to pull the squared quantities outside the square root. (x.n)^2 and (x.n+1)^2 in particular. Then the argument to the square root grows as x.n rather than x.n^5... But numerical recipes might be doing something intentionally that I'm just not ready to handle. Actually, yeah, I think they might have organized the code that way to make use of integer arithmetic which, these days, isn't as a big advantage. |
Clever ideas @andreasnoack and @fiatflux ! Since |
Thanks! I updated the pull to avoid Int overflow ;) julia> x = rand(10153);
julia> y = cos(π*x+0.5π);
julia> corr = SpearmanCorrelationTest(x,y)
Spearman's rank correlation test
--------------------------------
Population details:
parameter of interest: Spearman's ρ
value under h_0: 0.0
point estimate: 0.01719171284150201
Test summary:
outcome with 95% confidence: fail to reject h_0
two-sided p-value: 0.08324014651846666 (not significant)
Details:
Number of points: 10153
Spearman's ρ: 0.01719171284150201
S (Sum squared difference of ranks): 1.7143548239e11
adjustment for ties in x: 0.0
adjustment for ties in y: 0.0
julia> R"""cor.test($x,$y,method="spearman")"""
RCall.RObject{RCall.VecSxp}
Spearman's rank correlation rho
data: $x and $y
S = 1.7144e+11, p-value = 0.08324
alternative hypothesis: true rho is not equal to 0
sample estimates:
rho
0.01719171 |
When there is no ties, S values are the same between R's using StatsBase
function rho_with_ties(S,N,tx,ty)
# Equation (14.6.5) from Numerical Recipes
a=(N^3)-N
(1-((6/a)*(S+(tx/12)+(ty/12)))) / (sqrt(1-(tx/a))*sqrt(1-(ty/a)))
end
function diff_rho(x,y)
corr = SpearmanCorrelationTest(x, y)
corr.ρ - rho_with_ties(corr.S, corr.n, corr.xtiesadj, corr.ytiesadj)
end
describe(Float64[ diff_rho(rand(1:10,i), rand(1:10,i)) for i in 11:10000 ])
I found no differences between the real rho values and the values derived from our S values. I will add this to tests... |
Travis shows an error with Maybe tests for
|
return( spearman_P_exact(x, tail) ) | ||
end | ||
if method == :sampling | ||
return( spearman_P_sampling(x, tail) ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest removing the whitespace padding to be more consistent with other code in this repo
@fiatflux I updated the code, deleting paddings and extra spaces. I also setted a seed for each random test ;) |
end | ||
|
||
let x = collect(1:10), | ||
y = [5,4,3,2,1,6,10,9,8,7] | ||
|
||
# R's pspearman: 0.05443067 is the exact value | ||
corr = HypothesisTests.SpearmanCorrelationTest(x, y) | ||
corr = HypothesisTests.SpearmanCorrelationTest(x,y) | ||
@test_approx_eq_eps HypothesisTests.pvalue(corr) 0.05443067 1e-8 | ||
end | ||
|
||
# Using (N-1)N²(N+1)² overflows with N = 10153 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we still get overflow for N > 10153?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, It's working fine now ;)
julia> L=10153; N = 10L; x=rand(1:L, N); y=rand(1:L, N);
julia> R"""cor.test($x,$y,method="spearman",exact=FALSE)"""
RCall.RObject{RCall.VecSxp}
Spearman's rank correlation rho
data: $x and $y
S = 1.7465e+14, p-value = 0.6973
alternative hypothesis: true rho is not equal to 0
sample estimates:
rho
-0.001220803
julia> SpearmanCorrelationTest(x,y)
Spearman's rank correlation test
--------------------------------
Population details:
parameter of interest: Spearman's ρ
value under h_0: 0.0
point estimate: -0.0012208025475993133
Test summary:
outcome with 95% confidence: fail to reject h_0
two-sided p-value: 0.6972821892308928 (not significant)
Details:
Number of points: 101530
Spearman's ρ: -0.0012208025475993133
S (Sum squared difference of ranks): 1.746472562199395e14
adjustment for ties in x: 1.3199436e7
adjustment for ties in y: 1.3259556e7
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great! Might as well remove reference to it here.
@test_approx_eq_eps HypothesisTests.pvalue(corr) 0.05443067 1e-8 | ||
end | ||
|
||
# Using (N-1)N²(N+1)² overflows with N = 10153 | ||
srand(12345) # Seed for rand |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be fine to set the seed just once at the top of the test, but there's no harm doing it here too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I put one srand
per test, to avoid the hypothetical case of other rand
call introduced later between the test and a global srand
changing the results.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's fair. I'm generally much less concerned with a test breaking as a result of a change, than the possibility that the outcome of the same test on the same codebase randomly fails sometimes when repeatedly run. No big deal either way.
@@ -43,37 +43,37 @@ immutable SpearmanCorrelationTest <: CorrelationTest | |||
# Spearman's ρ | |||
ρ::Float64 | |||
|
|||
function SpearmanCorrelationTest(x,y) | |||
function SpearmanCorrelationTest(x, y) | |||
|
|||
n = length(x) | |||
(n != length(y)) && throw(ErrorException("x and y must have the same length")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason this doesn't use the @assert
macro? A la
@assert n==length(y) "Variables x and y must have the same length"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't know I can set the message of @Assert
println(io, ident, "adjustment for ties in y: ", x.ytiesadj) | ||
end | ||
|
||
function P_from_null_S_values(S_null, x::SpearmanCorrelationTest, tail) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason not to explicitly identify the tail parameter as having type Symbol?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then I'd suggest marking it explicitly, both as a (admittedly trivial) compiler hint and (less trivial) documentation/security for users of the library.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's done :D
Looking pretty good to me. Only thing that concerns me now is that it seems coveralls thinks left-tailed tests are not exercised in the unittests. What's going on there? |
@fiatflux I added more tests to improve test coverage and I fixed bugs that I found in the process. |
elseif method == :estimated | ||
return(spearman_P_estimated(x, tail)) | ||
elseif method == :ttest | ||
return(spearman_P_ttest(x, tail)) | ||
else |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tiny style thing:
if foo foobar() else if bar barbar() elif baz bazbar() end end
can be simplified to
if foo foobar() elif bar barbar() elif baz bazbar() end
(Sorry, I don't know why github is smashing that onto one line.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wrote it in that way to make explicit that with 10 or less elements, always the exact P value is calculated. Different methods are selected only if N > 10. The error with N<=10 is only for a consistent API.
Thanks for writing this @diegozea ! |
You're welcome! :) |
df = x.n-2 | ||
t = sqrt((df*ρ2)/(1-ρ2)) | ||
if tail == :both | ||
cdf(TDist(df), -t) + ccdf(Normal(), t) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this supposed to be Normal or TDist(df) in the second term?
What's the status of this PR? Would be great to have Spearman test in the package. |
Nothing new to add here, just wanted to know the state of this PR and if there's any luck this will be merged in the future Thanks! |
Hi, any news for the merge of this PR? |
Somebody needs to rebase the PR against current master and update it to run on Julia 1.0 and above. |
Here I'm adding the Spearman correlation test, with some alternatives to calculate P values. I'm think it's correct, even when S values are different from R's S values.
With less than 10 points, my implementation calculate all the possible S values using the ranks, and midranks in the case of ties, to calculate the exact P value. This looks better to me than the S used by R (without taking ties into account):
Best,