Spearman Test (this needs review) #53

diegozea · 2016-05-21T01:47:04Z

Here I'm adding the Spearman correlation test, with some alternatives to calculate P values. I'm think it's correct, even when S values are different from R's S values.

> x <- c(10, 10, 20, 30, 30)
> y <- c(1,2,3,4,5)
> cor.test(x, y, method="spearman")

    Spearman's rank correlation rho

data:  x and y
S = 1.0263, p-value = 0.01385
alternative hypothesis: true rho is not equal to 0
sample estimates:
      rho 
0.9486833 

Warning message:
In cor.test.default(x, y, method = "spearman") :
  Cannot compute exact p-value with ties
> (5^3 - 5) * (1-0.9486833)/6
[1] 1.026334
> sum((rank(x) - rank(y))^2) # Used in my implementation
[1] 1
>

With less than 10 points, my implementation calculate all the possible S values using the ranks, and midranks in the case of ties, to calculate the exact P value. This looks better to me than the S used by R (without taking ties into account):

julia> x = [10,10,20,30,30];

julia> y = collect(1:5);

julia> HypothesisTests.SpearmanCorrelationTest(x, y)
Spearman's rank correlation test
--------------------------------
Population details:
    parameter of interest:   Spearman's ρ
    value under h_0:         0.0
    point estimate:          0.9486832980505138

Test summary:
    outcome with 95% confidence: fail to reject h_0
    two-sided p-value:           0.06666666666666667 (not significant)

Details:
    Number of points:                    5
    Spearman's ρ:                        0.9486832980505138
    S (Sum squared difference of ranks): 1.0
    adjustment for ties in x:            12.0
    adjustment for ties in y:            0.0

Best,

coveralls · 2016-05-21T01:59:49Z

Coverage decreased (-6.7%) to 83.521% when pulling 5a9c7e5 on diegozea:master into 351b8d0 on JuliaStats:master.

diegozea · 2016-05-21T20:13:53Z

I ran all the methods in this pull request, and I measured the differences against R's P values with the following function:

function P_test(N)
    data = DataFrame(
        ties = DataArray(Bool, N),
        P_R  =  DataArray(Float64, N),
        P_R_exact  = DataArray(Float64, N),
        P_jl_exact =  DataArray(Float64, N),
        P_jl_sampling  =  DataArray(Float64, N),
        P_jl_estimated =  DataArray(Float64, N),
        P_jl_ttest =  DataArray(Float64, N)
    )
    y = 1:11
    for row in eachrow(data)
        x = vcat([1, 2], rand(1:40, 7), [39, 40])
        row[:ties] = length(unique(x)) != 11
        row[:P_R] = rcopy(R"""cor.test($x, $y, method="spearman", exact=FALSE)""")[symbol("p.value")]
        row[:P_R_exact] = rcopy(R"""cor.test($x, $y, method="spearman", exact=TRUE)""")[symbol("p.value")]
        corr = SpearmanCorrelationTest(x, y)
        row[:P_jl_exact] = pvalue(corr, method=:exact)
        row[:P_jl_sampling] = pvalue(corr, method=:sampling)
        row[:P_jl_estimated] = pvalue(corr, method=:estimated)
        row[:P_jl_ttest] = pvalue(corr, method=:ttest)
    end
    data    
end

From 110 samples without ties, this is the distribution of P - R's P with exact=TRUE:

         R (FALSE)    exact         sampling     estimated     ttest
min     -0.0051819   -0.000650414  -0.00289196   0.000254004  -0.0193903 
25%     -0.00449384  -0.000615007  -0.00130036   0.00464081   -0.0117049 
median  -0.00330545  -0.000446603  -0.000583082  0.00584124   -0.00690082
75%     -0.00165133  -0.000137737  -0.000265782  0.00641158   -0.00268308
max      6.66145e-5   0.000216525   0.000780357  0.00654717    3.33073e-5

From 390 samples with ties, this is the distribution of P - R's P with exact=FALSE:

         exact        sampling      estimated    ttest
min     -0.00123275  -0.00521559   -0.0179128   -0.0172322 
25%      0.00116759   0.000986073   0.00480537  -0.0114892
median   0.00277293   0.00224276    0.0081975   -0.00560353
75%      0.00389003   0.00330739    0.00975644  -0.00131509
max      0.00590762   0.00567053    0.010392    -5.16563e-9

Comparing all (500 samples, with and without ties) the values against the exact P value of this pull request, P - Julia exact P value:

     R (FALSE)    R (TRUE)      sampling      estimated    ttest
min -0.00590762  -0.00590762   -0.00677725   -0.0166801   -0.0226378  
5%  -0.00472031  -0.00434692   -0.00162825   -0.00377428  -0.0193502  
25% -0.00394747  -0.00364021   -0.000527372   0.00356583  -0.0149684  
50% -0.00275482  -0.0016838    -0.000109277   0.00581605  -0.00769338 
75% -0.00117023  -8.76156e-5    1.60013e-5    0.00702659  -0.0024413  
95% -8.8607e-5    0.000615007   0.000294464   0.00738872  -0.000154633
max  0.00123275   0.00123275    0.0011114     0.00759521  -7.96502e-7

:estimated is a good trade off between time performance and accuracy. :estimated and :ttest should be better with more points (I used 11 points to be able to calculate S for all the permutations to get an exact P value).

diegozea · 2016-05-23T20:56:53Z

I guest I'm getting Int overflow here... How can I solve it? The problem is in this line: https://github.com/diegozea/HypothesisTests.jl/blob/master/src/spearman.jl#L121

julia> x
Spearman's rank correlation test
--------------------------------
Population details:
    parameter of interest:   Spearman's ρ
    value under h_0:         0.0
    point estimate:          -0.7284221607028682

Error showing value of type HypothesisTests.SpearmanCorrelationTest:
ERROR: DomainError:
 in spearman_P_estimated at /home/diego/.julia/v0.4/HypothesisTests/src/spearman.jl:121
 in pvalue at /home/diego/.julia/v0.4/HypothesisTests/src/spearman.jl:166
 in show at /home/diego/.julia/v0.4/HypothesisTests/src/HypothesisTests.jl:98
 in anonymous at show.jl:1299
 in with_output_limit at ./show.jl:1276
 in showlimited at show.jl:1298
 in writemime at replutil.jl:4
 in display at REPL.jl:114
 in display at REPL.jl:117
 [inlined code] from multimedia.jl:151
 in display at multimedia.jl:163
 in print_response at REPL.jl:134
 in print_response at REPL.jl:121
 in anonymous at REPL.jl:624
 in run_interface at ./LineEdit.jl:1610
 in run_frontend at ./REPL.jl:863
 in run_repl at ./REPL.jl:167
 in _start at ./client.jl:420

julia> x.n
10153

julia> (x.n-1)
10152

julia> (x.n^2)
103083409

julia> ((x.n+1)^2)
103103716

julia> ( (x.n-1) * (x.n^2)* ((x.n+1)^2) )
-2782140239849997408

julia> ( (x.n-1) * (x.n^2)* ((x.n+1)^2) ) / 36
-7.72816733291666e16

julia> sqrt( ( (x.n-1) * (x.n^2)* ((x.n+1)^2) ) / 36 )
ERROR: DomainError:
sqrt will only return a complex result if called with a complex argument. Try sqrt(complex(x)).
 in sqrt at ./math.jl:144

diegozea · 2016-05-23T21:02:30Z

This gives a better result, but maybe is not the best solution:

(((x.n+1)^2)/36) * (x.n^2) * (x.n-1)

nalimilan · 2016-05-23T21:04:53Z

Maybe just call sqrt(max(0, ...))?

andreasnoack · 2016-05-23T21:06:39Z

Maybe use nf = float(x.n) in the calculations or compute the log of the standard deviation.

fiatflux · 2016-05-23T23:04:28Z

It could help a bit to pull the squared quantities outside the square root. (x.n)^2 and (x.n+1)^2 in particular. Then the argument to the square root grows as x.n rather than x.n^5... But numerical recipes might be doing something intentionally that I'm just not ready to handle. Actually, yeah, I think they might have organized the code that way to make use of integer arithmetic which, these days, isn't as a big advantage.

diegozea · 2016-05-24T01:11:53Z

Clever ideas @andreasnoack and @fiatflux ! Since N > 0, sqrt((N-1)*N^2*(N+1)^2) is just sqrt(N-1)*N*(N+1). I will also use float, just in case it could avoid the integer overflow.

diegozea · 2016-05-24T03:16:09Z

Thanks! I updated the pull to avoid Int overflow ;)

julia> x = rand(10153);

julia> y = cos(π*x+0.5π);

julia> corr = SpearmanCorrelationTest(x,y)
Spearman's rank correlation test
--------------------------------
Population details:
    parameter of interest:   Spearman's ρ
    value under h_0:         0.0
    point estimate:          0.01719171284150201

Test summary:
    outcome with 95% confidence: fail to reject h_0
    two-sided p-value:           0.08324014651846666 (not significant)

Details:
    Number of points:                    10153
    Spearman's ρ:                        0.01719171284150201
    S (Sum squared difference of ranks): 1.7143548239e11
    adjustment for ties in x:            0.0
    adjustment for ties in y:            0.0


julia> R"""cor.test($x,$y,method="spearman")"""
RCall.RObject{RCall.VecSxp}

    Spearman's rank correlation rho

data:  $x and $y
S = 1.7144e+11, p-value = 0.08324
alternative hypothesis: true rho is not equal to 0
sample estimates:
       rho 
0.01719171

coveralls · 2016-05-24T03:17:19Z

Coverage decreased (-6.7%) to 83.538% when pulling 74ab395 on diegozea:master into 351b8d0 on JuliaStats:master.

diegozea · 2016-05-24T06:15:27Z

When there is no ties, S values are the same between R's cor.test and this SpearmanCorrelationTest and they are calculated in R using the equation (14.6.4) in Numerical Recipes. Here I compare rho value using equation (14.6.5), used when there are ties, from our SpearmanCorrelationTest's S values against the real rho values. I use the following functions:

using StatsBase

function rho_with_ties(S,N,tx,ty)
    # Equation (14.6.5) from Numerical Recipes
    a=(N^3)-N
    (1-((6/a)*(S+(tx/12)+(ty/12)))) / (sqrt(1-(tx/a))*sqrt(1-(ty/a)))
end

function diff_rho(x,y)
    corr = SpearmanCorrelationTest(x, y)
    corr.ρ - rho_with_ties(corr.S, corr.n, corr.xtiesadj, corr.ytiesadj)
end 

describe(Float64[ diff_rho(rand(1:10,i), rand(1:10,i)) for i in 11:10000 ])

Summary Stats:
Mean:         -0.000000
Minimum:      -0.000000
1st Quartile: -0.000000
Median:       0.000000
3rd Quartile: 0.000000
Maximum:      0.000000

I found no differences between the real rho values and the values derived from our S values.

I will add this to tests...

coveralls · 2016-05-24T06:43:13Z

Coverage decreased (-6.6%) to 83.64% when pulling 8d2e57d on diegozea:master into 351b8d0 on JuliaStats:master.

diegozea · 2016-05-24T06:44:44Z

Travis shows an error with FisherExactTest unrelated to this pull request:

Maybe tests for t.ω should use @test_approx_eq_eps...

ERROR: LoadError: LoadError: assertion failed: |t.ω - 7.3653226779913386| <= 8.881784197001252e-12
  t.ω = 7.365322678060777
  7.3653226779913386 = 7.3653226779913386
  difference = 6.943867703057549e-11 > 8.881784197001252e-12
...
while loading /home/travis/.julia/v0.4/HypothesisTests/test/fisher.jl, in expression starting on line 81

fiatflux · 2016-05-24T21:16:52Z

src/spearman.jl

+        return( spearman_P_exact(x, tail) )
+    end
+    if method == :sampling
+        return( spearman_P_sampling(x, tail) )


I suggest removing the whitespace padding to be more consistent with other code in this repo

diegozea · 2016-05-25T22:03:32Z

@fiatflux I updated the code, deleting paddings and extra spaces. I also setted a seed for each random test ;)

fiatflux · 2016-05-25T22:06:47Z

test/spearman.jl

 end

 let x = collect(1:10),
    y = [5,4,3,2,1,6,10,9,8,7]

    # R's pspearman: 0.05443067 is the exact value
-    corr = HypothesisTests.SpearmanCorrelationTest(x, y)
+    corr = HypothesisTests.SpearmanCorrelationTest(x,y)
    @test_approx_eq_eps HypothesisTests.pvalue(corr) 0.05443067 1e-8
 end

 # Using (N-1)N²(N+1)² overflows with N = 10153


Do we still get overflow for N > 10153?

No, It's working fine now ;)

julia> L=10153; N = 10L; x=rand(1:L, N); y=rand(1:L, N); julia> R"""cor.test($x,$y,method="spearman",exact=FALSE)""" RCall.RObject{RCall.VecSxp} Spearman's rank correlation rho data: $x and $y S = 1.7465e+14, p-value = 0.6973 alternative hypothesis: true rho is not equal to 0 sample estimates: rho -0.001220803 julia> SpearmanCorrelationTest(x,y) Spearman's rank correlation test -------------------------------- Population details: parameter of interest: Spearman's ρ value under h_0: 0.0 point estimate: -0.0012208025475993133 Test summary: outcome with 95% confidence: fail to reject h_0 two-sided p-value: 0.6972821892308928 (not significant) Details: Number of points: 101530 Spearman's ρ: -0.0012208025475993133 S (Sum squared difference of ranks): 1.746472562199395e14 adjustment for ties in x: 1.3199436e7 adjustment for ties in y: 1.3259556e7

Great! Might as well remove reference to it here.

coveralls · 2016-05-25T22:08:52Z

Coverage decreased (-6.6%) to 83.64% when pulling 5f366e3 on diegozea:master into 351b8d0 on JuliaStats:master.

fiatflux · 2016-05-25T22:12:34Z

test/spearman.jl

    @test_approx_eq_eps HypothesisTests.pvalue(corr) 0.05443067 1e-8
 end

 # Using (N-1)N²(N+1)² overflows with N = 10153
+srand(12345) # Seed for rand


It should be fine to set the seed just once at the top of the test, but there's no harm doing it here too.

I put one srand per test, to avoid the hypothetical case of other rand call introduced later between the test and a global srand changing the results.

That's fair. I'm generally much less concerned with a test breaking as a result of a change, than the possibility that the outcome of the same test on the same codebase randomly fails sometimes when repeatedly run. No big deal either way.

coveralls · 2016-05-25T22:34:46Z

Coverage decreased (-6.6%) to 83.64% when pulling 4d826aa on diegozea:master into 351b8d0 on JuliaStats:master.

fiatflux · 2016-05-25T22:36:21Z

src/spearman.jl

@@ -43,37 +43,37 @@ immutable SpearmanCorrelationTest <: CorrelationTest
    # Spearman's ρ
    ρ::Float64

-    function SpearmanCorrelationTest(x,y)
+    function SpearmanCorrelationTest(x, y)

        n = length(x)
        (n != length(y)) && throw(ErrorException("x and y must have the same length"))


Is there a reason this doesn't use the @assert macro? A la
@assert n==length(y) "Variables x and y must have the same length"

I didn't know I can set the message of @Assert

fiatflux · 2016-05-25T22:46:45Z

src/spearman.jl

+    println(io, ident, "adjustment for ties in y:            ", x.ytiesadj)
+end
+
+function P_from_null_S_values(S_null, x::SpearmanCorrelationTest, tail)


Is there a reason not to explicitly identify the tail parameter as having type Symbol?

Not really...

Then I'd suggest marking it explicitly, both as a (admittedly trivial) compiler hint and (less trivial) documentation/security for users of the library.

It's done :D

coveralls · 2016-05-25T22:55:22Z

Coverage decreased (-6.6%) to 83.64% when pulling 3ec64c7 on diegozea:master into 351b8d0 on JuliaStats:master.

coveralls · 2016-05-25T23:05:06Z

Coverage decreased (-6.6%) to 83.64% when pulling c73345b on diegozea:master into 351b8d0 on JuliaStats:master.

fiatflux · 2016-05-25T23:30:34Z

Looking pretty good to me. Only thing that concerns me now is that it seems coveralls thinks left-tailed tests are not exercised in the unittests. What's going on there?

diegozea · 2016-05-26T00:51:51Z

@fiatflux I added more tests to improve test coverage and I fixed bugs that I found in the process.

fiatflux · 2016-05-26T15:37:16Z

src/spearman.jl

-    elseif method == :estimated
-        return(spearman_P_estimated(x, tail))
-    elseif method == :ttest
-        return(spearman_P_ttest(x, tail))
    else


Tiny style thing:
if foo foobar() else if bar barbar() elif baz bazbar() end end

can be simplified to
if foo foobar() elif bar barbar() elif baz bazbar() end

(Sorry, I don't know why github is smashing that onto one line.)

I wrote it in that way to make explicit that with 10 or less elements, always the exact P value is calculated. Different methods are selected only if N > 10. The error with N<=10 is only for a consistent API.

fiatflux · 2016-05-26T15:45:49Z

Thanks for writing this @diegozea !

diegozea · 2016-05-26T16:34:29Z

You're welcome! :)

maximsch2 · 2016-07-14T19:12:32Z

src/spearman.jl

+    df = x.n-2
+    t = sqrt((df*ρ2)/(1-ρ2))
+    if tail == :both
+        cdf(TDist(df), -t) + ccdf(Normal(), t)


Is this supposed to be Normal or TDist(df) in the second term?

BenjaminBorn · 2017-08-29T12:07:53Z

What's the status of this PR? Would be great to have Spearman test in the package.

nico202 · 2019-10-09T10:23:41Z

Nothing new to add here, just wanted to know the state of this PR and if there's any luck this will be merged in the future

Thanks!

clementpoiret · 2020-12-27T16:26:09Z

Hi, any news for the merge of this PR?

nalimilan · 2020-12-28T13:14:42Z

Somebody needs to rebase the PR against current master and update it to run on Julia 1.0 and above.

Spearman Test

5a9c7e5

Avoid Int overflow

74ab395

diegozea added 2 commits May 24, 2016 03:28

More tests for Spearman's Test

4dce494

Spearman: fix test

8d2e57d

fiatflux reviewed May 24, 2016
View reviewed changes

Delete padding and add srand

5f366e3

fiatflux reviewed May 25, 2016
View reviewed changes

spaces and overflow explanation

4d826aa

fiatflux reviewed May 25, 2016
View reviewed changes

Using @Assert

3ec64c7

Add ::Symbol where it's needed

c73345b

Fix some bugs and improve test coverage

415e2d8

fiatflux reviewed May 26, 2016
View reviewed changes

Delete extra new lines

14e54da

maximsch2 reviewed Jul 14, 2016
View reviewed changes

Spearman Test (this needs review) #53

Are you sure you want to change the base?

Spearman Test (this needs review) #53

Conversation

diegozea commented May 21, 2016

coveralls commented May 21, 2016

diegozea commented May 21, 2016

diegozea commented May 23, 2016

diegozea commented May 23, 2016

nalimilan commented May 23, 2016

andreasnoack commented May 23, 2016

fiatflux commented May 23, 2016

diegozea commented May 24, 2016

diegozea commented May 24, 2016 • edited

coveralls commented May 24, 2016

diegozea commented May 24, 2016

coveralls commented May 24, 2016

diegozea commented May 24, 2016

Choose a reason for hiding this comment

diegozea commented May 25, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

coveralls commented May 25, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fiatflux May 25, 2016 • edited

Choose a reason for hiding this comment

coveralls commented May 25, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

coveralls commented May 25, 2016

coveralls commented May 25, 2016

fiatflux commented May 25, 2016

diegozea commented May 26, 2016

fiatflux May 26, 2016 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fiatflux commented May 26, 2016

diegozea commented May 26, 2016

Choose a reason for hiding this comment

BenjaminBorn commented Aug 29, 2017

nico202 commented Oct 9, 2019

clementpoiret commented Dec 27, 2020

nalimilan commented Dec 28, 2020

diegozea commented May 24, 2016 •

edited

fiatflux May 25, 2016 •

edited

fiatflux May 26, 2016 •

edited