Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spearman Test (this needs review) #53

Open
wants to merge 10 commits into
base: master
Choose a base branch
from

Conversation

diegozea
Copy link

Here I'm adding the Spearman correlation test, with some alternatives to calculate P values. I'm think it's correct, even when S values are different from R's S values.

> x <- c(10, 10, 20, 30, 30)
> y <- c(1,2,3,4,5)
> cor.test(x, y, method="spearman")

    Spearman's rank correlation rho

data:  x and y
S = 1.0263, p-value = 0.01385
alternative hypothesis: true rho is not equal to 0
sample estimates:
      rho 
0.9486833 

Warning message:
In cor.test.default(x, y, method = "spearman") :
  Cannot compute exact p-value with ties
> (5^3 - 5) * (1-0.9486833)/6
[1] 1.026334
> sum((rank(x) - rank(y))^2) # Used in my implementation
[1] 1
> 

With less than 10 points, my implementation calculate all the possible S values using the ranks, and midranks in the case of ties, to calculate the exact P value. This looks better to me than the S used by R (without taking ties into account):

julia> x = [10,10,20,30,30];

julia> y = collect(1:5);

julia> HypothesisTests.SpearmanCorrelationTest(x, y)
Spearman's rank correlation test
--------------------------------
Population details:
    parameter of interest:   Spearman's ρ
    value under h_0:         0.0
    point estimate:          0.9486832980505138

Test summary:
    outcome with 95% confidence: fail to reject h_0
    two-sided p-value:           0.06666666666666667 (not significant)

Details:
    Number of points:                    5
    Spearman's ρ:                        0.9486832980505138
    S (Sum squared difference of ranks): 1.0
    adjustment for ties in x:            12.0
    adjustment for ties in y:            0.0

Best,

@coveralls
Copy link

Coverage Status

Coverage decreased (-6.7%) to 83.521% when pulling 5a9c7e5 on diegozea:master into 351b8d0 on JuliaStats:master.

@diegozea
Copy link
Author

I ran all the methods in this pull request, and I measured the differences against R's P values with the following function:

function P_test(N)
    data = DataFrame(
        ties = DataArray(Bool, N),
        P_R  =  DataArray(Float64, N),
        P_R_exact  = DataArray(Float64, N),
        P_jl_exact =  DataArray(Float64, N),
        P_jl_sampling  =  DataArray(Float64, N),
        P_jl_estimated =  DataArray(Float64, N),
        P_jl_ttest =  DataArray(Float64, N)
    )
    y = 1:11
    for row in eachrow(data)
        x = vcat([1, 2], rand(1:40, 7), [39, 40])
        row[:ties] = length(unique(x)) != 11
        row[:P_R] = rcopy(R"""cor.test($x, $y, method="spearman", exact=FALSE)""")[symbol("p.value")]
        row[:P_R_exact] = rcopy(R"""cor.test($x, $y, method="spearman", exact=TRUE)""")[symbol("p.value")]
        corr = SpearmanCorrelationTest(x, y)
        row[:P_jl_exact] = pvalue(corr, method=:exact)
        row[:P_jl_sampling] = pvalue(corr, method=:sampling)
        row[:P_jl_estimated] = pvalue(corr, method=:estimated)
        row[:P_jl_ttest] = pvalue(corr, method=:ttest)
    end
    data    
end

From 110 samples without ties, this is the distribution of P - R's P with exact=TRUE:

         R (FALSE)    exact         sampling     estimated     ttest
min     -0.0051819   -0.000650414  -0.00289196   0.000254004  -0.0193903 
25%     -0.00449384  -0.000615007  -0.00130036   0.00464081   -0.0117049 
median  -0.00330545  -0.000446603  -0.000583082  0.00584124   -0.00690082
75%     -0.00165133  -0.000137737  -0.000265782  0.00641158   -0.00268308
max      6.66145e-5   0.000216525   0.000780357  0.00654717    3.33073e-5

From 390 samples with ties, this is the distribution of P - R's P with exact=FALSE:

         exact        sampling      estimated    ttest
min     -0.00123275  -0.00521559   -0.0179128   -0.0172322 
25%      0.00116759   0.000986073   0.00480537  -0.0114892
median   0.00277293   0.00224276    0.0081975   -0.00560353
75%      0.00389003   0.00330739    0.00975644  -0.00131509
max      0.00590762   0.00567053    0.010392    -5.16563e-9

Comparing all (500 samples, with and without ties) the values against the exact P value of this pull request, P - Julia exact P value:

     R (FALSE)    R (TRUE)      sampling      estimated    ttest
min -0.00590762  -0.00590762   -0.00677725   -0.0166801   -0.0226378  
5%  -0.00472031  -0.00434692   -0.00162825   -0.00377428  -0.0193502  
25% -0.00394747  -0.00364021   -0.000527372   0.00356583  -0.0149684  
50% -0.00275482  -0.0016838    -0.000109277   0.00581605  -0.00769338 
75% -0.00117023  -8.76156e-5    1.60013e-5    0.00702659  -0.0024413  
95% -8.8607e-5    0.000615007   0.000294464   0.00738872  -0.000154633
max  0.00123275   0.00123275    0.0011114     0.00759521  -7.96502e-7

:estimated is a good trade off between time performance and accuracy. :estimated and :ttest should be better with more points (I used 11 points to be able to calculate S for all the permutations to get an exact P value).

@diegozea
Copy link
Author

I guest I'm getting Int overflow here... How can I solve it? The problem is in this line: https://github.com/diegozea/HypothesisTests.jl/blob/master/src/spearman.jl#L121

julia> x
Spearman's rank correlation test
--------------------------------
Population details:
    parameter of interest:   Spearman's ρ
    value under h_0:         0.0
    point estimate:          -0.7284221607028682

Error showing value of type HypothesisTests.SpearmanCorrelationTest:
ERROR: DomainError:
 in spearman_P_estimated at /home/diego/.julia/v0.4/HypothesisTests/src/spearman.jl:121
 in pvalue at /home/diego/.julia/v0.4/HypothesisTests/src/spearman.jl:166
 in show at /home/diego/.julia/v0.4/HypothesisTests/src/HypothesisTests.jl:98
 in anonymous at show.jl:1299
 in with_output_limit at ./show.jl:1276
 in showlimited at show.jl:1298
 in writemime at replutil.jl:4
 in display at REPL.jl:114
 in display at REPL.jl:117
 [inlined code] from multimedia.jl:151
 in display at multimedia.jl:163
 in print_response at REPL.jl:134
 in print_response at REPL.jl:121
 in anonymous at REPL.jl:624
 in run_interface at ./LineEdit.jl:1610
 in run_frontend at ./REPL.jl:863
 in run_repl at ./REPL.jl:167
 in _start at ./client.jl:420

julia> x.n
10153

julia> (x.n-1)
10152

julia> (x.n^2)
103083409

julia> ((x.n+1)^2)
103103716

julia> ( (x.n-1) * (x.n^2)* ((x.n+1)^2) )
-2782140239849997408

julia> ( (x.n-1) * (x.n^2)* ((x.n+1)^2) ) / 36
-7.72816733291666e16

julia> sqrt( ( (x.n-1) * (x.n^2)* ((x.n+1)^2) ) / 36 )
ERROR: DomainError:
sqrt will only return a complex result if called with a complex argument. Try sqrt(complex(x)).
 in sqrt at ./math.jl:144

@diegozea
Copy link
Author

This gives a better result, but maybe is not the best solution:

(((x.n+1)^2)/36) * (x.n^2) * (x.n-1)

@nalimilan
Copy link
Member

Maybe just call sqrt(max(0, ...))?

@andreasnoack
Copy link
Member

Maybe use nf = float(x.n) in the calculations or compute the log of the standard deviation.

@fiatflux
Copy link

It could help a bit to pull the squared quantities outside the square root. (x.n)^2 and (x.n+1)^2 in particular. Then the argument to the square root grows as x.n rather than x.n^5... But numerical recipes might be doing something intentionally that I'm just not ready to handle. Actually, yeah, I think they might have organized the code that way to make use of integer arithmetic which, these days, isn't as a big advantage.

@diegozea
Copy link
Author

Clever ideas @andreasnoack and @fiatflux ! Since N > 0, sqrt((N-1)*N^2*(N+1)^2) is just sqrt(N-1)*N*(N+1). I will also use float, just in case it could avoid the integer overflow.

@diegozea
Copy link
Author

diegozea commented May 24, 2016

Thanks! I updated the pull to avoid Int overflow ;)

julia> x = rand(10153);

julia> y = cos*x+0.5π);

julia> corr = SpearmanCorrelationTest(x,y)
Spearman's rank correlation test
--------------------------------
Population details:
    parameter of interest:   Spearman's ρ
    value under h_0:         0.0
    point estimate:          0.01719171284150201

Test summary:
    outcome with 95% confidence: fail to reject h_0
    two-sided p-value:           0.08324014651846666 (not significant)

Details:
    Number of points:                    10153
    Spearman's ρ:                        0.01719171284150201
    S (Sum squared difference of ranks): 1.7143548239e11
    adjustment for ties in x:            0.0
    adjustment for ties in y:            0.0


julia> R"""cor.test($x,$y,method="spearman")"""
RCall.RObject{RCall.VecSxp}

    Spearman's rank correlation rho

data:  $x and $y
S = 1.7144e+11, p-value = 0.08324
alternative hypothesis: true rho is not equal to 0
sample estimates:
       rho 
0.01719171 

@coveralls
Copy link

Coverage Status

Coverage decreased (-6.7%) to 83.538% when pulling 74ab395 on diegozea:master into 351b8d0 on JuliaStats:master.

@diegozea
Copy link
Author

When there is no ties, S values are the same between R's cor.test and this SpearmanCorrelationTest and they are calculated in R using the equation (14.6.4) in Numerical Recipes. Here I compare rho value using equation (14.6.5), used when there are ties, from our SpearmanCorrelationTest's S values against the real rho values. I use the following functions:

using StatsBase

function rho_with_ties(S,N,tx,ty)
    # Equation (14.6.5) from Numerical Recipes
    a=(N^3)-N
    (1-((6/a)*(S+(tx/12)+(ty/12)))) / (sqrt(1-(tx/a))*sqrt(1-(ty/a)))
end

function diff_rho(x,y)
    corr = SpearmanCorrelationTest(x, y)
    corr.ρ - rho_with_ties(corr.S, corr.n, corr.xtiesadj, corr.ytiesadj)
end 

describe(Float64[ diff_rho(rand(1:10,i), rand(1:10,i)) for i in 11:10000 ])
Summary Stats:
Mean:         -0.000000
Minimum:      -0.000000
1st Quartile: -0.000000
Median:       0.000000
3rd Quartile: 0.000000
Maximum:      0.000000

I found no differences between the real rho values and the values derived from our S values.

I will add this to tests...

@coveralls
Copy link

Coverage Status

Coverage decreased (-6.6%) to 83.64% when pulling 8d2e57d on diegozea:master into 351b8d0 on JuliaStats:master.

@diegozea
Copy link
Author

Travis shows an error with FisherExactTest unrelated to this pull request:

Maybe tests for t.ω should use @test_approx_eq_eps...

ERROR: LoadError: LoadError: assertion failed: |t.ω - 7.3653226779913386| <= 8.881784197001252e-12
  t.ω = 7.365322678060777
  7.3653226779913386 = 7.3653226779913386
  difference = 6.943867703057549e-11 > 8.881784197001252e-12
...
while loading /home/travis/.julia/v0.4/HypothesisTests/test/fisher.jl, in expression starting on line 81

return( spearman_P_exact(x, tail) )
end
if method == :sampling
return( spearman_P_sampling(x, tail) )

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest removing the whitespace padding to be more consistent with other code in this repo

@diegozea
Copy link
Author

@fiatflux I updated the code, deleting paddings and extra spaces. I also setted a seed for each random test ;)

end

let x = collect(1:10),
y = [5,4,3,2,1,6,10,9,8,7]

# R's pspearman: 0.05443067 is the exact value
corr = HypothesisTests.SpearmanCorrelationTest(x, y)
corr = HypothesisTests.SpearmanCorrelationTest(x,y)
@test_approx_eq_eps HypothesisTests.pvalue(corr) 0.05443067 1e-8
end

# Using (N-1)N²(N+1)² overflows with N = 10153

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still get overflow for N > 10153?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, It's working fine now ;)

julia> L=10153; N = 10L; x=rand(1:L, N); y=rand(1:L, N);

julia> R"""cor.test($x,$y,method="spearman",exact=FALSE)"""
RCall.RObject{RCall.VecSxp}

    Spearman's rank correlation rho

data:  $x and $y
S = 1.7465e+14, p-value = 0.6973
alternative hypothesis: true rho is not equal to 0
sample estimates:
         rho 
-0.001220803 



julia> SpearmanCorrelationTest(x,y)
Spearman's rank correlation test
--------------------------------
Population details:
    parameter of interest:   Spearman's ρ
    value under h_0:         0.0
    point estimate:          -0.0012208025475993133

Test summary:
    outcome with 95% confidence: fail to reject h_0
    two-sided p-value:           0.6972821892308928 (not significant)

Details:
    Number of points:                    101530
    Spearman's ρ:                        -0.0012208025475993133
    S (Sum squared difference of ranks): 1.746472562199395e14
    adjustment for ties in x:            1.3199436e7
    adjustment for ties in y:            1.3259556e7

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! Might as well remove reference to it here.

@coveralls
Copy link

Coverage Status

Coverage decreased (-6.6%) to 83.64% when pulling 5f366e3 on diegozea:master into 351b8d0 on JuliaStats:master.

@test_approx_eq_eps HypothesisTests.pvalue(corr) 0.05443067 1e-8
end

# Using (N-1)N²(N+1)² overflows with N = 10153
srand(12345) # Seed for rand

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be fine to set the seed just once at the top of the test, but there's no harm doing it here too.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I put one srand per test, to avoid the hypothetical case of other rand call introduced later between the test and a global srand changing the results.

Copy link

@fiatflux fiatflux May 25, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fair. I'm generally much less concerned with a test breaking as a result of a change, than the possibility that the outcome of the same test on the same codebase randomly fails sometimes when repeatedly run. No big deal either way.

@coveralls
Copy link

Coverage Status

Coverage decreased (-6.6%) to 83.64% when pulling 4d826aa on diegozea:master into 351b8d0 on JuliaStats:master.

@@ -43,37 +43,37 @@ immutable SpearmanCorrelationTest <: CorrelationTest
# Spearman's ρ
ρ::Float64

function SpearmanCorrelationTest(x,y)
function SpearmanCorrelationTest(x, y)

n = length(x)
(n != length(y)) && throw(ErrorException("x and y must have the same length"))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason this doesn't use the @assert macro? A la
@assert n==length(y) "Variables x and y must have the same length"

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't know I can set the message of @Assert

println(io, ident, "adjustment for ties in y: ", x.ytiesadj)
end

function P_from_null_S_values(S_null, x::SpearmanCorrelationTest, tail)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason not to explicitly identify the tail parameter as having type Symbol?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really...

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then I'd suggest marking it explicitly, both as a (admittedly trivial) compiler hint and (less trivial) documentation/security for users of the library.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's done :D

@coveralls
Copy link

Coverage Status

Coverage decreased (-6.6%) to 83.64% when pulling 3ec64c7 on diegozea:master into 351b8d0 on JuliaStats:master.

@coveralls
Copy link

Coverage Status

Coverage decreased (-6.6%) to 83.64% when pulling c73345b on diegozea:master into 351b8d0 on JuliaStats:master.

@fiatflux
Copy link

Looking pretty good to me. Only thing that concerns me now is that it seems coveralls thinks left-tailed tests are not exercised in the unittests. What's going on there?

@diegozea
Copy link
Author

@fiatflux I added more tests to improve test coverage and I fixed bugs that I found in the process.

elseif method == :estimated
return(spearman_P_estimated(x, tail))
elseif method == :ttest
return(spearman_P_ttest(x, tail))
else
Copy link

@fiatflux fiatflux May 26, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tiny style thing:
if foo foobar() else if bar barbar() elif baz bazbar() end end

can be simplified to
if foo foobar() elif bar barbar() elif baz bazbar() end

(Sorry, I don't know why github is smashing that onto one line.)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wrote it in that way to make explicit that with 10 or less elements, always the exact P value is calculated. Different methods are selected only if N > 10. The error with N<=10 is only for a consistent API.

@fiatflux
Copy link

Thanks for writing this @diegozea !

@diegozea
Copy link
Author

You're welcome! :)

df = x.n-2
t = sqrt((df*ρ2)/(1-ρ2))
if tail == :both
cdf(TDist(df), -t) + ccdf(Normal(), t)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this supposed to be Normal or TDist(df) in the second term?

@BenjaminBorn
Copy link
Contributor

What's the status of this PR? Would be great to have Spearman test in the package.

@nico202
Copy link

nico202 commented Oct 9, 2019

Nothing new to add here, just wanted to know the state of this PR and if there's any luck this will be merged in the future

Thanks!

@clementpoiret
Copy link

Hi, any news for the merge of this PR?

@nalimilan
Copy link
Member

Somebody needs to rebase the PR against current master and update it to run on Julia 1.0 and above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants