Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spearman Test (this needs review) #53

Open
wants to merge 10 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
5 changes: 3 additions & 2 deletions src/HypothesisTests.jl
Expand Up @@ -37,7 +37,7 @@ check_same_length(x::Vector, y::Vector) = if length(x) != length(y)
end

# Basic function for finding a p-value given a distribution and tail
pvalue(dist::ContinuousUnivariateDistribution, x::Number; tail=:both) =
pvalue(dist::ContinuousUnivariateDistribution, x::Number; tail=:both) =
if tail == :both
min(2 * min(cdf(dist, x), ccdf(dist, x)), 1.0)
elseif tail == :left
Expand All @@ -48,7 +48,7 @@ pvalue(dist::ContinuousUnivariateDistribution, x::Number; tail=:both) =
throw(ArgumentError("tail=$(tail) is invalid"))
end

pvalue(dist::DiscreteUnivariateDistribution, x::Number; tail=:both) =
pvalue(dist::DiscreteUnivariateDistribution, x::Number; tail=:both) =
if tail == :both
min(2 * min(ccdf(dist, x-1), cdf(dist, x)), 1.0)
elseif tail == :left
Expand Down Expand Up @@ -138,4 +138,5 @@ include("t.jl")
include("wilcoxon.jl")
include("power_divergence.jl")
include("anderson_darling.jl")
include("spearman.jl")
end
169 changes: 169 additions & 0 deletions src/spearman.jl
@@ -0,0 +1,169 @@
# spearman.jl
# Spearman's rank correlation test in Julia
#
# Copyright (C) 2016 Diego Javier Zea
#
# Permission is hereby granted, free of charge, to any person obtaining
# a copy of this software and associated documentation files (the
# "Software"), to deal in the Software without restriction, including
# without limitation the rights to use, copy, modify, merge, publish,
# distribute, sublicense, and/or sell copies of the Software, and to
# permit persons to whom the Software is furnished to do so, subject to
# the following conditions:
#
# The above copyright notice and this permission notice shall be
# included in all copies or substantial portions of the Software.
#
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
# EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
# MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
# NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
# LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
# OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
# WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

export CorrelationTest, SpearmanCorrelationTest

abstract CorrelationTest <: HypothesisTest

"Sum squared difference of ranks (midranks for ties)"
spearman_S(xrank, yrank) = sumabs2(xrank .- yrank)

immutable SpearmanCorrelationTest <: CorrelationTest
# Tied ranking for x and y
xrank::Vector{Float64}
yrank::Vector{Float64}
# Adjustment for ties
xtiesadj::Float64
ytiesadj::Float64
# Sum squared difference of ranks
S::Float64
# Number of points
n::Int
# Spearman's ρ
ρ::Float64

function SpearmanCorrelationTest(x, y)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, this is getting reeeeaallly nitpicky, but this function block has extra newlines compared with the general style of the rest of the codebase (I think).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's true, I like to se white spaces in code so I'm biased to the space paddings and extra newlines XD

n = length(x)
(n != length(y)) && throw(ErrorException("x and y must have the same length"))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason this doesn't use the @assert macro? A la
@assert n==length(y) "Variables x and y must have the same length"

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't know I can set the message of @Assert


xrank, xtiesadj = HypothesisTests.tiedrank_adj(x)
yrank, ytiesadj = HypothesisTests.tiedrank_adj(y)

S = spearman_S(xrank, yrank)

ρ = corspearman(x, y)

new(xrank, yrank, xtiesadj, ytiesadj, S, n, ρ)
end

end

testname(::SpearmanCorrelationTest) = "Spearman's rank correlation test"

# parameter of interest: name, value under h0, point estimate
population_param_of_interest(x::SpearmanCorrelationTest) = ("Spearman's ρ", 0.0, x.ρ)

function show_params(io::IO, x::SpearmanCorrelationTest, ident)
println(io, ident, "Number of points: ", x.n)
println(io, ident, "Spearman's ρ: ", x.ρ)
println(io, ident, "S (Sum squared difference of ranks): ", x.S)
println(io, ident, "adjustment for ties in x: ", x.xtiesadj)
println(io, ident, "adjustment for ties in y: ", x.ytiesadj)
end

function P_from_null_S_values(S_null, x::SpearmanCorrelationTest, tail)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason not to explicitly identify the tail parameter as having type Symbol?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really...

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then I'd suggest marking it explicitly, both as a (admittedly trivial) compiler hint and (less trivial) documentation/security for users of the library.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's done :D

S_null_mean = mean(S_null)
# S is approximately normally distributed
# S and ρ are inversely proportional
S_null[:] = S_null .- S_null_mean # center
S_centered = x.S - S_null_mean # center
if tail == :both
modS = abs(S_centered)
mean(S_null .<= -modS) + mean(S_null .>= modS)
elseif tail == :right
mean(S_null .<= S_centered)
elseif tail == :left
mean(S_null .>= S_centered)
else
throw(ArgumentError("tail=$(tail) is invalid"))
end
end

function spearman_P_exact(x::SpearmanCorrelationTest, tail)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You very well may have reason to follow this naming convention, but I think I should note that other tests rely on multiple-dispatch to avoid needing to use a function name that includes the name of the test. For example, in fisher.jl, there's pvalue(x::FisherExactTest; tail=:both, method=:central). Here we could call it, say, pvalue(x::SpearmanCorrelationTest, tail).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, spearman_P_exact is private/unexported function. The final user use: pvalue(corr, tail=:both, method=:exact)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad

S_null = Float64[ spearman_S(perm, x.yrank) for perm in permutations(x.xrank) ]
P_from_null_S_values(S_null, x, tail)
end

function spearman_P_sampling(x::SpearmanCorrelationTest, tail)
# 360000 samples gives an se(P) < 0.0005 for P < 0.1
X = copy(x.xrank)
S_null = Float64[ spearman_S(shuffle!(X), x.yrank) for sample in 1:360000 ]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a nondeterministic test. Even though failure may be rare, it's still not very fun to have nondeterministic tests complain. You should be able to make this test deterministic by just calling srand(12345) before using anything random. It would be super slick to compare this implementation to the R implementation by using an identical source of random numbers (and shuffles/samples based on them, etc), but I don't know offhand how to do that... and it is probably not worth the effort.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fiatflux How can I set the seed only for this function, but not for all the Julia session? I don't like the idea of having a seed fixed as side effect.
R's coin package also gives and approximate non deterministic P value when approximate is used (Monte-Carlo resampling):

> library("coin")
> x <- rnorm(100)
> y <- sqrt(abs(sin((x/3.14)+(0.6*x))))
> spearman_test(x ~ y, distribution=approximate(B=9999))

    Approximative Spearman Correlation Test

data:  x by y
Z = -0.37602, p-value = 0.7037
alternative hypothesis: true rho is not equal to 0

> spearman_test(x ~ y, distribution=approximate(B=9999))

    Approximative Spearman Correlation Test

data:  x by y
Z = -0.37602, p-value = 0.7144
alternative hypothesis: true rho is not equal to 0

> spearman_test(x ~ y, distribution=approximate(B=9999))

    Approximative Spearman Correlation Test

data:  x by y
Z = -0.37602, p-value = 0.7108
alternative hypothesis: true rho is not equal to 0

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point. And sorry, I meant to put this comment in test/...

My suggestion would be to call srand() at the end of the unit test to make sure the fixed seed doesn't propagate beyond the test. Probably best to put srand() inside a finally clause, too. Anyone else have thoughts on that?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's a problem to set the seed for a whole test session. Other uses of rand should do the same anyway to prevent random failures, so they shouldn't be affected.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just an idea ... at risk of being one of those obnoxious people who stomps around town beating their chest and shouting "TESTABILITY!!!1" ... it might be good practice to take a PRNG object as an optional parameter to nondeterministic functions.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok... I will add the srand call to tests ;)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this line still has extra padding.

P_from_null_S_values(S_null, x, tail)
end

# Use estimated mean and std for the S null distribution as in:
#
# Press WH, Teukolsky SA, Vetterling WT, Flannery BP.
# Numerical recipes in C.
# Cambridge: Cambridge university press; 1996.
function spearman_P_estimated(x::SpearmanCorrelationTest, tail)
N = float(x.n)
a = (N^3 - N)
# Numerical Recipes (14.6.6)
S_null_mean = (a/6.) - (x.xtiesadj/12.) - (x.ytiesadj/12.)
# Numerical Recipes (14.6.7)
S_null_std = sqrt( ((N - 1.)/36.) * (1. - (x.xtiesadj/a) ) * (1. - (x.ytiesadj/a) ) ) * N * (N + 1.)
zscore = (x.S - S_null_mean)/S_null_std
# S is approximately normally distributed
# S and ρ are inversely proportional
if tail == :both
cdf(Normal(), -abs(zscore)) + ccdf(Normal(), abs(zscore))
elseif tail == :right
cdf(Normal(), zscore)
elseif tail == :left
ccdf(Normal(), zscore)
else
throw(ArgumentError("tail=$(tail) is invalid"))
end
end

# Using T test for n > 10 as in:
#
# McDonald JH.
# Handbook of biological statistics.
# Baltimore, MD: Sparky House Publishing; 2009 Aug.
function spearman_P_ttest(x::SpearmanCorrelationTest, tail)
ρ2 = x.ρ^2
df = x.n-2
t = sqrt((df*ρ2)/(1-ρ2))
if tail == :both
cdf(TDist(df), -t) + ccdf(Normal(), t)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this supposed to be Normal or TDist(df) in the second term?

elseif tail == :right
ccdf(TDist(df), t)
elseif tail == :left
cdf(TDist(df), t)
else
throw(ArgumentError("tail=$(tail) is invalid"))
end
end

function pvalue(x::SpearmanCorrelationTest; tail=:both, method=:estimated)
if x.n <= 10
# Exact P value using permutations
return( spearman_P_exact(x, tail) )
end
if method == :sampling
return( spearman_P_sampling(x, tail) )

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest removing the whitespace padding to be more consistent with other code in this repo

elseif method == :exact
return( spearman_P_exact(x, tail) )
elseif method == :estimated
return( spearman_P_estimated(x, tail) )
elseif method == :ttest
return( spearman_P_ttest(x, tail) )
else
Copy link

@fiatflux fiatflux May 26, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tiny style thing:
if foo foobar() else if bar barbar() elif baz bazbar() end end

can be simplified to
if foo foobar() elif bar barbar() elif baz bazbar() end

(Sorry, I don't know why github is smashing that onto one line.)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wrote it in that way to make explicit that with 10 or less elements, always the exact P value is calculated. Different methods are selected only if N > 10. The error with N<=10 is only for a consistent API.

throw(ArgumentError("method=$(method) is invalid"))
end
end

1 change: 1 addition & 0 deletions test/runtests.jl
Expand Up @@ -9,4 +9,5 @@ include("t.jl")
include("wilcoxon.jl")
include("power_divergence.jl")
include("anderson_darling.jl")
include("spearman.jl")
include("common.jl")
67 changes: 67 additions & 0 deletions test/spearman.jl
@@ -0,0 +1,67 @@
using HypothesisTests, Base.Test

# Test Exact P value: n <= 10
let x = [ 44.4, 45.9, 41.9, 53.3, 44.7, 44.1 ],
y = [ 2.6, 3.1, 2.5, 5.0, 3.6, 4.0 ]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whitespace padding inconsistent with other style in this repo


corr = HypothesisTests.SpearmanCorrelationTest(x, y)

# R values
@test_approx_eq corr.ρ 0.6
@test_approx_eq_eps HypothesisTests.pvalue(corr) 0.2417 0.0001
@test_approx_eq_eps HypothesisTests.pvalue(corr, tail=:right) 0.1208 0.0001
@test_approx_eq_eps HypothesisTests.pvalue(corr, tail=:left) 0.9125 0.0001
end

show(IOBuffer(),
HypothesisTests.SpearmanCorrelationTest([ 44.4, 45.9, 41.9, 53.3, 44.7, 44.1 ],
[ 2.6, 3.1, 2.5, 5.0, 3.6, 4.0 ])
)

let x = collect(1:11),
y = [6,5,4,3,2,1,7,11,10,9,8]
# https://stat.ethz.ch/pipermail/r-devel/2009-February/052112.html
# correct P value 0.03044548

corr = HypothesisTests.SpearmanCorrelationTest(x, y)

@test_approx_eq_eps HypothesisTests.pvalue(corr, tail=:right, method=:exact) 0.03044548 1e-8
@test_approx_eq_eps HypothesisTests.pvalue(corr, tail=:right, method=:sampling) 0.030 1e-3
@test_approx_eq_eps HypothesisTests.pvalue(corr, tail=:right, method=:estimated) 0.030 1e-3
@test_approx_eq_eps HypothesisTests.pvalue(corr, tail=:right, method=:ttest) 0.03 1e-2
end

let x = collect(1:10),
y = [5,4,3,2,1,6,10,9,8,7]

# R's pspearman: 0.05443067 is the exact value
corr = HypothesisTests.SpearmanCorrelationTest(x, y)
@test_approx_eq_eps HypothesisTests.pvalue(corr) 0.05443067 1e-8
end

# Using (N-1)N²(N+1)² overflows with N = 10153

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still get overflow for N > 10153?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, It's working fine now ;)

julia> L=10153; N = 10L; x=rand(1:L, N); y=rand(1:L, N);

julia> R"""cor.test($x,$y,method="spearman",exact=FALSE)"""
RCall.RObject{RCall.VecSxp}

    Spearman's rank correlation rho

data:  $x and $y
S = 1.7465e+14, p-value = 0.6973
alternative hypothesis: true rho is not equal to 0
sample estimates:
         rho 
-0.001220803 



julia> SpearmanCorrelationTest(x,y)
Spearman's rank correlation test
--------------------------------
Population details:
    parameter of interest:   Spearman's ρ
    value under h_0:         0.0
    point estimate:          -0.0012208025475993133

Test summary:
    outcome with 95% confidence: fail to reject h_0
    two-sided p-value:           0.6972821892308928 (not significant)

Details:
    Number of points:                    101530
    Spearman's ρ:                        -0.0012208025475993133
    S (Sum squared difference of ranks): 1.746472562199395e14
    adjustment for ties in x:            1.3199436e7
    adjustment for ties in y:            1.3259556e7

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! Might as well remove reference to it here.

let x = rand(10153)

corr = SpearmanCorrelationTest(x,x)

@test_approx_eq corr.ρ 1.0
@test_approx_eq pvalue(corr) 0.0

end

# Test S value with ties

function rho_with_ties(S,N,tx,ty) # S == D
# Equation (14.6.5) from Numerical Recipes for rho with ties
a=(N^3)-N
(1-((6/a)*(S+(tx/12)+(ty/12)))) / (sqrt(1-(tx/a))*sqrt(1-(ty/a)))
end

function diff_rho(x,y)
corr = SpearmanCorrelationTest(x, y)
corr.ρ - rho_with_ties(corr.S, corr.n, corr.xtiesadj, corr.ytiesadj)
end

for i in 20:100
@test_approx_eq_eps diff_rho(rand(1:10,i), rand(1:10,i)) 0.0 1e-10
end