Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scipy.stats.spearmanr unexpected NaN handling #6530

Closed
ChickenProp opened this issue Aug 31, 2016 · 3 comments
Closed

scipy.stats.spearmanr unexpected NaN handling #6530

ChickenProp opened this issue Aug 31, 2016 · 3 comments
Labels
defect A clear bug or issue that prevents SciPy from being installed or used as expected scipy.stats
Milestone

Comments

@ChickenProp
Copy link

Neither nan_policy='omit' nor nan_policy='propagate' (the default) does what I would expect.

The docs say that propagate should return nan, but I haven't seen that happen. It simply seems to rank nan above any other value. For example:

In [1]: import scipy

In [2]: scipy.__version__
Out[2]: '0.18.0'

In [3]: import numpy as np

In [4]: from scipy.stats import spearmanr

In [5]: x = np.array(np.arange(15.0)).reshape(5,3)

In [7]: x[1,1] = np.nan

In [8]: x
Out[8]:
array([[  0.,   1.,   2.],
       [  3.,  nan,   5.],
       [  6.,   7.,   8.],
       [  9.,  10.,  11.],
       [ 12.,  13.,  14.]])

In [10]: spearmanr(x)
Out[10]:
SpearmanrResult(correlation=array([[ 1. ,  0.4,  1. ],
       [ 0.4,  1. ,  0.4],
       [ 1. ,  0.4,  1. ]]), pvalue=array([[  1.40426542e-24,   5.04631575e-01,   1.40426542e-24],
       [  5.04631575e-01,   1.40426542e-24,   5.04631575e-01],
       [  1.40426542e-24,   5.04631575e-01,   1.40426542e-24]]))

0.4 is the correct result if x[1,1] is instead 15. We can also see this when passing two one-dimensional arrays:

In [34]: spearmanr(x.ravel(), np.arange(15))
Out[34]: SpearmanrResult(correlation=0.80357142857142838, pvalue=0.00030726503181250915)

Again, this is the correct result if x[1,1] is 15.

Meanwhile, omit seems to correctly handle the case of two one-dimensional arrays, in that it returns the same result as if you filter out indexes which are nan in either array.

But omit seems to cause the a and b arguments to be raveled, and calculates a single correlation coefficient. If b is missing, or doesn't have the same number of entries as a, we get an error.

In [18]: spearmanr(x, nan_policy='omit')
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-18-c2b71c85ee3b> in <module>()
----> 1 spearmanr(x, nan_policy='omit')

/Users/206404437/local/share/anaconda/lib/python2.7/site-packages/scipy/stats/stats.pyc in spearmanr(a, b, axis, nan_policy)
   3308     if contains_nan and nan_policy == 'omit':
   3309         a = ma.masked_invalid(a)
-> 3310         b = ma.masked_invalid(b)
   3311         return mstats_basic.spearmanr(a, b, axis)
   3312

/Users/206404437/local/share/anaconda/lib/python2.7/site-packages/numpy/ma/core.pyc in masked_invalid(a, copy)
   2297         cls = type(a)
   2298     else:
-> 2299         condition = ~(np.isfinite(a))
   2300         cls = MaskedArray
   2301     result = a.view(cls)

TypeError: ufunc 'isfinite' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

This shouldn't be an error.

In [20]: spearmanr(x, x, nan_policy='omit')
Out[20]: SpearmanrResult(correlation=1.0, pvalue=0.0)

There are six variables here, but only a single result.

In [23]: spearmanr(x, x[:,:2], nan_policy='omit')
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-23-33f1f6e32fb7> in <module>()
----> 1 spearmanr(x, x[:,:2], nan_policy='omit')

/Users/206404437/local/share/anaconda/lib/python2.7/site-packages/scipy/stats/stats.pyc in spearmanr(a, b, axis, nan_policy)
   3309         a = ma.masked_invalid(a)
   3310         b = ma.masked_invalid(b)
-> 3311         return mstats_basic.spearmanr(a, b, axis)
   3312
   3313     if a.size <= 1:

/Users/206404437/local/share/anaconda/lib/python2.7/site-packages/scipy/stats/mstats_basic.pyc in spearmanr(x, y, use_ties)
    451
    452     """
--> 453     (x, y, n) = _chk_size(x, y)
    454     (x, y) = (x.ravel(), y.ravel())
    455

/Users/206404437/local/share/anaconda/lib/python2.7/site-packages/scipy/stats/mstats_basic.pyc in _chk_size(a, b)
     93     if na != nb:
     94         raise ValueError("The size of the input array should match!"
---> 95                          " (%s <> %s)" % (na, nb))
     96     return (a, b, na)
     97

ValueError: The size of the input array should match! (15 <> 10)

It looks as though these arrays are being raveled.

In [45]: spearmanr(x, x.T, nan_policy='omit')
Out[45]:
SpearmanrResult(correlation=0.60439560439560447, pvalue=masked_array(data = 0.0286727056714,
             mask = False,
       fill_value = 1e+20)
)

The a and b arguments here have different dimensions, so this should throw an error; but again, it simply seems to be ravelling them.

(When there are no nan values, omit seems to have no effect.)

@ev-br ev-br added scipy.stats defect A clear bug or issue that prevents SciPy from being installed or used as expected labels Oct 6, 2016
@ev-br
Copy link
Member

ev-br commented Feb 10, 2017

Fixed by #6850

@ev-br ev-br closed this as completed Feb 10, 2017
@ev-br ev-br added this to the 0.19.0 milestone Feb 10, 2017
@Benfeitas
Copy link

Benfeitas commented Jul 31, 2018

Would just like to say that this issue is still a problem. When comparing 2 vectors it works well if done independently, but not in a dataframe:

ble=[0,1,3,np.nan,5]
bla=[np.nan,1,3,4,5]
sp.stats.spearmanr(bla,ble,nan_policy='propagate')
#out: SpearmanrResult(correlation=nan, pvalue=nan)

sp.stats.spearmanr(pd.DataFrame([ble,bla]).T,nan_policy='propagate')
#out: SpearmanrResult(correlation=-0.09999999999999999, pvalue=0.8728885715695383)

I am running scipy 1.1.0, Python 3.6.0. Others found similar problems:
https://stackoverflow.com/questions/51386399/python-scipy-spearman-correlation-for-matrix-does-not-match-two-array-correlatio?rq=1

On a side note, this is a problem for me since I want to compute these statistics in a big dataframe.

(I've also commented this in here)

@rgommers
Copy link
Member

rgommers commented Aug 1, 2018

@Sosippus could you please open a new issue for this? Most of this issue was solved, apparently just not the 2-D input which behaves different from 2 1-D arrays as input.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
defect A clear bug or issue that prevents SciPy from being installed or used as expected scipy.stats
Projects
None yet
Development

No branches or pull requests

4 participants