Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add significance testing for `weighted_corr` #90

Merged
merged 54 commits into from Mar 27, 2019

Conversation

@bradyrx
Copy link
Contributor

commented Mar 10, 2019

This PR adds significance testing for weighted_corr per #89.

Feature Checklist

  • Calculate standard p-value per a two-tailed t-test

esmlab Checklist

  • Enable and install pre-commit to ensure style-guides and code checks are followed.
  • Target devel for new features or functionality changes.
  • Include documentation when adding new features.
  • Include new tests or update existing tests when applicable.
sudharsana-kjl and others added 5 commits Mar 9, 2019
remove duplicate call to _get_weights_and_dims
@bradyrx

This comment has been minimized.

Copy link
Contributor Author

commented Mar 10, 2019

Does anybody have a straightforward example of using the dim keyword on weighted_corr? I'm having issues with it. I'm testing the p-value function on, say, a lat/lon grid of data.

import numpy as np
import xarray as xr
import esmlab

base = np.random.rand(100,10,10)
x = xr.DataArray(np.random.rand(100,10,10) + base, dims=['time','lat','lon'])
y = xr.DataArray(np.random.rand(100,10,10) + base, dims=['time','lat','lon'])
# just looking to correlate over the 'time' dimension so a grid of correlation
# coefficients is returned.
r = esmlab.statistics.weighted_corr(x, y, dim=['time'], weights=None) 

Screen Shot 2019-03-09 at 9 03 48 PM

@andersy005

This comment has been minimized.

Copy link
Member

commented Mar 10, 2019

@bradyrx, this is a bug in esmlab. There's a chain of function calls that is happening in weighted_corr(), and I believe that some flags/arguments are not being handled correctly. Meanwhile, I will try to come up with a fix.

@andersy005 andersy005 added this to In progress in To Do List via automation Mar 10, 2019
@andersy005 andersy005 added this to the sprint-mar04-mar17 milestone Mar 10, 2019
@bradyrx

This comment has been minimized.

Copy link
Contributor Author

commented Mar 10, 2019

@andersy005, thanks. I figured that was the case but didn't want to go through the chain if I was just calling the function wrong. A potential solution is to write vectorized numpy functions, then wrap them from xarray.apply_ufunc, e.g. in xskillscore (https://github.com/raybellwaves/xskillscore/blob/master/xskillscore/core/deterministic.py). This tracks all coordinates and dimensions pretty seamlessly.

from xskillscore import pearson_r, pearson_r_p_value
import numpy as np
import xarray as xr
base = np.random.rand(100,10,10)
x = xr.DataArray(np.random.rand(100,10,10) + base, dims=['time','lat','lon'])
y = xr.DataArray(np.random.rand(100,10,10) + base, dims=['time','lat','lon'])

r = pearson_r(x, y, dim='time')
p = pearson_r_p_value(x, y, dim='time')

print(r)
<xarray.DataArray (lat: 10, lon: 10)>
array([[0.533905, 0.458444, 0.509774, 0.593456, 0.52763 , 0.608084, 0.436919,
        0.511359, 0.514602, 0.48052 ],
       [0.636213, 0.510837, 0.434097, 0.662831, 0.539225, 0.460084, 0.504803,
        0.421864, 0.569972, 0.393358],
       [0.470968, 0.511835, 0.474997, 0.424008, 0.348932, 0.540813, 0.496891,
        0.43896 , 0.638253, 0.582971],
       [0.563795, 0.445417, 0.562172, 0.580923, 0.508425, 0.434931, 0.596794,
        0.463918, 0.610172, 0.470445],
       [0.427276, 0.400987, 0.552899, 0.51279 , 0.555465, 0.400016, 0.487967,
        0.495112, 0.41245 , 0.527987],
       [0.65203 , 0.594892, 0.439531, 0.449366, 0.486517, 0.440055, 0.399026,
        0.496796, 0.410862, 0.603409],
       [0.485064, 0.620396, 0.551704, 0.585494, 0.468969, 0.577097, 0.348778,
        0.481249, 0.567304, 0.551467],
       [0.498158, 0.47105 , 0.46874 , 0.44657 , 0.455551, 0.512364, 0.391466,
        0.558187, 0.609233, 0.455428],
       [0.388836, 0.512343, 0.441547, 0.488008, 0.595803, 0.405281, 0.605687,
        0.445643, 0.45182 , 0.525349],
       [0.520046, 0.515418, 0.49629 , 0.428106, 0.588623, 0.515663, 0.328488,
        0.457904, 0.353342, 0.525672]])
Dimensions without coordinates: lat, lon

For now, I'll implement significance testing for 1D arrays, unless you get this bugfix in the near future.

@@ -358,4 +359,42 @@ def weighted_corr(x, y, weights=None, dim=None, apply_nan_mask=True):
* weighted_cov(y, y, weights=weights, dim=op_over_dims, apply_nan_mask=apply_nan_mask_flag)
)
corr_xy = numerator / denominator
return corr_xy

if return_p:

This comment has been minimized.

Copy link
@matt-long

matt-long Mar 11, 2019

Contributor

@bradyrx, should we return the p-value as an attribute?

This comment has been minimized.

Copy link
@bradyrx

bradyrx Mar 11, 2019

Author Contributor

I think it's best as a DataArray for two reasons:

  1. One can correlate two gridded products, or a time series to a grid, and then be returned an array of p-values corresponding to grid cells. (I'm not aware of any good way to do this via attributes)
  2. Using a DataArray allows one to mask with where since the coordinates are one-to-one. E.g., ds.where(p < 0.05).

I'll push forth on this PR later this week with the effective sample sizes. It's also probably worth waiting for the dim keyword fix before merging. Let me know your thoughts.

This comment has been minimized.

Copy link
@matt-long

matt-long Mar 11, 2019

Contributor

Good points. But this requires two computations to get the significance.

Multiple (optional) return arguments? Return a Dataset?

This comment has been minimized.

Copy link
@bradyrx

bradyrx Mar 11, 2019

Author Contributor

But this requires two computations to get the significance.

Could you clarify where you're looking to reduce computations? As a base case, corr_xy will always be computed. To get significance, we have to compute t no matter what (which is a quick vectorized numpy calculation so should have low overhead). Then also seems like we're constrained to scipy for the t-test.

Multiple (optional) return arguments? Return a Dataset?

I was thinking a Dataset might be the most straight forward to avoid the nuisance of returning a separate r and p. Although this probably means weighted_corr can only be run on DataArrays, since a Dataset would imply separate p-values for each variable which could make naming variables difficult. Although I'm sure there's a solution here.

In a DataArray only framework, it would look like:

esmlab.statistics.weighted_corr(x, y, dim='time')  # where x, y is gridded lat/lon

<xarray.DataArray (lat: 10, lon: 10)>
r (lat, lon) ...
p (lat, lon) ...

Thanks for iterating on this. Simple addition but worth thinking about all the use cases and seems like it might dictate some philosophy changes to the package.

@matt-long

This comment has been minimized.

Copy link
Contributor

commented Mar 11, 2019

We should consider fixing this bug in the context of making all functions operate on datasets as well as DataArray's.

The ultimate computational operation in each function in statistics.py should be in a private function that operates only on DataArray's and assumes _get_weights_and_dims has been applied.

Then we should pass DataArray's from public functions (i.e., mean, corr, etc.), returning the type passed in (DataArray or Dataset), looping over DataArray's as necessary.

When we call functions from within functions, we use the private ones.

We should remove all instances of

   # If the mask is applied in previous operation,
    # disable it for subseqent operations to speed up computation
    if apply_nan_mask:
        apply_nan_mask_flag = False
    else:
        apply_nan_mask_flag = True

@jukent, @andersy005 can you help with this so @bradyrx can move forward?

Merge changes in master into devel brach
@jukent

This comment has been minimized.

Copy link

commented Mar 13, 2019

@andersy005 I want to work on this but it feels wrapped up in our discussion from Monday on _get_weights_and_dims. I want to touch base first.

@andersy005 andersy005 referenced this pull request Mar 14, 2019
3 of 5 tasks complete
@andersy005

This comment has been minimized.

Copy link
Member

commented Mar 14, 2019

@bradyrx, I wanted to let you know that once #97 is merged, weighted_corr() will work as expected:

In [1]: import xarray as xr                                                                                               

In [2]: import numpy as np                                                                                                

In [3]: import esmlab                                                                                                     

In [4]: base = np.random.rand(100,10,10) 
   ...: x = xr.DataArray(np.random.rand(100,10,10) + base, dims=['time','lat','lon']) 
   ...: y = xr.DataArray(np.random.rand(100,10,10) + base, dims=['time','lat','lon'])                                     

In [5]: esmlab.statistics.weighted_corr(x, y, dim=['time'])                                                               
/Users/abanihi/devel/ncar/esmlab/esmlab/statistics.py:126: UserWarning: Computing mean using equal weights for all data po
ints
  warn('Computing mean using equal weights for all data points')
Out[5]: 
<xarray.DataArray (lat: 10, lon: 10)>
array([[0.524924, 0.494863, 0.508805, 0.437148, 0.487002, 0.572298, 0.535313,
        0.470061, 0.5556  , 0.551623],
       [0.531431, 0.486831, 0.477618, 0.430656, 0.522503, 0.328701, 0.339792,
        0.585835, 0.520957, 0.469617],
       [0.356394, 0.478052, 0.500501, 0.463823, 0.386106, 0.475005, 0.512288,
        0.515351, 0.535773, 0.506998],
       [0.510024, 0.31364 , 0.529359, 0.434292, 0.488146, 0.519409, 0.389186,
        0.412388, 0.570032, 0.497333],
       [0.558904, 0.515453, 0.399842, 0.558256, 0.57212 , 0.433026, 0.459941,
        0.551059, 0.378602, 0.48377 ],
       [0.54679 , 0.506646, 0.421284, 0.589415, 0.482562, 0.499417, 0.537398,
        0.521407, 0.514654, 0.570337],
       [0.580641, 0.430284, 0.368287, 0.383737, 0.393695, 0.561684, 0.339382,
        0.569851, 0.380189, 0.418446],
       [0.508465, 0.499632, 0.503726, 0.563216, 0.589911, 0.432491, 0.488958,
        0.430792, 0.527264, 0.439756],
       [0.501921, 0.478446, 0.594801, 0.463977, 0.512082, 0.551683, 0.3955  ,
        0.558195, 0.53957 , 0.576574],
       [0.549452, 0.486373, 0.504508, 0.493075, 0.457823, 0.486836, 0.558396,
        0.447431, 0.534848, 0.504618]])
Dimensions without coordinates: lat, lon
andersy005 added 4 commits Mar 14, 2019
esmlab/statistics.py Outdated Show resolved Hide resolved
@andersy005

This comment has been minimized.

Copy link
Member

commented Mar 19, 2019

@matt-long, as I mentioned a few days ago, it appears that #100 didn't fix all time manager issues.
Should we disable the failing tests for the time being?

@andersy005 andersy005 added this to the sprint-mar18-mar31 milestone Mar 19, 2019
bradyrx added 7 commits Mar 19, 2019
maintain dimensions/coordinates for pvalue

change to method used in scipy.stats

remove numpy array casting
* 'master' of github.com:bradyrx/esmlab:
  documentation update
  return weighted_corr as Dataset with both p and r
  remove numpy array casting
  change to method used in scipy.stats
  maintain dimensions/coordinates for pvalue
  return p as DataArray
  return p as DataArray
  add documentation for significance function
  base implementation of correlation p value
@bradyrx

This comment has been minimized.

Copy link
Contributor Author

commented Mar 19, 2019

@andersy005 -- okay, this should be good to go for a base implementation of p-values with weighted_corr. Note that pytest is broken from the previous commits but are unrelated to this implementation.

I also am holding off on adding effective sample size as a feature until later. It's a little more complicated and might require some larger changes to esmlab.

Screen Shot 2019-03-19 at 11 22 29 AM

@andersy005

This comment has been minimized.

Copy link
Member

commented Mar 19, 2019

@bradyrx, is it okay if I push some changes to master branch of your folk?

@bradyrx

This comment has been minimized.

Copy link
Contributor Author

commented Mar 19, 2019

@andersy005, yep no problem at all.

To Do List automation moved this from In progress to Reviewer approved Mar 19, 2019
@andersy005

This comment has been minimized.

Copy link
Member

commented Mar 21, 2019

@matt-long, when you get time, can you have a look at this?

@andersy005

This comment has been minimized.

Copy link
Member

commented Mar 27, 2019

Merging this! Will look into fixing time manager and associated test failures in a future PR.

@bradyrx, thank you for your contribution!

@andersy005 andersy005 merged commit 7ed3401 into NCAR:devel Mar 27, 2019
6 checks passed
6 checks passed
ci/circleci: docs-build Your tests passed on CircleCI!
Details
ci/circleci: linting Your tests passed on CircleCI!
Details
ci/circleci: python-2.7 Your tests passed on CircleCI!
Details
ci/circleci: python-3.6 Your tests passed on CircleCI!
Details
ci/circleci: python-3.7 Your tests passed on CircleCI!
Details
codecov/project 86.64% (+1.46%) compared to 025547a
Details
To Do List automation moved this from Reviewer approved to Done Mar 27, 2019
@bradyrx bradyrx referenced this pull request Mar 28, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
To Do List
  
Done
5 participants
You can’t perform that action at this time.