Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Basic implementation of cross-validation rsm using learner and CV. #504

Merged
merged 21 commits into from Oct 9, 2016

Conversation

swaroopgj
Copy link
Member

@swaroopgj swaroopgj commented Sep 29, 2016

This implements cross-validated RSA.
Adds CDist as a measure that will eventually be passed to CrossValidation to get cvRSA.

With help from @mvdoc @snastase @feilong

@coveralls
Copy link

Coverage Status

Coverage increased (+0.01%) to 80.716% when pulling 3215b70 on swaroopgj:cvRSA into 73ddbd0 on PyMVPA:master.

@codecov-io
Copy link

codecov-io commented Sep 29, 2016

Current coverage is 76.40% (diff: 98.57%)

Merging #504 into master will increase coverage by 0.83%

@@             master       #504   diff @@
==========================================
  Files           344        364    +20   
  Lines         41088      41179    +91   
  Methods           0          0          
  Messages          0          0          
  Branches       6592       6599     +7   
==========================================
+ Hits          31052      31463   +411   
+ Misses         7978       7788   -190   
+ Partials       2058       1928   -130   

Powered by Codecov. Last update 73ddbd0...bd94cb3

from scipy.stats import rankdata, pearsonr


class CDist(Measure):

pairwise_metric = Parameter('correlation', constraints='str', doc="""\
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor -- no need for trailing \ here

def __init__(self, **kwargs):
Measure.__init__(self, **kwargs)
self.train_ds = None
self.sattr = self.params.sattr
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so why bother binding them to the instance if already available in self.params?
hide away self.train_ds -- unlikely should be "public"

mgs = mean_group_sample(attrs=self.sattr)
test_ds = mgs(ds)
else:
test_ds = ds.copy(deep=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

two pieces of identical code/logic with different input/output usually beg for a helper function/method (e.g. _prep_dataset), please refactor

else:
self.train_ds = ds.copy(deep=True)

def __call__(self, ds, **kwargs):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please overload _call, not __call__ and why there are **kwargs??

city = pdist(data, 'cityblock')
cdist_euc = CDist(pairwise_metric='euclidean')
cdist_pear = CDist(pairwise_metric='correlation')
cdist_city = CDist(pairwise_metric='cityblock')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why to hardcode into variables, instead of just a loop

for dist in ('euclidean', 'cor...):
 pd_ = pdist(data, dist)
 cd_= CDist(pariwise_metric=dist)
...

?

@swaroopgj
Copy link
Member Author

@mvdoc took care of your feedback. Updated with his PR.

samples groups. Typically your category labels or targets.""")

def __init__(self, **kwargs):
Measure.__init__(self, **kwargs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we be using super here?

@yarikoptic
Copy link
Member

  • def init(self, **kwargs):

  •    Measure.__init__(self, **kwargs)
    

    should we be using super here?

ideally - yes ;)
pragmatically I think effect here would be the same since no multiple
inheritance at this point.

Yaroslav O. Halchenko
Center for Open Neuroscience http://centerforopenneuroscience.org
Dartmouth College, 419 Moore Hall, Hinman Box 6207, Hanover, NH 03755
Phone: +1 (603) 646-9834 Fax: +1 (603) 646-1419
WWW: http://www.linkedin.com/in/yarik

@coveralls
Copy link

Coverage Status

Coverage increased (+0.004%) to 80.706% when pulling e8ef537 on swaroopgj:cvRSA into 73ddbd0 on PyMVPA:master.

@yarikoptic yarikoptic changed the title ENH: Basic implementation of cross-validation rsm using learner and CV. WiP ENH: Basic implementation of cross-validation rsm using learner and CV. Sep 30, 2016
@swaroopgj
Copy link
Member Author

Seems like the test failed with unrelated error:
File "/home/travis/build/PyMVPA/PyMVPA/mvpa2/tests/test_benchmarks_hyperalignment.py", line 63, in test_timesegments_classification

distds = cdist(self._train_ds.samples, test_ds,
metric=self.params.pairwise_metric)
# Make target pairs
distds = Dataset(samples=distds.ravel()[None, ],
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, we are arranging samples as folds to be consistent with cross-validation as used with classifiers,
and target pairs as features. So, the datase is already 2-D for a given ROI. But when we run this in a Searchlight setting, this might become an issue as it vstacks datasets along these "target-pairs" dimension and messing with the feature dimension of Searchlight.
One option is to keep the result in each ROI, as nFolds-by-1 dataset and having that 1 element contain a array of "target-pairs".
Any suggestions on how to handle this? @yarikoptic @mvdoc

Copy link
Member Author

@swaroopgj swaroopgj Oct 4, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, I decided to return the result as single-feature dataset. This way it will work with Searchlight and users can collapse over cvfolds in sa if they wish to.

I think it is ready to consider for PR. @mvdoc @yarikoptic

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK for me about returning samples instead of features. I didn't notice that it was the way Andy returned the pairs in PDist already.

We need to fix the test for test_CDist_cval with the travis env when ca are enforced. @swaroopgj do you want me to send PR to your branch or can you modify it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, I think that failure is an issue to be addressed in CrossValidation. I am about to open an issue in that regard.

@mvdoc Do you think we should just by-pass it for now in tests?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's see what @yarikoptic says first ;-)

@swaroopgj swaroopgj changed the title WiP ENH: Basic implementation of cross-validation rsm using learner and CV. ENH: Basic implementation of cross-validation rsm using learner and CV. Oct 4, 2016
@@ -104,6 +105,9 @@ def test_CDist():
scipy_cdist.ravel())

def test_CDist_cval():
if _ENFORCE_CA_ENABLED:
# skip testing for now, since we are having issue with 'training_stats'
return
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you use SkipTest instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do...

@coveralls
Copy link

Coverage Status

Coverage increased (+0.01%) to 80.712% when pulling ebced46 on swaroopgj:cvRSA into 73ddbd0 on PyMVPA:master.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.007%) to 80.71% when pulling ebced46 on swaroopgj:cvRSA into 73ddbd0 on PyMVPA:master.

@swaroopgj
Copy link
Member Author

yay! It passed.

@mvdoc
Copy link
Member

mvdoc commented Oct 4, 2016

👍 @yarikoptic ready for review ;D

Copy link
Member

@yarikoptic yarikoptic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sweat but could be sweater ;)

from scipy.stats import rankdata, pearsonr


class CDist(Measure):
"""Compute dissimiliarity matrix for samples in a dataset
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't critical piece missing -- "cross-validated dissimilarity"

all possible metrics.""")

pairwise_metric_kwargs = Parameter({}, doc="""
kwargs dictionary passed to cdist. For example,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please indent like in other docs

# Make target pairs
distds = Dataset(samples=distds.ravel()[:, None],
sa={'pairs': list(product(self._train_ds.T,
test_ds.T))})
Copy link
Member

@yarikoptic yarikoptic Oct 4, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do not use .T since you made it configurable -- should be values of self.params.sattr
I wondered if you would have got any advantage of using FlattenMapper while originally just making NxN dataset, then Flattening it, which should provide 1 x N**2 and I believe .fa should bear those pairs). Do you really need it N**2 x 1?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self.params.sattr is a list of attributes to run mean_group_sample on, so it doesn't fit well here.
We can either make sattr a single sa instead of a list and use it here instead of .T and force mean_group_sample to always mean over single sa and not accept a list. Any suggestion on which is preferable?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, then why not to keep those which you get from mean group sample? Assign to corresponding sa, fa of your matrix.. I wonder though what flatten maker would do

Also add a test where you have a list and see how your code should break now

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could keep the sa from meang_group_sample if we don't ravel(). Reason I didn't choose flatten mapper is to flatten the matrix into samples/rows. Flatten mapper does it into columns/features (right?).

# skip testing for now, since we are having issue with 'training_stats'
raise SkipTest("Skipping test to avoid issue with 'training_stats while CA enabled")

targets = np.tile(range(3), 2)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ha -- those lines are not covered now... the cons of our oldfashioned travis setup -- I will fix for it

res = pymvpa_cdist(test_data)
# Check to make sure the cdist results are close to CDist results
assert_array_almost_equal(res.samples.ravel(),
scipy_cdist.ravel())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since to be trained once and then reused on new data -- would have been nice to test that it provides correct result (comparing to independent new train/test cycle)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added an extra test to re-use the trained measure again (on original training data) and it should still work (in this case, mimic pdist). Hope that's what you are referring to.

generator=NFoldPartitioner(),
errorfx=None)
res = cv(ds)
# Testing to make sure the both folds return same results, as they should
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so if there is a bug and it doesn't recompute the measure across folds we will be just fine? ;)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, it has to recomputed across folds.
Because the results for the two folds are not the same but transpose of each other after we reshape into 3-by-3 (as done in assert after this). If it puts the same values in both folds, it should fail (in theory).

@swaroopgj
Copy link
Member Author

Thanks @yarikoptic, working on the review.

mgs = mean_group_sample(attrs=self.params.sattr)
ds_ = mgs(ds)
else:
ds_ = ds.copy(deep=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw why bother deep copying the data??

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need. Left over from old implementation.

# Make target pairs
distds = Dataset(samples=distds.ravel()[:, None],
sa={'pairs': list(product(self._train_ds.T,
test_ds.T))})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, then why not to keep those which you get from mean group sample? Assign to corresponding sa, fa of your matrix.. I wonder though what flatten maker would do

Also add a test where you have a list and see how your code should break now

@coveralls
Copy link

Coverage Status

Coverage increased (+0.01%) to 80.716% when pulling 72b3fc6 on swaroopgj:cvRSA into 73ddbd0 on PyMVPA:master.

@swaroopgj
Copy link
Member Author

@yarikoptic I have addressed all your comments/requests except how to return results dataset. I will do it once we decide on something. Thanks!

@swaroopgj
Copy link
Member Author

@yarikoptic Made changes to return dataset as we discussed this morning.

@coveralls
Copy link

coveralls commented Oct 5, 2016

Coverage Status

Coverage increased (+0.5%) to 81.193% when pulling bd94cb3 on swaroopgj:cvRSA into 73ddbd0 on PyMVPA:master.

@yarikoptic
Copy link
Member

It would be nice to get some example/demo on e.g. comparing CV against "in sample" results

@yarikoptic yarikoptic merged commit 3452557 into PyMVPA:master Oct 9, 2016
@swaroopgj swaroopgj deleted the cvRSA branch February 8, 2017 00:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants