ENH: Basic implementation of cross-validation rsm using learner and CV. #504

swaroopgj · 2016-09-29T22:39:43Z

This implements cross-validated RSA.
Adds CDist as a measure that will eventually be passed to CrossValidation to get cvRSA.

With help from @mvdoc @snastase @feilong

coveralls · 2016-09-29T22:54:14Z

Coverage increased (+0.01%) to 80.716% when pulling 3215b70 on swaroopgj:cvRSA into 73ddbd0 on PyMVPA:master.

codecov-io · 2016-09-29T22:54:40Z

Current coverage is 76.40% (diff: 98.57%)

Merging #504 into master will increase coverage by 0.83%

@@             master       #504   diff @@
==========================================
  Files           344        364    +20   
  Lines         41088      41179    +91   
  Methods           0          0          
  Messages          0          0          
  Branches       6592       6599     +7   
==========================================
+ Hits          31052      31463   +411   
+ Misses         7978       7788   -190   
+ Partials       2058       1928   -130

Powered by Codecov. Last update 73ddbd0...bd94cb3

yarikoptic · 2016-09-30T00:49:48Z

mvpa2/measures/rsa.py

    from scipy.stats import rankdata, pearsonr


+class CDist(Measure):
+
+    pairwise_metric = Parameter('correlation', constraints='str', doc="""\


minor -- no need for trailing \ here

yarikoptic · 2016-09-30T00:50:57Z

mvpa2/measures/rsa.py

+    def __init__(self, **kwargs):
+        Measure.__init__(self, **kwargs)
+        self.train_ds = None
+        self.sattr = self.params.sattr


so why bother binding them to the instance if already available in self.params?
hide away self.train_ds -- unlikely should be "public"

yarikoptic · 2016-09-30T00:53:00Z

mvpa2/measures/rsa.py

+            mgs = mean_group_sample(attrs=self.sattr)
+            test_ds = mgs(ds)
+        else:
+            test_ds = ds.copy(deep=True)


two pieces of identical code/logic with different input/output usually beg for a helper function/method (e.g. _prep_dataset), please refactor

yarikoptic · 2016-09-30T00:55:14Z

mvpa2/measures/rsa.py

+        else:
+            self.train_ds = ds.copy(deep=True)
+
+    def __call__(self, ds, **kwargs):


please overload _call, not __call__ and why there are **kwargs??

yarikoptic · 2016-09-30T00:57:02Z

mvpa2/tests/test_rsa.py

+    city = pdist(data, 'cityblock')
+    cdist_euc = CDist(pairwise_metric='euclidean')
+    cdist_pear = CDist(pairwise_metric='correlation')
+    cdist_city = CDist(pairwise_metric='cityblock')


why to hardcode into variables, instead of just a loop

for dist in ('euclidean', 'cor...): pd_ = pdist(data, dist) cd_= CDist(pariwise_metric=dist) ...

?

REF,DOC: Refactor CDist, add DOC

swaroopgj · 2016-09-30T03:02:23Z

@mvdoc took care of your feedback. Updated with his PR.

mvdoc · 2016-09-30T03:01:43Z

mvpa2/measures/rsa.py

+        samples groups. Typically your category labels or targets.""")
+
+    def __init__(self, **kwargs):
+        Measure.__init__(self, **kwargs)


should we be using super here?

yarikoptic · 2016-09-30T03:05:55Z

def init(self, **kwargs):
   Measure.__init__(self, **kwargs)
should we be using super here?

ideally - yes ;)
pragmatically I think effect here would be the same since no multiple
inheritance at this point.

Yaroslav O. Halchenko
Center for Open Neuroscience http://centerforopenneuroscience.org
Dartmouth College, 419 Moore Hall, Hinman Box 6207, Hanover, NH 03755
Phone: +1 (603) 646-9834 Fax: +1 (603) 646-1419
WWW: http://www.linkedin.com/in/yarik

coveralls · 2016-09-30T03:19:10Z

Coverage increased (+0.004%) to 80.706% when pulling e8ef537 on swaroopgj:cvRSA into 73ddbd0 on PyMVPA:master.

swaroopgj · 2016-09-30T14:42:30Z

Seems like the test failed with unrelated error:
File "/home/travis/build/PyMVPA/PyMVPA/mvpa2/tests/test_benchmarks_hyperalignment.py", line 63, in test_timesegments_classification

swaroopgj · 2016-10-02T16:09:56Z

mvpa2/measures/rsa.py

+        distds = cdist(self._train_ds.samples, test_ds,
+                       metric=self.params.pairwise_metric)
+        # Make target pairs
+        distds = Dataset(samples=distds.ravel()[None, ],


Here, we are arranging samples as folds to be consistent with cross-validation as used with classifiers,
and target pairs as features. So, the datase is already 2-D for a given ROI. But when we run this in a Searchlight setting, this might become an issue as it vstacks datasets along these "target-pairs" dimension and messing with the feature dimension of Searchlight.
One option is to keep the result in each ROI, as nFolds-by-1 dataset and having that 1 element contain a array of "target-pairs".
Any suggestions on how to handle this? @yarikoptic @mvdoc

So, I decided to return the result as single-feature dataset. This way it will work with Searchlight and users can collapse over cvfolds in sa if they wish to.

I think it is ready to consider for PR. @mvdoc @yarikoptic

OK for me about returning samples instead of features. I didn't notice that it was the way Andy returned the pairs in PDist already.

We need to fix the test for test_CDist_cval with the travis env when ca are enforced. @swaroopgj do you want me to send PR to your branch or can you modify it?

So, I think that failure is an issue to be addressed in CrossValidation. I am about to open an issue in that regard.

@mvdoc Do you think we should just by-pass it for now in tests?

Let's see what @yarikoptic says first ;-)

Extend use of CDist to mahalanobis distance

…ights.

mvdoc · 2016-10-04T19:08:50Z

mvpa2/tests/test_rsa.py

@@ -104,6 +105,9 @@ def test_CDist():
                                      scipy_cdist.ravel())

 def test_CDist_cval():
+    if _ENFORCE_CA_ENABLED:
+        # skip testing for now, since we are having issue with 'training_stats'
+        return


can you use SkipTest instead?

coveralls · 2016-10-04T19:20:53Z

Coverage increased (+0.01%) to 80.712% when pulling ebced46 on swaroopgj:cvRSA into 73ddbd0 on PyMVPA:master.

coveralls · 2016-10-04T19:33:01Z

Coverage increased (+0.007%) to 80.71% when pulling ebced46 on swaroopgj:cvRSA into 73ddbd0 on PyMVPA:master.

swaroopgj · 2016-10-04T19:34:40Z

yay! It passed.

mvdoc · 2016-10-04T19:47:06Z

👍 @yarikoptic ready for review ;D

yarikoptic

sweat but could be sweater ;)

yarikoptic · 2016-10-04T20:10:23Z

mvpa2/measures/rsa.py

    from scipy.stats import rankdata, pearsonr


+class CDist(Measure):
+    """Compute dissimiliarity matrix for samples in a dataset


isn't critical piece missing -- "cross-validated dissimilarity"

yarikoptic · 2016-10-04T20:10:56Z

mvpa2/measures/rsa.py

+          all possible metrics.""")
+
+    pairwise_metric_kwargs = Parameter({}, doc="""
+    kwargs dictionary passed to cdist. For example,


please indent like in other docs

yarikoptic · 2016-10-04T20:15:18Z

mvpa2/measures/rsa.py

+        # Make target pairs
+        distds = Dataset(samples=distds.ravel()[:, None],
+                         sa={'pairs': list(product(self._train_ds.T,
+                                                   test_ds.T))})


do not use .T since you made it configurable -- should be values of self.params.sattr
I wondered if you would have got any advantage of using FlattenMapper while originally just making NxN dataset, then Flattening it, which should provide 1 x N**2 and I believe .fa should bear those pairs). Do you really need it N**2 x 1?

self.params.sattr is a list of attributes to run mean_group_sample on, so it doesn't fit well here.
We can either make sattr a single sa instead of a list and use it here instead of .T and force mean_group_sample to always mean over single sa and not accept a list. Any suggestion on which is preferable?

Btw, then why not to keep those which you get from mean group sample? Assign to corresponding sa, fa of your matrix.. I wonder though what flatten maker would do

Also add a test where you have a list and see how your code should break now

We could keep the sa from meang_group_sample if we don't ravel(). Reason I didn't choose flatten mapper is to flatten the matrix into samples/rows. Flatten mapper does it into columns/features (right?).

yarikoptic · 2016-10-04T20:18:03Z

mvpa2/tests/test_rsa.py

+        # skip testing for now, since we are having issue with 'training_stats'
+        raise SkipTest("Skipping test to avoid issue with 'training_stats while CA enabled")
+
+    targets = np.tile(range(3), 2)


ha -- those lines are not covered now... the cons of our oldfashioned travis setup -- I will fix for it

yarikoptic · 2016-10-04T20:18:54Z

mvpa2/tests/test_rsa.py

+            res = pymvpa_cdist(test_data)
+            # Check to make sure the cdist results are close to CDist results
+            assert_array_almost_equal(res.samples.ravel(),
+                                      scipy_cdist.ravel())


since to be trained once and then reused on new data -- would have been nice to test that it provides correct result (comparing to independent new train/test cycle)

I added an extra test to re-use the trained measure again (on original training data) and it should still work (in this case, mimic pdist). Hope that's what you are referring to.

yarikoptic · 2016-10-04T20:20:19Z

mvpa2/tests/test_rsa.py

+                         generator=NFoldPartitioner(),
+                         errorfx=None)
+    res = cv(ds)
+    # Testing to make sure the both folds return same results, as they should


so if there is a bug and it doesn't recompute the measure across folds we will be just fine? ;)

In this case, it has to recomputed across folds.
Because the results for the two folds are not the same but transpose of each other after we reshape into 3-by-3 (as done in assert after this). If it puts the same values in both folds, it should fail (in theory).

swaroopgj · 2016-10-04T20:26:47Z

Thanks @yarikoptic, working on the review.

yarikoptic · 2016-10-04T20:57:50Z

mvpa2/measures/rsa.py

+            mgs = mean_group_sample(attrs=self.params.sattr)
+            ds_ = mgs(ds)
+        else:
+            ds_ = ds.copy(deep=True)


Btw why bother deep copying the data??

No need. Left over from old implementation.

yarikoptic · 2016-10-04T21:00:33Z

mvpa2/measures/rsa.py

+        # Make target pairs
+        distds = Dataset(samples=distds.ravel()[:, None],
+                         sa={'pairs': list(product(self._train_ds.T,
+                                                   test_ds.T))})


Btw, then why not to keep those which you get from mean group sample? Assign to corresponding sa, fa of your matrix.. I wonder though what flatten maker would do

Also add a test where you have a list and see how your code should break now

coveralls · 2016-10-04T22:00:49Z

Coverage increased (+0.01%) to 80.716% when pulling 72b3fc6 on swaroopgj:cvRSA into 73ddbd0 on PyMVPA:master.

swaroopgj · 2016-10-04T23:02:16Z

@yarikoptic I have addressed all your comments/requests except how to return results dataset. I will do it once we decide on something. Thanks!

swaroopgj · 2016-10-05T14:30:37Z

@yarikoptic Made changes to return dataset as we discussed this morning.

coveralls · 2016-10-05T14:48:20Z

Coverage increased (+0.5%) to 81.193% when pulling bd94cb3 on swaroopgj:cvRSA into 73ddbd0 on PyMVPA:master.

yarikoptic · 2016-10-09T18:00:16Z

It would be nice to get some example/demo on e.g. comparing CV against "in sample" results

ENH: Basic implementation of cross-validation rsm using learner and CV.

3215b70

yarikoptic reviewed Sep 30, 2016

View reviewed changes

yarikoptic requested changes Sep 30, 2016

View reviewed changes

mvdoc and others added 6 commits September 29, 2016 22:20

REF,DOC: Refactor CDist, add DOC

f2d6f00

Refactor test

7b20a10

Don't duplicate code with _prepare_ds

b41e527

Remove extra line

f09f01a

Add is_trained flag

94ad406

Merge pull request #8 from mvdoc/cvRSA

e8ef537

REF,DOC: Refactor CDist, add DOC

mvdoc reviewed Sep 30, 2016

View reviewed changes

yarikoptic changed the title ~~ENH: Basic implementation of cross-validation rsm using learner and CV.~~ WiP ENH: Basic implementation of cross-validation rsm using learner and CV. Sep 30, 2016

mvdoc added 3 commits October 2, 2016 10:19

BF,TST: test without group_mean_samples, fix pairs feature attr

7231b34

TST: test CDist using cross-validation

972fa5c

Add kwargs for distances, test mahalanobis distance, simplify tests

a243ad6

swaroopgj commented Oct 2, 2016

View reviewed changes

swaroopgj added 2 commits October 2, 2016 22:55

Merge pull request #9 from mvdoc/cvRSA

9f76829

Extend use of CDist to mahalanobis distance

RF&ENH: CDist returns an M-by-1 dataset so that it works with Searchl…

f8c7184

…ights.

swaroopgj changed the title ~~WiP ENH: Basic implementation of cross-validation rsm using learner and CV.~~ ENH: Basic implementation of cross-validation rsm using learner and CV. Oct 4, 2016

BF: skipping test when CA enabled

d67b9f7

mvdoc suggested changes Oct 4, 2016

View reviewed changes

ENH: skipping test with SkipTest

ebced46

yarikoptic requested changes Oct 4, 2016

View reviewed changes

RF&DOC: Indenting doc consistently in CDist

abbe6d6

yarikoptic requested changes Oct 4, 2016

View reviewed changes

swaroopgj added 3 commits October 4, 2016 17:18

RF: removed deep copying dataset

a6b1415

ENH: Test to make sure multiple calls after training behave as expected.

8a955c9

BF: fixing test with pdist case

72b3fc6

swaroopgj added 3 commits October 5, 2016 10:09

ENH&RF: Ability to handle sa properly, and check for match of nsamples

6a8d2f5

DOC

5a6f13e

ENH: Added tests for nsamples check and result sa check

bd94cb3

yarikoptic merged commit 3452557 into PyMVPA:master Oct 9, 2016

swaroopgj deleted the cvRSA branch February 8, 2017 00:52

ENH: Basic implementation of cross-validation rsm using learner and CV. #504

ENH: Basic implementation of cross-validation rsm using learner and CV. #504

Conversation

swaroopgj commented Sep 29, 2016 • edited

coveralls commented Sep 29, 2016

codecov-io commented Sep 29, 2016 • edited

Current coverage is 76.40% (diff: 98.57%)

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

swaroopgj commented Sep 30, 2016

Choose a reason for hiding this comment

yarikoptic commented Sep 30, 2016

coveralls commented Sep 30, 2016

swaroopgj commented Sep 30, 2016

Choose a reason for hiding this comment

swaroopgj Oct 4, 2016 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

coveralls commented Oct 4, 2016

coveralls commented Oct 4, 2016

swaroopgj commented Oct 4, 2016

mvdoc commented Oct 4, 2016

yarikoptic left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yarikoptic Oct 4, 2016 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

swaroopgj commented Oct 4, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

coveralls commented Oct 4, 2016

swaroopgj commented Oct 4, 2016

swaroopgj commented Oct 5, 2016

coveralls commented Oct 5, 2016 • edited

yarikoptic commented Oct 9, 2016

swaroopgj commented Sep 29, 2016 •

edited

codecov-io commented Sep 29, 2016 •

edited

swaroopgj Oct 4, 2016 •

edited

yarikoptic Oct 4, 2016 •

edited

coveralls commented Oct 5, 2016 •

edited