New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Basic implementation of cross-validation rsm using learner and CV. #504
Conversation
Current coverage is 76.40% (diff: 98.57%)@@ master #504 diff @@
==========================================
Files 344 364 +20
Lines 41088 41179 +91
Methods 0 0
Messages 0 0
Branches 6592 6599 +7
==========================================
+ Hits 31052 31463 +411
+ Misses 7978 7788 -190
+ Partials 2058 1928 -130
|
from scipy.stats import rankdata, pearsonr | ||
|
||
|
||
class CDist(Measure): | ||
|
||
pairwise_metric = Parameter('correlation', constraints='str', doc="""\ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor -- no need for trailing \
here
def __init__(self, **kwargs): | ||
Measure.__init__(self, **kwargs) | ||
self.train_ds = None | ||
self.sattr = self.params.sattr |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so why bother binding them to the instance if already available in self.params?
hide away self.train_ds -- unlikely should be "public"
mgs = mean_group_sample(attrs=self.sattr) | ||
test_ds = mgs(ds) | ||
else: | ||
test_ds = ds.copy(deep=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
two pieces of identical code/logic with different input/output usually beg for a helper function/method (e.g. _prep_dataset
), please refactor
else: | ||
self.train_ds = ds.copy(deep=True) | ||
|
||
def __call__(self, ds, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please overload _call
, not __call__
and why there are **kwargs
??
city = pdist(data, 'cityblock') | ||
cdist_euc = CDist(pairwise_metric='euclidean') | ||
cdist_pear = CDist(pairwise_metric='correlation') | ||
cdist_city = CDist(pairwise_metric='cityblock') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why to hardcode into variables, instead of just a loop
for dist in ('euclidean', 'cor...):
pd_ = pdist(data, dist)
cd_= CDist(pariwise_metric=dist)
...
?
@mvdoc took care of your feedback. Updated with his PR. |
samples groups. Typically your category labels or targets.""") | ||
|
||
def __init__(self, **kwargs): | ||
Measure.__init__(self, **kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we be using super
here?
ideally - yes ;) Yaroslav O. Halchenko |
Seems like the test failed with unrelated error: |
distds = cdist(self._train_ds.samples, test_ds, | ||
metric=self.params.pairwise_metric) | ||
# Make target pairs | ||
distds = Dataset(samples=distds.ravel()[None, ], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, we are arranging samples as folds to be consistent with cross-validation as used with classifiers,
and target pairs as features. So, the datase is already 2-D for a given ROI. But when we run this in a Searchlight
setting, this might become an issue as it vstack
s datasets along these "target-pairs" dimension and messing with the feature dimension of Searchlight
.
One option is to keep the result in each ROI, as nFolds-by-1 dataset and having that 1 element contain a array of "target-pairs".
Any suggestions on how to handle this? @yarikoptic @mvdoc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, I decided to return the result as single-feature dataset. This way it will work with Searchlight
and users can collapse over cvfolds in sa
if they wish to.
I think it is ready to consider for PR. @mvdoc @yarikoptic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK for me about returning samples instead of features. I didn't notice that it was the way Andy returned the pairs in PDist already.
We need to fix the test for test_CDist_cval
with the travis env when ca are enforced. @swaroopgj do you want me to send PR to your branch or can you modify it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, I think that failure is an issue to be addressed in CrossValidation
. I am about to open an issue in that regard.
@mvdoc Do you think we should just by-pass it for now in tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's see what @yarikoptic says first ;-)
Extend use of CDist to mahalanobis distance
@@ -104,6 +105,9 @@ def test_CDist(): | |||
scipy_cdist.ravel()) | |||
|
|||
def test_CDist_cval(): | |||
if _ENFORCE_CA_ENABLED: | |||
# skip testing for now, since we are having issue with 'training_stats' | |||
return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you use SkipTest
instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do...
yay! It passed. |
👍 @yarikoptic ready for review ;D |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sweat but could be sweater ;)
from scipy.stats import rankdata, pearsonr | ||
|
||
|
||
class CDist(Measure): | ||
"""Compute dissimiliarity matrix for samples in a dataset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't critical piece missing -- "cross-validated dissimilarity"
all possible metrics.""") | ||
|
||
pairwise_metric_kwargs = Parameter({}, doc=""" | ||
kwargs dictionary passed to cdist. For example, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please indent like in other docs
# Make target pairs | ||
distds = Dataset(samples=distds.ravel()[:, None], | ||
sa={'pairs': list(product(self._train_ds.T, | ||
test_ds.T))}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do not use .T
since you made it configurable -- should be values of self.params.sattr
I wondered if you would have got any advantage of using FlattenMapper while originally just making NxN dataset, then Flattening it, which should provide 1 x N**2
and I believe .fa should bear those pairs). Do you really need it N**2 x 1
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.params.sattr
is a list of attributes to run mean_group_sample
on, so it doesn't fit well here.
We can either make sattr a single sa
instead of a list and use it here instead of .T
and force mean_group_sample
to always mean over single sa
and not accept a list. Any suggestion on which is preferable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw, then why not to keep those which you get from mean group sample? Assign to corresponding sa, fa of your matrix.. I wonder though what flatten maker would do
Also add a test where you have a list and see how your code should break now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could keep the sa from meang_group_sample if we don't ravel(). Reason I didn't choose flatten mapper is to flatten the matrix into samples/rows. Flatten mapper does it into columns/features (right?).
# skip testing for now, since we are having issue with 'training_stats' | ||
raise SkipTest("Skipping test to avoid issue with 'training_stats while CA enabled") | ||
|
||
targets = np.tile(range(3), 2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ha -- those lines are not covered now... the cons of our oldfashioned travis setup -- I will fix for it
res = pymvpa_cdist(test_data) | ||
# Check to make sure the cdist results are close to CDist results | ||
assert_array_almost_equal(res.samples.ravel(), | ||
scipy_cdist.ravel()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since to be trained once and then reused on new data -- would have been nice to test that it provides correct result (comparing to independent new train/test cycle)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added an extra test to re-use the trained measure again (on original training data) and it should still work (in this case, mimic pdist). Hope that's what you are referring to.
generator=NFoldPartitioner(), | ||
errorfx=None) | ||
res = cv(ds) | ||
# Testing to make sure the both folds return same results, as they should |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so if there is a bug and it doesn't recompute the measure across folds we will be just fine? ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case, it has to recomputed across folds.
Because the results for the two folds are not the same but transpose of each other after we reshape into 3-by-3 (as done in assert after this). If it puts the same values in both folds, it should fail (in theory).
Thanks @yarikoptic, working on the review. |
mgs = mean_group_sample(attrs=self.params.sattr) | ||
ds_ = mgs(ds) | ||
else: | ||
ds_ = ds.copy(deep=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw why bother deep copying the data??
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need. Left over from old implementation.
# Make target pairs | ||
distds = Dataset(samples=distds.ravel()[:, None], | ||
sa={'pairs': list(product(self._train_ds.T, | ||
test_ds.T))}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw, then why not to keep those which you get from mean group sample? Assign to corresponding sa, fa of your matrix.. I wonder though what flatten maker would do
Also add a test where you have a list and see how your code should break now
@yarikoptic I have addressed all your comments/requests except how to return results dataset. I will do it once we decide on something. Thanks! |
@yarikoptic Made changes to return dataset as we discussed this morning. |
It would be nice to get some example/demo on e.g. comparing CV against "in sample" results |
This implements cross-validated RSA.
Adds CDist as a measure that will eventually be passed to
CrossValidation
to get cvRSA.With help from @mvdoc @snastase @feilong