Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Basic implementation of cross-validation rsm using learner and CV. #504

Merged
merged 21 commits into from Oct 9, 2016
Merged
Show file tree
Hide file tree
Changes from 7 commits
Commits
Show all changes
21 commits
Select commit Hold shift + click to select a range
3215b70
ENH: Basic implementation of cross-validation rsm using learner and CV.
swaroopgj Sep 29, 2016
f2d6f00
REF,DOC: Refactor CDist, add DOC
mvdoc Sep 30, 2016
7b20a10
Refactor test
mvdoc Sep 30, 2016
b41e527
Don't duplicate code with _prepare_ds
mvdoc Sep 30, 2016
f09f01a
Remove extra line
mvdoc Sep 30, 2016
94ad406
Add is_trained flag
mvdoc Sep 30, 2016
e8ef537
Merge pull request #8 from mvdoc/cvRSA
swaroopgj Sep 30, 2016
7231b34
BF,TST: test without group_mean_samples, fix pairs feature attr
mvdoc Oct 2, 2016
972fa5c
TST: test CDist using cross-validation
mvdoc Oct 2, 2016
a243ad6
Add kwargs for distances, test mahalanobis distance, simplify tests
mvdoc Oct 2, 2016
9f76829
Merge pull request #9 from mvdoc/cvRSA
swaroopgj Oct 3, 2016
f8c7184
RF&ENH: CDist returns an M-by-1 dataset so that it works with Searchl…
swaroopgj Oct 4, 2016
d67b9f7
BF: skipping test when CA enabled
swaroopgj Oct 4, 2016
ebced46
ENH: skipping test with SkipTest
swaroopgj Oct 4, 2016
abbe6d6
RF&DOC: Indenting doc consistently in CDist
swaroopgj Oct 4, 2016
a6b1415
RF: removed deep copying dataset
swaroopgj Oct 4, 2016
8a955c9
ENH: Test to make sure multiple calls after training behave as expected.
swaroopgj Oct 4, 2016
72b3fc6
BF: fixing test with pdist case
swaroopgj Oct 4, 2016
6a8d2f5
ENH&RF: Ability to handle sa properly, and check for match of nsamples
swaroopgj Oct 5, 2016
5a6f13e
DOC
swaroopgj Oct 5, 2016
bd94cb3
ENH: Added tests for nsamples check and result sa check
swaroopgj Oct 5, 2016
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
48 changes: 46 additions & 2 deletions mvpa2/measures/rsa.py
Expand Up @@ -10,19 +10,63 @@

__docformat__ = 'restructuredtext'

from itertools import combinations
from itertools import combinations, product
import numpy as np
from mvpa2.measures.base import Measure
from mvpa2.datasets.base import Dataset
from mvpa2.base import externals
from mvpa2.base.param import Parameter
from mvpa2.base.constraints import EnsureChoice
from mvpa2.mappers.fx import mean_group_sample

if externals.exists('scipy', raise_=True):
from scipy.spatial.distance import pdist, squareform
from scipy.spatial.distance import pdist, squareform, cdist
from scipy.stats import rankdata, pearsonr


class CDist(Measure):
"""Compute dissimiliarity matrix for samples in a dataset
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't critical piece missing -- "cross-validated dissimilarity"


This `Measure` can be trained on part of the dataset (for example,
a partition) and called on another partition. It can be used in
cross-validation to generate cross-validated RSA.
"""
pairwise_metric = Parameter('correlation', constraints='str', doc="""
Distance metric to use for calculating pairwise vector distances for
dissimilarity matrix (DSM). See scipy.spatial.distance.pdist for
all possible metrics.""")

sattr = Parameter(['targets'], doc="""
List of sample attributes whose unique values will be used to identify the
samples groups. Typically your category labels or targets.""")

def __init__(self, **kwargs):
Measure.__init__(self, **kwargs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we be using super here?

self._train_ds = None

def _prepare_ds(self, ds):
if self.params.sattr is not None:
mgs = mean_group_sample(attrs=self.params.sattr)
ds_ = mgs(ds)
else:
ds_ = ds.copy(deep=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw why bother deep copying the data??

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need. Left over from old implementation.

return ds_

def _train(self, ds):
self._train_ds = self._prepare_ds(ds)
self.is_trained = True

def _call(self, ds):
test_ds = self._prepare_ds(ds)
# Call actual distance metric
distds = cdist(self._train_ds.samples, test_ds,
metric=self.params.pairwise_metric)
# Make target pairs
distds = Dataset(samples=distds.ravel()[None, ],
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, we are arranging samples as folds to be consistent with cross-validation as used with classifiers,
and target pairs as features. So, the datase is already 2-D for a given ROI. But when we run this in a Searchlight setting, this might become an issue as it vstacks datasets along these "target-pairs" dimension and messing with the feature dimension of Searchlight.
One option is to keep the result in each ROI, as nFolds-by-1 dataset and having that 1 element contain a array of "target-pairs".
Any suggestions on how to handle this? @yarikoptic @mvdoc

Copy link
Member Author

@swaroopgj swaroopgj Oct 4, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, I decided to return the result as single-feature dataset. This way it will work with Searchlight and users can collapse over cvfolds in sa if they wish to.

I think it is ready to consider for PR. @mvdoc @yarikoptic

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK for me about returning samples instead of features. I didn't notice that it was the way Andy returned the pairs in PDist already.

We need to fix the test for test_CDist_cval with the travis env when ca are enforced. @swaroopgj do you want me to send PR to your branch or can you modify it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, I think that failure is an issue to be addressed in CrossValidation. I am about to open an issue in that regard.

@mvdoc Do you think we should just by-pass it for now in tests?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's see what @yarikoptic says first ;-)

fa={'pairs': list(product(test_ds.UT, test_ds.UT))})
return distds


class PDist(Measure):
"""Compute dissimiliarity matrix for samples in a dataset

Expand Down
18 changes: 18 additions & 0 deletions mvpa2/tests/test_rsa.py
Expand Up @@ -73,6 +73,24 @@ def test_PDistConsistency():
assert_array_almost_equal(res4.samples,cres2)


def test_CDist():
targets = np.tile(range(3), 2)
chunks = np.repeat(np.array((0,1)), 3)
ds = dataset_wizard(samples=data, targets=targets, chunks=chunks)
# Some distance metrics
metrics = ['euclidean', 'correlation', 'cityblock']
for metric in metrics:
pd_ = pdist(data, metric)
cd_ = CDist(pairwise_metric=metric)

assert_true(not cd_.is_trained)
cd_.train(ds[ds.sa.chunks == 0, ])
assert_true(cd_.is_trained)
res = cd_(ds[ds.sa.chunks == 1, ])
# Check to make sure the pdist results are close to CDist results
assert_array_almost_equal(res.samples.ravel(),
squareform(pd_)[:3, 3:].ravel())


def test_PDist():
targets = np.tile(xrange(3),2)
Expand Down