Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Confusion with documentation and MDR feature construction output #25

Closed
jay-reynolds opened this issue Jan 12, 2018 · 4 comments
Closed
Labels

Comments

@jay-reynolds
Copy link

Hi, in the first example in the README, it states:

"For example, MDR can be used to construct a new feature composed from two existing features:"

but "GAMETES_Epistasis_2-Way_20atts_0.4H_EDM-1_1" used in the example has 21 columns, not 2.

The resulting output is a single column, which is a single feature -- is it that there's a single feature produced because that's what those 21 columns boiled down to, or is it because only 2 features from the dataframe were selected and used to construct the new feature? Or is there another reason?

Thanks in advance! I will continue reading the MDR paper I found on pubmed in the meanwhile.

@jay-reynolds
Copy link
Author

jay-reynolds commented Jan 12, 2018

From the paper https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3500181/

"MDR pools genotypes into 'high-risk' and 'low-risk' or 'response' and 'non-response' groups in order to reduce multidimensional data into only one dimension."

And from the abstract (paper behind paywall): https://www.ncbi.nlm.nih.gov/pubmed/16457852

"To address this problem, we have previously developed a multifactor dimensionality reduction (MDR) method for collapsing high-dimensional genetic data into a single dimension (i.e. constructive induction) thus permitting interactions to be detected in relatively small sample sizes."

I suppose that answers my question.

Closing ticket.

@rhiever
Copy link
Contributor

rhiever commented Jan 15, 2018

Hi @jay-reynolds! I wanted to clarify this for you. In the example from the README:

from mdr import MDR
import pandas as pd

genetic_data = pd.read_csv('https://github.com/EpistasisLab/scikit-mdr/raw/development/data/GAMETES_Epistasis_2-Way_20atts_0.4H_EDM-1_1.tsv.gz', sep='\t', compression='gzip')

features = genetic_data.drop('class', axis=1).values
labels = genetic_data['class'].values

my_mdr = MDR()
my_mdr.fit(features, labels)
my_mdr.transform(features)
>>>array([[1],
>>>       [1],
>>>       [1],
>>>       ...,
>>>       [0],
>>>       [0],
>>>       [0]])

We are taking all of the features from the dataset (20 features in total) and constructing a single new feature from them. This is not a typical use of MDR, but it still works in this case because the example dataset is a fairly "easy" dataset for MDR.

Typically we use MDR in one of two ways:

  1. We know exactly what features we want to perform feature construction on, so we subset the DataFrame down to those features and provide only those features to MDR. The regression example in the README shows an example of this case.

  2. We don't know what features we want to perform feature construction on, so we perform an exhaustive combinatorial search of all possible feature combinations (typically up to tuples of 2 and 3 features) and provide each of those tuples to MDR separately, and choose the best tuple(s) according to some MDR quality metric (typically, 10-fold CV accuracy).

@jay-reynolds
Copy link
Author

Thank you for the explanation, very much appreciated!

I've got TPOT going, so I think I'll give TPOT-MDR a go and see what it comes up with.

Have you tried using, say, hyperopt for combinatorial search instead of brute force or evolutionary methods?

@rhiever
Copy link
Contributor

rhiever commented Jan 16, 2018

Have you tried using, say, hyperopt for combinatorial search instead of brute force or evolutionary methods?

We haven't tried that, but would be very curious to see a demo of it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants