Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use scikit-learn's interface to joblib for multiprocessing #24

Merged
merged 11 commits into from Apr 12, 2017

Conversation

weixuanfu
Copy link
Contributor

  1. Removed joblib dependency and replaced it by scikit-learn's customized joblib

  2. Move scoring instancemethod function out of class to make it pickleable in python 2.7

  3. Works in both Windows OS and python2.7 environment.

Related issue #22

Test codes

import pandas as pd
import numpy as np
from sklearn.pipeline import make_pipeline
from skrebate import ReliefF, SURF, SURFstar, MultiSURF
from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score

data_link = ('https://github.com/EpistasisLab/scikit-rebate/raw/master/data/'
            'GAMETES_Epistasis_2-Way_20atts_0.4H_EDM-1_1.tsv.gz')

genetic_data = pd.read_csv(data_link, sep='\t', compression='gzip')

features, labels = genetic_data.drop('class', axis=1).values, genetic_data['class'].values


if __name__ == "__main__":
    # NOTE: we put the following in a 'if __name__ == "__main__"' protected
    # block to be able to use a multi-core algorithm that also works under
    # Windows, see: http://docs.python.org/library/multiprocessing.html#windows
    # The multiprocessing module is used as the backend of joblib.Parallel
    # that is used when n_jobs != 1


    # ReliefF

    clf = make_pipeline(ReliefF(n_features_to_select=2, n_neighbors=100, n_jobs=-1),
                        RandomForestClassifier(n_estimators=100))

    print('ReliefF',np.mean(cross_val_score(clf, features, labels)))


    # SURF

    clf = make_pipeline(SURF(n_features_to_select=2, n_jobs=-1),
                        RandomForestClassifier(n_estimators=100))
    print('SURF',np.mean(cross_val_score(clf, features, labels)))

    # SURF*

    clf = make_pipeline(SURFstar(n_features_to_select=2, n_jobs=-1),
                        RandomForestClassifier(n_estimators=100))

    print('SURF*',np.mean(cross_val_score(clf, features, labels)))

    # MultiSURF

    clf = make_pipeline(MultiSURF(n_features_to_select=2, n_jobs=-1),
                        RandomForestClassifier(n_estimators=100))

    print('MultiSURF',np.mean(cross_val_score(clf, features, labels)))

    # TURF

    clf = make_pipeline(RFE(ReliefF(n_jobs=-1), n_features_to_select=2),
                        RandomForestClassifier(n_estimators=100))

    print('TURF',np.mean(cross_val_score(clf, features, labels)))

@coveralls
Copy link

Coverage Status

Coverage increased (+25.4%) to 73.969% when pulling 993f50a on weixuanfu2016:test_sklearn_joblib into 6a3ba69 on EpistasisLab:development.

@rhiever
Copy link
Contributor

rhiever commented Apr 12, 2017

Thank you, @weixuanfu2016! Will look into this PR soon. We need to verify that these changes don't change the resulting scores at all.

@rhiever rhiever merged commit 1e9d9bd into EpistasisLab:development Apr 12, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants