Use scikit-learn's interface to joblib for multiprocessing #24

weixuanfu · 2017-04-10T16:20:20Z

Removed joblib dependency and replaced it by scikit-learn's customized joblib
Move scoring instancemethod function out of class to make it pickleable in python 2.7
Works in both Windows OS and python2.7 environment.

Related issue #22

Test codes

import pandas as pd
import numpy as np
from sklearn.pipeline import make_pipeline
from skrebate import ReliefF, SURF, SURFstar, MultiSURF
from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score

data_link = ('https://github.com/EpistasisLab/scikit-rebate/raw/master/data/'
            'GAMETES_Epistasis_2-Way_20atts_0.4H_EDM-1_1.tsv.gz')

genetic_data = pd.read_csv(data_link, sep='\t', compression='gzip')

features, labels = genetic_data.drop('class', axis=1).values, genetic_data['class'].values


if __name__ == "__main__":
    # NOTE: we put the following in a 'if __name__ == "__main__"' protected
    # block to be able to use a multi-core algorithm that also works under
    # Windows, see: http://docs.python.org/library/multiprocessing.html#windows
    # The multiprocessing module is used as the backend of joblib.Parallel
    # that is used when n_jobs != 1


    # ReliefF

    clf = make_pipeline(ReliefF(n_features_to_select=2, n_neighbors=100, n_jobs=-1),
                        RandomForestClassifier(n_estimators=100))

    print('ReliefF',np.mean(cross_val_score(clf, features, labels)))


    # SURF

    clf = make_pipeline(SURF(n_features_to_select=2, n_jobs=-1),
                        RandomForestClassifier(n_estimators=100))
    print('SURF',np.mean(cross_val_score(clf, features, labels)))

    # SURF*

    clf = make_pipeline(SURFstar(n_features_to_select=2, n_jobs=-1),
                        RandomForestClassifier(n_estimators=100))

    print('SURF*',np.mean(cross_val_score(clf, features, labels)))

    # MultiSURF

    clf = make_pipeline(MultiSURF(n_features_to_select=2, n_jobs=-1),
                        RandomForestClassifier(n_estimators=100))

    print('MultiSURF',np.mean(cross_val_score(clf, features, labels)))

    # TURF

    clf = make_pipeline(RFE(ReliefF(n_jobs=-1), n_features_to_select=2),
                        RandomForestClassifier(n_estimators=100))

    print('TURF',np.mean(cross_val_score(clf, features, labels)))

…-rebate into test_sklearn_joblib

coveralls · 2017-04-10T16:26:30Z

Coverage increased (+25.4%) to 73.969% when pulling 993f50a on weixuanfu2016:test_sklearn_joblib into 6a3ba69 on EpistasisLab:development.

rhiever · 2017-04-12T15:11:57Z

Thank you, @weixuanfu2016! Will look into this PR soon. We need to verify that these changes don't change the resulting scores at all.

weixuanfu2016 added 11 commits April 7, 2017 13:17

py27 not working

fff007b

py27 not working

e7f5825

Merge branch 'test_sklearn_joblib' of github.com:weixuanfu2016/scikit…

d230fbf

…-rebate into test_sklearn_joblib

still not work in python27 due to instance method

10eb705

try classmethod

40fd671

relieff works in python 2.7

9717198

add scoring utils

7ed822b

add __main__ in examples

dea2d46

add __main__

8a6454e

multiprocssing in p27 and windows works

1ce6479

clean codes

993f50a

rhiever added the enhancement label Apr 12, 2017

rhiever merged commit 1e9d9bd into EpistasisLab:development Apr 12, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use scikit-learn's interface to joblib for multiprocessing #24

Use scikit-learn's interface to joblib for multiprocessing #24

weixuanfu commented Apr 10, 2017

coveralls commented Apr 10, 2017

rhiever commented Apr 12, 2017

Use scikit-learn's interface to joblib for multiprocessing #24

Use scikit-learn's interface to joblib for multiprocessing #24

Conversation

weixuanfu commented Apr 10, 2017

coveralls commented Apr 10, 2017

rhiever commented Apr 12, 2017