
CSYE7105 Parallel Machine Learning & AI - 2020 Fall

Instructor: Dr. Handan Liu

Week 8-2: Grid Search Cross Validation in Parallel


In [1]:
import numpy as np
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.model_selection import GridSearchCV, StratifiedKFold
from sklearn.datasets import make_classification
from time import time

In [None]:
ExtraTreesClassifier

In [2]:
# n_samples=int(1e6)
X_train, y_train = make_classification(n_samples=10000, n_features=50, random_state=0)

In [3]:
model = ExtraTreesClassifier(class_weight='balanced')
parameters = {'criterion': ['gini', 'entropy'],
                       'min_samples_split' : [2, 4, 8],
                       'max_depth' : [3, 10, 20]}

In [4]:
# No parallel
start = time()
clf = GridSearchCV(model, parameters, verbose=2, scoring='roc_auc', 
                   cv=StratifiedKFold(shuffle=True), n_jobs=1)
clf.fit(X_train, y_train)
elasped = time() - start
print('Elasped time: ', elasped)

Fitting 5 folds for each of 18 candidates, totalling 90 fits
[CV] criterion=gini, max_depth=3, min_samples_split=2 ................


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV] . criterion=gini, max_depth=3, min_samples_split=2, total=   0.3s
[CV] criterion=gini, max_depth=3, min_samples_split=2 ................


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.2s remaining:    0.0s


[CV] . criterion=gini, max_depth=3, min_samples_split=2, total=   0.3s
[CV] criterion=gini, max_depth=3, min_samples_split=2 ................
[CV] . criterion=gini, max_depth=3, min_samples_split=2, total=   0.3s
[CV] criterion=gini, max_depth=3, min_samples_split=2 ................
[CV] . criterion=gini, max_depth=3, min_samples_split=2, total=   0.3s
[CV] criterion=gini, max_depth=3, min_samples_split=2 ................
[CV] . criterion=gini, max_depth=3, min_samples_split=2, total=   0.3s
[CV] criterion=gini, max_depth=3, min_samples_split=4 ................
[CV] . criterion=gini, max_depth=3, min_samples_split=4, total=   0.3s
[CV] criterion=gini, max_depth=3, min_samples_split=4 ................
[CV] . criterion=gini, max_depth=3, min_samples_split=4, total=   0.3s
[CV] criterion=gini, max_depth=3, min_samples_split=4 ................
[CV] . criterion=gini, max_depth=3, min_samples_split=4, total=   0.3s
[CV] criterion=gini, max_depth=3, min_samples_split=4 ................
[CV] .

[Parallel(n_jobs=1)]: Done  90 out of  90 | elapsed:   54.3s finished


Elasped time:  55.543402910232544


In [5]:
# Parallel on 4 CPUs
start = time()
clf = GridSearchCV(model, parameters, verbose=4, scoring='roc_auc', 
                   cv=StratifiedKFold(shuffle=True), n_jobs=4)
clf.fit(X_train, y_train)
elasped = time() - start
print('Elasped time: ', elasped)

Fitting 5 folds for each of 18 candidates, totalling 90 fits


[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  17 tasks      | elapsed:    2.9s
[Parallel(n_jobs=4)]: Done  90 out of  90 | elapsed:   19.2s finished


Elasped time:  20.425665855407715


In [6]:
# Parallel on 8 CPUs
start = time()
clf = GridSearchCV(model, parameters, verbose=4, scoring='roc_auc', 
                   cv=StratifiedKFold(shuffle=True), n_jobs=8)
clf.fit(X_train, y_train)
elasped = time() - start
print('Elasped time: ', elasped)

Fitting 5 folds for each of 18 candidates, totalling 90 fits


[Parallel(n_jobs=8)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=8)]: Done   9 tasks      | elapsed:    2.1s
[Parallel(n_jobs=8)]: Done  90 out of  90 | elapsed:   14.9s finished


Elasped time:  16.296517610549927
