## 実践演習12-3

sklearnでrandomized Searchを行います。

まず、実践演習9-1の内容で、randomized searchを行います。

In [1]:
import scipy as sp
import numpy as np
from sklearn.datasets import load_iris, load_boston
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import ShuffleSplit
from sklearn.model_selection import RandomizedSearchCV

irisデータを読み込み、パターン行列Xと教師ベクトルyにデータを格納します。

In [2]:
iris = load_iris()
X = iris.data
y = iris.target

### scipy.stats.exponの動作確認

[scipy.stats.expon](https://docs.scipy.org/doc/scipy-0.19.0/reference/generated/scipy.stats.expon.html)は確率密度関数 $\frac{exp(-x)}{scale}$ for $x \geq 0$を表します。rvsメソッドはその関数に基づいて乱数を発生し、ndarrayに格納します。適当なscaleで乱数を10個発生させてみます。

In [3]:
sp.stats.expon(scale=1).rvs(size=10)

array([ 0.11886781,  2.97239487,  1.81804597,  0.16100166,  0.77019644,
        0.48345519,  1.00452551,  0.30055887,  0.08398801,  0.51109484])

## 識別問題への適用

SVCのインスタンスの作成

In [4]:
svc = SVC(kernel='rbf')
svc

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

Gridの作成

In [5]:
params = {'C': sp.stats.expon(scale=100), 'gamma': sp.stats.expon(scale=.1)}

[RandomizedSearchCV](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html)の実行。繰り返しは20回としておきます。

In [6]:
clf = RandomizedSearchCV(svc, params, n_iter=20)
clf.fit(X, y)                          

RandomizedSearchCV(cv=None, error_score='raise',
          estimator=SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False),
          fit_params={}, iid=True, n_iter=20, n_jobs=1,
          param_distributions={'C': <scipy.stats._distn_infrastructure.rv_frozen object at 0x00000216EB424B00>, 'gamma': <scipy.stats._distn_infrastructure.rv_frozen object at 0x00000216EB424908>},
          pre_dispatch='2*n_jobs', random_state=None, refit=True,
          return_train_score=True, scoring=None, verbose=0)

結果のまとめ（全結果の表示、最適なパラメータ・スコア）

In [7]:
re = clf.cv_results_
for params, mean_score, std_score in zip(re['params'], re['mean_test_score'], re['std_test_score']):
    print("{:.3f} (+/- {:.3f}) for {}".format(mean_score, std_score, params))

0.967 (+/- 0.046) for {'C': 163.76357775619672, 'gamma': 0.06407029578622668}
0.953 (+/- 0.037) for {'C': 261.56662178058355, 'gamma': 0.40148191023134877}
0.973 (+/- 0.037) for {'C': 112.07618541339379, 'gamma': 0.056015286355579279}
0.960 (+/- 0.043) for {'C': 150.45823527396726, 'gamma': 0.1652757941322654}
0.967 (+/- 0.033) for {'C': 46.131549474832894, 'gamma': 0.04707946706513999}
0.967 (+/- 0.033) for {'C': 58.086415897803242, 'gamma': 0.077337364858622137}
0.973 (+/- 0.024) for {'C': 19.035652055929926, 'gamma': 0.048777926319357082}
0.973 (+/- 0.037) for {'C': 92.607656788421139, 'gamma': 0.057549025470712716}
0.973 (+/- 0.024) for {'C': 28.031273390638429, 'gamma': 0.1257217977546001}
0.967 (+/- 0.033) for {'C': 409.55763163324389, 'gamma': 0.0040545502361042001}
0.967 (+/- 0.046) for {'C': 61.141871828454164, 'gamma': 0.12988868546265031}
0.973 (+/- 0.024) for {'C': 18.794409685791479, 'gamma': 0.049993409483659969}
0.967 (+/- 0.033) for {'C': 43.579410143007777, 'gamma': 0.

In [8]:
clf.best_params_

{'C': 21.123676517718035, 'gamma': 0.049139526993008255}

In [9]:
clf.best_score_

0.97999999999999998

## 回帰問題への適用

bostonデータに対する[RandomForestRegressor](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html)回帰でRandomized searchを行います。

In [10]:
boston = load_boston()
X = boston.data
y = boston.target

回帰器のインスタンスの作成

In [11]:
rf = RandomForestRegressor()
rf

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_split=1e-07, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           n_estimators=10, n_jobs=1, oob_score=False, random_state=None,
           verbose=0, warm_start=False)

Gridを作成します。回帰器の数と、木の大きさでグリッドを構成します。

In [12]:
params = {'n_estimators': range(3,50), 'min_samples_leaf': range(1,20)}

回帰の場合は、ShuffleSplitのインスタンスを作成し、それをRandomForestRegressorのcvパラメータの値として与えます。

In [13]:
cv = ShuffleSplit(n_splits=3)
reg = RandomizedSearchCV(rf, params, cv=cv, scoring='r2')
reg.fit(X,y) 

RandomizedSearchCV(cv=ShuffleSplit(n_splits=3, random_state=None, test_size=0.1, train_size=None),
          error_score='raise',
          estimator=RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_split=1e-07, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           n_estimators=10, n_jobs=1, oob_score=False, random_state=None,
           verbose=0, warm_start=False),
          fit_params={}, iid=True, n_iter=10, n_jobs=1,
          param_distributions={'n_estimators': range(3, 50), 'min_samples_leaf': range(1, 20)},
          pre_dispatch='2*n_jobs', random_state=None, refit=True,
          return_train_score=True, scoring='r2', verbose=0)

結果の詳細表示で、交差確認がうまく適用できていることを確認します。

In [14]:
reg.cv_results_

{'mean_fit_time': array([ 0.12367312,  0.05064154,  0.04700486,  0.06789319,  0.00451001,
         0.0637211 ,  0.00534455,  0.02155797,  0.01571774,  0.00635004]),
 'mean_score_time': array([ 0.00300805,  0.00150394,  0.0016706 ,  0.00283488,  0.        ,
         0.00183876,  0.00084384,  0.00133705,  0.00100295,  0.        ]),
 'mean_test_score': array([ 0.86704333,  0.82359531,  0.80265675,  0.77885104,  0.7698688 ,
         0.82006898,  0.75792972,  0.81976054,  0.84528802,  0.79793068]),
 'mean_train_score': array([ 0.94991824,  0.89613666,  0.85716205,  0.83243839,  0.81411134,
         0.88360939,  0.82416124,  0.88464426,  0.92590605,  0.8545419 ]),
 'param_min_samples_leaf': masked_array(data = [3 7 13 19 19 9 18 8 4 10],
              mask = [False False False False False False False False False False],
        fill_value = ?),
 'param_n_estimators': masked_array(data = [43 29 32 48 3 42 4 13 9 4],
              mask = [False False False False False False False False False F

結果のまとめ（全結果の表示、最適なパラメータ・スコア）

In [15]:
re = reg.cv_results_
for params, mean_score, std_score in zip(re['params'], re['mean_test_score'], re['std_test_score']):
    print("{:.3f} (+/- {:.3f}) for {}".format(mean_score, std_score, params))

0.867 (+/- 0.056) for {'n_estimators': 43, 'min_samples_leaf': 3}
0.824 (+/- 0.058) for {'n_estimators': 29, 'min_samples_leaf': 7}
0.803 (+/- 0.086) for {'n_estimators': 32, 'min_samples_leaf': 13}
0.779 (+/- 0.080) for {'n_estimators': 48, 'min_samples_leaf': 19}
0.770 (+/- 0.095) for {'n_estimators': 3, 'min_samples_leaf': 19}
0.820 (+/- 0.075) for {'n_estimators': 42, 'min_samples_leaf': 9}
0.758 (+/- 0.099) for {'n_estimators': 4, 'min_samples_leaf': 18}
0.820 (+/- 0.056) for {'n_estimators': 13, 'min_samples_leaf': 8}
0.845 (+/- 0.073) for {'n_estimators': 9, 'min_samples_leaf': 4}
0.798 (+/- 0.085) for {'n_estimators': 4, 'min_samples_leaf': 10}


In [16]:
reg.best_params_

{'min_samples_leaf': 3, 'n_estimators': 43}

In [17]:
reg.best_score_

0.86704333194695549