# 超参优化

在机器学习的上下文中,超参数是在开始学习过程之前设置值的参数,而不是通过训练得到的参数数据.通常情况下,需要对超参数进行优化,给学习机选择一组最优超参数,以提高学习的性能和效果.

通过搜索超参数空间以便获得最好交叉验证分数是可能的而且是值得提倡的.

在`scikit-learn`包中提供了两种采样搜索候选的通用方法:

+ 对于给定的值,网格搜索`GridSearchCV`考虑了所有参数组合

+ 随机搜索`RandomizedSearchCV`可以从具有指定分布的参数空间中抽取给定数量的候选


sklearn提供了如下相关接口:

+ `model_selection.GridSearchCV(estimator, …)`|网格搜索
+ `model_selection.ParameterGrid(param_grid)`|按网格穷举参数
+ `model_selection.RandomizedSearchCV(…[, …])`|使用随机搜索搜索超参
+ `model_selection.ParameterSampler(…[, …])`|通过指定的分布产生超参的生成器
+ `model_selection.fit_grid_point(X, y, …[, …])`|搜索器训练

而且`GridSearchCV` 和 `RandomizedSearchCV` 可以通过使用关键字 `n_jobs` 可以使计算并行运.为-1表示有几个核跑几个进程

## GridSearch

简单说就是在范围内穷举超参的组合

In [5]:
from sklearn import svm, datasets
from sklearn.model_selection import GridSearchCV,ParameterGrid,ParameterSampler,RandomizedSearchCV,fit_grid_point

In [1]:
param_grid = [
  {'C': [1, 10, 100, 1000], 'kernel': ['linear']},
  {'C': [1, 10, 100, 1000], 'gamma': [0.001, 0.0001], 'kernel': ['rbf']},
 ]

In [4]:
list(ParameterGrid(param_grid))

[{'C': 1, 'kernel': 'linear'},
 {'C': 10, 'kernel': 'linear'},
 {'C': 100, 'kernel': 'linear'},
 {'C': 1000, 'kernel': 'linear'},
 {'C': 1, 'gamma': 0.001, 'kernel': 'rbf'},
 {'C': 1, 'gamma': 0.0001, 'kernel': 'rbf'},
 {'C': 10, 'gamma': 0.001, 'kernel': 'rbf'},
 {'C': 10, 'gamma': 0.0001, 'kernel': 'rbf'},
 {'C': 100, 'gamma': 0.001, 'kernel': 'rbf'},
 {'C': 100, 'gamma': 0.0001, 'kernel': 'rbf'},
 {'C': 1000, 'gamma': 0.001, 'kernel': 'rbf'},
 {'C': 1000, 'gamma': 0.0001, 'kernel': 'rbf'}]

In [6]:
iris = datasets.load_iris()

In [7]:
svc = svm.SVC()
clf = GridSearchCV(svc, param_grid)
clf.fit(iris.data, iris.target)

GridSearchCV(cv=None, error_score='raise',
       estimator=SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False),
       fit_params={}, iid=True, n_jobs=1,
       param_grid=[{'C': [1, 10, 100, 1000], 'kernel': ['linear']}, {'C': [1, 10, 100, 1000], 'gamma': [0.001, 0.0001], 'kernel': ['rbf']}],
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=0)

In [9]:
for i,v in clf.cv_results_.items():
    print(i)
    print(v)

split0_test_score
[ 1.          1.          1.          1.          0.90196078  0.90196078
  0.94117647  0.90196078  0.98039216  0.94117647  1.          0.98039216]
split1_test_score
[ 0.96078431  0.92156863  0.92156863  0.92156863  0.90196078  0.90196078
  0.92156863  0.90196078  0.96078431  0.92156863  0.96078431  0.96078431]
split2_test_score
[ 0.97916667  1.          0.97916667  1.          0.9375      0.9375
  0.97916667  0.9375      0.97916667  0.97916667  1.          0.97916667]
mean_test_score
[ 0.98        0.97333333  0.96666667  0.97333333  0.91333333  0.91333333
  0.94666667  0.91333333  0.97333333  0.94666667  0.98666667  0.97333333]
std_test_score
[ 0.01617914  0.03715363  0.03345566  0.03715363  0.0165782   0.0165782
  0.02371536  0.0165782   0.00902067  0.02371536  0.01857681  0.00902067]
rank_test_score
[ 2  3  7  3 10 10  8 10  3  8  1  3]
split0_train_score
[ 0.97979798  0.95959596  0.96969697  0.97979798  0.91919192  0.91919192
  0.93939394  0.91919192  0.96969697  0

## RandomizedSearch

随机搜索如其名就是在范围内随机的搜索生成参数.它同样需要一个范围,但它也支持参数为一个scipy定义的分布.

In [13]:
import scipy
import numpy as np

In [14]:
np.random.seed(0)
param_grid = {'C': scipy.stats.expon(scale=100), 'gamma': scipy.stats.expon(scale=.1),
  'kernel': ['rbf'], 'class_weight':['balanced', None]}

In [15]:
list(ParameterSampler(param_grid, n_iter=4))

[{'C': 79.587450816311005,
  'class_weight': None,
  'gamma': 0.18596042409118513,
  'kernel': 'rbf'},
 {'C': 195.1545320209259,
  'class_weight': None,
  'gamma': 0.055104849109549929,
  'kernel': 'rbf'},
 {'C': 103.81592949436094,
  'class_weight': 'balanced',
  'gamma': 0.035315914097266164,
  'kernel': 'rbf'},
 {'C': 5.8384670780703338,
  'class_weight': 'balanced',
  'gamma': 0.048360210090225335,
  'kernel': 'rbf'}]

In [16]:
clf =RandomizedSearchCV(svc, param_grid)
clf.fit(iris.data, iris.target)

RandomizedSearchCV(cv=None, error_score='raise',
          estimator=SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False),
          fit_params={}, iid=True, n_iter=10, n_jobs=1,
          param_distributions={'C': <scipy.stats._distn_infrastructure.rv_frozen object at 0x0000000017362390>, 'gamma': <scipy.stats._distn_infrastructure.rv_frozen object at 0x00000000173624E0>, 'kernel': ['rbf'], 'class_weight': ['balanced', None]},
          pre_dispatch='2*n_jobs', random_state=None, refit=True,
          return_train_score=True, scoring=None, verbose=0)

In [17]:
for i,v in clf.cv_results_.items():
    print(i)
    print(v)

split0_test_score
[ 1.          0.98039216  1.          1.          0.98039216  0.98039216
  0.96078431  1.          1.          0.98039216]
split1_test_score
[ 0.90196078  0.94117647  0.90196078  0.90196078  0.94117647  0.94117647
  0.96078431  0.90196078  0.90196078  0.94117647]
split2_test_score
[ 1.          0.97916667  1.          1.          1.          1.
  0.97916667  0.97916667  1.          1.        ]
mean_test_score
[ 0.96666667  0.96666667  0.96666667  0.96666667  0.97333333  0.97333333
  0.96666667  0.96        0.96666667  0.97333333]
std_test_score
[ 0.04644204  0.01830211  0.04644204  0.04644204  0.02441472  0.02441472
  0.00857493  0.04250721  0.04644204  0.02441472]
rank_test_score
[ 4  4  4  4  1  1  4 10  4  1]
split0_train_score
[ 0.97979798  0.96969697  0.96969697  0.95959596  0.95959596  0.95959596
  0.94949495  0.96969697  0.96969697  0.95959596]
split1_train_score
[ 1.          1.          1.          1.          1.          1.
  0.98989899  1.          1.      