<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#模型评估（一）：参数选择" data-toc-modified-id="模型评估（一）：参数选择-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>模型评估（一）：参数选择</a></span><ul class="toc-item"><li><span><a href="#Cross-validation:-evaluating-estimator-performance" data-toc-modified-id="Cross-validation:-evaluating-estimator-performance-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Cross-validation: evaluating estimator performance</a></span><ul class="toc-item"><li><span><a href="#Computing-cross-validated-metrics" data-toc-modified-id="Computing-cross-validated-metrics-1.1.1"><span class="toc-item-num">1.1.1&nbsp;&nbsp;</span>Computing cross-validated metrics</a></span></li><li><span><a href="#Cross-validation-iterators" data-toc-modified-id="Cross-validation-iterators-1.1.2"><span class="toc-item-num">1.1.2&nbsp;&nbsp;</span>Cross validation iterators</a></span></li></ul></li><li><span><a href="#Grid-Search:-Searching-for-estimator-parameters" data-toc-modified-id="Grid-Search:-Searching-for-estimator-parameters-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Grid Search: Searching for estimator parameters</a></span></li></ul></li><li><span><a href="#模型评估（二）：评估指标" data-toc-modified-id="模型评估（二）：评估指标-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>模型评估（二）：评估指标</a></span></li></ul></div>

# sklearn模块基础学习【3】

快速入门参考学习文档：  
  
>  https://sklearn.apachecn.org/docs/0.21.3/      
>  https://sklearn.apachecn.org/docs/0.21.3/50.html

## 模型评估（一）：参数选择

### Cross-validation: evaluating estimator performance

训练一个预测模型的参数，并在相同的数据上测试这个模型效果，是一种错误的方式：  
一个模型如果只是重复它刚刚看到的样本的标签，会有一个完美的分数，但在尚未看到的数据上却无法预测任何有用的东西。这种情况被称为过度拟合。为了避免这种情况，在进行（有监督的）机器学习实验时，通常的做法是将部分可用数据作为测试集X_test、y_test，将其作为测试集保留出来。

当评估不同设置（"超参数"）的模型时，例如SVM必须手动设置参数C，如果我们用测试集去选择最优的超参数，那么在测试集上仍然存在着过度拟合的风险。这是因为我们在不断调整超参数值，直到模型在测试集上的表现最佳为止。这样一来，关于测试集的知识就会 "泄露 "到模型中，评估指标不再报告泛化性能。为了解决这个问题，可以将数据集的另一部分作为所谓的 "验证集"：在训练集上进行训练，然后在验证集上进行评估模型选择超参数，学习到一个我们认为“最好的”模型后，可以在测试集上进行最终评估。

然而，通过将可用的数据分成三组，我们可以大幅减少可用于学习模型的样本数量，结果可能取决于一对（训练、验证）集的特定随机选择。
解决这个问题的方法是一个叫做交叉验证（CrossValidation，简称CV）的过程。当然，仍应保留一个测试集进行最终评估，但在做CV时，不再需要单独划分出一个验证集。在一种叫k-fold CV的简单交叉验证方法中，训练集被分割成k个小集（其他方法将在下文中描述，但一般遵循相同的原则）。对于每一个k个 "折子"，都要遵循以下步骤：
    * 使用k折中的k-1份数据作为训练数据来训练一个模型
    * 所得的模型在剩余的那1份数据上进行验证（即，它被用作验证集来计算一个性能评估标准，例如accuracy）
通过k折交叉验证所报告的最终模型性能指标，是在k次循环中分别计算出的评估指标值的平均值（具体k折交叉验证的方法和原理请参考sklearn官方文档对这块的解释：https://scikit-learn.org/stable/modules/cross_validation.html ）。这种方法在计算资源的开销上可能很昂贵，但不会浪费太多数据（相比固定一个测试集时的情况），这在样本数很少的情况下中是一个很大的优势。

#### Computing cross-validated metrics

```python
sklearn.model_selection.cross_val_score(estimator, X, y=None, groups=None, scoring=None, cv=None, n_jobs=None, verbose=0, fit_params=None, pre_dispatch='2*n_jobs', error_score=nan)
```

**参数：**
* estimator——用什么模型
* X——数据集输入
* y——数据集标签
* scoring——用什么指标来评估
* cv——几折交叉验证（默认5，一般设置5-10， 也可以传入一个KFold或Stratified迭代器，但实际上传入整数，默认就是用Stratified迭代器）
* n_jobs——开n个进程并行计算，默认为1（建议电脑闲置跑程序时设置为-1，让之以电脑最大资源进行并行计算）
* verbose——是否要将学习过程打印出来（如0或1或2或3，数字越大，打印信息越详细。但有的模型没有学习的过程，如这个perceptrom）
* error_score——遇到不合理的参数是否要报错

In [1]:
import numpy as np 
import pandas as pd
from sklearn.model_selection import cross_val_score

In [2]:
df = pd.read_csv("adultTest.csv")
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,class
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [3]:
df.dtypes

age                int64
workclass         object
fnlwgt             int64
education         object
education-num      int64
marital-status    object
occupation        object
relationship      object
race              object
sex               object
capital-gain       int64
capital-loss       int64
hours-per-week     int64
native-country    object
class             object
dtype: object

In [4]:
dfNew = pd.get_dummies(data=df, columns=['workclass','education','marital-status','occupation',
                                         'relationship','race','sex','native-country'])
dfNew.head()

Unnamed: 0,age,fnlwgt,education-num,capital-gain,capital-loss,hours-per-week,class,workclass_ ?,workclass_ Federal-gov,workclass_ Local-gov,...,native-country_ Portugal,native-country_ Puerto-Rico,native-country_ Scotland,native-country_ South,native-country_ Taiwan,native-country_ Thailand,native-country_ Trinadad&Tobago,native-country_ United-States,native-country_ Vietnam,native-country_ Yugoslavia
0,39,77516,13,2174,0,40,<=50K,0,0,0,...,0,0,0,0,0,0,0,1,0,0
1,50,83311,13,0,0,13,<=50K,0,0,0,...,0,0,0,0,0,0,0,1,0,0
2,38,215646,9,0,0,40,<=50K,0,0,0,...,0,0,0,0,0,0,0,1,0,0
3,53,234721,7,0,0,40,<=50K,0,0,0,...,0,0,0,0,0,0,0,1,0,0
4,28,338409,13,0,0,40,<=50K,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [5]:
# 将标签列数值化
dfNew['class'] = df['class'].map(lambda s :s.strip())
dfNew.loc[dfNew['class']=='<=50K','target'] = 0
dfNew.loc[dfNew['class']!='<=50K','target'] = 1

In [6]:
dfNew['target'].value_counts()

0.0    24720
1.0     7841
Name: target, dtype: int64

In [7]:
dfNew['class'].value_counts()

<=50K    24720
>50K      7841
Name: class, dtype: int64

In [8]:
# 删除原来的class列
dfNew.drop("class", axis=1, inplace=True)
# 构建训练数据集
X = dfNew.drop("target", axis=1)
y = dfNew['target']

In [9]:
from sklearn.ensemble import GradientBoostingClassifier

gbc = GradientBoostingClassifier(learning_rate=0.01, n_estimators=50, max_depth=2)
cross_val_score(gbc, X, y, cv=5, n_jobs=-1)

array([0.76708122, 0.76827396, 0.76765971, 0.76796683, 0.76858108])

#### Cross validation iterators

**cv: int, cross-validation generator or an iterable, optional**
  
sklearn中各种交叉验证的api中的cv参数决定了交叉验证拆分策略。cv可能的输入是：

* None, to use the default 5-fold cross validation,
* integer, to specify the number of folds in a (Stratified)KFold,
* CV splitter,
* An iterable yielding (train, test) splits as arrays of indices.

**其中，cv参数可以传入sklearn中自带的一些cv iterators：**
1. K-fold  
2. Stratified k-fold  
3. Label k-fold  
4. Leave-One-Out - LOO  
5. Leave-P-Out - LPO  
...

![image.png](attachment:image.png)

In [44]:
from sklearn.model_selection import KFold

# 划分示例
data1 = list(range(8))
kf = KFold(n_splits=4)
for train, test in kf.split(data1):
    print("%s %s" % (train, test))

[2 3 4 5 6 7] [0 1]
[0 1 4 5 6 7] [2 3]
[0 1 2 3 6 7] [4 5]
[0 1 2 3 4 5] [6 7]


In [11]:
# 建模示例
kf = KFold(n_splits=5)
gbc = GradientBoostingClassifier(learning_rate=0.1, n_estimators=50, max_depth=4)
cross_val_score(gbc, X, y, cv=kf, n_jobs=-1)

array([0.86074006, 0.86148649, 0.8659398 , 0.86624693, 0.86409705])

![image.png](attachment:image.png)

In [46]:
from sklearn.model_selection import StratifiedKFold

# 划分示例
data2 = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]
labels = [0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1]

skf = StratifiedKFold(n_splits=4)
for train, test in skf.split(data2, labels):
    print("%s %s" % (train, test))

[ 1  2  3  6  7  8  9 10 11] [0 4 5]
[ 0  2  3  4  5  8  9 10 11] [1 6 7]
[ 0  1  3  4  5  6  7 10 11] [2 8 9]
[0 1 2 4 5 6 7 8 9] [ 3 10 11]


In [19]:
# 建模示例
skf = StratifiedKFold(n_splits=5)
gbc = GradientBoostingClassifier(learning_rate=0.1, n_estimators=50, max_depth=4)
cross_val_score(gbc, X, y, cv=skf, n_jobs=-1)

array([0.85997236, 0.85964373, 0.86517199, 0.86578624, 0.86701474])

**ShuffleSplit则是在原始顺序的数据上，进行随机采样，拼成指定的test_size和train_size的数据供交叉验证  
示意图如下**  

![image.png](attachment:image.png)  
  
ShuffleSplit将在每次迭代期间随机采样整个数据集，以生成训练集和测试集。在每次交叉验证的迭代中，test_size和train_size参数控制测试和训练集应该多大。由于是在每次迭代中从整个数据集中进行采样（即有放回的采样），因此**ShuffleSplit可能在另一次迭代中再次选择前一次迭代中选择过的样本**（注意，**KFold即使设置了shuffle参数为True也仍然在每一折的划分中不会有重叠的样本**，这是**两者之间最大的区别**）  
  
ShuffleSplit划分时跟图中的classes or groups(类别占比)无关

In [56]:
from sklearn.model_selection import ShuffleSplit

# 划分示例
data1 = list(range(12))
ss = ShuffleSplit(n_splits=4, test_size=0.25, random_state=500)
for train, test in ss.split(data1):
    print("%s %s" % (train, test))

[ 4  8  0  5  6  9  1  7 10] [11  2  3]
[ 9  7  0  6  5  3 11 10  2] [4 1 8]
[ 8  5 11  1  3  7  9 10  2] [0 4 6]
[ 4 11  3  8  6  1  0  5  9] [10  2  7]


实际中，我们一般会使用StratifiedKFold（按样本各类别标签分层抽样）方式来做交叉验证划分样本，确保训练集、测试集中各类别样本的比例与原始数据集中一致。

当然类似于上面KFold与ShuffleSplit区别，StratifiedKFold和StratifiedShuffleSplit区别同样是在划分或抽样的方式上，只不过加上了分层的条件，所以StratifiedKFold是在每类样本中都进行K折，StratifiedShuffleSplit是在每类样本中都随机有放回地抽样指定比例的样本后得到的这部分数据作为验证集，其余作为训练集，因此每次划分都考虑到了各个类别间的分布占比，示意图如下

![image.png](attachment:image.png)

In [25]:
from sklearn.model_selection import StratifiedShuffleSplit

# 划分示例
data2 = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]
labels = [0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1]

sss = StratifiedShuffleSplit(n_splits=4, test_size=0.25)
for train, test in sss.split(data2, labels):
    print("%s %s" % (train, test))

[ 0  2  1  7  9  4  6 11 10] [8 5 3]
[10 11  9  0  1  4  5  2  6] [7 8 3]
[11  5  9  0  3 10  4  8  1] [7 6 2]
[ 7  9  2 10 11  8  1  3  4] [0 5 6]


### Grid Search: Searching for estimator parameters

A search consists of:
1. an estimator (regressor or classifier such as sklearn.svm.SVC());
2. a parameter space;
3. a method for searching or sampling candidates;
4. a cross-validation scheme; and
5. a score function.

```python
class sklearn.model_selection.GridSearchCV(estimator, param_grid, scoring=None, n_jobs=None, iid='deprecated', refit=True, cv=None, verbose=0, pre_dispatch='2*n_jobs', error_score=nan, return_train_score=False)
```

参数：
* estimator——用什么模型
* param__grid——参数字典（key为要寻优的参数名，value为要尝试寻优的值的列表）
* scoring——用什么指标来评估（分类器默认用准确率，也可改为'f1'、'roc_auc'等）
* cv—— 几折交叉验证（默认5，一般设置5-10， 也可以传入一个KFold或Stratified迭代器，但实际上传入整数默认就是用Stratified迭代器）
* n_jobs——开n个进程并行计算，默认为1（建议设置-1，让之并行计算）
* verbose——是否要将学习过程打印出来（如0或1或2或3，数字越大，打印信息越详细。但有的模型没有学习的过程，如这个perceptrom）
* iid——假设样本是否是独立同分布的（默认是True）
* refit——是否需要直接返回在整个训练集上的最佳分类器，默认为True，可直接将这个GridSearchCV实例用于predict
* error_score——遇到不合理的参数是否要报错，默认'nan'

In [27]:
from sklearn.model_selection import GridSearchCV

clf = GradientBoostingClassifier()
param_grid ={"learning_rate":[0.001, 0.05],
             'n_estimators':[50, 100],
             'max_depth':[2, 4]}
gscv = GridSearchCV(clf, param_grid, n_jobs=-1, verbose=1, cv=5, scoring='f1')

In [28]:
gscv.fit(X,y)

Fitting 5 folds for each of 8 candidates, totalling 40 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  40 out of  40 | elapsed:  1.5min finished


GridSearchCV(cv=5, error_score=0,
             estimator=GradientBoostingClassifier(ccp_alpha=0.0,
                                                  criterion='friedman_mse',
                                                  init=None, learning_rate=0.1,
                                                  loss='deviance', max_depth=3,
                                                  max_features=None,
                                                  max_leaf_nodes=None,
                                                  min_impurity_decrease=0.0,
                                                  min_impurity_split=None,
                                                  min_samples_leaf=1,
                                                  min_samples_split=2,
                                                  min_weight_fraction_leaf=0.0,
                                                  n_estimators=100,
                                                  n_iter_no_change=None,
           

In [31]:
gscv.best_score_

0.6758647243566147

In [32]:
gscv.best_params_

{'learning_rate': 0.05, 'max_depth': 4, 'n_estimators': 100}

## 模型评估（二）：评估指标

具体可以参考sklearn文档中列明的scoring指标：  
https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter  
  
例如分类任务可用的scoring指标如下

![image.png](attachment:image.png)

In [39]:
clf = GradientBoostingClassifier()
param_grid ={"learning_rate":[0.001, 0.05],
             'n_estimators':[50, 100],
             'max_depth':[2, 4]}

gscv = GridSearchCV(clf, param_grid, n_jobs=-1, verbose=4, cv=5, error_score=0, scoring='roc_auc')

In [40]:
gscv.fit(X,y)

Fitting 5 folds for each of 8 candidates, totalling 40 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done   9 tasks      | elapsed:   20.4s
[Parallel(n_jobs=-1)]: Done  36 out of  40 | elapsed:  1.3min remaining:    8.8s
[Parallel(n_jobs=-1)]: Done  40 out of  40 | elapsed:  1.4min finished


GridSearchCV(cv=5, error_score=0,
             estimator=GradientBoostingClassifier(ccp_alpha=0.0,
                                                  criterion='friedman_mse',
                                                  init=None, learning_rate=0.1,
                                                  loss='deviance', max_depth=3,
                                                  max_features=None,
                                                  max_leaf_nodes=None,
                                                  min_impurity_decrease=0.0,
                                                  min_impurity_split=None,
                                                  min_samples_leaf=1,
                                                  min_samples_split=2,
                                                  min_weight_fraction_leaf=0.0,
                                                  n_estimators=100,
                                                  n_iter_no_change=None,
           

In [41]:
gscv.best_score_

0.9198170295541225

In [42]:
gscv.best_params_

{'learning_rate': 0.05, 'max_depth': 4, 'n_estimators': 100}