<a href="https://colab.research.google.com/github/Klauszhao/GameCode/blob/master/GridSearchCVTest.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### 网格搜索
使用网格搜索法对5个模型进行调优（调参时采用五折交叉验证的方式），并进行模型评估，记得展示代码的运行结果

In [0]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
import lightgbm as lgb
from xgboost import XGBClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import GradientBoostingClassifier,RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import *
from sklearn.model_selection import cross_val_score,cross_val_predict
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import minmax_scale

import warnings
warnings.filterwarnings('ignore')

In [3]:
# 连接 Google colab 的云盘，数据集存放在云盘中，如果你的数据集不在云盘中，这段代码可以注释掉，从自己本地读取数据集
from google.colab import drive
drive.mount('/content/gdrive')
!ls 'gdrive/My Drive/Data'

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=email%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdocs.test%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive.photos.readonly%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fpeopleapi.readonly&response_type=code

Enter your authorization code:
··········
Mounted at /content/gdrive
aspect-extract	dataAnalyse  dataProcessByTwoTask.csv  zhaopin


In [4]:
# 读取数据
filePath = 'gdrive/My Drive/Data/dataProcessByTwoTask.csv'
data = pd.read_csv(filePath,encoding='gbk')

train_data, test_data = train_test_split(data, test_size=0.3, random_state=2018)
print("train_data",train_data.shape)
print("test_data",test_data.shape)

y_train = train_data['status']
x_train = train_data.drop(['status'],axis =1)

y_test = test_data['status']
x_test = test_data.drop(['status'],axis =1)

train_data (3096, 70)
test_data (1327, 70)


In [0]:
model_score_train = []   
decision_score_train = [] 
model_score_test = []   
decision_score_test = []

def proc_score(y_pred,y_pred_scores,y_test,train=True):  
    accuracy = accuracy_score(y_test,y_pred)
    precision = precision_score(y_test,y_pred)
    recall = recall_score(y_test,y_pred)
    f1 = f1_score(y_test,y_pred)
    roc_auc = roc_auc_score(y_test,y_pred_scores)

    if train:
        decision_score_train.append(y_pred_scores)
        model_score_train.append([accuracy,precision,recall,f1,roc_auc])
        text = 'Train'
    else:
        decision_score_test.append(y_pred_scores)
        model_score_test.append([accuracy,precision,recall,f1,roc_auc])
        text = 'Test'
    #print('{} confusion matrix:\n{}'.format(text,confusion_matrix(y_test,y_pred)))
    print('{}: accuracy : {:.3f} , precision : {:.3f} , recall : {:.3f} , f1 : {:.3f} , roc_auc : {:.3f}'.format(text,accuracy,precision,recall,f1,roc_auc))

### GridSearchCV 函数说明
- cv_results_  :   用来输出cv结果的，可以是字典形式也可以是numpy形式，还可以转换成DataFrame格式
- best_estimator_  ：通过搜索参数得到的最好的估计器，当参数refit=False时该对象不可用
- best_score_  ：float类型，输出最好的成绩
- best_params_  :  通过网格搜索得到的score最好对应的参数

### 逻辑回归
 使用逻辑回归运行五折交叉验证，网格搜索来获取最优参数，下面代码中C为正则化系数λ的倒数，必须为正数，默认为1，值越小，代表正则化越强。 一般来说需要调节这个参数，网格搜索只需要探讨这个参数为什么值可以取得较好的效果。
 
penalty: l1 和 l2 ，默认 l2 。若选择 l1 正则化，参数 solver 仅能够使用求解方式 liblinear 和 saga ；若使用 l2 正则化，参数 solver 中所有的求解方式都可以使用。

param_grid 中有多个C 值，网格搜索会得到具体的值

In [0]:
def LRGridSearch(x_train,y_train,x_test,y_test):

  parameters = {'penalty':['l1','l2'],'C':np.linspace(0.05,1,20).tolist()}
  
  grid_lr = GridSearchCV(LogisticRegression(), param_grid=parameters, cv=5, scoring='roc_auc')
  grid_lr.fit(x_train,y_train)
  #print("The best parameters are %s with a score of %0.2f"% (grid_lr.best_params_, grid_lr.best_score_))
  print("Test set score:{:.2f}".format(grid_lr.score(x_train,y_train)))
  print("Best parameters:{},Best score on train set:{:.2f}".format(grid_lr.best_params_,grid_lr.best_score_))
  #print("Best score on train set:{:.2f}".format(grid_lr.best_score_))
  
  
  # test_set score
  LR=grid_lr.best_estimator_
  
#  y_proba = LR.predict_proba(x_test)
#  y_predi = LR.predict(x_test)
#   d = np.vstack((y_proba.T[0].T, y_proba.T[1].T, y_predi, y_test.T)).T
#   d = d[d[:,0].argsort()]
#   Eva(d,LR.score(x_test, y_test))

  y_pred = LR.predict(x_test)
  y_pred_scores = cross_val_predict(LR,x_test,y_test,cv=5,method='decision_function')
  proc_score(y_pred,y_pred_scores,y_test,train=False)

### 决策树

分类树的8个重要参数：criterion、2个随机性相关的参数 (random_state、splitter）、5个剪枝参数（max_depth、min_samples_split、min_samples_leaf、max_feature、min_impurity_decrease）。

- criterion：不纯度计算方法。信息熵 entropy 和基尼系数 gini ，默认 gini。

-random_state：设置分枝中随机模式的参数。默认为 None。

- splitter：控制决策树中的随机选项。best 和random ，默认最佳分枝 best（分枝虽随机，但会优先选择更重要的特征分枝）。

- max_depth：树大最大深度。建议从3开始尝试。

- min_samples_split：一个节点至少包含 min_samples_split 个训练样本。默认为2。

- min_samples_leaf：一个节点在分枝后的每个子节点都必须包含 min_samples_leaf 个训练样本。建议从5开始尝试。

- max_features：限制分枝时考虑的特征个数（和 max_depth 异曲同工）。


In [0]:
def DecisionTreeGridSearch(x_train,y_train,x_test,y_test):
  param_grid ={'splitter':('best','random'),
              'criterion':('gini','entropy'),
              'max_depth':[*range(1,10)],
              'min_samples_leaf':[*range(1,50,5)],
              'min_impurity_decrease':[*np.linspace(0,0.5,10)],
              }
  
  
#   {'max_depth':np.linspace(2,32,31,dtype=np.int32),
#              'min_samples_split':np.linspace(0.1, 1.0, 10, endpoint=True),
#              'min_samples_leaf':np.linspace(0.1, 0.5, 5, endpoint=True),
#              'class_weight':['balanced']}

  gs_clf = GridSearchCV(DecisionTreeClassifier(),param_grid=param_grid,iid=True,cv=5,verbose=2,n_jobs=-1,scoring='roc_auc')
  gs_clf.fit(x_train,y_train)
  
  print('Test set score: {:.3f}'.format(gs_clf.score(x_test,y_test)))
  print('Best parameters: {}'.format(gs_clf.best_params_))
  print('Best cross-validation score: {:.3f}'.format(gs_clf.best_score_))
  print('Best estimator:\n{}'.format(gs_clf.best_estimator_))
  
  
  dt_clf = gs_clf.best_estimator_
  y_pred = dt_clf.predict(x_test)
  y_pred_scores = cross_val_predict(dt_clf,x_test,y_test,cv=5,method='predict_proba')
  proc_score(y_pred,y_pred_scores[:,1],y_test,train=False)

# 随机森林

控制基评估器的参数：criterion、max_depth、min_samples_split、min_samples_leaf、max_feature、min_impurity_decrease。

- n_estimators:森林中树木的数量，即基评估器的数量。n_estimators越大，模型的效果越好。n_estimators的默认值在现有版本的sklearn中是10，但是在即将更新的0.22版本中，这个默认值会被修正为100。一般来说，0-200选一个数会比较好。

- random_state:控制生成森林的模式。而在分类树最后，一个random_state只控制生成一棵树。

- bootstrap：控制抽样技术的参数，默认为True，代表有放回的抽样技术。


In [0]:
def RandomForestGridSearch(x_train,y_train,x_test,y_test):

#   parameters = {'max_depth': [3, 4, 5, 6, 7],
#                 'max_features': sp_randint(1, 11),
#                 'min_samples_split': sp_randint(2, 11),
#                 'bootstrap': [True, False],
#                 'criterion': ['gini', 'entropy']
#                 }
  # 简单的参数
  parameters = {'max_depth':range(3,14,2), 'min_samples_split':range(50,201,20)}
  n_iter_search = 20
  
  gs_clf = GridSearchCV(RandomForestClassifier(n_estimators=100,criterion='gini'), parameters,cv=5, iid=False, scoring='roc_auc')
  gs_clf.fit(x_train, y_train)
  
  print('Test set score: {:.3f}'.format(gs_clf.score(x_test,y_test)))
  print('Best parameters: {}'.format(gs_clf.best_params_))
  print('Best cross-validation score: {:.3f}'.format(gs_clf.best_score_))
  print('Best estimator:\n{}'.format(gs_clf.best_estimator_))
  
  rfc =gs_clf.best_estimator_
 
  y_pred = rfc.predict(x_test)
  y_pred_scores = cross_val_predict(rfc,x_test,y_test,cv=5,method='predict_proba')
  proc_score(y_pred,y_pred_scores[:,1],y_test,train=False)


### XGBoost
- max_depth: 每棵树的最大深度。太小会欠拟合，太大过拟合。正常值是3到10。

- learning_rate: 学习率，也就是梯度下降法中的步长。太小的话，训练速度太慢，而且容易陷入局部最优点。通常是0.0001到0.1之间。

- n_estimators: 树的个数。并非越多越好，通常是50到1000之间。

- colsample_bytree: 训练每个树时用的特征的数量。1表示使用全部特征，0.5表示使用一半的特征。

- subsample: 训练每个树时用的样本的数量。与上述类似，1表示使用全部样本，0.5表示使用一半的样本。

- reg_alpha: L1正则化的权重。用来防止过拟合。一般是0到1之间。

- reg_lambda: L2正则化的权重。用来防止过拟合。一般是0到1之间。

- min_child_weight: 每个子节点所需要的样本的数量（加权的数量）。若把它设置为大于1的数值，可以起到剪枝的效果，防止过拟合。

In [0]:
def XGBGridSearch(x_train,y_train,x_test,y_test):
  parameters = {'max_depth': [3, 4, 5, 6, 7, 8],
                'subsample': [0.5, 0.6, 0.7, 0.8, 0.9, 1.0],
                'colsample_bytree': [0.5, 0.6, 0.7, 0.8, 0.9, 1.0],
                'learning_rate': [0.01, 0.1, 0.2],
                'min_child_weight': [1, 2, 3],
                }
  # 简单的参数
  parameters = {'max_depth':range(3,10,2), 'min_child_weight':range(1,6,1)}
  
  n_iter_search = 20
  gs_clf = GridSearchCV(XGBClassifier(random_state=2018), parameters, cv=5, iid=False, scoring='roc_auc')
  gs_clf.fit(x_train, y_train)


  print('Test set score: {:.3f}'.format(gs_clf.score(x_test,y_test)))
  print('Best parameters: {}'.format(gs_clf.best_params_))
  print('Best cross-validation score: {:.3f}'.format(gs_clf.best_score_))
  print('Best estimator:\n{}'.format(gs_clf.best_estimator_))
  
  xgbst = gs_clf.best_estimator_
  y_pred = xgbst.predict(x_test)
  y_pred_scores = cross_val_predict(xgbst,x_test,y_test,cv=5,method='predict_proba')
  proc_score(y_pred,y_pred_scores[:,1],y_test,train=False)
  
  

### GBDT 

参数 训练，参数很多，训练很慢，可以分多次训练，多个参数分开网格搜索，但是这样得到的结果可能不太一样，
不过我觉得跟训练集关系很大，不同的训练集，训练的参数不太一样。

本文训练的最佳的参数如下

```

criterion='friedman_mse', init=None,
                           learning_rate=0.2, loss='deviance', max_depth=5,
                           max_features='sqrt', max_leaf_nodes=None,
                           min_impurity_decrease=0.0, min_impurity_split=None,
                           min_samples_leaf=80, min_samples_split=1400,
                           min_weight_fraction_leaf=0.0, n_estimators=60,
                           n_iter_no_change=None, presort='auto',
                           random_state=10, subsample=0.8, tol=0.0001,
                           validation_fraction=0.1, verbose=0,
                           warm_start=False


```



In [0]:
def GBDTGridSearch(x_train,y_train,x_test,y_test):
  
  parameters = {'n_estimators':range(20,81,10)
                ,'learning_rate':[0.05,0.1,0.2,0.5]
                ,'max_depth':range(3,14,2)
              }
  #  暂时放弃这下面两个参数的训练
  #  ,'min_samples_split':range(800,1900,200)
  #  ,'min_samples_leaf':range(60,101,10)
  
  gs_clf = GridSearchCV(estimator = GradientBoostingClassifier( min_samples_leaf=80, min_samples_split=1400,max_features='sqrt', subsample=0.8, random_state=10), 
                         param_grid = parameters, scoring='roc_auc',iid=False, cv=5)
  
  gs_clf.fit(x_train,y_train)
  
  print('Test set score: {:.3f}'.format(gs_clf.score(x_test,y_test)))
  print('Best parameters: {}'.format(gs_clf.best_params_))
  print('Best cross-validation score: {:.3f}'.format(gs_clf.best_score_))
  print('Best estimator:\n{}'.format(gs_clf.best_estimator_))
  
  gbdt = gs_clf.best_estimator_
  y_pred = gbdt.predict(x_test)
  y_pred_scores = cross_val_predict(gbdt,x_test,y_test,cv=5,method='predict_proba')
  proc_score(y_pred,y_pred_scores[:,1],y_test,train=False)

In [12]:
print("- - - - - - - - - - - - - - - - - - - 原始测试集，开始训练 - - - - - - - - - - - - - - - - - - - - - ")
print("- - - -  - - LR - - - - - - - - ")
LRGridSearch(x_train,y_train,x_test,y_test)
print("- - - -  - - 决策树 - - - - - - - - ")
DecisionTreeGridSearch(x_train,y_train,x_test,y_test)
print("- - - -  - - 随机森林 - - - - - - - - ")

RandomForestGridSearch(x_train,y_train,x_test,y_test)

print("- - - -  - - XGB - - - - - - - - ")

XGBGridSearch(x_train,y_train,x_test,y_test)

print("- - - -  - - GBDT - - - - - - - - ")
GBDTGridSearch(x_train,y_train,x_test,y_test)

- - - - - - - - - - - - - - - - - - - 原始测试集，开始训练 - - - - - - - - - - - - - - - - - - - - - 
- - - -  - - LR - - - - - - - - 
Test set score:0.82
Best parameters:{'C': 0.25, 'penalty': 'l1'},Best score on train set:0.80
Test: accuracy : 0.775 , precision : 0.647 , recall : 0.321 , f1 : 0.429 , roc_auc : 0.744
- - - -  - - 决策树 - - - - - - - - 
Fitting 5 folds for each of 3600 candidates, totalling 18000 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done 195 tasks      | elapsed:    4.0s
[Parallel(n_jobs=-1)]: Done 1647 tasks      | elapsed:   17.5s
[Parallel(n_jobs=-1)]: Done 4083 tasks      | elapsed:   42.5s
[Parallel(n_jobs=-1)]: Done 7479 tasks      | elapsed:  1.3min
[Parallel(n_jobs=-1)]: Done 11859 tasks      | elapsed:  2.1min
[Parallel(n_jobs=-1)]: Done 17199 tasks      | elapsed:  3.2min
[Parallel(n_jobs=-1)]: Done 18000 out of 18000 | elapsed:  3.3min finished


Test set score: 0.707
Best parameters: {'criterion': 'entropy', 'max_depth': 5, 'min_impurity_decrease': 0.0, 'min_samples_leaf': 36, 'splitter': 'best'}
Best cross-validation score: 0.751
Best estimator:
DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=5,
                       max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=36, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=None, splitter='best')
Test: accuracy : 0.756 , precision : 0.557 , recall : 0.350 , f1 : 0.430 , roc_auc : 0.715
- - - -  - - 随机森林 - - - - - - - - 
Test set score: 0.775
Best parameters: {'max_depth': 5, 'min_samples_split': 70}
Best cross-validation score: 0.793
Best estimator:
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=5, max_features='auto'

In [13]:
# 数据归一化操作

print("- - - - - - - - - - - - - - - - - - - 归一化测试集，开始训练 - - - - - - - - - - - - - - - - - - - - - ")

x_train_scale =minmax_scale(x_train)
x_test_scale =minmax_scale(x_test)

print("- - - -- - - - - -   - - LR - - - - - - - - - - -  - - - ")
LRGridSearch(x_train_scale,y_train,x_test_scale,y_test)
print("- - - -  - - 决策树 - - - - - - - - ")
DecisionTreeGridSearch(x_train_scale,y_train,x_test_scale,y_test)
print("- - - - - - - - - -  - - 随机森林 - -- - - - - -  - - - - - - ")

RandomForestGridSearch(x_train_scale,y_train,x_test_scale,y_test)

print("- - - - - - - - - -  - - XGB - - - - - -- - - - - -  - - ")

XGBGridSearch(x_train_scale,y_train,x_test_scale,y_test)

print("- - - - - - - - - -  - - GBDT - - - - - -- - - - - -  - - ")
GBDTGridSearch(x_train_scale,y_train,x_test_scale,y_test)

- - - - - - - - - - - - - - - - - - - 归一化测试集，开始训练 - - - - - - - - - - - - - - - - - - - - - 
- - - -- - - - - -   - - LR - - - - - - - - - - -  - - - 
Test set score:0.81
Best parameters:{'C': 1.0, 'penalty': 'l1'},Best score on train set:0.80
Test: accuracy : 0.770 , precision : 0.589 , recall : 0.415 , f1 : 0.487 , roc_auc : 0.759
- - - -  - - 决策树 - - - - - - - - 
Fitting 5 folds for each of 3600 candidates, totalling 18000 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done 296 tasks      | elapsed:    4.7s
[Parallel(n_jobs=-1)]: Done 2474 tasks      | elapsed:   25.2s
[Parallel(n_jobs=-1)]: Done 6128 tasks      | elapsed:  1.1min
[Parallel(n_jobs=-1)]: Done 11222 tasks      | elapsed:  1.9min
[Parallel(n_jobs=-1)]: Done 17792 tasks      | elapsed:  3.3min
[Parallel(n_jobs=-1)]: Done 18000 out of 18000 | elapsed:  3.3min finished


Test set score: 0.724
Best parameters: {'criterion': 'gini', 'max_depth': 9, 'min_impurity_decrease': 0.0, 'min_samples_leaf': 41, 'splitter': 'random'}
Best cross-validation score: 0.752
Best estimator:
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=9,
                       max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=41, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=None, splitter='random')
Test: accuracy : 0.736 , precision : 0.499 , recall : 0.513 , f1 : 0.506 , roc_auc : 0.702
- - - - - - - - - -  - - 随机森林 - -- - - - - -  - - - - - - 
Test set score: 0.783
Best parameters: {'max_depth': 7, 'min_samples_split': 50}
Best cross-validation score: 0.793
Best estimator:
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=