## 建模调参

建模调参首先是要建立模型，然后调整参数，使得模型的性能在已有数据特征上达到最优，根据具体任务的不同，大体可以分为分类和回归两类模型，比赛中常用的机器模型包括逻辑回归、决策树、随机森林、lightgbm、xgboost等，在特征工程阶段要结合使用的模型对数据进行合理的处理，才能使模型发挥最大的性能。在建模后，需要对模型进行评估，常用的评估指标包括准确率、F值、AUC曲线、MSE等。通过分析模型学习过程以及结果，我们应对模型的参数进行调整，如过拟合时要调整对应参数来降低过拟合。

### 模型建立

1. 集成树模型

使用lightgbm、xgboost、catboost进行建模，每一个模型的原理和使用方法不赘述，参见：

lightgbm：https://lightgbm.apachecn.org/#/

xgboost：https://xgboost.apachecn.org/#/

catboost：https://catboost.ai/docs/

使用交叉验证验证模型性能

In [None]:
from sklearn.model_selection import StratifiedKFold
def cv_model(clf, train_x, train_y, test_x, clf_name):
    print(datetime.datetime.now())
    folds = 5
    seed = 2020
    kf = StratifiedKFold(n_splits=folds, shuffle=True, random_state=seed)
    train = np.zeros(train_x.shape[0])
    test = np.zeros(test_x.shape[0])
    cv_scores = []
    for i, (train_index, valid_index) in enumerate(kf.split(train_x, train_y)):
        print('************************************ {} ************************************'.format(str(i+1)))
        trn_x, trn_y, val_x, val_y = train_x.iloc[train_index], train_y[train_index], train_x.iloc[valid_index], train_y[valid_index]
        if clf_name == "lgb":          
            train_matrix = clf.Dataset(trn_x, label=trn_y, weight=trn_y.map(lambda x:1 if x==0 else 3))
            #train_matrix = clf.Dataset(trn_x, label=trn_y)
            valid_matrix = clf.Dataset(val_x, label=val_y)

            params = {
                'boosting_type': 'gbdt',
                'objective': 'binary',
                'max_bin':200,
                'metric': 'auc',
                'min_child_weight': 1.5,
                'num_leaves': 2 ** 4,
                'lambda_l2': 10,
                'learning_rate': 0.1,
                'feature_fraction': 0.8,
                'bagging_fraction': 0.8,
                'bagging_freq': 20,
                'seed': 2020,
                'nthread': -1,
                'verbose': -1,
            }
            params = {
                'objective' : 'binary',
                'is_unbalance' : True,
                'metric' : 'auc',
                'max_depth' : 9,
                'num_leaves' : 75,
                'learning_rate' : 0.1,
                'min_child_samples' : 40,
                'min_child_weight' : 1,
                'colsample_bytree' : 0.7,
                'subsample' : 0.9,
                'subsample_freq' : 4,
                'reg_alpha' : 0.4,
                'reg_lambda' : 35,
                'cat_smooth' : 0,
                'seed': 2020,
                'nthread': -1,
                'verbose': -1,
            }

            model = clf.train(params, train_matrix, 10000, valid_sets=[train_matrix, valid_matrix], verbose_eval=200,early_stopping_rounds=200)
            val_pred = model.predict(val_x, num_iteration=model.best_iteration)
            test_pred = model.predict(test_x, num_iteration=model.best_iteration)
            #print("importance: ", model.feature_importances_attribute)
            #lightgbm.plot_importance(model)
            im=pd.DataFrame({'importance':model.feature_importance(),'var':trn_x.columns})
            im=im.sort_values(by='importance',ascending=False)
            print(im)
        if clf_name == "xgb":
            train_matrix = clf.DMatrix(trn_x , label=trn_y)
            valid_matrix = clf.DMatrix(val_x , label=val_y)
            params = {
                'booster': 'gbtree',
                'objective': 'binary:logistic',
                'eval_metric': 'auc',
                'gamma': 1,
                'min_child_weight': 2,
                'max_depth': 5,
                'lambda': 10,
                'subsample': 0.85,
                'colsample_bytree': 0.7,
                'colsample_bylevel': 1,
                'eta': 0.03,
                'tree_method': 'exact',
                'seed': 2019,
                'nthread': 12,
                "silent": True,
                'scale_pos_weight':3 , 
            }
            watchlist = [(train_matrix, 'train'),(valid_matrix, 'eval')]
            model = clf.train(params, train_matrix, num_boost_round=10000, evals=watchlist, verbose_eval=200, early_stopping_rounds=100)
            val_pred = model.predict(valid_matrix, ntree_limit=model.best_ntree_limit)
            test_pred = model.predict(clf.DMatrix(test_x), ntree_limit=model.best_ntree_limit)
            #print("importance: ", model.feature_importances_attribute)
            
            #clf.plot_importance()
            display(pd.Series(model.get_fscore()).sort_values(ascending=False))
            
        if clf_name == "cat":
            params = {
                'learning_rate': 0.03,
                'depth': 5,
                'l2_leaf_reg': 10,
                'bootstrap_type': 'Bernoulli',
                'od_type': 'Iter',
                'od_wait': 50,
                'random_seed': 2018,
                'subsample':0.8,
                'colsample_bylevel': 0.85,
                'allow_writing_files': False}
            model = clf(iterations=20000, **params)
            model.fit(trn_x, trn_y, eval_set=(val_x, val_y), cat_features=[], use_best_model=True, verbose=500)
            val_pred = model.predict(val_x)
            test_pred = model.predict(test_x)
            print("importance: ", model.get_feature_importance(prettified=True))
        train[valid_index] = val_pred
        test += test_pred / kf.n_splits
        cv_scores.append(roc_auc_score(val_y, val_pred))
        print(cv_scores)
    print("%s_scotrainre_list:" % clf_name, cv_scores)
    print("%s_score_mean:" % clf_name, np.mean(cv_scores))
    print("%s_score_std:" % clf_name, np.std(cv_scores))
    return train, test

In [None]:
import lightgbm as lgb
import xgboost as xgb
from catboost import CatBoostRegressor

def lgb_model(x_train, y_train, x_test):
    lgb_train, lgb_test = cv_model(lgb, x_train, y_train, x_test, "lgb")
    return lgb_train, lgb_test
def xgb_model(x_train, y_train, x_test):
    xgb_train, xgb_test = cv_model(xgb, x_train, y_train, x_test, "xgb")
    return xgb_train, xgb_test
def cat_model(x_train, y_train, x_test):
    cat_train, cat_test = cv_model(CatBoostRegressor, x_train, y_train, x_test, "cat")
    return cat_train, cat_test

In [None]:
lgb_train, lgb_test = lgb_model(train_data, target, test_data)
xgb_train, xgb_test = xgb_model(train_data, target, test_data)
cat_train, cat_test = cat_model(train_data, target, test_data)

2. 模型融合

使用stacking融合模型，第一层使用lightgbm、xgboost、catboost，第二层使用逻辑回归

In [None]:
from sklearn.model_selection import StratifiedKFold
from scipy import sparse
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.linear_model import LogisticRegression

import xgboost
import lightgbm
from catboost import CatBoostRegressor

def stacking(clf, train_x, train_y, test_x, clf_name, kf, label_split=None):
    train = np.zeros((train_x.shape[0], 1))
    test = np.zeros((test_x.shape[0], 1))
    test_pre = np.empty((folds, test_x.shape[0], 1))
    cv_scores = []
    for i, (train_index, val_index) in enumerate(kf.split(train_x, train_y)):
        tr_x = train_x.iloc[train_index]
        tr_y = train_y[train_index]
        val_x = train_x.iloc[val_index]
        val_y = train_y[val_index]
        if clf_name in ['rf', 'ada', 'gb', 'et', 'lr']:
            print('clf_name:', clf_name)
            clf.fit(tr_x, tr_y)
            val_pred = clf.predict(val_x).reshape(-1,1)
            test_pre[i,:] = clf.predict(test_x).reshape(-1,1)
            cv_scores.append(roc_auc_score(val_y, val_pred))
            print('{}折 cv_scores:{}'.format(i, cv_scores))
            
        elif clf_name == "lgb":
            train_matrix = clf.Dataset(tr_x, label=tr_y)
            valid_matrix = clf.Dataset(val_x, label=val_y)
            params = {
                'boosting_type': 'gbdt',
                'objective': 'binary',
                'max_bin':200,
                'metric': 'auc',
                'min_child_weight': 1.5,
                'num_leaves': 2 ** 4,
                'lambda_l2': 10,
                'learning_rate': 0.03,
                'feature_fraction': 0.8,
                'bagging_fraction': 0.8,
                'bagging_freq': 20,
                'seed': 2020,
                'nthread': -1,
                'verbose': -1,
            }
            model = clf.train(params, train_matrix, 10000, valid_sets=[train_matrix, valid_matrix], verbose_eval=200,early_stopping_rounds=200)
            
            val_pred = model.predict(val_x, num_iteration=model.best_iteration).reshape(-1,1)
            train[val_index] = val_pred
            test_pre[i, :] = model.predict(test_x, num_iteration=model.best_iteration).reshape(-1,1)
            
            cv_scores.append(roc_auc_score(val_y, val_pred))
            print('{}折 cv_scores:{}'.format(i, cv_scores))
            
        elif clf_name == "xgb":
            train_matrix = clf.DMatrix(tr_x , label=tr_y)
            valid_matrix = clf.DMatrix(val_x , label=val_y)
            params = {
                'booster': 'gbtree',
                'objective': 'binary:logistic',
                'eval_metric': 'auc',
                'gamma': 1,
                'min_child_weight': 2,
                'max_depth': 5,
                'lambda': 10,
                'subsample': 0.85,
                'colsample_bytree': 0.7,
                'colsample_bylevel': 1,
                'eta': 0.03,
                'tree_method': 'exact',
                'seed': 2019,
                'nthread': 12,
                "silent": True,
                'scale_pos_weight':3 , 
            }
            watchlist = [(train_matrix, 'train'),(valid_matrix, 'eval')]
            model = clf.train(params, train_matrix, num_boost_round=50000, evals=watchlist, verbose_eval=200, early_stopping_rounds=200)
            
            val_pred = model.predict(valid_matrix, ntree_limit=model.best_ntree_limit).reshape(-1,1)
            train[val_index] = val_pred
            
            test_pre[i, :] = model.predict(clf.DMatrix(test_x), ntree_limit=model.best_ntree_limit).reshape(-1,1)
            
            cv_scores.append(roc_auc_score(val_y, val_pred))
            print('{}折 cv_scores:{}'.format(i, cv_scores))
            
        elif clf_name == "cat":
            params = {
                'learning_rate': 0.03,
                'depth': 5,
                'l2_leaf_reg': 10,
                'bootstrap_type': 'Bernoulli',
                'od_type': 'Iter',
                'od_wait': 50,
                'random_seed': 2018,
                'subsample':0.8,
                'colsample_bylevel': 0.85,
                'allow_writing_files': False}
            model = clf(iterations=20000, **params)
            model.fit(tr_x, tr_y, eval_set=(val_x, val_y), cat_features=[], use_best_model=True, verbose=500)
            val_pred = model.predict(val_x).reshape(-1,1)
            train[val_index] = val_pred
            test_pre[i, :] = model.predict(test_x).reshape(-1,1)
            cv_scores.append(roc_auc_score(val_y, val_pred))
            print('{}折 cv_scores:{}'.format(i, cv_scores))
        test[:] = test_pre.mean(axis=0)
    print("%s_score_list:" % clf_name, cv_scores)
    print("%s_score_mean:" % clf_name, np.mean(cv_scores))
    print("%s_score_std:" % clf_name, np.std(cv_scores))
    return train.reshape(-1,1), test.reshape(-1,1)

In [None]:
def rf_model(train_x, train_y, test_x, kf, label_split=None):
    rf = RandomForestClassifier(n_estimators=200, max_depth=20, n_jobs=-1, random_state=2020, max_features="auto", verbose=1)
    rf_train, rf_test = stacking(rf, train_x, train_y, test_x, "rf", kf, label_split)
    return rf_train, rf_test, "rf_stacker"

def ada_model(train_x, train_y, test_x, kf, label_split=None):
    adaboost = AdaBoostClassifier(n_estimators=200, random_state=2020,learning_rate=0.01, n_jobs=-1)
    ada_train, ada_test = stacking(adaboost, train_x, train_y, test_x, "ada", kf, label_split)
    return ada_train, ada_test, "ada_stacker"


def gb_model(train_x, train_y, test_x, kf, label_split=None):
    gbdt = GradientBoostingClassifier(learning_rate=0.01, n_estimators=200, subsample=0.8, random_state=2020,max_depth=5,verbose=1)
    gbdt_train, gbdt_test = stacking(gbdt, train_x, train_y, test_x, "gbdt", kf, label_split)
    return gbdt_train, gbdt_test, "gbdt_stacker"


def et_model(train_x, train_y, test_x, kf, label_split=None):
    et = ExtraTreesClassifier(n_estimators=200, random_state=2020,learning_rate=0.01, max_depth=15,max_features="auto",verbose=1)
    et_train, et_test = stacking(et, train_x, train_y, test_x, "et", kf, label_split)
    return et_train, et_test, "et_stacker"


def lr_model(train_x, train_y, test_x, kf, label_split=None):
    lr = LogisticRegression(n_jobs=-1)
    lr_train, lr_test = stacking(lr, train_x, train_y, test_x, "lr", kf, label_split)
    return lr_train, lr_test, "lr_stacker"

def my_xgb_model(train_x, train_y, test_x, kf, label_split=None):
    xgb_train, xgb_test = stacking(xgboost, train_x, train_y, test_x, "xgb", kf, label_split)
    return xgb_train, xgb_test, "xgb_stacker"

def my_lgb_model(train_x, train_y, test_x, kf, label_split=None):
    lgb_train, lgb_test = stacking(lightgbm, train_x, train_y, test_x, "lgb", kf, label_split)
    return lgb_train, lgb_test, "lgb_stacker"

def my_cat_model(train_x, train_y, test_x, kf, label_split=None):
    cat_train, cat_test = stacking(CatBoostRegressor, train_x, train_y, test_x, "cat", kf, label_split)
    return cat_train, cat_test, "cat_stacker"

In [None]:
def stacking_pred(train_x, train_y, test_x, kf, clf_list, label_split=None, clf_fin="lgb", if_concat_origin=True):
    print(datetime.datetime.now())
    column_list = []
    train_data_list = []
    test_data_list = []
    for i, clf_list in enumerate(clf_list):
        clf_list = [clf_list]

        for clf in clf_list:
            train_data, test_data, clf_name = clf(train_x, train_y, test_x, kf, label_split)
            train_data_list.append(train_data)
            test_data_list.append(test_data)
            column_list.append("clf_%s"%(clf_name))
    train = np.concatenate(train_data_list, axis=1)
    test = np.concatenate(test_data_list, axis=1)
    if if_concat_origin:
        train = np.concatenate([train_x, train], axis=1)
        test = np.concatenate([test_x, test], axis=1)
    print(train_x.shape)
    print(train.shape)
    print(column_list)
    if clf_fin == "xgb":
        print('second layer model:', clf_fin)
        clf = xgboost
        train_matrix = clf.DMatrix(train , label=train_y)
        valid_matrix = clf.DMatrix(train , label=train_y)
        params = {
            'booster': 'gbtree',
            'objective': 'binary:logistic',
            'eval_metric': 'auc',
            'gamma': 1,
            'min_child_weight': 1.5,
            'max_depth': 5,
            'lambda': 10,
            'subsample': 0.7,
            'colsample_bytree': 0.7,
            'colsample_bylevel': 0.7,
            'eta': 0.04,
            'tree_method': 'exact',
            'seed': 2019,
            "silent": True,
        }
        watchlist = [(train_matrix, 'train'),(valid_matrix, 'eval')]
        model = clf.train(params, train_matrix, num_boost_round=5000, evals=watchlist, verbose_eval=200, early_stopping_rounds=200)

        pre = model.predict(clf.DMatrix(test), ntree_limit=model.best_ntree_limit).reshape(-1,1)
        return pre, train, test
    if clf_fin == 'lgb':
        print('second layer model:', clf_fin)
        clf = lightgbm
        train_matrix = clf.Dataset(train, label=train_y)
        valid_matrix = clf.Dataset(train, label=train_y)
        params = {
            'boosting_type': 'gbdt',
            'objective': 'binary',
            'metric': 'auc',
            'min_child_weight': 5,
            'num_leaves': 2 ** 5,
            'lambda_l2': 10,
            'feature_fraction': 0.8,
            'bagging_fraction': 0.8,
            'bagging_freq': 4,
            'learning_rate': 0.1,
            'seed': 2018,
            'silent': True,
            'verbose': -1,
        }
        model = clf.train(params, train_matrix, 5000, valid_sets=[train_matrix, valid_matrix], verbose_eval=200,early_stopping_rounds=200)
        pre = model.predict(test, num_iteration=model.best_iteration).reshape(-1,1)
        return pre, train, test
    if clf_fin == 'lr':
        print('second layer model:', clf_fin)
        clf = LogisticRegression(n_jobs=-1)
        clf.fit(train, train_y)
        pre = clf.predict_proba(test)
        return pre, train, test

In [None]:
from sklearn.model_selection import KFold, StratifiedKFold
folds = 5
seed = 2020
kf = StratifiedKFold(n_splits=folds, shuffle=True, random_state=0)

clf_list = [my_lgb_model, my_xgb_model, my_cat_model ]
#clf_list = [rf_model  ]

pre, train, test = stacking_pred(train_data, target, test_data, kf, clf_list, label_split=None, clf_fin='lr', if_concat_origin=False)

### 参数调整

参考文章：https://zhuanlan.zhihu.com/p/76206257

以lightgbm为例，使用网格搜索贪心参数，初始时选择较大学习率，加速收敛，其他参数调整后使用较小学习率提升精度：

1. 调整max_depth 和 num_leaves，这两个参数基本可以确定树的大小及复杂度，可以同时调整，参考代码如下，其余参数的代码类似：
2. 调整min_data_in_leaf 和 min_sum_hessian_in_leaf，防止树过拟合；
3. 调整feature_fraction,通过随机选择一定比列的特征去模型中，防止过拟合;
4. 调整bagging_fraction和bagging_freq,bagging_fraction相当于subsample样本采样，可以使bagging更快的运行，同时也可以降拟合。bagging_freq默认0，表示bagging的频率，0意味着没有使用bagging，k意味着每k轮迭代进行一次bagging；
5. 调整lambda_l1(reg_alpha)和lambda_l2(reg_lambda),通过L1正则化和L2正则化降低过拟合；
6. 调整cat_smooth，cat_smooth为设置每个类别拥有最小的个数，主要用于去噪；
7. 调整学习率。


In [None]:
from sklearn.model_selection import GridSearchCV  # Perforing grid search
parameters = {
    'max_depth': range(3,10,2),
    'num_leaves': range(10, 80, 10),
}

gbm = lgb.LGBMClassifier(
                        objective = 'binary',
                        is_unbalance = True,
                        metric = 'auc',

                        max_depth = 9,
                        num_leaves = 75,

                        learning_rate = 0.1,

                        min_child_samples = 40,
                        min_child_weight = 1,

                        colsample_bytree = 0.7,

                        subsample = 0.9,
                        subsample_freq = 4,

                        reg_alpha = 0.4,
                        reg_lambda = 35,

                        cat_smooth = 0,
                        )

gsearch = GridSearchCV(gbm, param_grid=parameters, scoring='roc_auc', cv=3)
gsearch.fit(train_data, target)
print('参数的最佳取值:{0}'.format(gsearch.best_params_))
print('最佳模型得分:{0}'.format(gsearch.best_score_))
print(gsearch.cv_results_['mean_test_score'])
print(gsearch.cv_results_['params'])