* 可以看出即使使用全部数据来训练模型，模型效果的提升并不大。
* 不如保留下效果好的nocp训练的模型，提升效果差的iscp训练的模型。
* 接下来对iscp训练的模型调参。

In [1]:
import pandas as pd
import numpy as np
import lightgbm as lgb
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer

In [2]:
def make_train_test(X, Y, seed, rate):
    idx = int(rate * X.shape[0])
    X_train = X[:idx]
    Y_train = Y[:idx]
    X_test = X[idx:]
    Y_test = Y[idx:]
    shuffled_indices = np.arange(X_train.shape[0])
    np.random.seed(seed)
    np.random.shuffle(shuffled_indices)
    X_train, Y_train = X_train[shuffled_indices], Y_train[shuffled_indices]
    return (X_train, Y_train), (X_test, Y_test)

In [3]:
data_iscp = pd.read_csv('train_data_iscp.csv', index_col='Unnamed: 0')
X_iscp = np.array(data_iscp.drop('target', axis=1))
y_iscp = np.array(data_iscp.target)
(X_trainval_iscp, y_trainval_iscp), (X_test_iscp, y_test_iscp) = make_train_test(X_iscp, y_iscp, 2, 0.8)
(X_train_iscp, y_train_iscp), (X_val_iscp, y_val_iscp) = make_train_test(X_trainval_iscp, y_trainval_iscp, 2, 0.75)

* 考察五种目标函数和五种评估函数

In [4]:
mae_test_data=pd.DataFrame(columns=['objectives','metrics','mae'])
objectives=['regression_l1','regression_l2','quantile','poisson','mape']
metrics=['mae','l2','quantile','poisson','mape']
callbacks = [
    lgb.early_stopping(stopping_rounds=10,verbose=True),
    lgb.log_evaluation(period=10,show_stdv=True)
]
for objective in objectives:
    for metric in metrics:
        model = lgb.LGBMRegressor(n_estimators=500, objective=objective, n_jobs=-1)
        model.fit(
            X_train_iscp, y_train_iscp,
            eval_set=(X_val_iscp, y_val_iscp),
            eval_metric=metric,
            callbacks=callbacks
        )
        y_pred_iscp = model.predict(X_val_iscp)
        size=mae_test_data.size
        mae_test_data.loc[size] = [
            objective, metric,
            mean_absolute_error(y_val_iscp, y_pred_iscp)
        ]
mae_test_data.to_csv('mae_test_data.csv', index=False)

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.017307 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 8133
[LightGBM] [Info] Number of data points in the train set: 593364, number of used features: 43
[LightGBM] [Info] Start training from score 99.096001
Training until validation scores don't improve for 10 rounds
[10]	valid_0's l1: 264.027
[20]	valid_0's l1: 158.016
[30]	valid_0's l1: 118.72
[40]	valid_0's l1: 107.51
[50]	valid_0's l1: 103.641
[60]	valid_0's l1: 101.84
[70]	valid_0's l1: 99.9637
[80]	valid_0's l1: 99.0236
[90]	valid_0's l1: 97.6458
[100]	valid_0's l1: 96.7605
[110]	valid_0's l1: 95.0143
[120]	valid_0's l1: 94.0003
[130]	valid_0's l1: 93.119
[140]	valid_0's l1: 92.2284
[150]	valid_0's l1: 91.3656
[160]	valid_0's l1: 90.6574
[170]	valid_0's l1: 89.6761
[180]	valid_0's l1: 88.5466
[190]	valid_0's l1: 87.7386
[200]	valid_

* 查看csv
* objective选用regression_l2，metric选用l2
* 考察其他超参数

In [10]:
param_grid = {
    'num_leaves': range(10,91,10),
    'max_depth': range(7,11,1)
}
neg_mae = make_scorer(mean_absolute_error, greater_is_better=False)
model = GridSearchCV(
    estimator=lgb.LGBMRegressor(objective='regression_l2'),
    param_grid=param_grid,
    scoring=neg_mae,
    cv=5,
    n_jobs=-1
)
model.fit(X_trainval_iscp,y_trainval_iscp)
model = model.best_estimator_
y_pred_iscp = model.predict(X_test_iscp)
mae_iscp = mean_absolute_error(y_test_iscp, y_pred_iscp)
mae_iscp

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.211490 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 8127
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.280587 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 8125
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.207824 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 8136
[LightGBM] [Info] Number of data points in the train set: 632922, number of used features: 43
[LightGBM] [Info] Number of data points in the train set: 632921, number of used features: 43
[LightGBM] [Info] Number of data

120.83852646664067