We can use models of scikit-learn, XGboost, and Keras for stacking. As a feature of our project, all out-of-fold predictions can be saved for further analisys after training. Stacking (stacked generalization) involves training a learning algorithm to combine the predictions of several other learning algorithms. Stacking yields typically better performance than any single trained models. It has been used successfully in regression and classification (Breiman, 1996).The basic idea is to use a pool of base classifiers, then using another classifier to combine their predictions, with the aim of reducing the generalization error.
- Numpy
- Pandas
- XGboost
- Scikit-learn
- Tensorflow2.0 or later
To train and predict the GeoStack model, just run python stacking/main.py
. Note that:
-
Set train and predict dataset under data/input
-
Stacking features from original dataset need to be under data/output/stacking_features
-
Final preidction result in
stacking/final_results.csv
is under output folder -
We can visulize the prediction results in
GoCAD
softwares
-
Set train dataset with its target data and test dataset.
FILES_LIST_stage1 = { 'train':( INPUT_PATH + 'train.csv', ), 'target':( INPUT_PATH + 'target.csv', ), 'test':( INPUT_PATH + 'test.csv', ), }
-
We define model classes that inherit
BaseModel
class, which are used in Stage 1, Stage 2. In our project, we usexgboost
,randomforest
,svm
, andgbdt
as stage 1 models. In stage2, we usexgboost
again as final model to predict the final results.The models usage and params are as follows:
# Model usage class xgb_stage1(BaseModel): def build_model(self): return Xgb(**self.params)
# For Stage 1 XGB_PARAMS = { 'colsample_bytree':0.80, 'learning_rate':0.1,"eval_metric":"auc", 'max_depth':5, 'min_child_weight':1, 'nthread':4, 'objective':'binary:logistic','seed':407, 'silent':1, 'subsample':0.60, } NN_PARAMS = { 'batch_size':32,'epoch':100, 'verbose':1, 'callbacks':[], 'shuffle':True, 'class_weight':None, 'sample_weight':None, 'normalize':True, 'categorize_y':True } RF_PARAMS = { 'n_estimators':500, 'criterion':'gini', 'n_jobs':8, 'verbose':0, 'random_state':407, 'oob_score':True, } GBDT_PARAMS = { 'n_estimators':300, 'learning_rate':0.05, 'subsample':0.8, 'max_depth':5, 'verbose':1, 'max_features':0.9, 'random_state':407, } SVM_PARAMS = { 'kernel':'rbf', 'C': 100, 'gamma': 11, 'probability': True } # For Stage 2 XGB_PARAMS_stage2 = { 'colsample_bytree':0.8, 'learning_rate':0.1, "eval_metric":"mlogloss", 'max_depth':4, 'seed':1234, 'nthread':8, 'reg_lambda':0.01, 'reg_alpha':0.01, 'subsample':0.80, 'objective':'multi:softprob', 'num_class':output_dim, }
-
Train each models of Stage 1 for stacking.
m = Model_XGB(name="xgb_stage1", flist=FILES_LIST_stage1, params = XGB_PARAMS, ) m.run() ...
-
Train each model of Stage 2 by using the prediction of Stage-1 models.
FILES_LIST_stage2 = { 'train':( INPUT_PATH + 'train.csv', FEATURE_PATH + 'xgb_stage1_all_fold.csv', FEATURE_PATH + 'nn_stage1_all_fold.csv', FEATURE_PATH + 'rf_stage1_all_fold.csv', FEATURE_PATH + 'gbdt_stage1_all_fold.csv', FEATURE_PATH + 'svm_stage1_all_fold.csv', ), 'target':( INPUT_PATH + 'target.csv', ), 'test':( INPUT_PATH + 'test.csv', FEATURE_PATH + 'xgb_stage1_test.csv', FEATURE_PATH + 'nn_stage1_test.csv', FEATURE_PATH + 'rf_stage1_test.csv', FEATURE_PATH + 'gbdt_stage1_test.csv', FEATURE_PATH + 'svm_stage1_test.csv', ), } m = XGB_stage2(name="xgb_stage2", flist=FILES_LIST_stage2, params = XGB_PARAMS_stage2, ) m.run()
-
Final result is saved as
stacking/output/stacking_features/xgb_stage2_test.csv
.
stacking/data/input
: original train and pred datasetstacking/data/output/stacking_features
: stage 1 featuresstacking/stacking/base.py
: stacking modulestacking/main.py
: train and predict program
base.py
:- Base models for stacking are defined here (using sklearn.base.BaseEstimator).
- Some models are defined here. e.g., XGBoost, Keras.
- These models are wrapped as scikit-learn like (using sklearn.base.ClassifierMixin, sklearn.base.RegressorMixin).
- That is, model class has some methods, fit(), predict_proba(), and predict().
- The stacking framework of our research
- The training strategy of our research
- We use AUC score and roc to evaluate the models and compare the performance of stacking model with single model.
- From the model evaluation results, the auc score of the stacking model is
0.93
. Compared with a single model, it achieves better generalization performance.
GPL3.0