Different sequence of feature cause different result when using LightGBM? #1294

w-zm · 2018-03-31T13:32:57Z

When I train a model using LightGBM, as follow:

xgtrain = lgb.Dataset(dtrain[predictors].values, label=dtrain[target].values,
                  feature_name=predictors,
                  categorical_feature=categorical_features
                  )
xgvalid = lgb.Dataset(dvalid[predictors].values, label=dvalid[target].values,
                  feature_name=predictors,
                  categorical_feature=categorical_features
                  )

evals_results = {}

bst1 = lgb.train(lgb_params, 
                 xgtrain, 
                 valid_sets=[xgtrain, xgvalid], 
                 valid_names=['train','valid'], 
                 evals_result=evals_results, 
                 num_boost_round=num_boost_round,
                 early_stopping_rounds=early_stopping_rounds,
                 verbose_eval=10, 
                 feval=feval)

n_estimators = bst1.best_iteration
print("\nModel Report")
print("n_estimators : ", n_estimators)
print(metrics+":", evals_results['valid'][metrics][n_estimators-1])

I run the code twice, everything is same except:

(1) the first time,

predictors = ['context_page_id', 'item_city_id', 'item_collected_level', 'item_price_level', 'item_pv_level', 'item_sales_level', 'shop_review_num_level', 'shop_review_positive_rate', 'shop_score_delivery', 'shop_score_description', 'shop_score_service', 'shop_star_level', 'user_age_level', 'user_gender_id', 'user_occupation_id', 'user_star_level', 'category_1', 'category_2', 'min', 'hour', 'day', 'week', 'buy_item', 'buy_shop', 'buy_brand', 'browse_total', 'buy_total', 'browse_buy_rate', 'item_browse', 'item_buy', 'item_browse_buy_rate', 'shop_browse', 'shop_buy', 'shop_browse_buy_rate', 'hour_bin_1', 'hour_bin_2', 'hour_bin_3', 'is_new_user_0', 'is_new_user_1', 'is_new_item_0', 'is_new_item_1', 'is_new_shop_0', 'is_new_shop_1', 'is_new_brand_0', 'is_new_brand_1']

(2) the second time,

predictors = ['browse_buy_rate', 'browse_total', 'buy_brand', 'buy_item', 'buy_shop', 'buy_total', 'category_1', 'category_2', 'context_page_id', 'day', 'hour', 'item_browse', 'item_browse_buy_rate', 'item_buy', 'item_city_id', 'item_collected_level', 'item_price_level', 'item_pv_level', 'item_sales_level', 'min', 'shop_browse', 'shop_browse_buy_rate', 'shop_buy', 'shop_review_num_level', 'shop_review_positive_rate', 'shop_score_delivery', 'shop_score_description', 'shop_score_service', 'shop_star_level', 'user_age_level', 'user_gender_id', 'user_occupation_id', 'user_star_level', 'week', 'hour_bin_1', 'hour_bin_2', 'hour_bin_3', 'is_new_user_0', 'is_new_user_1', 'is_new_item_0', 'is_new_item_1', 'is_new_shop_0', 'is_new_shop_1', 'is_new_brand_0', 'is_new_brand_1']

Just change the order, but the result is different:

(1) the first time:

...
[1030]  train's binary_logloss: 0.0781902   valid's binary_logloss: 0.0821433
Early stopping, best iteration is:
[837]   train's binary_logloss: 0.0799938   valid's binary_logloss: 0.0820824

Model Report
n_estimators :  837
binary_logloss: 0.08208239967439723

(2) the second time:

...
[930]   train's binary_logloss: 0.0792041   valid's binary_logloss: 0.0821642
Early stopping, best iteration is:
[738]   train's binary_logloss: 0.0810454   valid's binary_logloss: 0.0821186

Model Report
n_estimators :  738
binary_logloss: 0.08211859038553634

Can anyone explain it? Thank you very much.

The text was updated successfully, but these errors were encountered:

guolinke · 2018-03-31T14:15:29Z

This is by design. feature order will affect the accuracy.
The reason is, when choose a feature to split tree node, if two features have the same split gain, the feature with smaller index(id) will be chosen.

w-zm · 2018-03-31T14:21:16Z

Thank you for your detailed answers.

bbennett36 · 2018-04-05T21:00:27Z

@w-zm check out SHAP. I use this exclusively for feature importance. He shows in his research paper why gain/split are not always consistent.

w-zm · 2018-04-06T01:27:02Z

@bbennett36 ok, thanks a million.

guolinke closed this as completed Mar 31, 2018

lock bot locked as resolved and limited conversation to collaborators Mar 12, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Different sequence of feature cause different result when using LightGBM? #1294

Different sequence of feature cause different result when using LightGBM? #1294

w-zm commented Mar 31, 2018

guolinke commented Mar 31, 2018

w-zm commented Mar 31, 2018

bbennett36 commented Apr 5, 2018

w-zm commented Apr 6, 2018

Different sequence of feature cause different result when using LightGBM? #1294

Different sequence of feature cause different result when using LightGBM? #1294

Comments

w-zm commented Mar 31, 2018

guolinke commented Mar 31, 2018

w-zm commented Mar 31, 2018

bbennett36 commented Apr 5, 2018

w-zm commented Apr 6, 2018