Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Different sequence of feature cause different result when using LightGBM? #1294

Closed
w-zm opened this issue Mar 31, 2018 · 4 comments
Closed

Comments

@w-zm
Copy link

w-zm commented Mar 31, 2018

When I train a model using LightGBM, as follow:

xgtrain = lgb.Dataset(dtrain[predictors].values, label=dtrain[target].values,
                  feature_name=predictors,
                  categorical_feature=categorical_features
                  )
xgvalid = lgb.Dataset(dvalid[predictors].values, label=dvalid[target].values,
                  feature_name=predictors,
                  categorical_feature=categorical_features
                  )

evals_results = {}

bst1 = lgb.train(lgb_params, 
                 xgtrain, 
                 valid_sets=[xgtrain, xgvalid], 
                 valid_names=['train','valid'], 
                 evals_result=evals_results, 
                 num_boost_round=num_boost_round,
                 early_stopping_rounds=early_stopping_rounds,
                 verbose_eval=10, 
                 feval=feval)

n_estimators = bst1.best_iteration
print("\nModel Report")
print("n_estimators : ", n_estimators)
print(metrics+":", evals_results['valid'][metrics][n_estimators-1])

I run the code twice, everything is same except:

(1) the first time,

predictors = ['context_page_id', 'item_city_id', 'item_collected_level', 'item_price_level', 'item_pv_level', 'item_sales_level', 'shop_review_num_level', 'shop_review_positive_rate', 'shop_score_delivery', 'shop_score_description', 'shop_score_service', 'shop_star_level', 'user_age_level', 'user_gender_id', 'user_occupation_id', 'user_star_level', 'category_1', 'category_2', 'min', 'hour', 'day', 'week', 'buy_item', 'buy_shop', 'buy_brand', 'browse_total', 'buy_total', 'browse_buy_rate', 'item_browse', 'item_buy', 'item_browse_buy_rate', 'shop_browse', 'shop_buy', 'shop_browse_buy_rate', 'hour_bin_1', 'hour_bin_2', 'hour_bin_3', 'is_new_user_0', 'is_new_user_1', 'is_new_item_0', 'is_new_item_1', 'is_new_shop_0', 'is_new_shop_1', 'is_new_brand_0', 'is_new_brand_1']

(2) the second time,

predictors = ['browse_buy_rate', 'browse_total', 'buy_brand', 'buy_item', 'buy_shop', 'buy_total', 'category_1', 'category_2', 'context_page_id', 'day', 'hour', 'item_browse', 'item_browse_buy_rate', 'item_buy', 'item_city_id', 'item_collected_level', 'item_price_level', 'item_pv_level', 'item_sales_level', 'min', 'shop_browse', 'shop_browse_buy_rate', 'shop_buy', 'shop_review_num_level', 'shop_review_positive_rate', 'shop_score_delivery', 'shop_score_description', 'shop_score_service', 'shop_star_level', 'user_age_level', 'user_gender_id', 'user_occupation_id', 'user_star_level', 'week', 'hour_bin_1', 'hour_bin_2', 'hour_bin_3', 'is_new_user_0', 'is_new_user_1', 'is_new_item_0', 'is_new_item_1', 'is_new_shop_0', 'is_new_shop_1', 'is_new_brand_0', 'is_new_brand_1']

Just change the order, but the result is different:

(1) the first time:

...
[1030]  train's binary_logloss: 0.0781902   valid's binary_logloss: 0.0821433
Early stopping, best iteration is:
[837]   train's binary_logloss: 0.0799938   valid's binary_logloss: 0.0820824

Model Report
n_estimators :  837
binary_logloss: 0.08208239967439723

(2) the second time:

...
[930]   train's binary_logloss: 0.0792041   valid's binary_logloss: 0.0821642
Early stopping, best iteration is:
[738]   train's binary_logloss: 0.0810454   valid's binary_logloss: 0.0821186

Model Report
n_estimators :  738
binary_logloss: 0.08211859038553634

Can anyone explain it? Thank you very much.

@guolinke
Copy link
Collaborator

This is by design. feature order will affect the accuracy.
The reason is, when choose a feature to split tree node, if two features have the same split gain, the feature with smaller index(id) will be chosen.

@w-zm
Copy link
Author

w-zm commented Mar 31, 2018

Thank you for your detailed answers.

@bbennett36
Copy link

@w-zm check out SHAP. I use this exclusively for feature importance. He shows in his research paper why gain/split are not always consistent.

@w-zm
Copy link
Author

w-zm commented Apr 6, 2018

@bbennett36 ok, thanks a million.

@lock lock bot locked as resolved and limited conversation to collaborators Mar 12, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants