New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Categorical features with large int cause segmentation fault #1359

Closed
qmick opened this Issue May 5, 2018 · 2 comments

Comments

Projects
None yet
3 participants
@qmick

qmick commented May 5, 2018

Environment info

Operating System: Ubuntu server 16.04 64bit
CPU: Intel(R) Xeon(R) CPU E5-2640 v4 @ 2.40GHz * 2
C++/Python/R version: Python 3.5.2

Error Message:

Python output:

/home/zhang/.local/lib/python3.5/site-packages/lightgbm/basic.py:1038: UserWarning: categorical_feature in Dataset is overridden. New categorical_feature is ['item_id', 'user_id']
warnings.warn('categorical_feature in Dataset is overridden. New categorical_feature is {}'.format(sorted(list(categorical_feature))))
[LightGBM] [Warning] Met negative value in categorical features, will convert it to NaN
[LightGBM] [Warning] Met negative value in categorical features, will convert it to NaN
[LightGBM] [Warning] Met negative value in categorical features, will convert it to NaN
[1] 69368 segmentation fault python3 train.py

GDB output:

[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `python3 train.py'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 LightGBM::BinMapper::FindBin (this=, values=, num_sample_values=, total_sample_cnt=3, max_bin=255, min_data_in_bin=3, min_split_data=20,
bin_type=LightGBM::CategoricalBin, use_missing=true, zero_as_missing=false) at /home/zhang/lightgbm/LightGBM/src/io/bin.cpp:322
322 if (distinct_values_int[0] == 0) {
[Current thread is 1 (Thread 0x7f22a3957700 (LWP 44144))]

Reproducible examples

import lightgbm as lgb
import pandas as pd

data = {'user_id':[4505772604969228686, 2692638157208937547, 5247924392014515924],
       'item_id': [3412720377098676069, 3412720377098676069, 3412720377098676069]}
df = pd.DataFrame(data=data)

lgb_train = lgb.Dataset(df, label=[0, 1, 1])
params = {
    'objective': 'binary',
    'metric': 'binary_logloss'
}

gbm = lgb.train(params, lgb_train, categorical_feature=['user_id', 'item_id'])

Steps to reproduce

  1. Run example above

Possible reason

Seems like it's caused by Python int to C++ int conversion error. Large Python int become negative in C++ side. If all values within a DataFrame column are too large, which is common in ID features, these values will be treated as missing values. Then vector distinct_values_int will be empty and distinct_values_int[0] will cause access violation.

Use sklearn.preprocessing,LabelEncoder can solve this problem. But I think this should be fixed or at least throw Python error message instead of segmentation fault since it will cause Python notebook kernel death.

@guolinke

This comment has been minimized.

Member

guolinke commented May 5, 2018

@StrikerRUS I think we can check this in python side.

@qmick For the categorical feature, use the continued integer from zero is the most efficient way for LightGBM. And we only support 32-bit int in cpp side. When its range exceed 32-bit, using categorical feature is very slow (so as other solutions).

@StrikerRUS

This comment has been minimized.

Collaborator

StrikerRUS commented May 6, 2018

@guolinke I'll try, but not promise to do it fast.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment