Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python-side subsampling #309

Closed
ytsaig opened this issue Feb 21, 2017 · 9 comments
Closed

Python-side subsampling #309

ytsaig opened this issue Feb 21, 2017 · 9 comments

Comments

@ytsaig
Copy link

ytsaig commented Feb 21, 2017

I am trying to run the LightGBM training loop via the python API with custom logic for subsampling. In other words, instead of specifying subsample in the parameters to lightgbm, which would result in sampling over a uniform distribution, I would like to compute the sampled indices in each iteration and do an update step with subsampled data (a use case is clustered data, where sampling should account for the structure of the clusters).

Here's what I tried:

    from lightgbm import Dataset
    import lightgbm.basic as lgb

    train_set = Dataset(X, label=y, params=params)
    booster = lgb.Booster(params=params, train_set=train_set)
    for _ in range(1000):
            # custom_subsample() returns a numpy array with a subsample of indices
            subsample = custom_subsample(X)
            ts = train_set.subset(subsample)
            booster.update(ts)

In other words, use the subset() method to generate a subset of the training data in each iteration. However this results in a segmentation fault after one iteration.

I have also tried creating a new Dataset object in each iteration with a subsample of the full training data, but I get the following error message:
lightgbm.basic.LightGBMError: b'cannot reset training data, since new training data has different bin mappers'

Any suggestions on how to implement this?

@guolinke
Copy link
Collaborator

guolinke commented Feb 21, 2017

@ytsaig what is the version of code you used?

@ytsaig
Copy link
Author

ytsaig commented Feb 21, 2017

@guolinke in python, lightgbm.__version__ returns 0.1. I am running on ubuntu in python 3.5.2.

@guolinke
Copy link
Collaborator

@ytsaig can you paste the output of git log in your lightgbm source folder?

@ytsaig
Copy link
Author

ytsaig commented Feb 21, 2017

@guolinke sure, here are the first few lines:

commit 59c116f
Author: Guolin Ke
Date: Sun Feb 5 14:37:38 2017 +0800

fix #276

@guolinke
Copy link
Collaborator

@ytsaig OK, i see.
Can you try to update to the latest code, and rebuild python package, then see what happens?
Thanks

@ytsaig
Copy link
Author

ytsaig commented Feb 21, 2017

@guolinke thanks for the suggestion. I updated, and I'n not getting a segfault anymore. Now I get the following error:

  File "/home/ytsaig/anaconda3/lib/python3.5/site-packages/lightgbm-0.1-py3.5.egg/lightgbm/basic.py", line 1254, in update
    self.train_set.construct().handle))
  File "/home/ytsaig/anaconda3/lib/python3.5/site-packages/lightgbm-0.1-py3.5.egg/lightgbm/basic.py", line 47, in _safe_call
    raise LightGBMError(_LIB.LGBM_GetLastError())
lightgbm.basic.LightGBMError: b'cannot reset training data, since new training data has different bin mappers'

I'm not sure why it would have different bin mappers since Dataset.subset() sets reference = self.

@guolinke
Copy link
Collaborator

can you try:

    from lightgbm import Dataset
    import lightgbm.basic as lgb

    train_set = Dataset(X, label=y, params=params)
    booster = lgb.Booster(params=params, train_set=train_set)
    ts = []
    for _ in range(1000):
            # custom_subsample() returns a numpy array with a subsample of indices
            subsample = custom_subsample(X)
            ts.append(train_set.subset(subsample))
            booster.update(ts[-1])

@ytsaig
Copy link
Author

ytsaig commented Feb 21, 2017

@guolinke That seems to work, thanks! I'm curious, why keeping the subsets in an array makes a difference here?

For completeness, I created a reproducible example.

Finally, I'm wondering, if I wanted to have the sampling done on the C++ side by passing a list of indices and bypassing the bagging stage, what would be the recommended way to implement that? It looks like subset() uses the used_indices property, is it possible to use that directly in the update stage rather than artitifically "updating" the training data in each iteration? (since it's always the same underlying data, just sampled differently)

@guolinke
Copy link
Collaborator

@ytsaig
The reason is you should not free last used subset.
When resetting training data, lightgbm will compare the new data with the old one(last used subest).
If old one is freed, it will access the freed memory and cause error.

For the sampling, you can using '''bagging''' parameters. I think it is better than subset.

@lock lock bot locked as resolved and limited conversation to collaborators Mar 12, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants