Python-side subsampling #309

ytsaig · 2017-02-21T08:38:47Z

I am trying to run the LightGBM training loop via the python API with custom logic for subsampling. In other words, instead of specifying subsample in the parameters to lightgbm, which would result in sampling over a uniform distribution, I would like to compute the sampled indices in each iteration and do an update step with subsampled data (a use case is clustered data, where sampling should account for the structure of the clusters).

Here's what I tried:

    from lightgbm import Dataset
    import lightgbm.basic as lgb

    train_set = Dataset(X, label=y, params=params)
    booster = lgb.Booster(params=params, train_set=train_set)
    for _ in range(1000):
            # custom_subsample() returns a numpy array with a subsample of indices
            subsample = custom_subsample(X)
            ts = train_set.subset(subsample)
            booster.update(ts)

In other words, use the subset() method to generate a subset of the training data in each iteration. However this results in a segmentation fault after one iteration.

I have also tried creating a new Dataset object in each iteration with a subsample of the full training data, but I get the following error message:
lightgbm.basic.LightGBMError: b'cannot reset training data, since new training data has different bin mappers'

Any suggestions on how to implement this?

The text was updated successfully, but these errors were encountered:

guolinke · 2017-02-21T08:47:39Z

@ytsaig what is the version of code you used?

ytsaig · 2017-02-21T08:52:11Z

@guolinke in python, lightgbm.__version__ returns 0.1. I am running on ubuntu in python 3.5.2.

guolinke · 2017-02-21T08:53:38Z

@ytsaig can you paste the output of git log in your lightgbm source folder?

ytsaig · 2017-02-21T08:55:40Z

@guolinke sure, here are the first few lines:

commit 59c116f
Author: Guolin Ke
Date: Sun Feb 5 14:37:38 2017 +0800

fix #276

guolinke · 2017-02-21T08:57:26Z

@ytsaig OK, i see.
Can you try to update to the latest code, and rebuild python package, then see what happens?
Thanks

ytsaig · 2017-02-21T09:08:34Z

@guolinke thanks for the suggestion. I updated, and I'n not getting a segfault anymore. Now I get the following error:

  File "/home/ytsaig/anaconda3/lib/python3.5/site-packages/lightgbm-0.1-py3.5.egg/lightgbm/basic.py", line 1254, in update
    self.train_set.construct().handle))
  File "/home/ytsaig/anaconda3/lib/python3.5/site-packages/lightgbm-0.1-py3.5.egg/lightgbm/basic.py", line 47, in _safe_call
    raise LightGBMError(_LIB.LGBM_GetLastError())
lightgbm.basic.LightGBMError: b'cannot reset training data, since new training data has different bin mappers'

I'm not sure why it would have different bin mappers since Dataset.subset() sets reference = self.

guolinke · 2017-02-21T09:27:38Z

can you try:

    from lightgbm import Dataset
    import lightgbm.basic as lgb

    train_set = Dataset(X, label=y, params=params)
    booster = lgb.Booster(params=params, train_set=train_set)
    ts = []
    for _ in range(1000):
            # custom_subsample() returns a numpy array with a subsample of indices
            subsample = custom_subsample(X)
            ts.append(train_set.subset(subsample))
            booster.update(ts[-1])

ytsaig · 2017-02-21T13:01:23Z

@guolinke That seems to work, thanks! I'm curious, why keeping the subsets in an array makes a difference here?

For completeness, I created a reproducible example.

Finally, I'm wondering, if I wanted to have the sampling done on the C++ side by passing a list of indices and bypassing the bagging stage, what would be the recommended way to implement that? It looks like subset() uses the used_indices property, is it possible to use that directly in the update stage rather than artitifically "updating" the training data in each iteration? (since it's always the same underlying data, just sampled differently)

guolinke · 2017-02-24T01:32:47Z

@ytsaig
The reason is you should not free last used subset.
When resetting training data, lightgbm will compare the new data with the old one(last used subest).
If old one is freed, it will access the freed memory and cause error.

For the sampling, you can using '''bagging''' parameters. I think it is better than subset.

guolinke closed this as completed Feb 24, 2017

lock bot locked as resolved and limited conversation to collaborators Mar 12, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Python-side subsampling #309

Python-side subsampling #309

ytsaig commented Feb 21, 2017

guolinke commented Feb 21, 2017 •

edited

Loading

ytsaig commented Feb 21, 2017

guolinke commented Feb 21, 2017

ytsaig commented Feb 21, 2017 •

edited by guolinke

Loading

guolinke commented Feb 21, 2017

ytsaig commented Feb 21, 2017

guolinke commented Feb 21, 2017

ytsaig commented Feb 21, 2017

guolinke commented Feb 24, 2017

Python-side subsampling #309

Python-side subsampling #309

Comments

ytsaig commented Feb 21, 2017

guolinke commented Feb 21, 2017 • edited Loading

ytsaig commented Feb 21, 2017

guolinke commented Feb 21, 2017

ytsaig commented Feb 21, 2017 • edited by guolinke Loading

guolinke commented Feb 21, 2017

ytsaig commented Feb 21, 2017

guolinke commented Feb 21, 2017

ytsaig commented Feb 21, 2017

guolinke commented Feb 24, 2017

guolinke commented Feb 21, 2017 •

edited

Loading

ytsaig commented Feb 21, 2017 •

edited by guolinke

Loading