-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Python-side subsampling #309
Comments
@ytsaig what is the version of code you used? |
@guolinke in python, |
@ytsaig can you paste the output of |
@ytsaig OK, i see. |
@guolinke thanks for the suggestion. I updated, and I'n not getting a segfault anymore. Now I get the following error:
I'm not sure why it would have different bin mappers since Dataset.subset() sets reference = self. |
can you try:
|
@guolinke That seems to work, thanks! I'm curious, why keeping the subsets in an array makes a difference here? For completeness, I created a reproducible example. Finally, I'm wondering, if I wanted to have the sampling done on the C++ side by passing a list of indices and bypassing the bagging stage, what would be the recommended way to implement that? It looks like |
@ytsaig For the sampling, you can using '''bagging''' parameters. I think it is better than subset. |
I am trying to run the LightGBM training loop via the python API with custom logic for subsampling. In other words, instead of specifying
subsample
in the parameters to lightgbm, which would result in sampling over a uniform distribution, I would like to compute the sampled indices in each iteration and do an update step with subsampled data (a use case is clustered data, where sampling should account for the structure of the clusters).Here's what I tried:
In other words, use the subset() method to generate a subset of the training data in each iteration. However this results in a segmentation fault after one iteration.
I have also tried creating a new Dataset object in each iteration with a subsample of the full training data, but I get the following error message:
lightgbm.basic.LightGBMError: b'cannot reset training data, since new training data has different bin mappers'
Any suggestions on how to implement this?
The text was updated successfully, but these errors were encountered: