Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large LightGBM causes javac error "Code too Large" #103

Closed
chris-smith-zocdoc opened this issue Oct 1, 2019 · 5 comments
Closed

Large LightGBM causes javac error "Code too Large" #103

chris-smith-zocdoc opened this issue Oct 1, 2019 · 5 comments
Labels
help wanted Extra attention is needed

Comments

@chris-smith-zocdoc
Copy link

When generating code for a large number of trees, the generated code exceeds the 64KB limit in java.

From Stackoverflow

A single method in a Java class may be at most 64KB of bytecode.

One solution is to add subfunctions https://github.com/BayesWitnesses/m2cgen/blob/master/m2cgen/assemblers/boosting.py#L43-L48 instead of having the body of every tree inside subroutine0. The amount of code that will fit inside each function is dependent on its depth + width so we might require some heuristic or tunable parameter. In my case, I ended up with 10 trees per subfunction

I'm not sure if there are similar limits in other languages

@izeigerman
Copy link
Member

Hey @chris-smith-zocdoc!

This is rather weird because we're aware of this Java limitations and this is why we came up with subroutines in the first place. I remember testing this implementation with as many as 500 - 1000 estimators with XGBoost, LightGBM and Random Forest without any problem.

Can you please provide some steps to reproduce this issue locally? Ideally using some public dataset or the one available in the scikit-learn package?

@chris-smith-zocdoc
Copy link
Author

I think the issue is that the trees are not in separate subroutines
image

To reproduce

import lightgbm as lgb
import m2cgen as m2c
import numpy as np


N = 10000
np.random.seed(seed=7)
data = np.random.random(size=(N, 200))
target = np.random.random(size=(N, 1))


estimator = lgb.LGBMRegressor(n_estimators=100, random_state=1, max_depth=64, num_leaves=100)

estimator.fit(data, target)

res = m2c.export_to_java(estimator)


with open('Model.java', 'w') as f:
    f.write(res)

Then

javac Model.java

@izeigerman
Copy link
Member

Ah, I see. So the individual tree is pretty large. Ok, we may want to consider to wrap individual estimators in their own subroutines based on some threshold values for max_depth and num_leaves.

@cugurm
Copy link

cugurm commented Aug 26, 2020

Hi, the same situation happens with large (sklearns) decision trees.

Do you have a solution for that (is it planned if it doesn't exist?)

@izeigerman
Copy link
Member

Hey @MilanCugur ! Please take a look at the following discussion - #297. The workaround mentioned there may potentially help with your issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Development

Successfully merging a pull request may close this issue.

3 participants