In Java interpreter ignore subroutines and perform code split based on the AST size #158

izeigerman · 2020-01-26T18:06:06Z

After investigating possible solutions for #152, I came to a conclusion that with the existing design it's extremely hard to come up with the optimal algorithm to split code into subroutines on the interpreter side (and not in assemblers).
The primary reason for that is that since we always interpret one expression at a time it's hard to predict both the depth of the current subtree and the number of expressions that are left to interpret in other branches.
I've achieved some progress by splitting expressions into separate subroutines based on the size of the code generated so far (i.e. code size threshold), but more often than not I'll get some stupid subroutines like this one:

public static double subroutine2(double[] input) {
    return 22.640634908349323;
}

That's why I took a simpler approach and attempted to optimize an interpreter that caused trouble in the first place - the R one.
I slightly modified its behavior: when the binary expressions count threshold is exceeded, it no longer split them into separate variable assignments, but moves them into their own subroutines. Although it might not be the most optimal way for simpler models (like linear ones), it helps tremendously with gradient boosting and random forest models.
Since those models are summation of independent estimators, we end up putting every N (5 by default) estimators into their own subroutine, improving this way the execution time.
@StrikerRUS please let me know what you think.

coveralls · 2020-01-26T19:19:01Z

Coverage decreased (-0.1%) to 95.806% when pulling 70e71f7 on iaroslav/issue-152 into a3082b5 on master.

StrikerRUS · 2020-01-26T21:09:29Z

@izeigerman All your thoughts seems to be reasonable for me. However, R tests still have unacceptable execution time or even hang forever 🙁 .

izeigerman · 2020-01-26T22:40:22Z

Yeah, apparently now it's being killed due to hitting the memory constraint :( I'm now seriously considering to limit the number of models tested for R.

PS: it completes within 30m on my local machine, though it's not very reassuring, since more test models will definitely lead to violation of the time constraint.

izeigerman · 2020-01-26T23:13:40Z

Ended up excluding large XGBoost and LGBM models from the R's test suite.

StrikerRUS · 2020-01-26T23:47:56Z

OK, so removing subroutine wrapper for each tree actually means that XGBoost and LightGBM are not supported in R anymore - generated code for tiny model cannot be executed in a feasible time (I guess R session crashes, but Travis doesn't report anything for some reason and simply hangs forever). I don't think that it's a good alternative for slowness in Java.
I believe that it will be much better to continue support all models for both languages by adding a mechanism which will allow to filter languages. For instance, something like this: ast.SubroutineExpr(self._assemble_tree(t), exclude_lang={'Java'}). Or any other solution but not one which drops the support of subset of models.

izeigerman · 2020-01-26T23:52:21Z

Hm

generated code for tiny model cannot be executed in a feasible time (I guess R session crashes

I couldn't see any evidence to that. Can you please share more context on this one? Which tiny models? Can this be reproduced locally (I couldn't)?

For instance, something like this: ast.SubroutineExpr(self._assemble_tree(t), exclude_lang={'Java'}).

This will introduce a tight coupling between assemblers and interpreters which is IMHO a poor design decision.

Btw, tests are passing now (with large boosting models excluded).

izeigerman · 2020-01-26T23:59:24Z

Additionally it doesn't seem like R was designed for interpreting extensive amounts of code. We're pushing its limits already and I have a feeling it's not hard to come up with a large enough model that will blow up the R interpreter.
So I suggest we admit that this platform is quite limited in its capabilities (either in memory constraints or runtime). Though I think it should still support smaller ensemble models.
Can you please provide an example of the model which fails?

StrikerRUS · 2020-01-27T00:13:51Z

I couldn't see any evidence to that. Can you please share more context on this one?

I don't believe that 7.5GB is a harsh memory constraint you mentioned. Instead, I think it hangs due to some reason which is similar to context overflow and caused by a lot of ifs in one function.

What tiny models?

You excluded LightGBM with the following params LIGHTGBM_PARAMS_LARGE = dict(n_estimators=100, num_leaves=100, max_depth=64, random_state=RANDOM_SEED). I don't think that they can be treated as a serious model in a real world application.

This will introduce a tight coupling between assemblers and interpreters which is IMHO a poor design decision.

Agree! I just gave an example I imagined within 1m which will allow to save the support of LightGBM and XGBoost for R. Of course, I believe we will be able to develop quite smarter and more efficient solution.

izeigerman · 2020-01-27T00:37:31Z

@StrikerRUS , somehow our conversation was very thought provoking. I've just got another idea and preliminary tests look very promising. Please ignore this PR for now and I'll tag you once it's ready to be revisited. Thanks!

StrikerRUS · 2020-01-27T00:42:57Z

Additionally it doesn't seem like R was designed for interpreting extensive amounts of code. We're pushing its limits already and I have a feeling it's not hard to come up with a large enough model that will blow up the R interpreter.

I think it's applicable to practically every programming language, especially interpreted one. 😃

So I suggest we admit that this platform is quite limited in its capabilities (either in memory constraints or runtime). Though I think it should still support smaller ensemble models.

Yes, but wrapping each tree into a function allows to support bigger models. Why should we consciously limit the number of supported models while we know how to overcome it?
In addition, it's very naturally to do so. One tree - one function sounds super reasonable. For dealing with extremely deep trees we have a mixin which is useful addition.
It seems that Java has its problems too as it becomes slower with larger number of methods. And here we comes again to the problem about giving more love to one language...

Can you please provide an example of the model which fails?

Sorry, didn't get it. Which model did you mean?

I strongly believe that we should continue wrapping trees into subroutines for R and don't do it for Java somehow. It looks like the best solution we could offer for users.

StrikerRUS · 2020-01-27T00:49:05Z

Ah, I'm answering to an outdated comment again! 😄 Seems that something is wrong with my browser it doesn't provide new comments without pressing F5.

I've just got another idea and preliminary tests look very promising.

Wow, it sounds very awesome! Will be happy to see it. Thank you very much for the constructive conversation!

… AST size

izeigerman · 2020-01-28T14:59:23Z

Hey @StrikerRUS, this PR is ready to be revisited.
It appears I've finally managed to come up with a good enough heuristics that is able to successfully compete with SubroutineExpr approach. I've compared models from E2E tests as well as some random configurations of boosting models and the number of generated Java methods is more or less the same.
Of course the new approach is not as stable and predictable as the old one so outliers are to be expected.
Another downside of the new algorithm is that it's computationally demanding and it takes more time to interpret AST. At this point I consider this downside to be negligible.
In a follow-up PR I'm planning to do the following:

Generalize solution I came up with for Java and reuse it in R.
Drop the SubroutineExpr altogether.

Thanks!

StrikerRUS

@izeigerman Thank you very much for rethinking approach! I like it much more than the previous one. Here are some my initial comments before a detailed review.

.travis.yml

m2cgen/assemblers/ensemble.py

m2cgen/ast.py

m2cgen/interpreters/c/interpreter.py

…mber division

StrikerRUS

Great stuff! Here are my comments. Also, I believe code examples should be re-generated due to changes in Java interpreter and ensemble code.

.travis.yml

m2cgen/ast.py

tests/interpreters/test_java.py

tests/test_ast.py

izeigerman · 2020-02-04T02:53:35Z

@StrikerRUS Sorry about the delay, this should be good to go. Thanks!

StrikerRUS

LGTM! Thanks a lot! Left two minor comments up to you.

StrikerRUS · 2020-02-05T01:21:09Z

m2cgen/ast.py

+    (PowExpr, lambda e: [e.base_expr, e.exp_expr]),
+    (VectorVal, lambda e: e.exprs),
+    (IfExpr, lambda e: [e.test, e.body, e.orelse]),


I think these expr can be wrapped into a tuples too for the consistent interface. Like, ((PowExpr), lambda e: [e.base_expr, e.exp_expr]),

Sorry, I'm not 100% sold on this :D

StrikerRUS · 2020-02-05T01:33:12Z

tests/test_ast.py

+
+
+def test_count_all_exprs_types():
+    expr = ast.BinVectorNumExpr(


Maybe even more complicated expr with deeper nesting? 🙂

TBH, I don't see a value in having even deeper nesting since it won't really provide additional coverage. Additionally this will make the test more complex and somewhat obscure its purpose. Thanks for the feedback though.

izeigerman · 2020-02-06T16:11:46Z

Thanks for your review 👍

izeigerman requested a review from StrikerRUS January 26, 2020 18:06

izeigerman changed the title ~~In R interpreter split the binary expressions into subroutine instead of just variables~~ In Java interpreter ignore subroutines and perform code split based on the AST size Jan 27, 2020

In Java interpreter ignore subroutines and perform split based on the…

bc42710

… AST size

izeigerman force-pushed the iaroslav/issue-152 branch from 5bda331 to bc42710 Compare January 27, 2020 02:47

izeigerman added 8 commits January 26, 2020 21:20

Fix Travis build

9f8a94a

Experimenting

b3f60b6

More experiments

0e9c22e

Cosmetic renaming

c2d6ab7

Drop redundant attribute

8de3047

Improve Java interpreter heuristics

db656cf

Cosmetic update

482a4c9

Add helpful comments

4ebefee

StrikerRUS requested changes Jan 29, 2020

View reviewed changes

.travis.yml Outdated Show resolved Hide resolved

.travis.yml Outdated Show resolved Hide resolved

m2cgen/assemblers/ensemble.py Outdated Show resolved Hide resolved

m2cgen/ast.py Show resolved Hide resolved

m2cgen/interpreters/c/interpreter.py Outdated Show resolved Hide resolved

izeigerman added 4 commits January 28, 2020 19:26

Refactor ast.count_exprs

2e5c2af

Update Random Forest AST so that it no longer relies on verctor by nu…

79f17aa

…mber division

Revert introduction of the vector by number division operation

1e4eba8

Update Travis config

e8b673d

StrikerRUS requested changes Jan 29, 2020

View reviewed changes

Address review comments

a485558

Merge branch 'master' into iaroslav/issue-152

70e71f7

StrikerRUS approved these changes Feb 5, 2020

View reviewed changes

izeigerman merged commit 81c3e6a into master Feb 6, 2020

izeigerman deleted the iaroslav/issue-152 branch February 6, 2020 16:12

StrikerRUS mentioned this pull request Mar 7, 2020

removed unused code #175

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

In Java interpreter ignore subroutines and perform code split based on the AST size #158

In Java interpreter ignore subroutines and perform code split based on the AST size #158

izeigerman commented Jan 26, 2020 •

edited

Loading

coveralls commented Jan 26, 2020 •

edited

Loading

StrikerRUS commented Jan 26, 2020

izeigerman commented Jan 26, 2020 •

edited

Loading

izeigerman commented Jan 26, 2020

StrikerRUS commented Jan 26, 2020

izeigerman commented Jan 26, 2020 •

edited

Loading

izeigerman commented Jan 26, 2020

StrikerRUS commented Jan 27, 2020

izeigerman commented Jan 27, 2020

StrikerRUS commented Jan 27, 2020

StrikerRUS commented Jan 27, 2020

izeigerman commented Jan 28, 2020

StrikerRUS left a comment

StrikerRUS left a comment

izeigerman commented Feb 4, 2020

StrikerRUS left a comment

StrikerRUS Feb 5, 2020

izeigerman Feb 6, 2020

StrikerRUS Feb 5, 2020

izeigerman Feb 6, 2020

izeigerman commented Feb 6, 2020



		def test_count_all_exprs_types():
		expr = ast.BinVectorNumExpr(

In Java interpreter ignore subroutines and perform code split based on the AST size #158

In Java interpreter ignore subroutines and perform code split based on the AST size #158

Conversation

izeigerman commented Jan 26, 2020 • edited Loading

coveralls commented Jan 26, 2020 • edited Loading

StrikerRUS commented Jan 26, 2020

izeigerman commented Jan 26, 2020 • edited Loading

izeigerman commented Jan 26, 2020

StrikerRUS commented Jan 26, 2020

izeigerman commented Jan 26, 2020 • edited Loading

izeigerman commented Jan 26, 2020

StrikerRUS commented Jan 27, 2020

izeigerman commented Jan 27, 2020

StrikerRUS commented Jan 27, 2020

StrikerRUS commented Jan 27, 2020

izeigerman commented Jan 28, 2020

StrikerRUS left a comment

Choose a reason for hiding this comment

StrikerRUS left a comment

Choose a reason for hiding this comment

izeigerman commented Feb 4, 2020

StrikerRUS left a comment

Choose a reason for hiding this comment

StrikerRUS Feb 5, 2020

Choose a reason for hiding this comment

izeigerman Feb 6, 2020

Choose a reason for hiding this comment

StrikerRUS Feb 5, 2020

Choose a reason for hiding this comment

izeigerman Feb 6, 2020

Choose a reason for hiding this comment

izeigerman commented Feb 6, 2020

izeigerman commented Jan 26, 2020 •

edited

Loading

coveralls commented Jan 26, 2020 •

edited

Loading

izeigerman commented Jan 26, 2020 •

edited

Loading

izeigerman commented Jan 26, 2020 •

edited

Loading