Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use string buffer to store generated code #211

Merged
merged 7 commits into from May 19, 2020
Merged

use string buffer to store generated code #211

merged 7 commits into from May 19, 2020

Conversation

StrikerRUS
Copy link
Member

While changes in #207 are arguably giving significant improvement, the speedup achieved by these updates can be seen on transpiling medium-size models without any profiling tools.

During skimming the report from #207 (comment), I noticed that add_code_line() takes a huge amount of time (also refer to #209 for another strings-related improvement).

Here is a simple bench.

import sys

from sklearn.datasets import load_boston

import lightgbm as lgb
import m2cgen as m2c

X, y = load_boston(True)
est = lgb.LGBMRegressor(n_estimators=1000, random_state=42).fit(X, y)

sys.setrecursionlimit(int(1e5))

%%timeit -n7 -r5
_ = m2c.export_to_python(est)

And the results are the following:

30.6 s ± 16.9 ms per loop (mean ± std. dev. of 5 runs, 7 loops each)
2.41 s ± 4.84 ms per loop (mean ± std. dev. of 5 runs, 7 loops each)

from string import Template
from weakref import finalize
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self._finalize_buffer()
self._code_buf = StringIO()
self._code = None
self._finalizer = finalize(self, self._finalize_buffer)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, yet another Python TIL for me!

@coveralls
Copy link

coveralls commented May 9, 2020

Coverage Status

Coverage decreased (-0.05%) to 95.479% when pulling fe069e7 on stringio into 3db1cc2 on master.

@izeigerman
Copy link
Member

Oh, wow. This is a dramatic improvement indeed! Fantastic job 👍

Copy link
Member

@izeigerman izeigerman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few points. Great improvement 🚀

if not self._code_buf.closed:
self._code = self._code_buf.getvalue()
self._finalize_buffer()
return self._code if self._code is not None else ""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Minor] return self._code if self._code else ""

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I always prefer to check for None explicitly according to the recommendation from PEP.

Also, beware of writing if x when you really mean if x is not None -- e.g. when testing whether a variable or argument that defaults to None was set to some other value. The other value might have a type (such as a container) that could be false in a boolean context!
https://www.python.org/dev/peps/pep-0008/#programming-recommendations

If you only used if key: here, then an argument which evaluated to false would not be considered. Explicitly comparing with is None is the correct idiom to make this check. See Truth Value Testing.
https://stackoverflow.com/a/17117361

https://stackoverflow.com/a/7816439

self._finalizer = finalize(self, self._finalize_buffer)

def _finalize_buffer(self):
if self._code_buf is not None and not self._code_buf.closed:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Minor] if self._code_buf and not self._code_buf.closed:

"Call reset_state() to allocate new buffer.")

def get_generated_code(self):
if not self._code_buf.closed:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we do the if self._code here as well (similarly to _finalize_buffer check)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope. We call reset_state() in __init__() where _code_buf is set to StringIO(). So we have a guarantee here that _code_buf has been already created (not None from __init__()).

@@ -46,24 +69,34 @@ def decrease_indent(self):
# All code modifications should be implemented via following methods.

def add_code_line(self, line):
self._check_buf_closed()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The invocation of this method all over the place looks like boilerplate. Shall we create a method called _append_to_code_buf and do both checking and writing there?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good! I'll do it right now.


def prepend_code_line(self, line):
self.code = line + "\n" + self.code
self._check_buf_closed()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The content of this method now looks identical to the prepend_code_lines. Can this method be expressed in terms of the other one? Eg. self.prepend_code_lines([line])?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch!

"closing the underlying buffer!\n"
"Call reset_state() to allocate new buffer.")

def get_generated_code(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit concerned about the method name which claims that it queries the state but which in fact modifies it (irreversibly). I suggest we be more explicit and call it something like finalize_generator or get_code_and_close or finalize_and_return_code, whichever you like.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That being said we don't necessarily have to support repeated invocations for this method and throw an exception on subsequent method calls. Please correct me if I'm wrong here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolutely agree for changing the name!

That being said we don't necessarily have to support repeated invocations for this method and throw an exception on subsequent method calls. Please correct me if I'm wrong here.

That's true but only for now. Maybe in the future we will need something like

if cg.get_generated_code() != "":
    return cg.get_generated_code()
else:
    ...

Of course, we can save the result from the first call into a variable, but I don't see any reasons to eliminate the opportunity to return the same value from this method in its subsequent calls. I believe it is in the consistency of functional programming paradigm 🙂 .

self._finalize_buffer()
self._code_buf = StringIO()
self._code = None
self._finalizer = finalize(self, self._finalize_buffer)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, yet another Python TIL for me!

@StrikerRUS
Copy link
Member Author

Thanks a lot for the thoughtful review! I addressed review comments in the two latest commits and left some replies. Please give it another round of review.

Copy link
Member

@izeigerman izeigerman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thanks for addressing comments 👍

@izeigerman
Copy link
Member

@StrikerRUS does this PR require any additional updates considering that the Ruby PR has been merged?

@StrikerRUS
Copy link
Member Author

does this PR require any additional updates considering that the Ruby PR has been merged?

Yeah, definitely! Just pushed the needed changes.

@izeigerman izeigerman merged commit 182a53e into master May 19, 2020
@izeigerman izeigerman deleted the stringio branch May 19, 2020 17:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants