-
Notifications
You must be signed in to change notification settings - Fork 238
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
drop numpy dependency from runtime where possible #111
drop numpy dependency from runtime where possible #111
Conversation
tanh_function_name = "np.tanh" | ||
exponent_function_name = "math.exp" | ||
power_function_name = "math.pow" | ||
tanh_function_name = "math.tanh" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand. It seems like now we make an assumption that expr
here will always be scalar. And similar assumptions will be made for ExpExpr
and PowExpr
.
Maybe I'm missing something though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ohh.. I see, those are instances of NumExpr
which means that they are scalars.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably think about adding checks similar to what we have here, but for subclasses of NumExpr
to verify that they only accept/produce scalars.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably think about adding checks similar to what we have here, but for subclasses of NumExpr to verify that they only accept/produce scalars.
Yeah, seems to be so!
I guess all other languages rely on this implicit assumption now as well. For instance, these C functions work only with scalars:
m2cgen/m2cgen/interpreters/c/interpreter.py
Lines 20 to 22 in 30abeef
exponent_function_name = "exp" | |
power_function_name = "pow" | |
tanh_function_name = "tanh" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should I add asserts for scalars in this PR? Or it's better to have a separate one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would say it's up to you as it's not really relevant to or required by this PR.
On the other hand I feel like it's just adding 4 asserts: 1 for ExpExpr
, 1 for TanhExpr
and 2 for PowExpr
.
So whatever works for you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@krinart I added asserts in this PR, if you do not mind 🙂 .
Thanks a lot for the guide where to add them!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your effort! 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your contribution 👍
I'm a bit confused that there are now to ways to define a vector in Python. One through vector_init
and the other one - through array_convert_to_numpy
. Furthermore I'm not sure I fully understand the motivation behind this approach.
Additionally can we please revert all generated code samples that are not related to Python?
power_function_name = "math.pow" | ||
tanh_function_name = "math.tanh" | ||
|
||
with_vector_operations = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure why can't we keep using with_vectors
flag here that comes from ToCodeInterpreter
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with_vectors
means that some vectors are used. Refer to https://github.com/BayesWitnesses/m2cgen/blob/master/generated_code_examples/python/classification/linear.py as an example.
with_vector_operations
means that some vector math happens. Refer to
return ((var0) * (0.5)) + ((var1) * (0.5)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@izeigerman Oh, just realized that with_vector_operations
is the same as with_linear_algebra
.
Updated in the latest commit.
return self._cg.infix_expression( | ||
left=self._do_interpret(expr.left, **kwargs), | ||
left=self._cg.array_convert_to_numpy( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand this change. Why can't we keep using vector_init
for this purpose?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because [1, 2, 3] + [4, 5, 6] != [5, 7, 9]
in Python.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My question is rather why do we need raw Python lists in a generated code in the first place. Why can't it always be a numpy array?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm guessing that this is the part where we may fallback to Python lists to eliminate the numpy
dependency but I'd rather always have a consistent representation of the vector (especially considering that we may return it from the score
function). So I'd argue that every time when we need vectors, it should be numpy
arrays.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this is about eliminating heavy redundant dependency from runtime (as well as reducing time and memory costs). I think that this is quite more important rather than have consistent return value. There are many frameworks which return either type1
, or type2
, depending on some conditions. BTW, score
at present may return double
and double[]
, so this can be treated as inconsistency as well.
I think that from the user side it's much easier to wrap score
function call into np.array()
only once, if they need consistent output, rather than manually edit generated function removing dozens of np.assarray()
entries, if they do not have/want/able to install numpy
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Joke 😃 : It's time to change project's description:
Transform ML models into a native code (Java, C, Python, Go, JavaScript) with zero dependencies
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, score at present may return double and double[], so this can be treated as inconsistency as well.
Well there is a distinction between scalar and vector, but here we're introducing different flavors of vectors.
Additionally we break backward compatibility, which is not great.
I'm positive about the benefit of using math
vs numpy
in certain scenarios, but I'm not fully convinced about the whole vector thing.
@krinart you're a better Python expert than I am. What's your opinion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well there is a distinction between scalar and vector, but here we're introducing different flavors of vectors.
Additionally we break backward compatibility, which is not great.
Agree! I only want to say that the project has version 0.4.0
, which means that, according to the semantic versioning, its API is subject to change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After discussion with @krinart I think it may not be that bad in Python world. We also discussed the possibility of dropping numpy dependency altogether in favor of manually written vector multiplication like we do in other languages. This PR can be the first step in that direction. Let's merge it. @StrikerRUS thanks a lot for your effort 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@izeigerman Thanks a lot for considering this change!
We also discussed the possibility of dropping numpy dependency altogether in favor of manually written vector multiplication like we do in other languages.
It's a great step towards isolated environments where it's not so straightforward to install any dependency! I guess they are the majority of the target audience of the package. As of now only vector addition and multiplication by number should be supported, it will not hurt the performance too much.
However, with the growth of supported algorithms number there can be needed more non-trivial vector operations where numpy
speed will be crucial. So I think there can be an option which will provide users the possibility to choose either they need a runtime with zero dependencies but slow, or one with a few number of dependencies but significantly faster.
cc @krinart
return self._cg.infix_expression( | ||
left=self._do_interpret(expr.left, **kwargs), | ||
left=self._cg.array_convert_to_numpy( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above
I reverted changes from generated examples that are not related to this PR. Also, I realized that |
Closed #108.
See #108 for the rationale. With this PR
numpy
is used only for runtime code where some vector math is performed (that is fornumpy
was created).I'm so sorry about the excess diff in
linear
andxgboost
generated code examples for all languages. I guess they were caused by different package versions. Should I remove them from this PR or it's OK?As a side note, I think there can be a CI job which will generate code examples with the latest packages and push them back into the repo after the each commit in
master
. So that examples will be always fresh.