Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

drop numpy dependency from runtime where possible #111

Merged
merged 5 commits into from
Oct 25, 2019
Merged

drop numpy dependency from runtime where possible #111

merged 5 commits into from
Oct 25, 2019

Conversation

StrikerRUS
Copy link
Member

@StrikerRUS StrikerRUS commented Oct 19, 2019

Closed #108.

See #108 for the rationale. With this PR numpy is used only for runtime code where some vector math is performed (that is for numpy was created).

I'm so sorry about the excess diff in linear and xgboost generated code examples for all languages. I guess they were caused by different package versions. Should I remove them from this PR or it's OK?

As a side note, I think there can be a CI job which will generate code examples with the latest packages and push them back into the repo after the each commit in master. So that examples will be always fresh.

@coveralls
Copy link

coveralls commented Oct 19, 2019

Coverage Status

Coverage increased (+0.05%) to 95.238% when pulling 9b5fe1e on StrikerRUS:drop_numpy_runtime into 30abeef on BayesWitnesses:master.

tanh_function_name = "np.tanh"
exponent_function_name = "math.exp"
power_function_name = "math.pow"
tanh_function_name = "math.tanh"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand. It seems like now we make an assumption that expr here will always be scalar. And similar assumptions will be made for ExpExpr and PowExpr.

Maybe I'm missing something though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ohh.. I see, those are instances of NumExpr which means that they are scalars.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably think about adding checks similar to what we have here, but for subclasses of NumExpr to verify that they only accept/produce scalars.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably think about adding checks similar to what we have here, but for subclasses of NumExpr to verify that they only accept/produce scalars.

Yeah, seems to be so!
I guess all other languages rely on this implicit assumption now as well. For instance, these C functions work only with scalars:

exponent_function_name = "exp"
power_function_name = "pow"
tanh_function_name = "tanh"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I add asserts for scalars in this PR? Or it's better to have a separate one?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say it's up to you as it's not really relevant to or required by this PR.

On the other hand I feel like it's just adding 4 asserts: 1 for ExpExpr, 1 for TanhExpr and 2 for PowExpr.

So whatever works for you.

Copy link
Member Author

@StrikerRUS StrikerRUS Oct 20, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@krinart I added asserts in this PR, if you do not mind 🙂 .

Thanks a lot for the guide where to add them!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your effort! 👍

Copy link
Member

@izeigerman izeigerman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your contribution 👍
I'm a bit confused that there are now to ways to define a vector in Python. One through vector_init and the other one - through array_convert_to_numpy. Furthermore I'm not sure I fully understand the motivation behind this approach.
Additionally can we please revert all generated code samples that are not related to Python?

power_function_name = "math.pow"
tanh_function_name = "math.tanh"

with_vector_operations = False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure why can't we keep using with_vectors flag here that comes from ToCodeInterpreter?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with_vectors means that some vectors are used. Refer to https://github.com/BayesWitnesses/m2cgen/blob/master/generated_code_examples/python/classification/linear.py as an example.
with_vector_operations means that some vector math happens. Refer to

return ((var0) * (0.5)) + ((var1) * (0.5))
as an example.

Copy link
Member Author

@StrikerRUS StrikerRUS Oct 21, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@izeigerman Oh, just realized that with_vector_operations is the same as with_linear_algebra.
Updated in the latest commit.

return self._cg.infix_expression(
left=self._do_interpret(expr.left, **kwargs),
left=self._cg.array_convert_to_numpy(
Copy link
Member

@izeigerman izeigerman Oct 21, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand this change. Why can't we keep using vector_init for this purpose?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because [1, 2, 3] + [4, 5, 6] != [5, 7, 9] in Python.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My question is rather why do we need raw Python lists in a generated code in the first place. Why can't it always be a numpy array?

Copy link
Member

@izeigerman izeigerman Oct 21, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm guessing that this is the part where we may fallback to Python lists to eliminate the numpy dependency but I'd rather always have a consistent representation of the vector (especially considering that we may return it from the score function). So I'd argue that every time when we need vectors, it should be numpy arrays.

Copy link
Member Author

@StrikerRUS StrikerRUS Oct 21, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is about eliminating heavy redundant dependency from runtime (as well as reducing time and memory costs). I think that this is quite more important rather than have consistent return value. There are many frameworks which return either type1, or type2, depending on some conditions. BTW, score at present may return double and double[], so this can be treated as inconsistency as well.

I think that from the user side it's much easier to wrap score function call into np.array() only once, if they need consistent output, rather than manually edit generated function removing dozens of np.assarray() entries, if they do not have/want/able to install numpy.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Joke 😃 : It's time to change project's description:

Transform ML models into a native code (Java, C, Python, Go, JavaScript) with zero dependencies

Copy link
Member

@izeigerman izeigerman Oct 22, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, score at present may return double and double[], so this can be treated as inconsistency as well.

Well there is a distinction between scalar and vector, but here we're introducing different flavors of vectors.

Additionally we break backward compatibility, which is not great.

I'm positive about the benefit of using math vs numpy in certain scenarios, but I'm not fully convinced about the whole vector thing.

@krinart you're a better Python expert than I am. What's your opinion?

Copy link
Member Author

@StrikerRUS StrikerRUS Oct 22, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well there is a distinction between scalar and vector, but here we're introducing different flavors of vectors.
Additionally we break backward compatibility, which is not great.

Agree! I only want to say that the project has version 0.4.0, which means that, according to the semantic versioning, its API is subject to change.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After discussion with @krinart I think it may not be that bad in Python world. We also discussed the possibility of dropping numpy dependency altogether in favor of manually written vector multiplication like we do in other languages. This PR can be the first step in that direction. Let's merge it. @StrikerRUS thanks a lot for your effort 👍

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@izeigerman Thanks a lot for considering this change!

We also discussed the possibility of dropping numpy dependency altogether in favor of manually written vector multiplication like we do in other languages.

It's a great step towards isolated environments where it's not so straightforward to install any dependency! I guess they are the majority of the target audience of the package. As of now only vector addition and multiplication by number should be supported, it will not hurt the performance too much.

However, with the growth of supported algorithms number there can be needed more non-trivial vector operations where numpy speed will be crucial. So I think there can be an option which will provide users the possibility to choose either they need a runtime with zero dependencies but slow, or one with a few number of dependencies but significantly faster.

cc @krinart

return self._cg.infix_expression(
left=self._do_interpret(expr.left, **kwargs),
left=self._cg.array_convert_to_numpy(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above

@StrikerRUS
Copy link
Member Author

I reverted changes from generated examples that are not related to this PR. Also, I realized that with_vector_operations I introduced here is the same as with_linear_algebra instead, so I used it to determine when numpy is needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

drop numpy dependency from Python code for cases without vectors
4 participants