-
Notifications
You must be signed in to change notification settings - Fork 209
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Python Bindings for HLL Sketch #161
Comments
If it would add a package dependency to the pom, python binding will need to live in a different repository. |
I wouldn't call this "traction". No third party expressed any interest yet. |
Having the library adapted for different languages makes a lot of sense and Python is a good place to start. As Jon points out we may need to set up a separate repo for language adaptors or merge it into one of the other repos. Sketches-misc might be a candidate. |
Python does not use maven for build and dependency management. It has a very different tool called distutils( DistUtils). Also I have to still explore the performance overhead that a python wrapper might induce on the existing sketches-core library. Rather than an adaptor, it might might make sense to write a python library from scratch, rather than a wrapper for an existing java library. I see that Edo Liberty who was till recently in Yahoo already has a python repo for two of the sketching algorithms here - Frequest Directions and Streaming quantiles Also, there is an existing python library published for CountMin sketch . But I don't think there are any implementations of HLL, theta sketch, tuple sketch, or sampling sketches in python. |
My Python repos are meant for research purposes. They are unrelated to the
sketches-core library.
As a whole, I’m a big python fan. I also see that python binding were, for
many open source projects, the inflection point that started wide user base
adoption.
I’m all for creating Python binding if we could do it in a thoughtful way
that doesn’t compromise performence.
…On Sun, Oct 8, 2017 at 14:58 Anirudh ***@***.***> wrote:
Python does not use maven for build and dependency management. It has a
very different tool called distutils(
http://docs.activestate.com/activepython/3.2/diveintopython3/html/packaging.html
).
Also I have to still explore the performance overhead that a python
wrapper might induce on the existing sketches-core library. Rather than an
adaptor, it might might make sense to write a python library from scratch,
rather than a wrapper for an existing java library.
I see that Edo Liberty https://edoliberty.github.io// <http://url> who
was till recently in Yahoo already has a python repo for two of the
sketching algorithms here -
https://github.com/edoliberty/frequent-directions <http://url> and
https://github.com/edoliberty/streaming-quantiles <http://url>
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#161 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AMAIed2wiwUQSk2X4MONpLWRJpkDtGF0ks5sqUVzgaJpZM4PrXk_>
.
|
I would strongly advise against attempting to write a python DataSketches library from scratch:
The public APIs of the core library, on the other hand, are quite stable, and "relatively" consistent across the different sketch families. There are Python/Java tools out there, e.g., py4j, and several others. I would suggest that if you are passionate about Python, a great place to start would be to investigate various Python/Java interfaces and evaluate them for performance, stability, and ease of use. This, by itself, would be a major contribution that other Python developers could then leverage. |
Lee, we all agree with you. The suggestion was to find a way to use the
java library from within python without compromising performentce.
…On Wed, Oct 11, 2017 at 04:14 Lee Rhodes ***@***.***> wrote:
I would strongly advise against attempting to write a python DataSketches
library from scratch:
- The implementations of the various sketches in the core library have
been highly optimized for performance, and as a result, the implementations
are quite complex and leverage a lot of subtle techniques. Any single
sketch, such as the HllSketch, is not one algorithm, but a collection of
algorithms and techniques. Redesigning even a single sketch from scratch in
Python is a huge task and without the knowledge of the design internals you
would be at a big disadvantage.
- A parallel implementation would not benefit from continuous updates
and performance improvements of the core java library. This would double
the effort in maintaining and supporting the python code base. Our small
development team could not possibly take this on or support it.
The public APIs of the core library, on the other hand, are quite stable,
and "relatively" consistent across the different sketch families.
There are Python/Java tools out there, e.g., py4j
<https://www.py4j.org/index.html>, and several others. I would suggest
that if you are passionate about Python, a great place to start would be to
investigate various Python/Java interfaces and evaluate them for
performance, stability, and ease of use. This, by itself, would be a major
contribution that other Python developers could then leverage.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#161 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AMAIeTw0XaIcNZ2VNMOZrzGo0LAVeXkFks5srKMPgaJpZM4PrXk_>
.
|
My comment was in response to the suggestion:
|
Ahh, I missed that.
…On Wed, Oct 11, 2017 at 07:59 Lee Rhodes ***@***.***> wrote:
My comment was in response to the suggestion:
Rather than an adaptor, it might might make sense to write a python
library from scratch, rather than a wrapper for an existing java library.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#161 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AMAIeYF_aX4nMebGYxTs3DAW1F5eWHO2ks5srNffgaJpZM4PrXk_>
.
|
I can work on this on weekends. I will invest time in this direction.
|
That's great Anirudh
I think you are doing the right thing. If one could pip install
datasketches it will be a major driver of adoption.
A note about pure python, I (and others) have tried several times to get it
to perform. It just doesn't. The Java
library is way (way!) faster.
Edo
…On Wed, Oct 11, 2017 at 11:19 PM, Anirudh ***@***.***> wrote:
I can work on this on weekends. I will invest time in this direction.
There are Python/Java tools out there, e.g., py4j, and several others. I
would suggest that if you are passionate about Python, a great place to
start would be to investigate various Python/Java interfaces and evaluate
them for performance, stability, and ease of use. This, by itself, would be
a major contribution that other Python developers could then leverage.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#161 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AMAIeT_DqkKLB7Kam-KGZiS84NNi_XI4ks5sra9_gaJpZM4PrXk_>
.
|
There has been no activity on this issue for several weeks, so I am closing this issue for now. We can always reopen this issue in the future. |
As per the discussion in this thread in the google group - https://groups.google.com/d/msg/sketches-user/8TaAXaT_6qo/A2JJkIuZBQAJ
there is traction for having a python binding for the different sketch families in sketches-core, similar to how the library has for pig and hive. I was thinking we could get started on the python adaptors by having a wrapper library for the hyperloglog sketches. Would that be a good place to start?
For Pig and Hive the bindings were defined as UDFs that pig and hive scripts can use. How will we define the wrapper classes in python? Will it be something on the lines of Jython - http://www.jython.org/jythonbook/en/1.0/JythonAndJavaIntegration.html
The text was updated successfully, but these errors were encountered: