Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python Bindings for HLL Sketch #161

Closed
anirudhacharya opened this issue Oct 2, 2017 · 12 comments
Closed

Python Bindings for HLL Sketch #161

anirudhacharya opened this issue Oct 2, 2017 · 12 comments

Comments

@anirudhacharya
Copy link
Member

As per the discussion in this thread in the google group - https://groups.google.com/d/msg/sketches-user/8TaAXaT_6qo/A2JJkIuZBQAJ

there is traction for having a python binding for the different sketch families in sketches-core, similar to how the library has for pig and hive. I was thinking we could get started on the python adaptors by having a wrapper library for the hyperloglog sketches. Would that be a good place to start?

For Pig and Hive the bindings were defined as UDFs that pig and hive scripts can use. How will we define the wrapper classes in python? Will it be something on the lines of Jython - http://www.jython.org/jythonbook/en/1.0/JythonAndJavaIntegration.html

@jmalkin
Copy link
Contributor

jmalkin commented Oct 2, 2017

If it would add a package dependency to the pom, python binding will need to live in a different repository.

@AlexanderSaydakov
Copy link
Contributor

I wouldn't call this "traction". No third party expressed any interest yet.

@leerho
Copy link
Contributor

leerho commented Oct 4, 2017

Having the library adapted for different languages makes a lot of sense and Python is a good place to start. As Jon points out we may need to set up a separate repo for language adaptors or merge it into one of the other repos. Sketches-misc might be a candidate.

@anirudhacharya
Copy link
Member Author

anirudhacharya commented Oct 8, 2017

Python does not use maven for build and dependency management. It has a very different tool called distutils( DistUtils).

Also I have to still explore the performance overhead that a python wrapper might induce on the existing sketches-core library. Rather than an adaptor, it might might make sense to write a python library from scratch, rather than a wrapper for an existing java library.

I see that Edo Liberty who was till recently in Yahoo already has a python repo for two of the sketching algorithms here - Frequest Directions and Streaming quantiles

Also, there is an existing python library published for CountMin sketch . But I don't think there are any implementations of HLL, theta sketch, tuple sketch, or sampling sketches in python.

@edoliberty
Copy link

edoliberty commented Oct 9, 2017 via email

@leerho
Copy link
Contributor

leerho commented Oct 11, 2017

I would strongly advise against attempting to write a python DataSketches library from scratch:

  • The implementations of the various sketches in the core library have been highly optimized for performance, and as a result, the implementations are quite complex and leverage a lot of subtle techniques. Any single sketch, such as the HllSketch, is not one algorithm, but a collection of algorithms and techniques. Redesigning even a single sketch from scratch in Python is a huge task and without the knowledge of the design internals you would be at a big disadvantage.
  • A parallel implementation would not benefit from continuous updates and performance improvements of the core java library. This would double the effort in maintaining and supporting the python code base. Our small development team could not possibly take this on or support it.

The public APIs of the core library, on the other hand, are quite stable, and "relatively" consistent across the different sketch families.

There are Python/Java tools out there, e.g., py4j, and several others. I would suggest that if you are passionate about Python, a great place to start would be to investigate various Python/Java interfaces and evaluate them for performance, stability, and ease of use. This, by itself, would be a major contribution that other Python developers could then leverage.

@edoliberty
Copy link

edoliberty commented Oct 11, 2017 via email

@leerho
Copy link
Contributor

leerho commented Oct 11, 2017

My comment was in response to the suggestion:

Rather than an adaptor, it might might make sense to write a python library from scratch, rather than a wrapper for an existing java library.

@edoliberty
Copy link

edoliberty commented Oct 11, 2017 via email

@anirudhacharya
Copy link
Member Author

I can work on this on weekends. I will invest time in this direction.

There are Python/Java tools out there, e.g., py4j, and several others. I would suggest that if you are passionate about Python, a great place to start would be to investigate various Python/Java interfaces and evaluate them for performance, stability, and ease of use. This, by itself, would be a major contribution that other Python developers could then leverage.

@edoliberty
Copy link

edoliberty commented Oct 12, 2017 via email

@leerho
Copy link
Contributor

leerho commented Nov 20, 2017

There has been no activity on this issue for several weeks, so I am closing this issue for now. We can always reopen this issue in the future.

@leerho leerho closed this as completed Nov 20, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants