Skip to content
This repository has been archived by the owner on Mar 24, 2021. It is now read-only.

Python + Kafka community unification ideas #559

Closed
amontalenti opened this issue Jun 6, 2016 · 17 comments
Closed

Python + Kafka community unification ideas #559

amontalenti opened this issue Jun 6, 2016 · 17 comments

Comments

@amontalenti
Copy link
Contributor

amontalenti commented Jun 6, 2016

In the last couple of years there, we can’t help but notice that there has been quite a lot of fragmentation in the Python community around Kafka. As far as we can tell, there are three major projects as of this writing:

  • pykafka, this project, maintained by Emmett Butler and Keith Bourgoin from Parse.ly, with contributions by Yung-Chin Oei (pure Python with a librdkafka extension for speedups; evolved from Kafka 0.7 driver; has tested Kafka 0.8.2, 0.9.x support)
  • kafka-python, maintained by Dana Powers, currently at Pandora (pure Python, clean implementation focused on Kafka 0.8 => 0.10)
  • confluent-kafka-python, recently released by the Confluent team as part of the broader Kafka 0.10 release (Python C extension module wrapping librdkafka; Kafka 0.9+ focused)

Each project has a different history, level of current support for Kafka, and set of features — and, of course, different APIs. This is obviously not an ideal state for the Python user community around Kafka, both those that currently have Kafka clusters in production and for those looking to adopt Kafka for new projects.

I wonder if anyone has any ideas for what should happen -- if anything -- to unify the Python community around Kafka.

@ottomata
Copy link
Contributor

ottomata commented Jun 6, 2016

+1, but I’m not sure how this could happen naturally. Each of the
mentioned clients have pros and cons of their own, and the maintainers of
each project have reasons for developing their clients in the way that they
have.

In any case, something using librdkafka is likely the smartest way to go.

On Sun, Jun 5, 2016 at 9:05 PM, Andrew Montalenti notifications@github.com
wrote:

In the last couple of years there, we can’t help but notice that there has
been quite a lot of fragmentation in the Python community around Kafka. As
far as we can tell, there are three major projects as of this writing:

  • pykafka https://github.com/Parsely/pykafka, this project,
    maintained by Emmett Butler and Keith Bourgoin from Parse.ly, with
    contributions by Yung-Chin Oei (pure Python with a librdkafka
    extension for speedups; evolved from Kafka 0.7 driver; has Kafka 0.8/0.9+
    support)
  • kafka-python https://pypi.python.org/pypi/kafka-python, maintained
    by Dana Powers, currently at Pandora (pure Python, clean implementation
    focused on Kafka 0.9+)
  • confluent-kafka-python
    https://github.com/confluentinc/confluent-kafka-python, recently
    released by the Confluent team as part of the broader Kafka 0.10 release
    (Python C extension module wrapping librdkafka; Kafka 0.9+ focused)

Each project has a different history, level of current support for Kafka,
and set of features — and, of course, different APIs. This is obviously not
an ideal state for the Python user community around Kafka, both those that
currently have Kafka clusters in production and for those looking to adopt
Kafka for new projects.

I wonder if anyone has any ideas for what should happen -- if anything --
to unify the Python community around Kafka.


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#559, or mute the thread
https://github.com/notifications/unsubscribe/ABjPxlQtr-WZHHr10phRQrSUWVzFuGvMks5qI3JUgaJpZM4Iufn4
.

@amontalenti
Copy link
Contributor Author

@ottomata I don't think I'm looking for a "grand merge". I think, instead, I'm trying to think about how to make sure there isn't wasted effort from this point forward.

For example, pykafka and kafka-python provided varying degrees of production support for Kafka during its "stumbling infant" period in open source, and that's fine -- natural exploration/competition is what open source does best.

But now we have a stable librdkafka, and relatively mature pure Python projects. We have an opportunity to do things like: a) evaluate the relative pro's and con's of each API; b) benchmark their real-world performance, and c) think about long-term maintenance.

There are also practical concerns of the limited maintainer hours for open source contributors and the migration path for existing users of each project.

I think what the community wants in the long-term is something like what we have with Redis, Cassandra and Elasticsearch -- there is a "clear" choice for the Python module/API to use (redis-py, cassandra-driver, and elasticsearch-py respectively), even though there are historical projects that have supported older versions of each API, but which now point people to the right place. (e.g. pycassa now points people to cassandra-driver).

@ottomata
Copy link
Contributor

ottomata commented Jun 9, 2016

Aye, having used all 3 clients, here some passing opinions of them:

pykafka

For a time pykafka was simply better than kafka-python. It was the first to have a balanced consumer. pykafka does not have good dynamic topic production support (#354), which makes it hard to use for some use cases. Now that the Kafka API supports managing and balancing consumer groups itself, pykafka's interface feels a little fragmented. The community around pykafka is excellent. We have had several production issues, all of which have been pretty quickly responded to and fixed.

kafka-python

I actually like kafka-python's interface the best of all. It feels the most clean. However, even the newer version seems to perform fairly poorly for synchronous production (yes I know, never the answer bla bla), so I don't think we will use it in the future. Plus, I like the idea of outsourcing the work of the client to librdkafka, and kafka-python does not have this.

confluent-kafka-python

So far so good. It is very new, but I like using it and it seems to perform really well. It'd be nice if the consumer interface returned an iterable.

I think Wikimedia will end up using confluent-kafka-python in the near future. pykafka's lack of dynamic topic support is a no go for us, and kafka-python's performance isn't good enough.

Using librdkafka seems like the right thing to do, as it is so widely used in many different languages and environments that you can generally expect it to be very solid. Wikimedia has an interest in a robust nodejs Kafka client as well, and I expect we will leverage librdkafka for that too.

@amontalenti
Copy link
Contributor Author

@ottomata Thanks for this write-up.

@jofusa recently wrote-up and shared this handy benchmark comparing pykafka, kafka-python, and confluent-kafka-python. The full thing is worth a read, but these timings speak for themselves. I've cleaned up the tables (rounded the numbers) and sorted each by msgs/s.

Consumer timings:

consumer time MBs/s msgs/s
confluent-kafka-python 3.83 24.93 261,408
pykafka (rdkafka) 6.09 15.67 164,311
kafka-python 26.56 3.59 37,668
pykafka (python) 29.43 3.24 33,977

Producer timings:

producer time MBs/s msgs/s
confluent-kafka-python 5.45 17.50 183,456
pykafka (rdkafka) 15.72 6.06 63,595
pykafka (python) 57.31 1.66 17,446
kafka-python 67.86 1.40 14,737

It seems like his timings have confirmed what our benchmarks also proved: that librdkafka provides huge speedups across-the-board. So based on the data, that would confirm @ottomata's thought that whatever the Python community unifies around, it should be something librdkafka-based, as these benchmarks are pretty hard to ignore.

@amontalenti
Copy link
Contributor Author

amontalenti commented Jun 16, 2016

Pulling @dpkp if he doesn't mind. He shared some good comments about kafka-python on the Hacker News thread discussing @jofusa's article.

Here's what strikes me about where we are between these 3 projects:

  • Pure Python "feels good". Having a pure Python wrapper around the Kafka internals, even if it means having to maintain a codebase separate from librdkafka's C code, may be a good thing, since it a) allows us to isolate driver/protocol bugs more easily; b) lets us "poke" at the internals as Pythonistas more easily; c) lets the driver get used in environments where C builds may be less easy to come by.
  • Pure Python is better for PyPy. It probably also gives us more flexibility to build support for Python-specific meta-libraries like asyncio, gevent, or tornado.
  • Python tooling can be helpful. e.g. command-line tools, akin to pykafka's kafka-tools CLI, around Kafka also tends to be helpful for people who are using Python + Kafka in production.
  • Python idioms make people happy. Confluent may have an interest in having a "unified" API between Java, C, C++, Python, etc., but we are probably more interested in using Python-native features to have an "idiomatic" API... e.g. iterators/generators/context managers and the like.

Meanwhile, it seems clear that:

  • librdkafka is really fast -- like, it's so fast we can't ignore it... a lot of people come to Kafka looking for speed and librdkafka delivers it.
  • There are only so many volunteer hours to go around. Ultimately, @emmett9001, @kbourgoin, and @dpkp are maintaining open source drivers "in their spare time", whereas Confluent et al are maintaining librdkafka as part of their business strategy in commercializing Kafka. The community would benefit from having a "single clear choice" for a driver, we'd also benefit from collaborating as we wouldn't spend all our time re-implementing the same stuff.

Obviously, this is sort of just providing the "lay of the land"... where we go from here, I'm not sure. @emmett9001, @kbourgoin and I are having a little get-together in NYC next week and maybe we can chat about some plans and ideas to share with the community here after we talk it through a bit. But I'd love to hear other feedback.

@amontalenti
Copy link
Contributor Author

also pulling @yungchin here, one of the other "volunteers" :)

@dpkp
Copy link

dpkp commented Jun 17, 2016

I have fun working on kafka-python. I hope others enjoy it. If there's a better driver out there, huzzah!

My one piece of advice is that all of you should get more involved on the kafka-dev mailing list, particularly with respect to API KIPs and wire protocol issues. The core team is very focused on the java ecosystem, and that can lead to api designs that force client drivers into less than ideal positions. If nothing else, getting more non-java perspectives into the mix would be a great improvement.

I know Magnus is working very hard on the librdkafka side and it does appear that Confluent is paying a few more people to work on wrapper libraries. That will be a great benefit for users that want higher performance than they can get from a single python process w/ GIL restrictions.

But I do think one of the great benefits of a community is a diversity of approaches and view points. And python is in a very unique position because we can generally develop faster than other languages and I believe we should continue to leverage that to the benefit of the entire community.

So with that said, yay for options! Have fun writing software. We only live once.

@brianbruggeman
Copy link

brianbruggeman commented Jul 5, 2016

The splintering is something that I've found to be an irritation; @amontalenti thank you for opening up a discussion. Echoing @dpkp, I'm not sure this is necessarily the proper location for the discussion, but at least there's a discussion somewhere. Kafka's rapid release cycle is at least partially a reason for the splintering. Keeping up with the feature creep and maintaining production level code is tough. Everything has changed within the past 8 months.

I think there should probably be a frank discussion about pykafka specifically in this thread or with a lack of consensus, to push everyone to helping develop Confluent's api. Long term, it feels like their API will become the standard.

I feel that Confluent's entry, while performant, is not really usable for rapid iteration. It appears to be built by a team that knows c really well and has jumped into Python's C-API but hasn't really engaged the python community at any length. Without knowing the librdkafka api, this library in its current state is unusable. In my ideal world, there would be a strong python interface around Confluent-Kafka's current librdkafka skeleton that would also be compatible with pypy. I think the best option here would be a better C-interface with a ctypes and/or cffi version. This makes Confluent accessible to not only python but also: javascript, lua, ruby, lisp, haskell, etc. Any language that has a libffi interface. I think for us, Confluent-Kafka is a wait and see if there's updates, adoption and traction.

I prefer kafka-python's interface most of all, but at the same time, I think kafka-python's performance is lacking and the updates are more sparse. In addition, while I appreciate the work done by the author, it's just one person. When we started putting together our code, we began initially with kafka-python and quickly moved to pykafka.

We are currently using pykafka because it has the most features and still appears to be performant. It is built with python in mind and has been tested on a variety of environments. But I really dislike the threading and the current API. I also feel that there should be one obvious way to create a Consumer as opposed to three variances.

I currently am supporting all three libraries in my backend library with a minor change in a configuration file to update which library should be used. However, this is obviously less than ideal. Moving forward, though, our support will be primarily with pykafka because it has the most python support with the widest feature-set.

@dpkp
Copy link

dpkp commented Jul 6, 2016

@brianbruggeman what performance do you expect from a python driver? I can push 100Mb/sec w/ a single core running a kafka-python producer. I am skeptical that anyone really needs more than that. I generally expect kafka brokers to be network bound first. Any normal deployment would have more producers than brokers, and if a single producer can max a network card, I think you've already hit overkill.

@brianbruggeman
Copy link

@dpkp One of our data streams is currently sitting at about 30 MB/sec with spikes over 60 MB/sec. This is expected to grow for second half of 2016.

@jianbin-wei
Copy link

We are currently using pykafka. Our use case is more focused on producing to and consuming from topics dynamically. To do that, we have a simple wrapper that create producer/consumer for each topic within one process. However, we run into some serious issues:

  • there are multiple threads for each partition. This causes large number of threads and kills CPU.
  • using greenlets can alleviate the issue but it would require our other codes to be gevent-compaible.
  • separate consumers for each topic makes our system not scalable, i.e., the number of processes we can have is the max number of partitions of all topics.

The java interface looks much nicer. I did a quick check on confluent-kafka interface, it looks like what we need. One concern is that how librdkafka can keep up with kafka development and new features.

@merrellb
Copy link

I've recently been intrigued by the efforts of libraries like hyper-h2, h11 and wsproto to implement canonical, pure python protocols and state machines without any networking. This seems like the most flexible approach, allowing a wide range of networking and opinionated frameworks to be layered on top. I am just starting to dive into these libraries but curious if anyone has an opinion on which would be easier to rip out all of the networking code :-)

@dpkp
Copy link

dpkp commented May 4, 2017 via email

@emmettbutler
Copy link
Contributor

Here's @mrocklin's blogpost with an overview of the current state of the ecosystem http://matthewrocklin.com/blog/work/2017/10/10/kafka-python

@brianbruggeman
Copy link

@emmett9001 Thank you for that link; the thoughts there echo my impressions as well.

Ultimately, I think confluent-kafka is the long-term future for python. Kafka is intended to be fast and we need to take advantage of python's CAPI to get to the needed performance level and the library essentially exposes librdkafka through cPython's c-api. But the package itself is very python developer unfriendly and the API piece, while functional, isn't designed with a Python developer in mind. It takes more digging than should be necessary to put the package in use.

In contrast, Pykafka is definitely the most friendly python package. It's relatively performant, but threading makes it clunky and for any serious usage of kafka streaming, you'll need those threads to be performant.

FWIW (circling back from a year+ ago) - we went with confluent-kafka.

@dpkp
Copy link

dpkp commented Oct 10, 2017

my humble perspective is that fewer people really care about performance than say they do. It makes for good blog posts, but really what matters is stability, bugfixes, and keeping up with server features / development. I think kafka-python manages to tackle these quite well.

Also, fwiw, I can consume almost 100Mb/sec on a single core running kafka-python on pypy, which is certainly in line with the raw performance you see from librdkafka.

@amontalenti
Copy link
Contributor Author

Closing for inactivity.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

7 participants