Documentation on PyKafka vs kafka-python #334

Closed
microamp opened this Issue Nov 9, 2015 · 14 comments

Comments

Projects
None yet
9 participants
@microamp
Contributor

microamp commented Nov 9, 2015

Hello. I'd like to play around with Kafka, but I don't know which client to use to start with. I know there is at least one other Python client called kafka-python. I wonder if there is any documentation on comparison between the two. I'll start with PyKafka in the meantime. :)

@emmett9001

This comment has been minimized.

Show comment
Hide comment
@emmett9001

emmett9001 Nov 9, 2015

Member

@microamp Thanks, this is a great idea. There's currently no documentation on this, but to my knowledge the main differences are the specifics of the Python API and PyKafka's implementation of the BalancedConsumer. PyKafka strives to keep the API as pythonic as possible, which means using useful features of the language where appropriate for client code simplicity. This includes things like context managers for object cleanup and futures for asynchronous error handling. PyKafka's balanced consumer implements the Kafka project's notion of the "high level consumer", which uses ZooKeeper to balance consumption of partitions between multiple nodes in a consumer group. From what I understand, kafka-python is waiting until Kafka 0.9, when this functionality will be supported natively by the Kafka server itself, to implement self-balancing consumers.
Also, the last time we did a speed test (which was admittedly a while ago at this point), PyKafka's consumer outperformed kafka-python. I unfortunately no longer have the results from that test, so you may not want to bet too hard on PyKafka being significantly faster or slower - just figured I'd mention it.

Member

emmett9001 commented Nov 9, 2015

@microamp Thanks, this is a great idea. There's currently no documentation on this, but to my knowledge the main differences are the specifics of the Python API and PyKafka's implementation of the BalancedConsumer. PyKafka strives to keep the API as pythonic as possible, which means using useful features of the language where appropriate for client code simplicity. This includes things like context managers for object cleanup and futures for asynchronous error handling. PyKafka's balanced consumer implements the Kafka project's notion of the "high level consumer", which uses ZooKeeper to balance consumption of partitions between multiple nodes in a consumer group. From what I understand, kafka-python is waiting until Kafka 0.9, when this functionality will be supported natively by the Kafka server itself, to implement self-balancing consumers.
Also, the last time we did a speed test (which was admittedly a while ago at this point), PyKafka's consumer outperformed kafka-python. I unfortunately no longer have the results from that test, so you may not want to bet too hard on PyKafka being significantly faster or slower - just figured I'd mention it.

@emmett9001 emmett9001 self-assigned this Nov 9, 2015

@emmett9001

This comment has been minimized.

Show comment
Hide comment
@emmett9001

emmett9001 Nov 9, 2015

Member

Some more research - there are differences in the versions of python supported by each library. PyKafka supports 2.7, 3.4, 3.5, and pypy, while kafka-python adds 2.6 and removes 3.5 support. kafka-python also requires a ZooKeeper connection for offset management, which PyKafka does not. kafka-python supports versions of Kafka from 0.8.0 to 0.8.2, where PyKafka only supports 0.8.2.

Member

emmett9001 commented Nov 9, 2015

Some more research - there are differences in the versions of python supported by each library. PyKafka supports 2.7, 3.4, 3.5, and pypy, while kafka-python adds 2.6 and removes 3.5 support. kafka-python also requires a ZooKeeper connection for offset management, which PyKafka does not. kafka-python supports versions of Kafka from 0.8.0 to 0.8.2, where PyKafka only supports 0.8.2.

@microamp

This comment has been minimized.

Show comment
Hide comment
@microamp

microamp Nov 10, 2015

Contributor

@emmett9001

Thanks a lot for the reply. I find the information very helpful.

It's good to know that PyKafka supports Python 3.4+. It was still work in progress the last time I checked a few months back. Good work guys.

Contributor

microamp commented Nov 10, 2015

@emmett9001

Thanks a lot for the reply. I find the information very helpful.

It's good to know that PyKafka supports Python 3.4+. It was still work in progress the last time I checked a few months back. Good work guys.

@emmett9001 emmett9001 closed this Nov 10, 2015

rduplain added a commit that referenced this issue Nov 10, 2015

rduplain added a commit that referenced this issue Nov 10, 2015

rduplain added a commit that referenced this issue Nov 11, 2015

yungchin added a commit that referenced this issue Nov 13, 2015

Merge remote-tracking branch 'parsely/master' into feature/rdkafka_ex…
…tension

The producer-futures feature was backed out of master, which means the
expected interface for RdKafkaProducer._produce() has changed back, too.
I've addressed all merge conflicts here - the change in the _produce()
interface will be addressed in the next commit.

* parsely/master: (26 commits)
  changelog updates for 2.0.3, dev version
  increment version
  Catch IOError in recvall_into util.
  Catch IOError during connection response.
  re-import weakref
  Revert 52ae7a1
  Link #334 to README.
  drop autocommit logging to DEBUG level. fixes #337
  update after socket error in offset manager discovery
  remove unused condition
  unconditionally update partition leaders on update
  Load all topic values on values method.
  fix typo causing interpreter error in reset_offsets`
  fix outdenting error
  clarify functools import
  catch all exceptions when removing from zookeeper
  be very specific about the error we expect
  producer: minimal changes for gc'ability
  balancedconsumer: minimal changes for gc'ability (RFC)
  add logging, fix some retry/reconnect/update logic in simpleconsumer
  ...

Signed-off-by: Yung-Chin Oei <yungchin@yungchin.nl>

Conflicts:
	.travis.yml
	pykafka/simpleconsumer.py
	tests/pykafka/test_producer.py
@ottomata

This comment has been minimized.

Show comment
Hide comment
@ottomata

ottomata Nov 13, 2015

Contributor

A difference between kafka-python and pykafka is the producer interface. kafka-python does not require that you know the topic when instantiating the producer. This is convenient if you need to produce to topics dynamically based on input (which I do!) :)

Contributor

ottomata commented Nov 13, 2015

A difference between kafka-python and pykafka is the producer interface. kafka-python does not require that you know the topic when instantiating the producer. This is convenient if you need to produce to topics dynamically based on input (which I do!) :)

@amontalenti

This comment has been minimized.

Show comment
Hide comment
@amontalenti

amontalenti Nov 13, 2015

Member

@ottomata That seems like an interesting request for us to look at. Want to open a separate issue about that?

Member

amontalenti commented Nov 13, 2015

@ottomata That seems like an interesting request for us to look at. Want to open a separate issue about that?

@ottomata

This comment has been minimized.

Show comment
Hide comment
@ottomata

ottomata Nov 13, 2015

Contributor

Sure!

Contributor

ottomata commented Nov 13, 2015

Sure!

@cscheffler

This comment has been minimized.

Show comment
Hide comment
@cscheffler

cscheffler Nov 20, 2015

Contributor

@emmett9001 @ottomata Just got pointed at this thread and thought I'd make a late contribution.

We compared pykafka and kafka-python about 2 months ago while trying to decide which one to use. In the end, the deciding factor for us was that balanced consumers were much easier to manage in pykafka.

Also, we discovered later, a pykafka producer doesn't die on Kafka broker restart, while our kafka-python producers did.

Below are performance figures from a 3-node Kafka cluster running in EC2, using a single producer or consumer. The three numbers for each test are the quartiles measured for the test.

  • pykafka producer: 41400 – 46500 – 50200 messages per second
  • pykafka consumer: 12100 – 14400 – 23700 messages per second
  • kafka-python producer: 26500 – 27700 – 29500 messages per second
  • kafka-python consumer: 35000 – 37300 – 39100 messages per second

So, for clarification, the median performance of a pykafka producer was 46500 messages per second, with a quartile range of 41400 (25th percentile) to 50200 (75th percentile). Hope that makes sense.

Contributor

cscheffler commented Nov 20, 2015

@emmett9001 @ottomata Just got pointed at this thread and thought I'd make a late contribution.

We compared pykafka and kafka-python about 2 months ago while trying to decide which one to use. In the end, the deciding factor for us was that balanced consumers were much easier to manage in pykafka.

Also, we discovered later, a pykafka producer doesn't die on Kafka broker restart, while our kafka-python producers did.

Below are performance figures from a 3-node Kafka cluster running in EC2, using a single producer or consumer. The three numbers for each test are the quartiles measured for the test.

  • pykafka producer: 41400 – 46500 – 50200 messages per second
  • pykafka consumer: 12100 – 14400 – 23700 messages per second
  • kafka-python producer: 26500 – 27700 – 29500 messages per second
  • kafka-python consumer: 35000 – 37300 – 39100 messages per second

So, for clarification, the median performance of a pykafka producer was 46500 messages per second, with a quartile range of 41400 (25th percentile) to 50200 (75th percentile). Hope that makes sense.

@emmett9001

This comment has been minimized.

Show comment
Hide comment
@emmett9001

emmett9001 Nov 20, 2015

Member

This is awesome, thanks for the performance numbers @cscheffler. Do you have anything to share on the methodology you used to find them?

Member

emmett9001 commented Nov 20, 2015

This is awesome, thanks for the performance numbers @cscheffler. Do you have anything to share on the methodology you used to find them?

@ottomata

This comment has been minimized.

Show comment
Hide comment
@ottomata

ottomata Nov 23, 2015

Contributor

Cool! For the producer bench, did you just use the default parameters? I assume async with req_acks=1?

Contributor

ottomata commented Nov 23, 2015

Cool! For the producer bench, did you just use the default parameters? I assume async with req_acks=1?

@rghv

This comment has been minimized.

Show comment
Hide comment
@rghv

rghv Nov 25, 2015

@cscheffler can you please share the links to the test scripts, if they are open-sourced? I see https://github.com/cscheffler/kafka-demo which uses pykafka. It would be great help if you can share the test scripts for kafka-python that were used in your comparison. Thanks!

rghv commented Nov 25, 2015

@cscheffler can you please share the links to the test scripts, if they are open-sourced? I see https://github.com/cscheffler/kafka-demo which uses pykafka. It would be great help if you can share the test scripts for kafka-python that were used in your comparison. Thanks!

@emmett9001

This comment has been minimized.

Show comment
Hide comment
@emmett9001

emmett9001 Jun 16, 2016

Member

This writeup by @jofusa is the most thorough comparative benchmark of the python kafka clients I've seen.

Member

emmett9001 commented Jun 16, 2016

This writeup by @jofusa is the most thorough comparative benchmark of the python kafka clients I've seen.

@soedjais

This comment has been minimized.

Show comment
Hide comment
@soedjais

soedjais Oct 27, 2016

Leaving a url of another benchmark done recently between pykafka 2.3.1, kafka-python 1.1.1, and confluent-kafka 0.9.1
http://activisiongamescience.github.io/2016/06/15/Kafka-Client-Benchmarking/
Edit: already mentioned above by @emmett9001

soedjais commented Oct 27, 2016

Leaving a url of another benchmark done recently between pykafka 2.3.1, kafka-python 1.1.1, and confluent-kafka 0.9.1
http://activisiongamescience.github.io/2016/06/15/Kafka-Client-Benchmarking/
Edit: already mentioned above by @emmett9001

@johnistan

This comment has been minimized.

Show comment
Hide comment
@johnistan

johnistan Oct 27, 2016

Contributor

original author here. just a fyi those are one and the same

Contributor

johnistan commented Oct 27, 2016

original author here. just a fyi those are one and the same

@amitt001

This comment has been minimized.

Show comment
Hide comment
@amitt001

amitt001 Jul 19, 2017

It's Jul, 2017 is there any new update and a recent comparison?
I think now even kafka-python supports the balanced consumers.

amitt001 commented Jul 19, 2017

It's Jul, 2017 is there any new update and a recent comparison?
I think now even kafka-python supports the balanced consumers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment