Skip to content
g9yuayon edited this page Dec 16, 2013 · 1 revision

Why Not Use Kafka Directly?

We didn't use Kafka directly for two reasons:

  • Migration cost is too high for us. We've already got large deployment of Suro clients and collectors before we needed Kafka. To switch to Kafka, we would have to complete the following tasks:

    • Update Suro client to "swap in" Kafka producer while keeping full backward compatibility. No application should require any change except a rebuild and redeployment.
    • Ask every team to rebuild and redeploy their applications. This would take long time because each team has their own release schedules and priorities.
    • Update all the dashboard and automations that we have made for Suro to support operation of Kafka
    • Update Suro collector to get data from Kafka. Note Kafka itself is not a collector. It's a powerful persistent queue. We would still have to process the data it offers.
    • We would also need to introduce dependency of Zookeeper to each application (not the case any more with Kafka 0.8, which was not released yet when we started integrating with Kafka).
  • Supporting auto scaling with Kafka 0.7 was not easy. In Netflix we encourage teams to collect data at will. Any team can start to collect new type of event of reasonable volume at any time. That means Suro collectors must be able to scale up automatically to meet ever increasing event volumes and number of event types. We also need to scale down Chukwa collectors when traffic dwindles in off hours. Such auto scaling can be done with Kafka, but it would require careful implementation the following features:

    • Kafka brokers are drained before being terminated during a scale-down operation.
    • When Kafka cluster is being scaled up, Kafka producers would automatically balance accordingly.
    • Making sure when number of Kafka consumers increase, no consumer will become idle. This is because Kafka's maximum number of active consumers for a given topic can be as large as the total number of partitions of that topic.
    • Enhance Kafka 0.7 to make sure data will not be lost if a Kafka broker terminates unexpectedly. Kafka 0.8 implemented replication to deal with this problem, but Kafka 0.7 didn't.

Given all the associated cost and our relatively simple use cases for Kafka, we decided to put Kafka behind Suro. This also enabled us to move quickly to set up a real-time data pipeline while operating and developing Suro, all by a two-person team.

Clone this wiki locally