-
Notifications
You must be signed in to change notification settings - Fork 61
Camus
Camus is a tool built by Linkedin which is essentially a distributed consumer running a MapReduce job underneath to consume messages from Kafka and save them to HDFS. It includes the following features and more:
- Automatic discovery of topics
- Avro schema management / In progress
- Date partitioning
Camus is great for the Kafka->HDFS pipeline as it keeps a track of the last offset consumed for a topic and also allows to whitelist and blacklist topics so that one can consume only a subset of topics. Moreover, it also compresses the data before saving it to HDFS which saves space by an order of magnitude. Camus is very easy to set up and is worth the time spent. You can read more about it here.
Zookeeper, Kafka, Hadoop
Camus jar is built with maven. If you don’t have maven installed, Install it on the node you would be setting up Camus on.
node:~$ sudo apt-get install maven
This must be done on Hadoop namenode.
We will clone Camus and build the jar with maven. We will install Camus in /usr/local
.
node:~$ cd /usr/local/ node:~$ sudo git clone https://github.com/linkedin/camus.git node:~$ sudo chown -R ubuntu camus node:~$ cd ./camus/ node:~$ mvn clean package
Here we set up the HDFS path and configure the Kafka broker ports
node:~$ cd /usr/local/camus/camus-example/src/main/resources/ node:~$ vim camus.propertiesChange bolded italic text in /usr/local/camus/camus-example/src/main/resources/camus.properties############################################################################ … … … # final top-level data output directory, sub-directory will be dynamically created for each topic pulled etl.destination.path=hdfs://:9000/camus/topics # HDFS location where you want to keep execution files, i.e. offsets, error logs, and count files etl.execution.base.path=hdfs://:9000/camus/exec # where completed Camus job output directories are kept, usually a sub-dir in the base.path etl.execution.history.path=hdfs://:9000/camus/exec/history … …Comment the following line#camus.message.decoder.class=com.linkedin.camus.etl.kafka.coders.LatestSchemaKafkaAvroMessageDecoderAdd the following two linescamus.message.decoder.class=com.linkedin.camus.etl.kafka.coders.JsonStringMessageDecoder etl.record.writer.provider.class=com.linkedin.camus.etl.kafka.common.StringRecordWriterProvider … … … kafka.brokers=add a list of kafka broker node addresses with the kafka port. e.g. kafka.brokers=<public_dns_1>:9092,<public_dns_1>:9092,<public_dns_1>:9092:9092 No spaces between the commas or ‘=’ sign… … … etl.output.codec=gzip
After the configuration file is set, we need to have the camus jar and log4j.xml in hadoop’s path.
node:~$ cp /usr/local/camus/camus-example/target/camus-example-0.1.0-SNAPSHOT-shaded.jar /usr/local/hadoop/etc/hadoop/ node:~$ cp /usr/local/camus/camus-example/src/main/resources/log4j.xml /usr/local/hadoop/etc/hadoop
Now Camus is configured to consume messages from Kafka brokers. We can specify topics which should be ignored and the topics which should consume messages as follows:
You can black-list and white-list topics by specifying topic names in the following two lines in the /usr/local/camus/camus-example/src/main/resources/camus.properties filekafka.blacklist.topics=blacklist-topic1,blacklist-topic2 kafka.whitelist.topics=whitelist-topic1,whitelist-topic2
You first need to change the path to where the camus jar is and then submit the camus jar as a hadoop job.
node:~$ cd /usr/local/hadoop/etc/hadoop/ node:~$ hadoop jar camus-example-0.1.0-SNAPSHOT-shaded.jar com.linkedin.camus.etl.kafka.CamusJob -P /usr/local/camus/camus-example/src/main/resources/camus.properties
Copy the above two commands in a file and save it as a shell script and then you can execute that shell script from anywhere which means you can run camus from any directory.
Find out more about the Insight Data Engineering Fellows Program in New York and Silicon Valley, apply today, or sign up for program updates.
You can also read our engineering blog here.