Skip to content
Ronak Nathani edited this page Sep 16, 2016 · 3 revisions

Table of Contents

  1. Introduction
  2. Requirements
  3. Dependencies
  4. Clone Camus
  5. Configure Camus
  6. Run Camus

Introduction

Camus is a tool built by Linkedin which is essentially a distributed consumer running a MapReduce job underneath to consume messages from Kafka and save them to HDFS. It includes the following features and more:

  • Automatic discovery of topics
  • Avro schema management / In progress
  • Date partitioning

Camus is great for the Kafka->HDFS pipeline as it keeps a track of the last offset consumed for a topic and also allows to whitelist and blacklist topics so that one can consume only a subset of topics. Moreover, it also compresses the data before saving it to HDFS which saves space by an order of magnitude. Camus is very easy to set up and is worth the time spent. You can read more about it here.

Requirements

Zookeeper, Kafka, Hadoop

Dependencies

Camus jar is built with maven. If you don’t have maven installed, Install it on the node you would be setting up Camus on.

node:~$ sudo apt-get install maven

Clone Camus

This must be done on Hadoop namenode.

We will clone Camus and build the jar with maven. We will install Camus in /usr/local.

node:~$ cd /usr/local/
node:~$ sudo git clone https://github.com/linkedin/camus.git
node:~$ sudo chown -R ubuntu camus
node:~$ cd ./camus/
node:~$ mvn clean package

Configure Camus

Here we set up the HDFS path and configure the Kafka broker ports

node:~$ cd /usr/local/camus/camus-example/src/main/resources/
node:~$ vim camus.properties
Change bolded italic text in /usr/local/camus/camus-example/src/main/resources/camus.properties
############################################################################ … … … # final top-level data output directory, sub-directory will be dynamically created for each topic pulled etl.destination.path=hdfs://:9000/camus/topics # HDFS location where you want to keep execution files, i.e. offsets, error logs, and count files etl.execution.base.path=hdfs://:9000/camus/exec # where completed Camus job output directories are kept, usually a sub-dir in the base.path etl.execution.history.path=hdfs://:9000/camus/exec/history … …
Comment the following line
#camus.message.decoder.class=com.linkedin.camus.etl.kafka.coders.LatestSchemaKafkaAvroMessageDecoder
Add the following two lines
camus.message.decoder.class=com.linkedin.camus.etl.kafka.coders.JsonStringMessageDecoder etl.record.writer.provider.class=com.linkedin.camus.etl.kafka.common.StringRecordWriterProvider … … … kafka.brokers=
add a list of kafka broker node addresses with the kafka port. e.g. kafka.brokers=<public_dns_1>:9092,<public_dns_1>:9092,<public_dns_1>:9092:9092 No spaces between the commas or ‘=’ sign
… … … etl.output.codec=gzip

After the configuration file is set, we need to have the camus jar and log4j.xml in hadoop’s path.

node:~$ cp /usr/local/camus/camus-example/target/camus-example-0.1.0-SNAPSHOT-shaded.jar /usr/local/hadoop/etc/hadoop/
node:~$ cp /usr/local/camus/camus-example/src/main/resources/log4j.xml /usr/local/hadoop/etc/hadoop

Now Camus is configured to consume messages from Kafka brokers. We can specify topics which should be ignored and the topics which should consume messages as follows:

You can black-list and white-list topics by specifying topic names in the following two lines in the /usr/local/camus/camus-example/src/main/resources/camus.properties file
kafka.blacklist.topics=blacklist-topic1,blacklist-topic2 kafka.whitelist.topics=whitelist-topic1,whitelist-topic2

Run Camus

You first need to change the path to where the camus jar is and then submit the camus jar as a hadoop job.

node:~$ cd /usr/local/hadoop/etc/hadoop/
node:~$ hadoop jar camus-example-0.1.0-SNAPSHOT-shaded.jar com.linkedin.camus.etl.kafka.CamusJob -P /usr/local/camus/camus-example/src/main/resources/camus.properties

Copy the above two commands in a file and save it as a shell script and then you can execute that shell script from anywhere which means you can run camus from any directory.

Clone this wiki locally