Camus

Introduction

Camus is a tool built by Linkedin which is essentially a distributed consumer running a MapReduce job underneath to consume messages from Kafka and save them to HDFS. It includes the following features and more:

Automatic discovery of topics
Avro schema management / In progress
Date partitioning

Camus is great for the Kafka->HDFS pipeline as it keeps a track of the last offset consumed for a topic and also allows to whitelist and blacklist topics so that one can consume only a subset of topics. Moreover, it also compresses the data before saving it to HDFS which saves space by an order of magnitude. Camus is very easy to set up and is worth the time spent. You can read more about it here.

Requirements

Zookeeper, Kafka, Hadoop

Dependencies

Camus jar is built with maven. If you don’t have maven installed, Install it on the node you would be setting up Camus on.

node:~$ sudo apt-get install maven

Clone Camus

This must be done on Hadoop namenode.

We will clone Camus and build the jar with maven. We will install Camus in /usr/local.

node:~$ cd /usr/local/
node:~$ sudo git clone https://github.com/linkedin/camus.git
node:~$ sudo chown -R ubuntu camus
node:~$ cd ./camus/
node:~$ mvn clean package

Configure Camus

Here we set up the HDFS path and configure the Kafka broker ports

node:~$ cd /usr/local/camus/camus-example/src/main/resources/
node:~$ vim camus.properties

Change bolded italic text in 
/usr/local/camus/camus-example/src/main/resources/camus.properties

############################################################################
…
…
…

# final top-level data output directory, sub-directory will be dynamically created for each topic pulled

etl.destination.path=hdfs://:9000/camus/topics 

# HDFS location where you want to keep execution files, i.e. offsets, error logs, and count files

etl.execution.base.path=hdfs://:9000/camus/exec 

# where completed Camus job output directories are kept, usually a sub-dir in the base.path

etl.execution.history.path=hdfs://:9000/camus/exec/history 
…
…

Comment the following line

#camus.message.decoder.class=com.linkedin.camus.etl.kafka.coders.LatestSchemaKafkaAvroMessageDecoder


Add the following two lines 

camus.message.decoder.class=com.linkedin.camus.etl.kafka.coders.JsonStringMessageDecoder 

etl.record.writer.provider.class=com.linkedin.camus.etl.kafka.common.StringRecordWriterProvider 
…
…
…
kafka.brokers= 

add a list of kafka broker node addresses with the kafka port.
e.g. kafka.brokers=<public_dns_1>:9092,<public_dns_1>:9092,<public_dns_1>:9092:9092 
No spaces between the commas or ‘=’ sign

…
…
…
etl.output.codec=gzip

After the configuration file is set, we need to have the camus jar and log4j.xml in hadoop’s path.

node:~$ cp /usr/local/camus/camus-example/target/camus-example-0.1.0-SNAPSHOT-shaded.jar /usr/local/hadoop/etc/hadoop/
node:~$ cp /usr/local/camus/camus-example/src/main/resources/log4j.xml /usr/local/hadoop/etc/hadoop

Now Camus is configured to consume messages from Kafka brokers. We can specify topics which should be ignored and the topics which should consume messages as follows:


You can black-list and white-list topics by specifying topic names in the following two lines in the /usr/local/camus/camus-example/src/main/resources/camus.properties file

kafka.blacklist.topics=blacklist-topic1,blacklist-topic2
kafka.whitelist.topics=whitelist-topic1,whitelist-topic2

Run Camus

You first need to change the path to where the camus jar is and then submit the camus jar as a hadoop job.

node:~$ cd /usr/local/hadoop/etc/hadoop/
node:~$ hadoop jar camus-example-0.1.0-SNAPSHOT-shaded.jar com.linkedin.camus.etl.kafka.CamusJob -P /usr/local/camus/camus-example/src/main/resources/camus.properties

Copy the above two commands in a file and save it as a shell script and then you can execute that shell script from anywhere which means you can run camus from any directory.

Interested in transitioning to a career in data engineering?

Find out more about the Insight Data Engineering Fellows Program in New York and Silicon Valley, apply today, or sign up for program updates.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Camus

Table of Contents

Introduction

Requirements

Dependencies

Clone Camus

Configure Camus

Run Camus

Interested in transitioning to a career in data engineering?

Home

AWS

Ingestion

File Systems

Batch Processing

Stream Processing

Databases

Web frameworks

Other

Clone this wiki locally