Installation
Apache Kafka version kafka 2.4.0 for Scala 2.12 download
Apache Spark version 2.1.0 + Hadoop 2.7 download
ElasticSearch version 5.5.2 download
Kibana version 5.5.2 download
Logstash version 5.5.2 download
The topology used in CATRACA is in Figure
This tutorial supposes all servers are in the network 10.10.10.0/24
with the following IP:
-
kafka01: 10.10.10.7
-
kafka02: 10.10.10.3
-
master: 10.10.10.6
-
slave01: 10.10.10.9
-
slave02: 10.10.10.13
-
slave03: 10.10.10.11
-
elastic01: 10.10.10.15
-
kibana01: 10.10.10.10
The IP addresses, however, can be customized according to your network settings.
This installation process was done in Ubuntu server 16.04 amd64 servers.
for all nodes, we suggest to edit the /etc/hosts
and modify the file as following:
127.0.0.1 localhost
10.10.10.6 master
10.10.10.9 slave01
10.10.10.13 slave02
10.10.10.11 slave03
10.10.10.7 kafka01
10.10.10.3 kafka02
10.10.10.15 elastic01
10.10.10.10 kibana01
verify that all nodes has its own hostname correct, e.g. the master node hostname MUST be master
. To change the name run:
sudo hostname master
Requirements (install in ALL nodes):
sudo apt-get install python-setuptools python-dev build-essential python-pip
sudo pip install numpy sklearn scipy python-geoip python-geoip-geolite2
verify that all requeriments where installed with: sudo pip freeze
Nodes must communicate between each other to send commands. To achieve this we need to establish ssh key-less communications.
To generate a new key pair, enter the following command at the terminal of your local machine (i.e. your computer):
sudo apt-get install openssh-server openssh-client
ssh-keygen -t rsa -P ""
Assuming your local user is called "localuser", you will see the output similar to the following:
Generating public/private rsa key pair.
Enter file in which to save the key (/Users/localuser/.ssh/id_rsa):
Hit return to accept this file name and path (or enter a new name).
Next, you will be prompted for a passphrase to secure the key with. You may either enter a passphrase or leave the passphrase blank.
NOTE: If you leave the passphrase blank, you will be able to use the private key for authentication without entering a passphrase. If you enter a passphrase, you will need both the private key and the passphrase to log in. Securing your keys with passphrases is more secure, but both methods have their uses and are more secure than basic password authentication.
This generates a key-pair, private key id_rsa
and a public key id_rsa.pub
, in the .ssh directory of the localuser's home directory.
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
Copy the public key that appears in the screen, and copy it to your clipboard.
To enable the use of SSH key to authenticate as the new remote user, you must add the public key to a special file in the user's home directory.
For this, copy the content of ~/.ssh/id_rsa.pub
from the master node and paste it on the ~/.ssh/authorized_keys
file of all the slaves as well as master.
Create a new directory called .ssh and restrict its permissions with the following commands: mkdir .ssh chmod 700 .ssh
Now open a file in .ssh called authorized_keys with a text editor. We will use nano to edit the file:
nano .ssh/authorized_keys
Now insert your public key (which should be in your clipboard) by pasting it into the editor.
Hit CTRL-X to exit the file, then Y to save the changes that you made, then ENTER to confirm the file name.
Now restrict the permissions of the authorized_keys file with this command:
chmod 600 .ssh/authorized_keys
Type this command once to return to the root user:
exit
restart SSH:
service ssh restart
verify in /etc/ssh/sshd_config
that ssh by key-pair authentication is enable
PubkeyAuthentication yes
Repeat the previous steps for key-less for all slaves and for the master.
Check if this process was done correctly by performing SSH to all slave nodes and to the master
ssh slave01
ssh slave02
ssh slave03
ssh master
Next, update the metadata of the new repository by running the following command:
sudo apt-get update
Once you have finished, run the following command to install JDK 8:
sudo apt-get install openjdk-8-jdk
You can also verify that JDK 8 is installed properly by running the following command:
sudo java -version
You should see the output something like this:
java version "1.8.0_66"
Java(TM) SE Runtime Environment (build 1.8.0_66-b17)
Java HotSpot(TM) 64-Bit Server VM (build 25.66-b17, mixed mode)
This section will let you install all the necessary software to run the Kafka Cluster.
Before installing Apache Kafka, you will need to have zookeeper available and running. ZooKeeper is an open source service for maintaining configuration information, providing distributed synchronization, naming and providing group services.
By default, ZooKeeper package is available in Ubuntu's default repository, you can install it by running the following command:
sudo apt-get install zookeeperd
Once installation is finished, it will be started as a daemon automatically. By default, ZooKeeper will run on port 2181.
You can test it by running the following command:
sudo apt-get install net-tools
netstat -ant | grep :2181
If everything's fine, you should see the following Output:
tcp6 0 0 :::2181 :::* LISTEN
In CATRACA, Apache Kafka acts as a kind of universal pipeline for data. Kafka is a distributed messaging system, originally developed at LinkedIn.
Download Apache Kafka packet:
wget http://apache-mirror.rbc.ru/pub/apache/kafka/1.1.0/kafka_2.12-1.1.0.tgz
Next, create a directory for Kafka installation:
mkdir /opt/kafka
Extract the downloaded archive using tar command in /opt/kafka
:
tar -xvf kafka_2.12-1.1.0.tgz -C /opt/kafka/
The next step is to start Kafka server, you can start it by running kafka-server-start.sh
script located at /opt/kafka/kafka_2.12-1.1.0/bin/
directory.
sudo /opt/kafka/kafka_2.12-1.1.0/bin/kafka-server-start.sh /opt/kafka/kafka_2.12-1.1.0/config/server.properties
You should see the following output, if the server has started successfully:
[2017-07-22 21:43:48,279] WARN No meta.properties file under dir /tmp/kafka-logs/meta.properties
(kafka.server.BrokerMetadataCheckpoint)
[2017-07-22 21:43:48,516] INFO Kafka version : 0.10.0.1 (org.apache.kafka.common.utils.AppInfoParser)
[2017-07-22 21:43:48,525] INFO Kafka commitId : a7a17cdec9eaa6c5 (org.apache.kafka.common.utils.AppInfoParser)
[2017-07-22 21:43:48,527] INFO [Kafka Server 0], started (kafka.server.KafkaServer)
[2017-07-22 21:43:48,555] INFO New leader is 0 (kafka.server.ZookeeperLeaderElector$LeaderChangeListener)
If you want to start the Kafka server as a background process, add the &
in the end:
sudo /opt/kafka/kafka_2.12-1.1.0/bin/kafka-server-start.sh /opt/kafka/kafka_2.12-1.1.0/config/server.properties &
You now have a Kafka server running and listening on port 9092.
Now it is time to verify if the Kafka server is operating correctly.
To test Kafka, create a sample topic with name "testing" in Apache Kafka using the following command:
sudo /opt/kafka/kafka_2.12-1.1.0/bin/kafka-topics.sh --create --bootstrap-server localhost:9092 --replication-factor 1 --partitions 1 --topic testing
You should see the following output:
Created topic "testing".
Now, ask Zookeeper to list available topics on Apache Kafka by running the following command:
sudo /opt/kafka/kafka_2.12-1.1.0/bin/kafka-topics.sh --list --bootstrap-server localhost:9092
You should see the following output:
testing
Now, publish a sample messages to Apache Kafka topic called testing by using the following producer command:
sudo /opt/kafka/kafka_2.12-1.1.0/bin/kafka-console-producer.sh --broker-list localhost:9092 --topic testing
After running above command, enter some messages like "CATRACA topic working!!" press enter, then enter another message like "Where are you?"
Now, use consumer command to check for messages on Apache Kafka Topic called testing by running the following command: In another terminal run:
sudo /opt/kafka/kafka_2.12-1.1.0/bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic testing --from-beginning
You should see the following output:
CATRACA topic working!!
Where are you?
This section will guide you to install Apache Hadoop and Apache Spark in a Multi node/Cluster setup, as it is explained in the topology section. The following commands must be executed in ALL nodes of the hadoop/spark cluster. To run the commands on several nodes we recommend the use of terminator
In CATRACA, we use Apache Hadoop for distributing storing with the Hadoop Distributed File System (HDFS).
First you need to download Apache Hadoop 2.6.0 (i.e. hadoop-2.6.0.tar.gz)or latest version source from Apache download Mirrors.
wget http://mirror.nbtelecom.com.br/apache/hadoop/common/hadoop-3.1.2/hadoop-3.2.1.tar.gz
Create the hadoop installation parent dir
mkdir -p /opt/hadoop
Extract Hadoop source
tar -xzvf hadoop-3.2.1.tar.gz -C /opt/hadoop/ --strip-components=1
User profile:
sudo nano ~/.bashrc
Update user configuration file by appending the following environment variables at the end of this file.
# -- HADOOP ENVIRONMENT VARIABLES START -- #
export JAVA_HOME=/usr/lib/jvm/java-8-oracle
export HADOOP_HOME=/opt/hadoop
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib"
# -- HADOOP ENVIRONMENT VARIABLES END -- #
Load the new environment variables.
source ~/.bashrc
To edit file, run
sudo nano /opt/hadoop/etc/hadoop/hadoop-env.sh
And insert the following to update JAVA_HOME
variable
JAVA_HOME=/usr/lib/jvm/java-8-oracle
If you would like to run as hadoop as root (not recomendable) you will possible have an error message:
ERROR: Attempting to operate on hdfs namenode as root ERROR: but there is no HDFS_NAMENODE_USER defined. Aborting operation.
To solve it write these lines in all nodes at hadoop-env.sh
export YARN_RESOURCEMANAGER_USER=root
export YARN_NODEMANAGER_USER=root
export HDFS_NAMENODE_USER=root
export HDFS_DATANODE_USER=root
export HDFS_JOURNALNODE_USER=root
To edit file, enter the following command
sudo nano /opt/hadoop/etc/hadoop/core-site.xml
Paste these lines into <configuration>
tag
<property>
<name>fs.default.name</name>
<value>hdfs://master:9000</value>
</property>
To edit file, enter the following command
sudo nano /opt/hadoop/etc/hadoop/hdfs-site.xml
Paste these lines into <configuration>
tag
<property>
<name>dfs.replication</name>
<value>3</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/opt/hadoop_tmp/hdfs/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/opt/hadoop_tmp/hdfs/datanode</value>
</property>
To edit file, enter the following command
sudo nano /opt/hadoop/etc/hadoop/yarn-site.xml
Paste/Update these lines into <configuration>
tag
<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
<value>master:8025</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>master:8035</value>
</property>
<property>
<name>yarn.resourcemanager.address</name>
<value>master:8050</value>
</property>
To edit file, fire the below given command
sudo nano /opt/hadoop/etc/hadoop/mapred-site.xml
Paste these lines into <configuration>
tag
<property>
<name>mapreduce.job.tracker</name>
<value>master:5431</value>
</property>
<property>
<name>mapred.framework.name</name>
<value>yarn</value>
</property>
Update the directory of master node of Hadoop cluster.
To edit file, run the below given command
sudo nano /opt/hadoop/etc/hadoop/masters
Add name of master nodes
master
Create namenode
dir
mkdir -p /opt/hadoop_tmp/hdfs/namenode
Update the directory of slave nodes of Hadoop cluster.
To edit file, run
sudo nano /opt/hadoop/etc/hadoop/slaves
Add name of slave nodes
slave01
slave02
slave03
Create datanode dir
mkdir -p /opt/hadoop_tmp/hdfs/datanode
Format Namenode
hdfs namenode -format
Execute the hadoop cluster:
Start hdfs daemons
/opt/hadoop/sbin/start-dfs.sh
Start MapReduce daemons:
/opt/hadoop/sbin/start-yarn.sh
Verify Hadoop daemons:
jps
Output for Master node
3344 JpS
2233 SecondaryNameNode
2908 ResourceManager
2544 NameNode
Output for Slave nodes
3344 JpS
2233 NodeManager
2544 DataNode
Create a folder in hadoop file system
hdfs dfsadmin -safemode leave
hadoop fs -mkdir /user
hadoop fs -mkdir /user/app
Upload a file to the new created path:
hdfs dfs -put <file> /user/app
hdfs dfs -ls /user/app/
For more HDFS commands follow the hdfs web site
Ports:
- For HDFS web console – http://master:9870
- For ResourceManager – http://master:8088
Spark is used in CATRACA as the main core for processing. To ingest data coming from kafka, analyze and classify the data and then to visualize it in the Visualization layer.
Install scala in all machines:
sudo apt-get install scala
Download and install spark
wget https://archive.apache.org/dist/spark/spark-2.1.0/spark-2.1.0-bin-hadoop2.7.tgz
mkdir /opt/spark
tar xzf spark-2.1.0-bin-hadoop2.7.tgz -C /opt/spark --strip-components=1
Now edit .bashrc
file located in user’s home directory
nano ~/.bashrc
And and add following environment variables:
export SPARK_HOME=/opt/spark
export PATH=$PATH:$SPARK_HOME/bin
export PYSPARK_DRIVER_PYTHON=ipython
Close the ~/.bashrc
file with ctrl + x
and Y
Load the new environment variables.
source ~/.bashrc
Now edit configuration file spark-env.sh
in $SPARK_HOME/conf/
cp $SPARK_HOME/conf/spark-env.sh.template $SPARK_HOME/conf/spark-env.sh
nano $SPARK_HOME/conf/spark-env.sh
And insert the following parameters
export JAVA_HOME=/usr/lib/jvm/java-8-oracle
export SPARK_WORKER_CORES=8
export IPYTHON=1
export PYSPARK_PYTHON=/usr/bin/python3
export PYSPARK_DRIVER_PYTHON=ipython3
Create configuration file slaves (in $SPARK_HOME/conf/) and add following entries:
nano $SPARK_HOME/conf/slaves
slave01
slave02
slave03
copy the spark.defaults
mv /opt/spark/conf/spark-defaults.conf.template /opt/spark/conf/spark-defaults.conf
Update the spark.defaults files
nano /opt/spark/conf/spark-defaults.conf
add the following lines
spark.master spark://master:7077
spark.eventLog.enabled true
spark.eventLog.dir /tmp/
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.driver.memory 5g
spark.executor.extraJavaOptions -XX:+PrintGCDetails -Dkey=value -Dnumbers="on$
NOTE: the spark.driver.memory
parameter must be adjusted according to the slaves machine, the MINIMUM configuration is 512m
.
Create the spark-events folder in the master
mkdir /tmp/spark-events
To add dependencies, go to path $SPARK_HOME
(/opt/spark) and download the following packages in all nodes:
wget http://central.maven.org/maven2/org/apache/spark/spark-streaming-kafka-0-8-assembly_2.11/2.1.1/spark-streaming-kafka-0-8-assembly_2.11-2.1.1.jar
wget http://central.maven.org/maven2/org/elasticsearch/elasticsearch-spark-20_2.10/5.5.1/elasticsearch-spark-20_2.10-5.5.1.jar
On the master node start the cluster:
$SPARK_HOME/sbin/start-all.sh
Ports:
In CATRACA, we use the elasticstack software Logstash, ElasticSearch, and Kibana. Logstash is used to collect and parse logs from application or IDS and send them to kafka. The output of the processing layer is sent to the elasticsearch which provides a fast search and store service. Elasticsearch communicates through queries with the user interface that runs in the Kibana environment in which it can represent the results to be viewed by the user.
We will install Elasticsearch using the package manager from the Elastic repository.
wget -qO - https://artifacts.elastic.co/GPG-KEY-elasticsearch | sudo apt-key add -
echo "deb https://artifacts.elastic.co/packages/5.x/apt stable main" | sudo tee -a /etc/apt/sources.list.d/elastic-5.x.list
sudo apt update
sudo apt-get install elasticsearch
Once the installation is completed, open the elasticsearch.yml file and restrict the remote access to the Elasticsearch instance:
sudo nano /etc/elasticsearch/elasticsearch.yml
cluster.name: spark
network.host: 0.0.0.0
http.port: 9200
NOTE: in /etc/elaticsearch/jvm.options
memory must be configured according with the machine, MINIMUM is 512m
.
-Xms512m
-Xmx512m
Start the Elasticsearch service and set it to automatically start on boot:
sudo systemctl restart elasticsearch
sudo systemctl enable elasticsearch
sudo update-rc.d elasticsearch defaults 95 10
Same as Elasticsearch, we will install Kibana using the package manager from the Elastic repository.
sudo apt-get install kibana
Once the installation is completed, open the kibana.yml
file and restrict the remote access to the Kibana instance:
sudo nano /etc/kibana/kibana.yml
server.host: 0.0.0.0
elasticsearch.url: "http://10.10.10.15:9200"
Start the Elasticsearch service and set it to start automatically on boot:
sudo systemctl restart kibana
sudo systemctl enable kibana
sudo update-rc.d kibana defaults 96 9
open kibana on localhost on port 5601 and click over the devtool section:
Create e new index using this mapping. If the index is asked during the importation choose spark
.
Go to devtools and paste the mapping content in the console, the press play (green bottom top right) .
The output should be like:
{
"acknowledged": true,
"shards_acknowledged": true,
"index": "spark"
}
Then, go to the Management > Index Patter and set a new index patter as spark
.
In order to load the CATRACA dashboard, in Kibana go to Management > Saved Objects and import. Import the CATRACA Dashboard and refresh the page.
The final step is to install Logstash using the package manager from the Elastic repository.
sudo apt install logstash
Go to dashboard tab in the control panel and you will see the CATRACA Dashboard