Skip to content

Installation

Tincho edited this page Feb 10, 2020 · 99 revisions

Requirements

Apache Kafka version kafka 2.4.0 for Scala 2.12 download

Apache Spark version 2.1.0 + Hadoop 2.7 download

ElasticSearch version 5.5.2 download

Kibana version 5.5.2 download

Logstash version 5.5.2 download

CATRACA Topology

The topology used in CATRACA is in Figure Figure 2

This tutorial supposes all servers are in the network 10.10.10.0/24 with the following IP:

  • kafka01: 10.10.10.7

  • kafka02: 10.10.10.3

  • master: 10.10.10.6

  • slave01: 10.10.10.9

  • slave02: 10.10.10.13

  • slave03: 10.10.10.11

  • elastic01: 10.10.10.15

  • kibana01: 10.10.10.10

The IP addresses, however, can be customized according to your network settings.

Basic installation (for ALL servers)

This installation process was done in Ubuntu server 16.04 amd64 servers.

for all nodes, we suggest to edit the /etc/hosts and modify the file as following:

127.0.0.1 localhost
10.10.10.6 master
10.10.10.9 slave01
10.10.10.13 slave02
10.10.10.11 slave03
10.10.10.7 kafka01
10.10.10.3 kafka02
10.10.10.15 elastic01   
10.10.10.10 kibana01

verify that all nodes has its own hostname correct, e.g. the master node hostname MUST be master. To change the name run:

sudo hostname  master

Requirements (install in ALL nodes):

sudo apt-get install python-setuptools python-dev build-essential python-pip
sudo pip install numpy sklearn scipy python-geoip python-geoip-geolite2

verify that all requeriments where installed with: sudo pip freeze

Configure Public Key Authentication

Nodes must communicate between each other to send commands. To achieve this we need to establish ssh key-less communications.

To generate a new key pair, enter the following command at the terminal of your local machine (i.e. your computer):

sudo apt-get install openssh-server openssh-client

ssh-keygen -t rsa -P ""

Assuming your local user is called "localuser", you will see the output similar to the following:

Generating public/private rsa key pair.
Enter file in which to save the key (/Users/localuser/.ssh/id_rsa):
Hit return to accept this file name and path (or enter a new name).

Next, you will be prompted for a passphrase to secure the key with. You may either enter a passphrase or leave the passphrase blank.

NOTE: If you leave the passphrase blank, you will be able to use the private key for authentication without entering a passphrase. If you enter a passphrase, you will need both the private key and the passphrase to log in. Securing your keys with passphrases is more secure, but both methods have their uses and are more secure than basic password authentication.

This generates a key-pair, private key id_rsa and a public key id_rsa.pub, in the .ssh directory of the localuser's home directory.

cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

Copy the public key that appears in the screen, and copy it to your clipboard.

Add Public Key to New Remote User (slaves)

To enable the use of SSH key to authenticate as the new remote user, you must add the public key to a special file in the user's home directory.

For this, copy the content of ~/.ssh/id_rsa.pub from the master node and paste it on the ~/.ssh/authorized_keys file of all the slaves as well as master.

Create a new directory called .ssh and restrict its permissions with the following commands: mkdir .ssh chmod 700 .ssh

Now open a file in .ssh called authorized_keys with a text editor. We will use nano to edit the file:

nano .ssh/authorized_keys

Now insert your public key (which should be in your clipboard) by pasting it into the editor.

Hit CTRL-X to exit the file, then Y to save the changes that you made, then ENTER to confirm the file name.

Now restrict the permissions of the authorized_keys file with this command:

chmod 600 .ssh/authorized_keys

Type this command once to return to the root user:

exit

restart SSH:

service ssh restart

verify in /etc/ssh/sshd_config that ssh by key-pair authentication is enable

PubkeyAuthentication yes

Repeat the previous steps for key-less for all slaves and for the master.

Check if this process was done correctly by performing SSH to all slave nodes and to the master

  ssh slave01
  ssh slave02
  ssh slave03
  ssh master

JAVA

Next, update the metadata of the new repository by running the following command:

sudo apt-get update

Once you have finished, run the following command to install JDK 8:

sudo apt-get install openjdk-8-jdk

You can also verify that JDK 8 is installed properly by running the following command:

sudo java -version

You should see the output something like this:

java version "1.8.0_66"
Java(TM) SE Runtime Environment (build 1.8.0_66-b17)
Java HotSpot(TM) 64-Bit Server VM (build 25.66-b17, mixed mode)

Kafka Cluster

This section will let you install all the necessary software to run the Kafka Cluster.

Install ZooKeeper

Before installing Apache Kafka, you will need to have zookeeper available and running. ZooKeeper is an open source service for maintaining configuration information, providing distributed synchronization, naming and providing group services.

By default, ZooKeeper package is available in Ubuntu's default repository, you can install it by running the following command:

sudo apt-get install zookeeperd

Once installation is finished, it will be started as a daemon automatically. By default, ZooKeeper will run on port 2181.

You can test it by running the following command:

sudo apt-get install net-tools
netstat -ant | grep :2181

If everything's fine, you should see the following Output:

tcp6       0      0 :::2181                 :::*                    LISTEN

Apache Kafka

In CATRACA, Apache Kafka acts as a kind of universal pipeline for data. Kafka is a distributed messaging system, originally developed at LinkedIn.

Installing Kafka

Download Apache Kafka packet:

wget  http://apache-mirror.rbc.ru/pub/apache/kafka/1.1.0/kafka_2.12-1.1.0.tgz

Next, create a directory for Kafka installation:

mkdir /opt/kafka

Extract the downloaded archive using tar command in /opt/kafka:

tar -xvf kafka_2.12-1.1.0.tgz -C /opt/kafka/

Starting Kafka Server

The next step is to start Kafka server, you can start it by running kafka-server-start.sh script located at /opt/kafka/kafka_2.12-1.1.0/bin/ directory.

sudo /opt/kafka/kafka_2.12-1.1.0/bin/kafka-server-start.sh /opt/kafka/kafka_2.12-1.1.0/config/server.properties

You should see the following output, if the server has started successfully:

[2017-07-22 21:43:48,279] WARN No meta.properties file under dir /tmp/kafka-logs/meta.properties 
(kafka.server.BrokerMetadataCheckpoint)
[2017-07-22 21:43:48,516] INFO Kafka version : 0.10.0.1 (org.apache.kafka.common.utils.AppInfoParser)
[2017-07-22 21:43:48,525] INFO Kafka commitId : a7a17cdec9eaa6c5 (org.apache.kafka.common.utils.AppInfoParser)
[2017-07-22 21:43:48,527] INFO [Kafka Server 0], started (kafka.server.KafkaServer)
[2017-07-22 21:43:48,555] INFO New leader is 0 (kafka.server.ZookeeperLeaderElector$LeaderChangeListener)

If you want to start the Kafka server as a background process, add the & in the end:

sudo /opt/kafka/kafka_2.12-1.1.0/bin/kafka-server-start.sh /opt/kafka/kafka_2.12-1.1.0/config/server.properties &

You now have a Kafka server running and listening on port 9092.

Testing Kafka Server

Now it is time to verify if the Kafka server is operating correctly.

Create a simple topic

To test Kafka, create a sample topic with name "testing" in Apache Kafka using the following command:

sudo /opt/kafka/kafka_2.12-1.1.0/bin/kafka-topics.sh --create --bootstrap-server localhost:9092 --replication-factor 1 --partitions 1 --topic testing

You should see the following output:

Created topic "testing".

Now, ask Zookeeper to list available topics on Apache Kafka by running the following command:

sudo /opt/kafka/kafka_2.12-1.1.0/bin/kafka-topics.sh --list --bootstrap-server localhost:9092

You should see the following output:

testing

Now, publish a sample messages to Apache Kafka topic called testing by using the following producer command:

sudo /opt/kafka/kafka_2.12-1.1.0/bin/kafka-console-producer.sh --broker-list localhost:9092 --topic testing

After running above command, enter some messages like "CATRACA topic working!!" press enter, then enter another message like "Where are you?"

Now, use consumer command to check for messages on Apache Kafka Topic called testing by running the following command: In another terminal run:

Terminal consumer

sudo /opt/kafka/kafka_2.12-1.1.0/bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic testing --from-beginning

You should see the following output:

CATRACA topic working!!
Where are you?

Source

Spark/Hadoop Cluster

This section will guide you to install Apache Hadoop and Apache Spark in a Multi node/Cluster setup, as it is explained in the topology section. The following commands must be executed in ALL nodes of the hadoop/spark cluster. To run the commands on several nodes we recommend the use of terminator

Hadoop

In CATRACA, we use Apache Hadoop for distributing storing with the Hadoop Distributed File System (HDFS).

First you need to download Apache Hadoop 2.6.0 (i.e. hadoop-2.6.0.tar.gz)or latest version source from Apache download Mirrors.

APPLY TO ALL NODES (SLAVES AND MASTERS)

wget http://mirror.nbtelecom.com.br/apache/hadoop/common/hadoop-3.1.2/hadoop-3.2.1.tar.gz

Create the hadoop installation parent dir

mkdir -p /opt/hadoop

Extract Hadoop source

tar -xzvf hadoop-3.2.1.tar.gz -C /opt/hadoop/ --strip-components=1

User profile:

sudo nano ~/.bashrc

Update user configuration file by appending the following environment variables at the end of this file.

# -- HADOOP ENVIRONMENT VARIABLES START -- #
export JAVA_HOME=/usr/lib/jvm/java-8-oracle
export HADOOP_HOME=/opt/hadoop
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib"
# -- HADOOP ENVIRONMENT VARIABLES END -- #

Load the new environment variables.

source ~/.bashrc

Configuration file: hadoop-env.sh

To edit file, run

sudo nano /opt/hadoop/etc/hadoop/hadoop-env.sh

And insert the following to update JAVA_HOME variable

JAVA_HOME=/usr/lib/jvm/java-8-oracle

If you would like to run as hadoop as root (not recomendable) you will possible have an error message:

ERROR: Attempting to operate on hdfs namenode as root ERROR: but there is no HDFS_NAMENODE_USER defined. Aborting operation.

To solve it write these lines in all nodes at hadoop-env.sh

export YARN_RESOURCEMANAGER_USER=root
export YARN_NODEMANAGER_USER=root
export HDFS_NAMENODE_USER=root
export HDFS_DATANODE_USER=root
export HDFS_JOURNALNODE_USER=root

Configuration file: core-site.xml

To edit file, enter the following command

sudo nano /opt/hadoop/etc/hadoop/core-site.xml

Paste these lines into <configuration> tag

<property>
<name>fs.default.name</name>
<value>hdfs://master:9000</value>
</property>

Configuration file: hdfs-site.xml

To edit file, enter the following command

sudo nano /opt/hadoop/etc/hadoop/hdfs-site.xml

Paste these lines into <configuration> tag

<property>
      <name>dfs.replication</name>
      <value>3</value>
</property>
<property>
      <name>dfs.namenode.name.dir</name>
      <value>file:/opt/hadoop_tmp/hdfs/namenode</value>
 </property>
 <property>
      <name>dfs.datanode.data.dir</name>
      <value>file:/opt/hadoop_tmp/hdfs/datanode</value>
 </property>

Configuration file: yarn-site.xml

To edit file, enter the following command

sudo nano /opt/hadoop/etc/hadoop/yarn-site.xml

Paste/Update these lines into <configuration> tag

<property>
    <name>yarn.resourcemanager.resource-tracker.address</name>
<value>master:8025</value>
</property>
<property>
    <name>yarn.resourcemanager.scheduler.address</name>
    <value>master:8035</value>
</property>
<property>
    <name>yarn.resourcemanager.address</name>
    <value>master:8050</value>
</property>

Configuration file: mapred-site.xml

To edit file, fire the below given command

sudo nano /opt/hadoop/etc/hadoop/mapred-site.xml

Paste these lines into <configuration> tag

<property>
    <name>mapreduce.job.tracker</name>
    <value>master:5431</value>
</property>
<property>
    <name>mapred.framework.name</name>
    <value>yarn</value>
</property>

APPLY ONLY TO THE MASTER NODE

Update the directory of master node of Hadoop cluster.

To edit file, run the below given command

sudo nano /opt/hadoop/etc/hadoop/masters

Add name of master nodes

master

Create namenode dir

mkdir -p /opt/hadoop_tmp/hdfs/namenode

APPLY ONLY TO THE SLAVES NODES

Update the directory of slave nodes of Hadoop cluster.

To edit file, run

sudo nano /opt/hadoop/etc/hadoop/slaves

Add name of slave nodes

slave01
slave02
slave03

Create datanode dir

mkdir -p /opt/hadoop_tmp/hdfs/datanode

APPLY TO ALL NODES (SLAVES AND MASTERS)

Format Namenode

hdfs namenode -format

APPLY ONLY TO THE MASTER NODE

Execute the hadoop cluster:

Start hdfs daemons

/opt/hadoop/sbin/start-dfs.sh

Start MapReduce daemons:

/opt/hadoop/sbin/start-yarn.sh

APPLY TO ALL NODES (SLAVES AND MASTERS)

Verify Hadoop daemons:

jps

Output for Master node

3344 JpS
2233 SecondaryNameNode
2908 ResourceManager
2544 NameNode

Output for Slave nodes

3344 JpS
2233 NodeManager
2544 DataNode

APPLY ONLY TO THE MASTER NODE

Create a folder in hadoop file system

hdfs dfsadmin -safemode leave
hadoop fs -mkdir /user
hadoop fs -mkdir /user/app

Upload a file to the new created path:

hdfs dfs -put <file> /user/app
hdfs dfs -ls /user/app/

For more HDFS commands follow the hdfs web site

Ports:

Source

Spark

Spark is used in CATRACA as the main core for processing. To ingest data coming from kafka, analyze and classify the data and then to visualize it in the Visualization layer.

Install scala in all machines:

sudo apt-get install scala

Download and install spark

wget https://archive.apache.org/dist/spark/spark-2.1.0/spark-2.1.0-bin-hadoop2.7.tgz
mkdir /opt/spark
tar xzf spark-2.1.0-bin-hadoop2.7.tgz -C /opt/spark --strip-components=1

Now edit .bashrc file located in user’s home directory

nano ~/.bashrc 

And and add following environment variables:

export SPARK_HOME=/opt/spark
export PATH=$PATH:$SPARK_HOME/bin
export PYSPARK_DRIVER_PYTHON=ipython

Close the ~/.bashrc file with ctrl + x and Y

Load the new environment variables.

source ~/.bashrc 

Now edit configuration file spark-env.sh in $SPARK_HOME/conf/

cp $SPARK_HOME/conf/spark-env.sh.template $SPARK_HOME/conf/spark-env.sh
nano $SPARK_HOME/conf/spark-env.sh

And insert the following parameters

export JAVA_HOME=/usr/lib/jvm/java-8-oracle
export SPARK_WORKER_CORES=8
export IPYTHON=1
export PYSPARK_PYTHON=/usr/bin/python3
export PYSPARK_DRIVER_PYTHON=ipython3

Create configuration file slaves (in $SPARK_HOME/conf/) and add following entries:

nano $SPARK_HOME/conf/slaves
slave01
slave02
slave03

copy the spark.defaults

mv /opt/spark/conf/spark-defaults.conf.template /opt/spark/conf/spark-defaults.conf

Update the spark.defaults files

nano /opt/spark/conf/spark-defaults.conf

add the following lines

spark.master                     spark://master:7077
spark.eventLog.enabled           true
spark.eventLog.dir               /tmp/
spark.serializer                 org.apache.spark.serializer.KryoSerializer
spark.driver.memory              5g
spark.executor.extraJavaOptions  -XX:+PrintGCDetails -Dkey=value -Dnumbers="on$

NOTE: the spark.driver.memory parameter must be adjusted according to the slaves machine, the MINIMUM configuration is 512m.

Create the spark-events folder in the master

mkdir /tmp/spark-events 

To add dependencies, go to path $SPARK_HOME (/opt/spark) and download the following packages in all nodes:

wget http://central.maven.org/maven2/org/apache/spark/spark-streaming-kafka-0-8-assembly_2.11/2.1.1/spark-streaming-kafka-0-8-assembly_2.11-2.1.1.jar
wget http://central.maven.org/maven2/org/elasticsearch/elasticsearch-spark-20_2.10/5.5.1/elasticsearch-spark-20_2.10-5.5.1.jar

On the master node start the cluster:

$SPARK_HOME/sbin/start-all.sh

Ports:

Source

ElasticStack Cluster

In CATRACA, we use the elasticstack software Logstash, ElasticSearch, and Kibana. Logstash is used to collect and parse logs from application or IDS and send them to kafka. The output of the processing layer is sent to the elasticsearch which provides a fast search and store service. Elasticsearch communicates through queries with the user interface that runs in the Kibana environment in which it can represent the results to be viewed by the user.

ElasticSearch

We will install Elasticsearch using the package manager from the Elastic repository.

wget -qO - https://artifacts.elastic.co/GPG-KEY-elasticsearch | sudo apt-key add -
echo "deb https://artifacts.elastic.co/packages/5.x/apt stable main" | sudo tee -a /etc/apt/sources.list.d/elastic-5.x.list
sudo apt update
sudo apt-get install elasticsearch

Once the installation is completed, open the elasticsearch.yml file and restrict the remote access to the Elasticsearch instance:

sudo nano /etc/elasticsearch/elasticsearch.yml
cluster.name: spark
network.host: 0.0.0.0
http.port: 9200

NOTE: in /etc/elaticsearch/jvm.options memory must be configured according with the machine, MINIMUM is 512m.

-Xms512m
-Xmx512m

Start the Elasticsearch service and set it to automatically start on boot:

sudo systemctl restart elasticsearch
sudo systemctl enable elasticsearch
sudo update-rc.d elasticsearch defaults 95 10

Kibana

Same as Elasticsearch, we will install Kibana using the package manager from the Elastic repository.

sudo apt-get install kibana

Once the installation is completed, open the kibana.yml file and restrict the remote access to the Kibana instance:

sudo nano /etc/kibana/kibana.yml
server.host: 0.0.0.0
elasticsearch.url: "http://10.10.10.15:9200"

Start the Elasticsearch service and set it to start automatically on boot:

sudo systemctl restart kibana
sudo systemctl enable kibana
sudo update-rc.d kibana defaults 96 9

open kibana on localhost on port 5601 and click over the devtool section:

Create e new index using this mapping. If the index is asked during the importation choose spark.

Go to devtools and paste the mapping content in the console, the press play (green bottom top right) .

The output should be like:

{
"acknowledged": true,
"shards_acknowledged": true,
"index": "spark"
}

Then, go to the Management > Index Patter and set a new index patter as spark.

In order to load the CATRACA dashboard, in Kibana go to Management > Saved Objects and import. Import the CATRACA Dashboard and refresh the page.

Logstash

The final step is to install Logstash using the package manager from the Elastic repository.

sudo apt install logstash

Source

Go to dashboard tab in the control panel and you will see the CATRACA Dashboard