Skip to content

JaswanthP6878/Druid-Push-Poc

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Setup Guide

First Start The Druid, from apache-druid-26.0.0 folder

bin/start-nano-quickstart

Then Start Kafka server (version/folder kafka_2.13-2.7.0)

./bin/kafka-server-start.sh config/server.properties

Both are running on the same zookeper instance so the order matters

Then run the DruidProducerMain.java file to populate the queue on the topic mentioned in the code.

During Closing first close the Kafka Server, then the druid server.

Druid Service SQL calls

SQL Calls can be made to localhost:8888,

{
  "query" : "SELECT TIME_FLOOR(\"__time\", 'P1D') AS \"__time_by_day\" , AVG(\"tempValue\") AS \"avg_tempValue\" FROM \"test1\" GROUP BY 1 ORDER BY 2 DESC"
}

This Query gets average temperature on a hourly basis

to send the request we write use the following command (if we add the query into the query.json file)

curl -X POST -H'Content-Type: application/json' http://localhost:8888/druid/v2/sql/ -d @query.json

Running Kafka and Druid in a linux machine.

for druid to get working we have to make sure that the port 8080 on the linux machine is not being utilized. we can verify this by

lsof -i :8080

and if some process is running we can kill it (if possible) using

pkill -9 <proces id from lsof command>

For Kafka to work, we need to change server.properties so that it accepts listeners other than the ones from localhost, so we need to change the advertised.listners

advertised.listeners = PLAINTEXT://192.168.1.21.9092

Here the IP address provided is of the linux machine running the kafka and druid instance.

Setting Up Hadoop as the deep storage database;

we first download and tar the hadoop version 2.10.2 we change, etc/hadoop/core-site.xml to,

<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://localhost:9000</value>
    </property>
</configuration>

and, etc/hadoop/hdfs-site.xml

<configuration>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>
</configuration>

after wards we need to format the filesystem for the namenode;

 bin/hdfs namenode -format

to start and stop the dfs deamon

## to start
sbin/start-dfs.sh
## to stop
sbin/stop-dfs.sh

We can make directories in the hadoop distributed Files sytem using

bin/hdfs dfs -mkdir input
// we can make a root path like
bin/hdfs dfs -mkdir -p /user/root

we create two paths for the druid project

  • druid/segments, to store data segments
  • druid/index-logs, to store the indexing logs that are used for querying.

next we copy all the .xml files from etc/hadoop/*.xml to conf/druid/single-server/nano-quickstart/_common, in the druid system

and we change the druid.storage commands in conf/druid/single-server/nano-quickstart/_common/common.runtime.properties

druid.storage.type = hdfs
druid.storage.storageDirectory = hdfs://localhost:9000/user/root/druid/segments

// and for indexs

druid.indexer.logs.type = hdfs
druid.indexer.logs.directory= hdfs://localhost:9000/user/root/druid/index-logs

Running the entire application

First, run the hdfs file system, then the druid system and then the kafka queue

// 1 inside the hadoop2.10.2 folder
sbin/start-dfs.sh

//2  inside the druid folder
/bin/start-nano-quickstart

//3  inside the kafka folder.
./bin/kafka-server-start.sh config/server.properties

application in the current state is working and is able to use the hadoop DFS as its deep storage.

Clustered Deployment

  • We copied the config files from the nano-quickstart in cluster config.

  • all the files are same except the indexer/runtime.configuratio, which is not present in the nano-quickstart. start up for clustered development->

  • first start the hdfs server in ubuntu server

    • sbin/start-dfs.sh [from the hadoop folder in druid project]
  • next start the master server, data server, query server

    • bin/start-cluster-master-with-zk-server in DuruidClusterProject inside 10.10.10.45

    • bin/start-cluster-data-server in DuruidClusterProject-Data inside 10.10.10.46

    • bin/start-cluster-query-server in DuruidClusterProject-Data inside 10.10.10.47

  • start the kafka broker,

    • ./bin/kafka-server-start.sh config/server.properties in 10.10.10.48, inside the kafka-2_13-3.5.0 folder
# for the master server in DruidClusterProject
bin/start-cluster-master-with-zk-server
# for the data server in the DruidClusterProject-data
bin/start-cluster-data-server
# for the query server in the DruidClusterProject-query
bin/start-cluster-query-server

The order matters so , first start master, then data and then query.

Appendix- useful commands

to learn where the java sdk is

whereis javac

to get the complete directory

readlink -f /usr/bin/javac

the "/usr/bin/javac" is the output from above command (whereis javac)

to get the list of all the places where java sdks are

update-alternatives --list java

Releases

No releases published

Packages

No packages published

Languages