Skip to content
Zubair Saiyed edited this page Sep 7, 2016 · 3 revisions

Presto

Table of Contents

  1. Introduction

  2. Requirements

  3. Dependencies

  4. Install Presto

  5. Configure Presto

  6. Starting Presto

  7. Presto Command Line Interface (CLI)

  8. Checking Presto nodes

  9. Presto Example

  10. Requirements

  11. Create a table in Cassandra

  12. Create a table in Hive

### Introduction

Presto is a distributed SQL query engine designed to query large data sets distributed over one or more heterogeneous data sources. It is a tool designed to efficiently query vast amounts of data using distributed queries. If you work with terabytes or petabytes of data, you are likely using tools that interact with Hadoop and HDFS. Presto was designed as an alternative to tools that query HDFS using pipelines of MapReduce jobs such as Hive or Pig, but Presto is not limited to accessing HDFS. Presto can be and has been extended to operate over different kinds of data sources including traditional relational databases and other data sources such as Cassandra. Presto also allows you to perform a join on tables from different data sources and it is all done in memory.

Presto was designed to handle data warehousing and analytics: data analysis, aggregating large amounts of data and producing reports. These workloads are often classified as Online Analytical Processing (OLAP).

### Requirements
  • Java 8
### Dependencies

Latest Presto version requires Java 8. If Java 8 is not installed on all the machines, install it as follows

node:~$ sudo add-apt-repository ppa:webupd8team/java -y 
node:~$ sudo apt-get update 
node:~$ sudo apt-get install oracle-java8-installer
### Install Presto

This installation process needs to be executed on all nodes you want to setup your Presto cluster on.

node:~$ wget https://repo1.maven.org/maven2/com/facebook/presto/presto-server/0.115/presto-server-0.115.tar.gz -P ~/Downloads
node:~$ sudo tar zxvf ~/Downloads/presto-server-* -C /usr/local
node:~$ sudo mv /usr/local/presto-server-* /usr/local/presto

Set the PRESTO_HOME environment variable and add to PATH in .profile

node:~$ nano ~/.profile
Add the following for the Presto environment variables

export PRESTO_HOME=/usr/local/presto
export PATH=$PATH:$PRESTO_HOME/bin

node:~$ . ~/.profile

Presto needs a data directory for storing logs, local metadata, etc. It is recommended creating a data directory outside of the installation directory, which allows it to be easily preserved when upgrading Presto.

node:~$ mkdir -p /var/presto/data
### Configure Presto

Create an etc directory inside the installation directory. This will hold the following configuration:

  • Node Properties: environmental configuration specific to each node
  • JVM Config: command line options for the Java Virtual Machine
  • Config Properties: configuration for the Presto server
  • Catalog Properties: configuration for Connectors (data sources)

Node Properties

The node properties file, etc/node.properties, contains configuration specific to each node. A node is a single installed instance of Presto on a machine. The following is a minimal etc/node.properties:

node:~$ mkdir $PRESTO_HOME/etc
node:~$ nano $PRESTO_HOME/etc/node.properties

Add the following to the node.properties file

node.environment=production # All nodes in a cluster have the same environment
node.data-dir=/var/presto/data # Presto stores logs and other data here

Also, every node has a unique id. So, let’s generate one and add it to node.properties

node:~$ TEMPUID=$(uuidgen)
node:~$ echo node.id=$TEMPUID >> $PRESTO_HOME/etc/node.properties

JVM Config

The JVM config file, etc/jvm.config, contains a list of command line options used for launching the Java Virtual Machine.

node:~$ nano $PRESTO_HOME/etc/jvm.config

Add the following to the jvm.config file

-server
-Xmx16G
-XX:+UseG1GC
-XX:+UseGCOverheadLimit
-XX:+ExplicitGCInvokesConcurrent
-XX:+HeapDumpOnOutOfMemoryError
-XX:OnOutOfMemoryError=kill -9 %p

Config Properties

The config properties file, etc/config.properties, contains the configuration for the Presto server. Every Presto server can function as both a coordinator and a worker, but dedicating a single machine to only perform coordination work provides the best performance on larger clusters.

For the node you choose to be the coordinator, configure it as follows

coordinator-node:~$ nano $PRESTO_HOME/etc/config.properties

Add the following to the config.properties file

coordinator=true
node-scheduler.include-coordinator=false
http-server.http.port=8080
task.max-memory=1GB
query.max-memory=50GB
query.max-memory-per-node=1GB
discovery-server.enabled=true
discovery.uri=http://example.net:8080

For all other worker nodes, configure all of them as follows

worker-node:~$ nano $PRESTO_HOME/etc/config.properties

Add the following to the config.properties file

coordinator=false
http-server.http.port=8080
task.max-memory=1GB
query.max-memory=50GB
query.max-memory-per-node=1GB
discovery.uri=http://example.net:8080

Next, create log.properties in etc as follows. This should be done on both coordinator and worker nodes.

node:~$ nano $PRESTO_HOME/etc/log.properties

Add the following to the log.properties file

com.facebook.presto=DEBUG

Catalog Properties

Presto accesses data via connectors, which are mounted in catalogs. The connector provides all of the schemas and tables inside of the catalog. For example, the Hive connector maps each Hive database to a schema, so if the Hive connector is mounted as the hive catalog, and Hive contains a table clicks in database web, that table would be accessed in Presto as hive.web.clicks. Let’s mount Hive, Cassandra and jmx connectors as follows:

This should be done on both coordinator and worker nodes.

node:~$ mkdir $PRESTO_HOME/etc/catalog
node:~$ nano $PRESTO_HOME/etc/catalog/jmx.properties

Add the following to the jmx.properties file

connector.name=jmx

node:~$ nano $PRESTO_HOME/etc/catalog/cassandra.properties

Add the following to the cassandra.properties file

connector.name=cassandra
cassandra.contact-points=<public-ip> # Public IP of at least one of the cassandra nodes

node:~$ nano $PRESTO_HOME/etc/catalog/hive.properties

Add the following to the hive.properties file

connector.name=hive-hadoop2
hive.metastore.uri=thrift://<private-ip>:9083 # private IP of the node where hive is installed. It should be the same as hadoop’s namenode.
### Starting Presto

This should be done on all the nodes you have your Presto cluster on.

node:~$ sudo $PRESTO_HOME/bin/launcher run &

You can check the Web UI on coordinator-node-public-dns:8080. It would look like this:

To stop presto on any node, do the following:

node:~$ sudo $PRESTO_HOME/bin/launcher stop
### Presto Command Line Interface (CLI)

Install this only on the coordinator node. However, it can be installed on any node (any node which doesn’t even have Presto installed) and you would just have to point it to your coordinator node.

For installing CLI on the coordinator/worker node.

node:~$ wget https://repo1.maven.org/maven2/com/facebook/presto/presto-cli/0.115/presto-cli-0.115-executable.jar
node:~$ mv presto-cli-0.115-executable.jar $PRESTO_HOME/presto_cli
node:~$ chmod +x presto_cli

Let’s start the CLI

node:~$ $PRESTO_HOME/presto_cli --server localhost:8080 --catalog hive --schema default 
# this will connect to the hive default database. You can also connect to a Cassandra keyspace as follows

node:~$ $PRESTO_HOME/presto_cli --server localhost:8080 --catalog cassandra --schema keyspace-name 

At this point, don’t worry about how to use presto. There is a dev on how to use it, referenced at the end of this dev.

If you are installing the CLI on a node other than your coordinator node, do as follows:

node:~$ wget https://repo1.maven.org/maven2/com/facebook/presto/presto-cli/0.115/presto-cli-0.115-executable.jar
node:~$ mv presto-cli-0.115-executable.jar /usr/local/presto_cli
node:~$ chmod +x presto_cli

Let’s start the CLI

node:~$ /usr/local/presto_cli --server <coordinator-node-private-ip>:8080 --catalog hive --schema default 
# this will connect to the hive default database. You can also connect to a Cassandra keyspace as follows

node:~$ /usr/local/presto_cli --server <coordinator-node-private-ip>:8080 --catalog cassandra --schema keyspace-name 
### Checking Presto nodes

Execute the following query on the node you have CLI installed on to see all nodes in the Presto Cluster.

First start the CLI

node:~$ /usr/local/presto_cli --server localhost:8080 --catalog hive --schema default 

node:~$ SELECt * FROM system.runtime.nodes;

You should see the coordinator node and all the worker nodes.

Your Presto cluster is ready to be used. Refer to Presto Example Dev for how to use Presto to query data from Hive and Cassandra.

If you are curious, read this blog about Presto architecture.

### Presto Example

In this example we will query tables in Hive and Cassandra from Presto.

### Requirements * Hive * Cassandra ### Create a table in Cassandra

This could be done on any node in the Cassandra cluster.

node:~$ cqlsh

cqlsh> CREATE KEYSPACE presto WITH replication = {'class': 'SimpleStrategy', 'replication_factor' : 3};
cqlsh> USE insight;
cqlsh:presto> CREATE TABLE user_purchases (user_id INT, item TEXT, quanity INT, amount FLOAT, place TEXT, PRIMARY KEY (user_id));
cqlsh:presto> INSERT INTO user_purchases (user_id, item, quanity, amount, place) VALUES (1, 'Shirt', 2, 3050.50, 'New York');
cqlsh:presto> INSERT INTO user_purchases (user_id, item, quanity, amount, place) VALUES (2, 'Shoes', 3, 8140.60, 'Palo Alto');
cqlsh:presto> INSERT INTO user_purchases (user_id, item, quanity, amount, place) VALUES (3, 'Mobile Phone', 1, 18300.00, 'San Francisco');
cqlsh:presto> INSERT INTO user_purchases (user_id, item, quanity, amount, place) VALUES (4, 'Laptop', 1, 40140.60, 'New York');

Check the table contents

cqlsh:insight> SELECT * FROM user_purchases;

It would look as follows:

### Create a table in Hive

We will create a CSV file and populate it with some data. This should be done on the Hadoop namenode.

node:~$ nano user_info.csv

Add the following lines to the CSV file

1,Steven,Smith,24,42000,Male
2,Pawan,Lathwal,24,30000,Male
3,Mariya,Gilbert,25,44000,Female
4,Taylor,Lockwood,24,41000,Male
5,Sanjiv,Singh,25,51000,Male

node:~$ hdfs dfs -mkdir /data
node:~$ hdfs dfs -copyFromLocal user_info.csv /data

Start Hive on the namenode.

namenode:~$ hive
hive> create external table user_info2 (id INT, fname STRING, lname STRING, age INT, salary INT, gender STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
hive> LOAD DATA INPATH '/data/user_info.csv' INTO TABLE user_info;

Check the table contents

hive> SELECT * FROM user_info;

It would look as follows:

Note that the Hive metastore should be started as Presto connects to the Hive metastore to read all the data. Start Hive metastore using the following command:

namenode:~$ hive --service metastore

Now, let’s start Presto CLI on the node you installed Presto CLI on.

node:~$ $PRESTO_HOME/presto_cli --server localhost:8080 --catalog cassandra --schema presto

You can check the Cassandra table

presto:presto> SHOW TABLES;

You can also check the table contents

presto:presto> SELECT * FROM user_purchases;

You can also check the Hive table contents

presto:presto> SELECT * FROM hive.default.user_info;

You can think of having these two tables in the same database. Now, let’s write a query which executes a join on these two tables

presto:presto> SELECT fname, lname, age, item, place FROM user_purchases join hive.default.user_info on hive.default.user_info.id = user_id;

And the results would look as follows:

Presto can perform joins on tables from multiple data sources and all the computation is done in memory, which makes it really powerful and useful for ad hoc queries.