# Presto Overview

## What is Presto?

> Presto is an open-source, distributed SQL query engine that can connect to multiple data sources simultaneously

When Hadoop become popular, Hive was Hadoop's de-facto standard tool for data warehousing and SQL-like queries. However, some limitations of Hive started to appear over time, and alternative tools began emerging. Among these tools were Presto (sometimes referred to as Trino) and Impala.

Presto was released in 2013 by Facebook. In 2015, Teradata joined the Presto community and started offering support and software updates to the tool. 

Presto can be defined as a data virtualisation or interactive querying tool that runs on a distributed cluster of machines which can connect to various different data stores (for both structured and unstructured data) and access data within these different stores using a single, integrated query. 

For example, if we have data stored in HDFS, Amazon S3 and Postgres, we can access all 3 of them in the _same SELECT query_. This is a very powerful feature that distinguishes Presto from other tools.

Below is an image demonstrating where Presto fits in relation to the various data stores:

<p align="center">
  <img src="./images/presto-overview3.png" width=450>
</p>


## Features of Presto

- Open-source tool that can connect to a wide variety of back-end data stores simultaneously (including HDFS, SQL databases, NoSQL data stores, S3 and Snowflake)
- Uses SQL for data querying. It's dubbed as the "SQL on Anything" engine.
- It can be considered as a "Hive 2.0" tool that excels in fast, interactive queries (rather than long-running batch data processing queries)
- Designed from the ground-up to mainly perform low-latency interactive analytics on massive datasets (Petabytes in size)
- Uses in-memory processing (similar to Apache Spark)
- Seperates data storage activities from computational logic, and can scale each of them independently
- Provides industry-grade reliability. It's currently deployed by many well known companies such as: Facebook, Airbnb, Netflix, Dropbox and Groupon
- Presto views data from a columnar perspective, hence file formats that are columnar-based are the ones the tool can handle most efficiently (for instance, RC files, ORC files, Parquet or SequenceFiles)
    - Columnar compression helps to save up to 75% of disk storage space over regular file storage methods

## Components of Presto

> Presto is a distributed application which consists of 3 main components: Coordinators (the tool's central brain), workers (which implement the actual data querying tasks) and clients (who submit query requests and get the results)

Below is a diagram showing how the 3 components fit together:

<p align="center">
  <img src="./images/presto-architecture2.png" width=600>
</p>

Next, let's take a closer look at each component:

### Coordinator

- Plays the role of the "central brain" for the tool
- All clients must connect to the coordinator in order to send query requests
- Coordinates the data transfer between the worker nodes and the client
- Communicates with all remote data sources to access _metadata_ and identify what tables and files are stored
- Responsible for parsing, analysing, planning and scheduling all client requests
- Contains 3 sub-components that assist in all of the above tasks:
  - Parser/analyser
  - Planner
  - Scheduler

### Worker
- Implements all the tasks that have been approved and scheduled by the coordinator
- Communicates with the remote data sources (such as HDFS) to read and access the data
- Performs all instructed tasks and submits feedback/results to the coordinator

### Client
- There are different tools used to communicate with Presto and tell it which queries to run
- We call these _clients_, they send the query request to the coordinator and recieve the final result
- Some include:
  - Presto's command line interface. Read more about it [here](https://prestodb.io/docs/current/installation/cli.html)
  - PopSQL, which is a graphical user interface client. Read more about it [here](https://popsql.com/presto-client)


## How do Presto Queries run?

From a high-level, Presto queries go through 8 steps. 

<p align="center">
  <img src="./images/presto-query-steps.png" width=900>
</p>


#### 1. Client submits a query
- The query automatically goes to the coordinator

#### 2. Coordinator analyses the query
- This analysis identifies which data sources are required, and what exactly needs to be done with that data

#### 3. Coordinator goes to those data sources
- It checks the metadata, available tables, column names, and whether or not the user has access to those tables

#### 4. Coordinator creates the _query plan_
- This is done by determining the required steps, checking which workers contain data pertaining to those steps, and scheduling the execution of those steps

#### 5. Coordinator communicates the plan with workers
- Once the plan is finalised, it is communicated to the appropriate worker nodes for execution

#### 6. Workers fetch required data
- Once the query plan is communicated, the workers are instructed to go fetch the data from the data sources identified earlier to prepare for data related tasks

#### 7. Workers implement data transformations
- Next, the workers perform required tasks (such as data aggregation), and send the results back to the coordinator

#### 8. Coordinator sends results to the client
- Once the coordinator has recieved all required results from all workers, the final output is returned to the client/user

#### A few things to note about how query execution is implemented:
- All tasks run in parallel
    - If one task fails, all other tasks fail (meaning the entire query fails)
- Uses memory-to-memory transfer
    - No use of hard disk storage at all
- Leverages pipelined execution across all nodes in a massive parallel processing (MPP) manner
- Uses multi-threading to ensure all CPU cores are utilized effectively


_Note: For a detailed list of available connectors that integrate Presto with various data stores, take a look at the [official documentation](https://prestodb.io/docs/current/connector)_


## Strengths of Presto

> The core strength of Presto is a feature called _data federation_, meaning that in a single query, Presto can connect to and combine data from multiple sources 

- There are a variety of readily-available plugins which help to smoothly connect to various big data sources such as: HDFS, Kafka, and NoSQL data stores 
- Can connect to both structured (such as Postgres) and unstructured data sources (like Cassandra)
- Presto leverages in-memory data processing
    - Which is very fast compared to hard disk data processing used by other tools like Hive
    - Uses distributed massive parellel processing techniques to optimise memory processing
    - Implements efficient utilisation of CPU cores via multi-threading
    - As memory usage never spills over outside of the RAM, Presto can run without a storage layer. This means that there is no vendor lock-in for any Hadoop distribution, no storage engine technology limitations, and no hidden costs for other storage services
- Data engineers can create their own customer connectors easily
- Authentication and database table access can be configured in the plugins to allow customised access to certain users

## Limitations of Presto

> The main limitation of Presto is that it does not have its own data storage layer. Accordingly, Presto must rely on other tools to store and access data. Moreover, machines running Presto require high RAM as it doesn't use a "spill to hard disk" approach when the RAM is full

Other limitations include:
- Performance-wise, the tool is not suitable for long-running/complex batch queries as it doesn't support "data spill to disk" 
- Security features for the tool are still maturing
- In large queries, if one part of the query fails and all other parts are successful, the entire query still fails


## Presto Vs. Hive Comparison

> Presto is viewed as a more modern and advanced version of Hive. However, this doesn't mean that Presto is better in every situation. Let's examine how these 2 tools compare
<p>

<table>
    <thead>
        <tr>
            <th style="width:auto;text-align:center"></th>
            <th style="width:auto;text-align:center">Hive</th>
            <th style="width:auto;text-align:center">Presto</th>
        </tr>
    </thead>
    <tbody>
       <tr>
            <th>Processing Speed</th>
            <td><li> Hive has a higher latency in general
                <li> Optimised for query throughput - thus is well suited for complex, long-running batch queries on big data
				</td>
            <td><li> Presto uses memory-to-memory transfer, which is much faster for simple queries
                <li> Ideal for analytical queries 
            </td>
       </tr>
       <tr>
            <th>Data Processing Method</th>
            <td><li>  Hive uses MapReduce, which is a disk-based data processing approach 
            <li> Slower for analytical queries, but efficient for long running batch jobs  
            </td>
            <td><li> In-memory data processing
                <li> Faster for interactive queries
                <li> However, memory has a limit after which the query will fail if it's too complex
            </td>
      </tr>  
	  <tr>
            <th>Community Support</th>
            <td><li> Open-source tool
				<li> Community support is dwindling
			</td>
            <td><li> Open-source tool
                <li> Much more active and dynamic community support
            </td>
      </tr>
      <tr>
            <th>Data Management</th>
            <td><li> Hive uses a "pull" approach to pull data from various sources and store it in Hive
                <li> Hive can then implement required transformations
            </td>
            <td><li> Presto uses a "push" approach to push data execution steps to the data sources themselves
                <li> This is because Presto does not have its own data storage infrastructure
            </td>
      </tr>
		  <tr>
            <th>Integration Complexity</th>
            <td><li> Hive can integrate with other tools such as Amazon S3 and HBase
                <li> However, the integration can be complex and require customisations
            </td>
            <td><li> Presto is easier to integrate with various tools such as HDFS, Amazon S3, MongoDB, MySQL and Postgres
                <li> Comes with a wide array of readily available plugins
            </td>
      </tr>  
	  <tr>
            <th>Scalability</th>
            <td><li> Hive can scale up or down as required
                <li> It's best suited for big data stored in large clusters
			</td>
            <td><li> Presto can also scale up or down, and has been designed for use in the cloud
				<li> Can be used with small to medium sized clusters efficiently
			</td>
      </tr>
      <tr>
            <th>Query Fault-Tolerance</th>
            <td><li> Hive can tolerate some errors in query execution
                <li> Even if some errors occur, the query can still run until completion
			</td>
            <td><li> Presto does not tolerate any errors in query execution
				<li> If one part of the query fails, the entire query instantly fails
			</td>
      </tr>
      <tr>
            <th>Best Used For</th>
            <td><li> Large data aggregations (such as JOINs) on big data with a vast number of tables
                <li> Long-running scheduled batch jobs that require a lot of time (hours or days)
			</td>
            <td><li> Quickly exploring data interactively
			</td>
      </tr>
    </tbody>



## Presto Vs. Hive Experiment

- In one experiment implemented by a company called Treasure Data involving 3 aggregate queries that were executed on the same data, Presto outperformed Hive significantly as we can see in the below diagram showing the query processing times

<p align="center">
  <img src="./images/presto-hive-experiment3.png" width=600>
</p>

Based on this study, we can see that:
- For each of the first 8 minutes (elapsed minutes), we can see the total number of queries each tool ran in order to reach the final result
- Notice that, as time goes by, almost 10X more Hive queries compared to Presto to get the same result
- The main takeaway is that, Presto is faster, and more efficient


## Uses Cases for Presto

> Presto has been gaining wider popularity over the past few years and is being adopted by an increasing number of global companies

<p align="center">
  <img src="./images/companies-using-presto.png" width=600>
</p>

### Facebook Use Case
- Facebook uses multiple on-premise production clusters with hundreds of nodes in total
- They had a massive Hadoop HDFS data warehouse with over 300 petabytes of data
- Thousands of daily internal active users running hundreds of queries concurrently
- Presto was created and used to query the data interactively
- Hive was more suited for large-scale reliable data computation

### Netflix Use Case
- Netflix had over 200 nodes in an Amazon EC2 cluster
- Nodes had over 25 Petabytes of data stored in Parquet files
- Over 350 users querying the data with approximately 3,000 queries daily
- Presto was used to enable efficient interactive data querying
- Results indicate that queries which took 1 or 2 MapReduce phases in Hadoop ran 10 to 100 times faster in Presto


## Key Takeaways
- Presto is an open-source, distributed SQL data querying engine that can connect to various sources of structured and unstructured data simultaneously
- It's a more modern variation of Apache Hive that is faster in data querying, and can easily integrate with data stores such as: HBase, HDFS, Amazon S3 and Postgres
- Presto uses in-memory data processing and thus seperates the data storage activities from computational logic. This makes it flexible, scalable, and highly-efficient in the cloud
- The tool is composed of three main components: Coordinator (central brain), the Worker (which performs the actual tasks) and the Client (which sends queries)
- The core strength of Presto is its ability to combine data from multiple data sources in one query. This feature is called _data federation_. 
- The main limitation of Presto is that it does not have its own data storage layer - it must rely on other tools to store and access data.
- Presto and Hive each have their own pros and cons. Presto is more efficient in interactive data querying while Hive excels at long-running batch jobs on massive datasets
- Major global companies are currently using Presto as part of their data engineering foundation. It's mainly leveraged to perform interactive data querying for reporting purposes