# Presto - Cassandra Integration

## Why Integrate Presto with Cassandra?

> As you may recall, Presto is a powerful data querying engine that does not provide its own data storage platform. Accordingly, we need to integrate Presto with other tools in order to be able to query data. 

One of the most popular NoSQL tools used by global companies is Apache Cassandra. Cassandra stores data in a columnar format, and is very fast in writing and storing data. Therefore, one of the common tool integrations used by global companies is to connect Presto with Cassandra to leverage the strengths of both.

## Steps to Integrate Presto with Cassandra

To be able to integrate Presto and Cassandra, we first must have both tools installed and sucessfully running on our system.

_Note: Refer to the notebooks covering [Presto Installation](https://portal.theaicore.com/lesson/6d812d61-a6ec-4e07-a5c7-5364e0823dc9) and [Cassandra Installation](https://portal.theaicore.com/lesson/50cf07cb-28e3-4e0c-9777-6ba221cd3374) for the detailed steps required to install each on your local machine._

For the remainder of this notebook, we'll assume that both tools have successfully been installed and that they operate correctly.

Let's begin.


#### 1. Create a New Presto `/etc/catalog/` Directory

In order to connect Presto with Cassandra, we need to create a Presto connector configuration file.  

To do this, we need to first go to the Presto home directory and create a new `/etc/catalog/` directory (if it doesn't already exist). Presto looks for connector files in this folder by default, so any tool we want to integrate Presto with must have a configuration file placed there.

Let's go ahead and do this:

In [None]:
# Create a catalog folder to store connector configuration files
## Should we add a section in the Presto notebook to configure `$PRESTO_HOME` cause I don't think it's shown there. Otherwise this should be.
## cd /usr/local/presto/etc
cd $PRESTO_HOME/etc
sudo mkdir catalog

This should be the expected folder structure:

<p align="center">
  <img src="images/presto-tree.png" width=600>
</p>

#### 2. Create Cassandra Properties File and Add Configurations

Create a new file called `cassandra.properties`.  We'll need to specify the following:

- `connector.name`
    - This is the type of tool Presto will connect to. In our case, it wil be Cassandra.
    - For a full list of supported connectors, check the official [documentation guide](https://prestodb.io/docs/current/connector.html)
    <p>
- `cassandra.contact.points`
    - This is a list of one or more IP's that contain the Cassandra nodes we will be connecting to
    - In our case, we'll be using the local machine (localhost), which by default, has the IP = 127.0.0.1 on most machines

Run the below command from inside the `/etc/catalog/` directory:

In [None]:
# Create a new cassandra.properties file
sudo nano cassandra.properties

In [None]:
# Add this to the cassandra.properties file
connector.name=cassandra # cassandra is one possible option outlined in docs
cassandra.contact-points=127.0.0.1

Save the file and exit. This should be the expected file content:

<p align="center">
  <img src="images/cassandra-properties.png" width=600>
</p>

#### 3. Check Cassandra Settings

Next, we need to ensure that the IP address Cassandra is configured to use (which should be `127.0.0.1`) is _the same as the one we added above_ in the `cassandra.properties` file. 

Also, we need to double check that the default port open for Cassandra's native protocol is `9042`.

To do this, we need to go to the Cassandra conf directory and open the `cassandra.yaml` file to review the settings.

Here are the parameters we need to check:
- `seed provider`
    - Defines the IP addresses (seeds) of the available Cassandra nodes, seperated by commas.
    <p>
- `start_native_transport`
    - Starts the native transport server which is used to connect the Cassandra shell with client applications
    
_Note: For a detailed list of all available parameters and a description of what each does, check the official [Cassandra documentation](https://cassandra.apache.org/doc/latest/cassandra/configuration/cass_yaml_file.html)_

In [None]:
# Open the cassandra.yaml file
## Same again here should we specify creating `$CASSANDRA_HOME` in our bashrc/zshrc?
## Otherwise this is cd /etc/cassandra/conf
cd $CASSANDRA_HOME/conf 
sudo nano cassandra.yaml

In [None]:
# Check the below settings are set
seed_provider:
    - class_name: org.apache.cassandra.locator.SimpleSeedProvider
      parameters:
          - seeds: "127.0.0.1"


This is what the file should look like:
<p align="center">
  <img src="images/cassandra-seed.png" width=600>
</p>


Next, look for the `start_native_transport` setting and ensure that it is enabled as follows:

In [None]:
# Ensure the below settings are set
start_native_transport: true
native_transport_port: 9042

<p align="center">
  <img src="images/cassandra-native.png" width=600>
</p>

#### 4. Start Cassandra

Next, we need to start Cassandra. 

To do this, run the below command from a terminal window:

_Note: Running Cassandra with the `-R` flag gives it root (administrator) privilages_

In [None]:
# Start Cassandra
## Can say alternatively can run it as a service with sudo systemctl start cassandra check with sudo systemctl status cassandra
cd $CASSANDRA_HOME
sudo cassandra -R

If everything runs smoothly, this would be the expected output:

<p align="center">
  <img src="images/cassandra-running.png" width=1200. height=1000>
</p>

#### 5. Run the Cassandra Shell

Next, we need to run the Cassandra shell by entering the below command from a new terminal window (keeping the previous Cassandra terminal open):

In [None]:
# Run the Cassandra shell
cd $CASSANDRA_HOME
sudo cqlsh

You should now be in inside the shell as seen below:
<p align="center">
  <img src="images/cql.png" width=600>
</p>


#### 6. Start Presto Server

Now we need to start the Presto server process by running the below commands from a third terminal screen (keep the previous two terminals running as is):

_Note: Using the `launcher` command with `run` provides a more detailed output than running it with `start`_

In [None]:
# Start the Presto server
cd $PRESTO_HOME/bin
sudo ./launcher run

If all goes well, this should be the expected output:
<p align="center">
  <img src="images/presto-server.png" width=600>
</p>

#### 7. Start Presto Client

Next, we need to start the Presto client which we'll use to connect to the server. We'll need to provide some important parameters which include:
- `--server`
    - Presto Node to run and port to use
    <p>
- `--catalog`
    - Connector type to use for tool integration
    - In our case, we'll be connecting to Cassandra

To do this, run the below command from a new terminal window:

In [None]:
# Start the Presto server
sudo ./presto --server localhost:8080 --catalog cassandra

If everything runs smoothly, this should be the output you see:

<p align="center">
  <img src="images/presto-shell.png" width=600>
</p>

If no errors show up, then Presto has successfully connected to Cassandra using the provided settings.

Now we are ready to start using the Presto shell to interact with data stored in Cassandra.

## Using Presto to Query Data in Cassandra

Integrating Presto with Cassandra allows us to leverage Presto's powerful data querying engine along with Cassandra's data storage benefits. 

In this part of the notebook, we'll be creating a new Cassandra keyspace and table, populating it with data, and then we'll connect to this table using Presto to perform some interactive queries on the data.

In particular, we'll be doing a JOIN operation on the data stored in Cassandra to demonstrate that, although Cassandra does not natively support JOINs, integrating it with Presto allows us to overcome this limitation.

Let's begin.

#### 1. Create a New Cassandra Keyspace

Now that both Presto and Cassandra are up and running, we'll create a new Cassandra keyspace.

Run the below command from within the _Cassandra shell_:

In [None]:
# Create a new Cassandra keyspace
CREATE KEYSPACE presto_cassandra WITH replication = {'class': 'SimpleStrategy', 'replication_factor': '1'}  AND durable_writes = true;

If everything runs smoothly, this is the expected output:

_Note: By using the `DESCRIBE KEYSPACES` command, we can see all available keyspaces:_

<p align="center">
  <img src="images/cassandra-keyspace-create2.png" width=600>
</p>

#### 2. Create a New Table

Next, we'll create a new column family (table) called `spacecraft_journey` inside the `presto_cassandra` keyspace.

Run the below command from the _Cassandra shell_:

In [None]:
# Create a new Cassandra table called spacecraft_journey
CREATE TABLE IF NOT EXISTS presto_cassandra.spacecraft_journey (
  	spacecraft_name text,
  	journey_id timeuuid,
  	start timestamp,
  	end timestamp,
  	active boolean,
  	summary text,
  	PRIMARY KEY ((spacecraft_name), journey_id)
) WITH CLUSTERING ORDER BY (journey_id desc);

Let's check that the table was created successfully.  We'll do this in both the Presto shell and the Cassandra shell to see how the outputs compare.

First, let's run the below command from within the _Presto shell_:

In [None]:
# From the Presto shell, switch to the presto_cassandra keyspace and check the tables it contains
USE presto_cassandra;
SHOW TABLES;

We should be able to see the below output:
<p align="center">
  <img src="images/presto-show-tables.png" width=600>
</p>


Next, let's check the table from the _Cassandra shell_ by running the below command: 

In [None]:
# From the Cassandra shell, switch to the presto_cassandra keyspace and check the tables it contains
USE presto_cassandra;
DESCRIBE TABLES;

<p align="center">
  <img src="images/cassandra-describe-table.png" width=600>
</p>

_Note: The difference in the commands to show the available tables between Presto and Cassandra is due to the fact that Presto uses SQL syntax, while CQL has a slightly different variation._

#### 3. Download Data File

Now that the new table is ready, let's go ahead and populate it with some data. To do this, download this [CSV file from here](https://aicore-files.s3.amazonaws.com/Data-Eng/spaceship.csv)

Once the file is downloaded, move it to the default Cassandra data folder `/var/lib/cassandra/data` by running the below command from the directory in which you downloaded the file into:

In [None]:
# Move the CSV file into the default Cassandra data folder
sudo mv spaceship.csv /var/lib/cassandra/data/spaceship.csv

#### 4. Import Data into Cassandra Table

Next, let's import the `spaceship.csv` file into the Cassandra table.

Run the below command from the _Cassandra Shell_:

_Note: We need to specify each of the columns in the same order as they are in the file)_

In [None]:
# Import the CSV file into the spacecraft_journey table
COPY spacecraft_journey (spacecraft_name, journey_id, start, end, active, summary) FROM '/var/lib/cassandra/data/spaceship.csv' WITH HEADER = false;

Assuming the import completes successfully, this should be the expected output:
<p align="center">
  <img src="images/cassandra-spaceship-import.png" width=1200>
</p>

#### 4. Check the Data Load from Presto

Once the data import step has completed successfully, let's check the data to make sure it loaded properly.

Run the below command from the _Presto shell_:

In [None]:
# Check the Cassandra table from within Presto
SELECT * FROM presto_cassandra.spacecraft_journey;

This should be the expected output:
<p align="center">
  <img src="images/presto-data-check.png" width=1200>
</p>

#### 5. Create 2nd Cassandra Table

Cassandra doesn't allow JOINs between 2 tables, but using Presto we have more flexibility and can avoid some of Cassandra's limitation. To give this a try, we'll first create a 2nd Cassandra table called `spacecraft_speed` and JOIN it to the `spacecraft_journey` table.

Run the below command from within the _Cassandra shell_:

In [None]:
# Create a second table called spacecraft_speed
CREATE TABLE IF NOT EXISTS presto_cassandra.spacecraft_speed (
	spacecraft_name text,
	journey_id timeuuid,
	speed double,
	speed_unit text,
	reading_time timestamp,
	PRIMARY KEY ((spacecraft_name, journey_id), reading_time)
) WITH CLUSTERING ORDER BY (reading_time DESC);

If everything runs smoothly, this should be the expected output:

<p align="center">
  <img src="images/cassandra-2nd-table.png" width=600>
</p>

#### 6. Download the 2nd CSV Data File

Download the `spaceship_speed` CSV file [from here](https://aicore-files.s3.amazonaws.com/Data-Eng/spaceship_speed.csv) and move it to the default Cassandra data folder `/var/lib/cassandra/data` by running the below command from the directory in which you downloaded the file into:

In [None]:
# Move the CSV file into the default Cassandra data folder
sudo mv spaceship_speed.csv /var/lib/cassandra/data/spaceship_speed.csv

#### 7. Import Data into 2nd Table

After the file has been downloaded, enter the below command to import the data into the new `spaceship_speed` table.

Run the command from the _Cassandra Shell_:

In [None]:
# Import the spaceship_speed data
COPY spacecraft_speed (spacecraft_name, journey_id, speed, speed_unit, reading_time) FROM '/var/lib/cassandra/data/spaceship_speed.csv' WITH HEADER = false;

This is the expected output assuming the import operation is successful:
<p align="center">
  <img src="images/cassandra-2nd-table-import.png" width=900>
</p>

#### 8. Check the Data Load of 2nd Table

Let's first check that the data was imported successfully.

Run the below command from the _Presto shell_:

In [None]:
# Check the spacecraft_speed data load from Presto
SELECT * FROM presto_cassandra.spacecraft_speed;

This is the expected output:
<p align="center">
  <img src="images/presto-2nd-table-check.png" width=600>
</p>

#### 9. Run a JOIN Operation

Now that we have both tables up and running and populated with data, let's try doing a JOIN operation. 

Rememnber, doing a JOIN is not support inside Cassandra, but running Presto allows us to do this as the operation leverages Presto's features.

We'll join both tables on the `journey_id` column by using a SQL `INNER JOIN` operation to return only the matching records from both tables.

Run the below command from within the _Presto shell_:

In [None]:
# Join the spacecraft_journey and spacecraft_speed tables by using the journey_id
SELECT spacecraft_journey.spacecraft_name, spacecraft_journey.summary, spacecraft_speed.speed FROM spacecraft_journey INNER JOIN spacecraft_speed ON spacecraft_journey.journey_id = spacecraft_speed.journey_id; 


Assuming everything runs smoothly, this is the expected output:

<p align="center">
  <img src="images/presto-join-tables.png" width=600>
</p>

This is just one example, but the data stored in Cassandra could be queried in any other way using Presto.

## Common Presto Operations

Since Presto leverages SQL as the query language, it's possible to run a wide variety of commands.

For a detailed list of these commands, check the Presto SQL [official documentation](https://prestodb.io/docs/current/sql.html)

Some of the most popular types of operations include:

- `CREATE`
    - Used to create new schemas, tables, views and user-defined functions
- `ALTER`
    - Used to modify an already existing schema, table or user-defined function
- `DESCRIBE`
    - Used to describe the columns that a specific table contains
- `DROP`
    - Used to remove a schema, table, view or a user-defined function
- `SHOW`
    - Used to list/display the available catalogs, schemas, tables, user-defined functions, views, user roles and other parameters.

That should be enough to get you started and show you how to create and use a connector to integrate Presto with Cassandra.