# Apache Cassandra - Data Loading

In this notebook, we'll go through the steps required to load data using Cassandra and to query that data.

## Loading Data into Cassandra and Querying it 

Now that we have the data store up and running, our next activity will be to create a new keyspace, column family (table) and to populate this table with some data.

_Note: For a full list of commands and an explanation of what each of them does, check out the official [documentation here](https://docs.datastax.com/en/dse/5.1/cql/cql/cqlQuickReference.html)_


#### 1. Run Cassandra then Open Shell

First of all, we need to run the Cassandra tool itself. The tool runs as a background process, which we can connect to using the Cassandra shell called `cqlsh`. 

This shell will be used to write SQL-like commands to be able to interact with the data.

_Note: We are running Cassandra in root (administrator) mode by using the -R flag. This is to have full access rights to create and delete keyspaces, tables etc._

Let's go ahead and run the below commands:

In [None]:
# Run Cassandra process in the background using Root privilages
sudo cassandra -R

# Open the Cassandra shell
sudo cqlsh

If all goes well, you should see the below output:

<p align="center">
  <img src="./images/cassandra-run3.png" width=800>
  <figcaption align="center"><cite>Cassandra Shell</cite></figcaption>
</p>

#### 2. Create a New Keyspace

Let's go ahead and create a new keyspace called `data`. We'll use the [`CREATE KEYSPACE`](https://docs.datastax.com/en/cql-oss/3.x/cql/cql_reference/cqlCreateKeyspace.html) command to do that. 

Run the below command from _inside the Cassandra shell_:

_Remember:_
- Cassandra keyspaces are analogous to databases
- Column familys ara analogous to tables

In [None]:
## Think we should explain SimpleStrategy and replicaiton_factor here and durable writes. 
## Simple Strategy assigns the same replication factor to all nodes in the entire cluster.
## Replaction factor can assign how many replications of the data we want. In this case needs to be 1 as we are only running locally.
## durable_write determines whether to bypass the commit log when writing to the keyspace(recommended to set to true)

CREATE KEYSPACE data WITH REPLICATION = {'class': 'SimpleStrategy', 'replication_factor': '1'} AND durable_writes = 'true';

This is what you expect to see:

<p align="center">
  <img src="./images/cassandra-keyspace-create.png" width=1200>
  <figcaption align="center"><cite>Create Keyspace Command</cite></figcaption>
</p>

#### 2. Create a New Table

Now that we have a keyspace available, we'll create a new table called `retail_data` inside that keyspace. This table will have the same columns as the CSV file that we'll download shortly. We'll use the `id` as the Primary Key.

To do this, we need to first switch to the `data` keyspace by the [`USE`](https://docs.datastax.com/en/cql-oss/3.x/cql/cql_reference/cqlUse.html) command, then create the column family inside of it by using the [`CREATE TABLE`](https://docs.datastax.com/en/cql-oss/3.x/cql/cql_reference/cqlCreateTable.html#cqlCreateTable) command.

Enter the below from within the Cassandra shell:

In [None]:
# Switch to the data keyspace
USE data;

# Create a new table in the data keyspace
CREATE TABLE data.retail_data  ( id int PRIMARY KEY, description text, quantity int, price decimal, customer int, country text, );

This is the expected outcome:
<p align="center">
  <img src="./images/cassandra-create-table2.png" width=1200>
  <figcaption align="center"><cite>Create Table Command</cite></figcaption>
</p>

You can use the [`DESCRIBE TABLE`](https://docs.datastax.com/en/cql-oss/3.x/cql/cql_reference/cqlshDescribe.html) command to ensure that the table is properly set up. Run the below command:

In [None]:
# Check the newly created table
DESCRIBE TABLE retail_data;

If all goes well, you should see the below output:

<p align="center">
  <img src="./images/cassandra-describe-table2.png" width=900>
  <figcaption align="center"><cite>DESCRIBE TABLE Output</cite></figcaption>
</p>

#### 3. Download the `retail.csv` File

Our next step is to import the `retail.csv` file into the newly created `retail_data` table. 

You can download the file by clicking on [this link](https://aicore-files.s3.amazonaws.com/Data-Eng/retail.csv).

Once the file is downloaded, copy (or move) it into the default Cassandra data folder `/var/lib/cassandra/data` by running the below command from the terminal:

_Note: First change the directory to become the one you downloaded the `retail.csv` file into. You can also put the file in any other directory of your choice, but in that case you must specify your full path when importing the file_

In [None]:
# Copy retail.csv to the default Cassandra data folder
sudo cp retail.csv /var/lib/cassandra/data/retail.csv

#### 3. Import the CSV File into the `retail_data` Table

Now that the file is in the data folder, we can go ahead and load it into the `retail_data` table. To do that, run the [`COPY`](https://docs.datastax.com/en/cql-oss/3.x/cql/cql_reference/cqlshCopy.html) command from within the Cassandra shell. We'll need to specify the columns, the file path and activate the header row:

In [None]:
# Import the data file into the retail_data table
COPY data.retail_data (id, description, quantity, price, customer, country) FROM '/var/lib/cassandra/data/retail.csv' WITH HEADER = TRUE;

_Note: Some of the records being parsed might show errors. Don't worry - this will not impact importing the correct rows_

Once the data import completes successfully, you should see the below output:
<p align="center">
  <img src="./images/cassandra-file-import.png" width=800>
  <figcaption align="center"><cite>Data Import Prompt</cite></figcaption>
</p>


#### 4. Check Loaded Data

Now that the import has completed successfully, let's check the available data by using the [`SELECT`](https://docs.datastax.com/en/cql-oss/3.x/cql/cql_reference/cqlSelect.html) command as follows:

In [None]:
SELECT * FROM retail_data;

The expected output should look like:

<p align="center">
  <img src="./images/cassandra-select.png" width=600>
  <figcaption align="center"><cite>SELECT Command Output</cite></figcaption>
</p>

#### 5. Query the Data

Now that the data is ready for querying, there are a number of operations we can perform that are very similar to those in SQL. These include:

- `SELECT`
    - Retrives data based on certain criteria
    - We can also add filters using the `WHERE` command
    - We can also do `GROUP BY` to aggregate the data
    - We can also do `ORDER BY` to sort the data
- `SUM`
    - Does a summation operation on the data
- `COUNT`
    - Counts the occurances of data
- `MAX`
    - Finds the maximum value
- `MIN`
    - Finds the minimum value
- `AVG`
    - Calculates the average of a group of numbers

_Note: Check out more details on the aggregation operations in [this documentation](https://www.geeksforgeeks.org/aggregate-functions-in-cassandra/)_

Let's go ahead and try one aggregation operation. We can find the total price (`SUM`) of all products sold where `Country = United Kingdom` by running the below command:

In [None]:
SELECT SUM(price) FROM retail_data WHERE country = 'United Kingdom' ALLOW FILTERING;

_Note: `ALLOW FILTERING` tells Cassandra to enable using aggregations and other filters on the data. If you run the query without specifying this, it will fail_.

Assuming the query completes successfully, you should see output similar to:

<p align="center">
  <img src="./images/cassandra-sum.png" width=900>
  <figcaption align="center"><cite>Filtered SUM Query Output</cite></figcaption>
</p>

#### 6. More Data Querying Examples

Let's run a few more queries on the data:

- Find the product with the maximum price in the United Kingdom:

In [None]:
# Find the product with the maximum price in the United Kingdowm
SELECT description, MAX(price) FROM retail_data WHERE country = 'United Kingdom' AND quantity = 1 ALLOW FILTERING;

This is the expected output:

<p align="center">
  <img src="./images/cassandra-max3.png" width=900>
  <figcaption align="center"><cite>Filtered MAX Query Output</cite></figcaption>
</p>

- Find how many products were sold in France:

In [None]:
# How many products were sold in France
SELECT COUNT(quantity) FROM retail_data WHERE country='France' ALLOW FILTERING;

Expected output:
<p align="center">
  <img src="./images/cassandra-france.png" width=900>
  <figcaption align="center"><cite>Filtered COUNT Query Output</cite></figcaption>
</p>


#### 7. Drop Table

If we want to delete a table, we can do so using the [`DROP TABLE`](https://docs.datastax.com/en/cql-oss/3.3/cql/cql_reference/cqlDropTable.html) command. 

Here is how to use it:

In [None]:
# Delete the retail_data table
DROP TABLE retail_data;

This is the expected output:
<p align="center">
  <img src="./images/cassandra-drop-table.png" width=600>
  <figcaption align="center"><cite>DROP TABLE Command</cite></figcaption>
</p>

> Because Cassandra does not allow operations like JOINs or complex subqueries, queries are typically simpler than what you might often encounter in SQL. Because of this, it's more important to just start using it rather than seeing more examples.

Now you are ready to fully explore data using Cassandra! 

## Key Takeaways

- To create a new Cassandra keyspace (which resembles a database in SQL), we can use the `CREATE KEYSPACE` command\
- To create a new table, we can use the `CREATE TABLE` command and specify the keyspace name under which the table will belong
- `DESCRIBE TABLE` will display detailed information about any Cassandra managed table
- To import data into a Cassandra managed table, use the `COPY` command and specify the columns
- Similar to SQL, a `SELECT` statement can be used to query data in a Cassandra table 
- Cassandra provides aggregation commands such as `SUM`, `COUNT`, `MAX`, `MIN` and `AVG`
- So far, Cassandra does not permit JOIN operations on tables
