<img src="./images/logo.png" alt="Drawing" style="width: 500px;"/>

# **Exercise 1:** Exploring Sales Data with Apache Spark

This exercise will introduce **Apache Spark on HPE Ezmeral Unified Analytics**. We'll leverage Spark's powerful distributed processing capabilities to analyze sales information and uncover insights of sales data across three major European grocery stores.

In this exercise, you will:

- Set up a Spark session for interacting with data.
- Generate sample sales data for different countries and currencies.
- Explore techniques for data loading, transformation, and analysis using Spark SQL and DataFrames.
- Create Delta Tables and perform version control.

Feel free to modify and extend the code examples to suit your specific data analysis needs.

Let's get started!

### **Prerequisites:**

As instructed in the [Introductory notebook](./00.introduction.ipynb), ensure that you have run `pip install -r requirements.txt` in a Terminal window, located in the same working directory, prior to running this notebook. 

<div class="alert alert-block alert-danger">
    <b>Important:</b> Make sure you selected <b>PySpark</b> for your notebook kernel - check the top right corner!
</div>

## **1. Create Spark Session**

Think about the most recent Excel spreadsheet you edited. It probably had tens or even hundreds of rows across tens of columns. When you run an Excel command, such as a *SUM()* or a *VLOOKUP()*, you may have noticed that it took a far bit of time to process. Maybe, even the fans of your laptop sped up a bit as your computer worked to crunch the numbers. 

Now, scale that same command out to a spreadsheet with tens of **millions** of rows across **thousands** of columns. That is the Big Data that companies must work with on a daily basis, and no single PC is going to run any *VLOOKUP* command on data of that size.

Instead of spreadsheets, the enterprise world is largely built upon **tables** in a variety of formats. To query these tables to retrieve certain data takes a **mammoth** amount of compute. It makes no sense to have a single **compute server** executing these queries - it would be far faster to parallelize queries across several computers. Enter **Apache Spark**.

### Introduction to Apache Spark on HPE Ezmeral Unified Analytics

Apache Spark is a popular open-source big data framework that **distributes the computations** required to perform queries on large sets of data. This distribution, along with working with data in-memory rather than directly from storage disks, drastically brings down the time usually taken to query and index data. The combination of speed, versatility, and ease of use made Spark the go-to framework when working with big data. 

Apache Spark comes pre-installed with **HPE Ezmeral Unified Analytics** and can leverage as much or as little of the compute available in a Unified Analytics cluster as a user desired. The core components of an Apache Spark deployment include:

<img src="./images/exercise1/spark_archi.PNG" alt="Drawing" style="width: 60%;"/>

**Driver:** The driver program coordinates the execution of Spark jobs. It submits tasks to executors, schedules operations, and manages communication between various components.

**Workers:** These are machines in the Spark cluster that manage executors. Each worker runs one or more executors. When running Spark on a HPE Ezmeral Unified Analytics deployment, Spark Workers are Kubernetes pods distributed among worker nodes of the Unified Analytics cluster, allowing them to scale across multiple machines as required. 

**Executors:** Executors reside on worker nodes and carry out the actual computations assigned by the driver program. They partition and distribute the workload across machines in the cluster.

**JVM:**  Spark utilizes the Java Virtual Machine (JVM) on each worker node to execute executors.

On **HPE Ezmeral Unified Analytics**, you will use Apache Spark to analyze large datasets at high speed with a unified platform for batch processing, streaming, and machine learning.

### Create a Spark Interactive Session

Let's begin using Spark! Here, you use Unified Analytics' native integration of **Apache Livy** to create and manage an interactive Spark session. Livy is an open-source REST service that enables remote and interactive analytics on Apache Spark clusters. It provides a way to interact with Spark clusters programmatically using a REST API, allowing you to submit Spark jobs, run interactive queries, and manage Sparksessions from web applications without the need for a specific Spark client. As a result, multiple Unified Analytics users can interact with your Spark cluster concurrently and reliably!

First, let's connect to the Livy endpoint and create a new Spark interactive session. The Spark interactive
session is particularly useful for exploratory data analysis, prototyping, and iterative development. It allows you to
interactively work with large datasets, perform transformations, apply analytical operations, and build ML models using
Spark's distributed computing capabilities. 

To communicate with Livy and manage your sessions you use Sparkmagic, an open-source tool that provides a Jupyter kernel
extension. Sparkmagic integrates with Livy, to provide the underlying communication layer between the Jupyter kernel and
the Spark cluster.

**Execute the cell below**, then:

1. Select the `Add Endpoint` tab.
1. Select `Single Sign-on` and ensure there is a Livy address in the `Address` field. 
1. Click `Add Endpoint`.
1. Select the `Create Session` tab.
1. Provide a name (e.g. `retail-demo`).
1. Select `python` under the Language field.
1. Click `Create Session` (right side).

The session will take a few minutes for your session to initialize. 

Once ready, the Manage Sessions pane will activate, displaying
your session ID. When the session state turns to idle, you're all set!

In [None]:
%manage_spark

Now, let's check the status of the session.

1. Navigate back to the Unified Analytics dashboard.
1. In the sidebar navigation menu, select `Spark Interactive Sessions`.

![image.png](./images/exercise1/menu.PNG)

3. Here, you can check the status of your session. It will take 2-3 minutes to start. When the `State` says `Idle`, the session is ready. 

![image.png](./images/exercise1/session.PNG)

4. Scroll back up to the Notebook cell of the session (%manage_spark command). Confirm under the `Manage Sessions` tab that the session should now be visible as `Idle` too. 

![image.png](./images/exercise1/session2.PNG)

### Configure Spark Interactive Session

1. Run the `%config_spark` magic command.
2. Leave the settings as they are. Click `Submit`.

<div class="alert alert-block alert-danger">
    <b>Important:</b> Ignore the resulting message and <b>do not</b> restart the kernel.
</div>

In [None]:
%config_spark

Next, let's import the required libraries for working with Spark in this notebook.

In [None]:
from pyspark.sql import SparkSession
from pyspark.conf import SparkConf
from pyspark.sql.functions import udf, col, lit
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType
import os

We will also define the paths for where Spark will pull files from and save files to. These paths are specific to the Unified Analytics directory structure and to be left as they are.

In [None]:
file_root = "file:///mounts/shared-volume/user/data"
delta_root = "file:///mounts/shared-volume/shared/retail-delta/data/"

You can now instantiate the Spark session. We'll add delta extensions to the configuration to be able to interact with the delta tables.

In [None]:
# Set up the Spark session
spark = SparkSession.builder \
    .appName("DataCleaningWithSpark") \
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
    .config("spark.local.dir", "/mnt/shared/end2end-main-exercises/exercises") \
    .getOrCreate()

print("Pyspark session started")

## **2. Generating and Preparing Sales Data**

In this section, we are going to synthetically generate several years of sales data from our three retail stores located in three countries: Switzerland, Germany and the Czech Republic. This sales data will provide the basis for the remaining exercises, where we will learn to analyze, graph and build dashboards to gather insights between and across regions. 

**Optional:** To use `Data Sources` connected through Unified Analytics (such as MySQL, MariaDB and PostgresSQL databases), follow **this**.


### Generating Sales Data

A Python script has been provided which can generate the sales data for the three given country locations. 

The parameters for this script are:

- cu: Currency, to account for conversions between stores.
- s: Number of stores in that region.
- sy: Start Year
- ey: End Year
- csv: Resulting File Name

We'll see the first 10 rows of the newly created table. 

In [None]:
%run resources/create_csv.py -c "Germany" -cu EUR -s 5 -sy 2019 -ey 2023 -csv "germany_sales_data_2019_2023.csv"

In [None]:
%run resources/create_csv.py -c "Czech Republic" -cu CZK -s 5 -sy 2019 -ey 2023 -csv "resources/czech_sales_data_2019_2023.csv"

In [None]:
%run resources/create_csv.py -c "Swiss" -cu CHF -s 5 -sy 2019 -ey 2023 -csv "resources/swiss_sales_data_2019_2023.csv"

Next, we'll ensure that our Spark Interactive session can access the data.

In [None]:
# Define the directory path
data_path = file_root

# List files in the directory
files = spark.sparkContext.wholeTextFiles(data_path)

# Display the list of files
for file_path, _ in files.collect():
    print(file_path)

## **3. Create Delta Tables**

In this section, we will create Delta Tables from our CSV files that we can query using Unified Analytics. Delta Tables are a type of table that can be created in Delta Lake, which is an extension of Apache Parquet file format.

### Define an ETL Pipeline to create Delta Tables 

First, let's define some functions that will:

1. Load the data in from a CSV and return a pandas DataFrame.

In [None]:
from pyspark.sql.types import IntegerType

def load_data(spark, country, data_path):
    # Define the path to the CSV file
    csv_path = f"{data_path}/{country}_sales_data_2019_2023.csv"

    # Define the schema with specific data types
    schema = StructType([
        StructField("PRODUCTID", IntegerType(), True),
        StructField("PRODUCT", StringType(), True),
        StructField("TYPE", StringType(), True),
        StructField("UNITPRICE", DoubleType(), True),
        StructField("UNIT", StringType(), True),
        StructField("QTY", IntegerType(), True),
        StructField("TOTALSALES", DoubleType(), True),
        StructField("CURRENCY", StringType(), True),
        StructField("STORE", StringType(), True),
        StructField("COUNTRY", StringType(), True),
        StructField("YEAR", IntegerType(), True)
    ])

    # Read data from the CSV file with the specified schema
    df = spark.read \
        .format("csv") \
        .schema(schema) \
        .option("header", "true") \
        .load(csv_path)

    return df

2. Clean the data, in this case by ensuring the currency of each item is standardized in Euros.

In [None]:
def clean_data(df, spark, country):
    # Define a UDF to convert currencies to EUR
    convert_udf = udf(lambda currency, amount: amount / CZK_TO_EUR_RATE if currency == "CZK" else amount / CHF_TO_EUR_RATE if currency == "CHF" else amount, DoubleType())

    # Apply the UDFs to the DataFrame
    corrected_df = df.withColumn("totalsales", convert_udf(col("currency"), col("totalsales"))) \
                     .withColumn("currency", lit("EUR"))

    # Show the results
    corrected_df.show()

    return corrected_df

3. Save the data as parquet files (Delta Tables).

In [None]:
def write_data(df, country):
    delta_path = delta_root + country

    # Check if the directory exists, and create it if it doesn't
    if not os.path.exists(delta_path):
        os.makedirs(delta_path)
        
    df.write.format("delta").mode("overwrite").save(delta_path)

Great! We've just created functions that will **extract** the data from our generated CSV files, **transform** them into Delta Tables with the currency standardized, then **load** them into a new directory.

You guessed it! We have just created an **ETL pipeline!** 

After declaring our country list and our currency conversion rates, we can run the pipeline.

In [None]:
# Constants
COUNTRY_LIST = ["czech", "germany", "swiss"]
CZK_TO_EUR_RATE = 25
CHF_TO_EUR_RATE = 1

<div class="alert alert-block alert-warning">
<b>Hint:</b> As you can tell by the parameters to the create_csv.py functions in Section 2, we can synthetically generate data for as many stores in as many European countries as we want! Feel free to experiment, so long as the countries are declared in the cell above <b>and the countries that are already there remain.</b>
</div>

In [None]:
for country in COUNTRY_LIST:
    # Load data from the DBs
    df = load_data(spark, country, data_path)
    df.show()
    
    # Clean the data
    cleaned_df = clean_data(df, spark, country)
    cleaned_df.printSchema()
    
    # Write the cleaned data back to the Delta Table
    write_data(cleaned_df, country)

Now, we'll confirm the Delta Tables were create correctly.

In [None]:
for country in COUNTRY_LIST:
    # List files in a directory
    selected_country_path = delta_root + country
    files = os.listdir(selected_country_path)
    print("Table:", country)
    
    for file in files:
        if file.endswith(".parquet"):
            full_path = os.path.join(selected_country_path, file)
            print("Saved in:", full_path)

    print()

## **4. Exploring Dataset Version Control**

For the last part of this exercise, we'll explore how to best leverage the Delta Table format to clean and manipulate our datasets using our Spark Interative session. 

To "clean" data involves identifying and correcting errors, inconsistencies, and inaccuracies within datasets to ensure their reliability and usability for any given analytics use case. In today's data-driven world, this is **crucial** step of any analytics, modelling or AI workflow.

This process typically includes tasks such as handling missing values, removing duplicates, standardizing formats, and resolving discrepancies, ultimately aiming to improve the quality and integrity of the data to ensure any insight generated through analysis is accurate and sound.

Cleaning data will often take several iterations. If you make a modification on a dataset that you want to roll back, doing so manually can often be a nightmare. This is where the Apache Parquet format and Delta Tables come into their own. 

Let's explore how to use Delta Tables as our own dataset version control!

<div class="alert alert-block alert-warning">
For the smoothest experience, remove all previous versions of Delta Tables that may exist from previous runnings of this exercise.<br><br> Open a <b> Terminal </b> window and run: <i>rm -r /mnt/shared/retail-delta<i>
</div>

### Ruining a perfectly good Delta Table.

First, let's load the `czech` Delta Table as a pandas DataFrame.

In [None]:
from delta.tables import DeltaTable
# Set the parameters
country = "czech"
delta_path = delta_root + country

# Read the Delta table using the load method
read_df = spark.read.format("delta").load(delta_path)

# Show the contents of the DataFrame
read_df.show()

Next, we'll modify the Delta Table by changing the data values in all of the other columns aside from  `TYPE`,  `UNIT PRICE`,  `QTY` and  `totalsales` to  `NULL`. 

This will result in the creation of a new Delta Table (a Parquet file in the `czech` Delta Table path) that will be set as the default when retrieving the `czech` Delta Table.

In [None]:
# Select only a subset of columns for the initial Delta Table
selected_columns = ["type", "unitprice", "qty", "totalsales"]

# Create the new Delta Table with the same schema, but only the selected columns data.
df_select = df.select("type", "unitprice", "unit", "qty")
df_select.write.format("delta").mode("overwrite").save(delta_path)

Let's load the  `czech` Delta Table now and see see how it looks.

In [None]:
# Read the Delta table using the load method
read_df_select = spark.read.format("delta").load(delta_path)

# Show the contents of the DataFrame
read_df_select.show()

# Display the schema of the version 0 DataFrame
read_df_select.printSchema()

### Time Warp!

Whilst it's fun to ruin data intentionally, it is very common in data engineering practice to make the occasional mistake. Thankfully, by saving our datasets as Delta Tables, we now have a "version control" for any datasets that we manipulate. To ensure we have a complete table with no `NULL` for the remaining exercises, let's roll the `czech` Delta Table back to before we messed with it.

Let's first observe the two versions of the `czech` Delta Table - before and after our column manipulation. 

In [None]:
# Create a DeltaTable object
delta_table = DeltaTable.forPath(spark, delta_path)

# Get the history of the Delta table
history_df = delta_table.history()

# List all versions with timestamp
versions_with_timestamp = history_df.select("version", "timestamp").distinct().collect()

# Display the list of versions with timestamp
print("List of Delta Table Versions with Timestamp:")
for version_info in versions_with_timestamp:
    version = version_info["version"]
    timestamp = version_info["timestamp"]
    print(f"Version: {version}, Timestamp: {timestamp}")


As expected, two versions - timestamped! Let's print them out in full. 

In [None]:
# Read a specific version (e.g., version 0) of the Delta table
print("Before Manipulation:")
read_df_version_0 = spark.read.format("delta").option("versionAsOf", "0").load(delta_path)
read_df_version_0.show()

# Read a specific version (e.g., version 1) of the Delta table
print("After Manipulation:")
read_df_version_1 = spark.read.format("delta").option("versionAsOf", "1").load(delta_path)
read_df_version_1.show()

Now, let's set the default version for the `czech` Delta Table as the original dataset (Version 0). We'll do this by overwriting the current Delta Table (Version 1) with the data from original (Version 0). 

In [None]:
# Read a specific version (e.g., version 0) of the Delta table
read_df_version_0 = spark.read.format("delta").option("versionAsOf", "0").load(delta_path)

# If you want to perform further actions or overwrite the current Delta table:
# Overwrite the current Delta table with version 0 data
read_df_version_0.write.format("delta").mode("overwrite").save(delta_path)

We'll load the default `czech` Delta Table in and see how it looks.

In [None]:
# Read the Delta table using the load method
read_df_select = spark.read.format("delta").load(delta_path)

# Show the contents of the DataFrame
read_df_select.show()

# Display the schema of the version 0 DataFrame
read_df_select.printSchema()

Our original data is back! As you will recall, whenever a Delta Table is modified, a new Parquet file (Delta Table Version) is created. We can see this when we observe all of the versions once again. 

In [None]:
# Create a DeltaTable object
delta_table = DeltaTable.forPath(spark, delta_path)

# Get the history of the Delta table
history_df = delta_table.history()

# List all versions with timestamp
versions_with_timestamp = history_df.select("version", "timestamp").distinct().collect()

# Display the list of versions with timestamp
print("List of Delta Table Versions with Timestamp:")
for version_info in versions_with_timestamp:
    version = version_info["version"]
    timestamp = version_info["timestamp"]
    print(f"Version: {version}, Timestamp: {timestamp}")

# **Conclusion**

In this exercise, you learned to perform the basics of data engineering - all within a single notebook! 

**HPE Ezmeral Unified Analytics** makes this possible by natively supporting and including the most widely used open-source data tools and frameworks and making them readily available out-of-the-box, such that you spent this time performing invaluable data preperation for upcoming exercises instead of hours installing and connecting them all!

In the next exercise, you will learn how to use EzPresto on HPE Ezmeral Unified Analytics to prepare these datasets for visualization and modelling. 

