# Module 2 Notebook 0: Loading the Data

In this notebook, we'll focus specifically loading the data from the parquet files in storage, verifying the contents and storing it as Spark tables so that we can easily refer to it on all the future notebooks.

### Setup

To get set up, do these tasks first: 

- Upload the ecommerce-* files to storage accessible to your Spark cluster
- Regardless of how you get the files into your storage, you will have to replace the paths I use in the code below with the paths that make sense for your environment as well as the right authentication method. The ones I use are for accessing storage in the Databricks workspace that I have access to.

- First, define the paths of the files we are reading.
- Replace with what is appropriate with your environment.

In [6]:
# Define paths for each file in the Workspace storage
customers_path = "data/ecommerce_customers.parquet"
products_path = "data/ecommerce_products.parquet"
interactions_path = "data/ecommerce_interactions.parquet"

Let's load and verify the dataframes

In [7]:
from pyspark.sql import SparkSession

# Créer une SparkSession
spark = SparkSession.builder \
    .appName("MonApplication") \
    .getOrCreate()

In [9]:
# Load the parquet files
customers_df = spark.read.parquet(customers_path)
products_df = spark.read.parquet(products_path)
interactions_df = spark.read.parquet(interactions_path)

customers_df.show(5)
products_df.show(5)
interactions_df.show(5)


+-----------+---+------+-------+-----------+----------------+
|customer_id|age|gender|country|tenure_days|membership_level|
+-----------+---+------+-------+-----------+----------------+
|          1| 40|     M|     US|        936|        Platinum|
|          2| 33|     M|     UK|        192|          Silver|
|          3| 42|     M|     MX|        160|          Bronze|
|          4| 53|     M|     AU|        823|          Bronze|
|          5| 32|     F|     US|        513|          Bronze|
+-----------+---+------+-------+-----------+----------------+
only showing top 5 rows

+----------+--------+-----+----------+
|product_id|category|price|avg_rating|
+----------+--------+-----+----------+
|         1|Clothing|33.64|       3.8|
|         2|Clothing|24.05|       3.5|
|         3|  Beauty|58.67|       4.1|
|         4|Clothing|11.15|       4.3|
|         5|  Beauty|47.74|       4.5|
+----------+--------+-----+----------+
only showing top 5 rows

+-----------+----------+-----------------

Let's review the count and the distribution of interaction types

In [10]:
print("Customers count:", customers_df.count())
print("Products count:", products_df.count())
print("Interactions count:", interactions_df.count())

Customers count: 10000
Products count: 1000
Interactions count: 112942


In [11]:
from pyspark.sql.functions import col, count, round

# Calculate distribution of interaction_type in record count
interaction_count_df = interactions_df.groupBy("interaction_type").agg(count("*").alias("count"))

# Calculate total interactions for percentage calculation
total_interactions = interactions_df.count()

# Calculate percentage distribution
interaction_percentage_df = interaction_count_df.withColumn("percentage", round((col("count") / total_interactions) * 100, 2))

# Display the distribution
display(interaction_percentage_df)

DataFrame[interaction_type: string, count: bigint, percentage: double]

We'll load our e-commerce dataset from parquet files and create persistent tables for use across all modules.

In [12]:
# Create database if it doesn't exist
spark.sql("CREATE DATABASE IF NOT EXISTS ecommerce")
spark.sql("USE ecommerce")

# Drop tables if they exist
spark.sql("DROP TABLE IF EXISTS ecommerce.interactions")
spark.sql("DROP TABLE IF EXISTS ecommerce.customers")
spark.sql("DROP TABLE IF EXISTS ecommerce.products")

# Create or replace persistent tables
interactions_df.write.mode("overwrite").saveAsTable("ecommerce.interactions")
customers_df.write.mode("overwrite").saveAsTable("ecommerce.customers")
products_df.write.mode("overwrite").saveAsTable("ecommerce.products")

print("Created persistent tables in 'ecommerce' database")

Created persistent tables in 'ecommerce' database


Now continue on Module 2 Notebook 1!

In [13]:
interactions_df = spark.table("ecommerce.interactions")
customers_df = spark.table("ecommerce.customers")
products_df = spark.table("ecommerce.products")

In [14]:
%sql
DROP DATABASE ecommerce CASCADE

SyntaxError: invalid syntax (3646665849.py, line 2)