# PySpark Medallion Architecture Learning Guide

Welcome to your PySpark learning repository! This notebook will guide you through the core concepts of PySpark for data engineering, focusing on the Medallion Architecture (Bronze, Silver, Gold layers). You'll learn how to set up your local environment, ingest raw data, transform it, and aggregate it for analytical purposes.

## 1. Setting up your Spark Session

Before we begin, let's ensure your Spark session is correctly configured. We've provided a utility function in `conf/spark_session_config.py` to help with this. Run the following Python code to initialize your Spark session.

**Exercise 1.1**: Run the cell below to create a SparkSession and print its version.


In [10]:
import os

os.path.dirname(os.getcwd())

'/Users/vamsi_mbmax/Developer/VAM_Documents/01_vam_PROJECTS/LEARNING/proj_Databases/dev_proj_Databases/practise_db_book_pyspark_learn'

In [11]:
os.path.abspath(os.path.dirname(os.getcwd()))

'/Users/vamsi_mbmax/Developer/VAM_Documents/01_vam_PROJECTS/LEARNING/proj_Databases/dev_proj_Databases/practise_db_book_pyspark_learn'

In [12]:
# Import necessary modules and create SparkSession with proper configuration
import os
import sys
from pyspark.sql import SparkSession
from pyspark.sql import functions as F

# Add the project root to the Python path to import custom modules
project_root = os.path.dirname(os.getcwd())
if project_root not in sys.path:
    sys.path.append(project_root)

from conf.spark_session_config import get_spark_session

# Create SparkSession using our configuration utility
spark = get_spark_session(app_name="MedallionArchitectureGuide")

print(f"Spark Version: {spark.version}")
print(f"Spark Context available: {spark.sparkContext is not None}")
print(f"Project root: {project_root}")

Spark Version: 4.0.0
Spark Context available: True
Project root: /Users/vamsi_mbmax/Developer/VAM_Documents/01_vam_PROJECTS/LEARNING/proj_Databases/dev_proj_Databases/practise_db_book_pyspark_learn


## 2. Bronze Layer: Raw Data Ingestion

The Bronze layer is where raw data is ingested as-is from source systems. We've already generated some sample sales data for you in `data/raw/sales_data.csv`. The `scripts/bronze_ingestion.py` script reads this CSV and saves it as a Delta table in `data/bronze/sales`.

**Exercise 2.1**: Run the `bronze_ingestion.py` script from your terminal (or a new cell if you prefer, but it's designed to be run as a script). Then, use PySpark to read the ingested Bronze table and display its schema and a few rows.

```bash
# From your project root directory in the terminal:
# PYTHONPATH=$(pwd) python3 scripts/bronze_ingestion.py
```

**Exercise 2.2**: Read the Bronze table using PySpark.


In [7]:
# from pyspark.sql import SparkSession
# from pyspark.sql import functions as F

# spark = SparkSession.builder.appName("MedallionArchitectureGuide").getOrCreate()

# spark.version

In [23]:
import findspark

findspark.init()

from pyspark.sql import SparkSession
from pyspark.sql import functions as F

spark = SparkSession.builder.appName("BronzeLayerIngestion").getOrCreate()

spark.version

'4.0.0'

In [20]:
df = spark.read.option("header", "true").option("inferSchema","true").csv("../data/raw/sales_data.csv")
# df = spark.read.option("header", "true").csv("../data/formats/employees.csv")
print("Basic CSV read (skills as string):")
df.limit(5).show()

Basic CSV read (skills as string):
+--------------+------------+-------------+-----------+--------+------+----------------+
|transaction_id|product_name|customer_name|       city|quantity| price|transaction_date|
+--------------+------------+-------------+-----------+--------+------+----------------+
|             1|  Headphones|  Customer_11|    Phoenix|       4|333.86|      2024-02-23|
|             2|  Headphones|  Customer_28|    Phoenix|       4|411.97|      2024-12-02|
|             3|      Laptop|   Customer_5|    Phoenix|       4| 341.6|      2024-09-16|
|             4|     Monitor|  Customer_39|    Phoenix|       3|247.17|      2024-12-21|
|             5|       Mouse|  Customer_28|Los Angeles|       3| 11.75|      2024-06-27|
+--------------+------------+-------------+-----------+--------+------+----------------+



In [21]:
df.write.mode("overwrite").parquet("../data/bronze/sales")

                                                                                

In [22]:
spark.stop()

## 3. Silver Layer: Cleaned and Conformed Data

The Silver layer contains cleaned, conformed, and enriched data. In our example, we'll calculate the `total_price` and add a `processing_timestamp`. The `scripts/silver_transformation.py` script performs this transformation.

**Exercise 3.1**: Run the `silver_transformation.py` script from your terminal. Then, read the Silver table and inspect its schema and data.

```bash
# From your project root directory in the terminal:
# PYTHONPATH=$(pwd) python3 scripts/silver_transformation.py
```

**Exercise 3.2**: Read the Silver table using PySpark and display its content.


**Exercise 3.3**: Try to perform an additional transformation. For example, filter the data to only include sales where `quantity` is greater than 1.


## 4. Gold Layer: Aggregated Data for Analytics

The Gold layer is optimized for analytics and reporting. Here, we'll aggregate the sales data to get the total revenue per product and city. The `scripts/gold_aggregation.py` script handles this.

**Exercise 4.1**: Run the `gold_aggregation.py` script from your terminal. Then, read the Gold table and examine the aggregated results.

```bash
# From your project root directory in the terminal:
# PYTHONPATH=$(pwd) python3 scripts/gold_aggregation.py
```

**Exercise 4.2**: Read the Gold table using PySpark and display its content.


In [None]:
# TODO: Read the Gold table using Delta format
# Hint: Use spark.read.format("delta").load("../data/gold/sales_summary")
# Display schema and show all rows


**Exercise 4.3**: Use Spark SQL to query the Gold table. For example, find the top 5 products by total revenue.


In [None]:
# TODO: Create a temporary view and query it using Spark SQL
# Hint: First create a temp view, then use spark.sql() to query
# Find top 5 products by total revenue


## 5. Cleaning Up

After you're done practicing, it's good practice to stop your SparkSession.

**Exercise 5.1**: Stop the SparkSession.


In [None]:
# TODO: Stop the SparkSession
# Hint: Use spark.stop() and print a confirmation message


## Next Steps

- Experiment with different data transformations and aggregations.
- Try ingesting data from other formats (e.g., JSON, Parquet) into the Bronze layer.
- Explore more advanced PySpark features like UDFs, window functions, and structured streaming.
- Apply these concepts to your Databricks projects!

Happy PySparking!
