# PySpark Medallion Architecture Learning Guide

Welcome to your PySpark learning repository! This notebook will guide you through the core concepts of PySpark for data engineering, focusing on the Medallion Architecture (Bronze, Silver, Gold layers). You'll learn how to set up your local environment, ingest raw data, transform it, and aggregate it for analytical purposes.

## 1. Setting up your Spark Session

Before we begin, let's ensure your Spark session is correctly configured. We've provided a utility function in `conf/spark_session_config.py` to help with this. Run the following Python code to initialize your Spark session.

**Exercise 1.1**: Run the cell below to create a SparkSession and print its version.


In [3]:
# Import necessary modules and create SparkSession with proper configuration
import os
import sys
from pyspark.sql import SparkSession
from pyspark.sql import functions as F

# Add the project root to the Python path to import custom modules
project_root = os.path.abspath(os.path.join(os.getcwd(), os.pardir))
if project_root not in sys.path:
    sys.path.append(project_root)

from conf.spark_session_config import get_spark_session

# Create SparkSession using our configuration utility
spark = get_spark_session(app_name="MedallionArchitectureGuide")

print(f"Spark Version: {spark.version}")
print(f"Spark Context available: {spark.sparkContext is not None}")
print(f"Project root: {project_root}")

Spark Version: 4.0.0
Spark Context available: True
Project root: /Users/vamsi_mbmax/Developer/VAM_Documents/01_vam_PROJECTS/LEARNING/proj_Databases/dev_proj_Databases/practise_db_book_pyspark_learn/ref_manus_pyspark_course/pyspark_learning_repo


## 2. Bronze Layer: Raw Data Ingestion

The Bronze layer is where raw data is ingested as-is from source systems. We've already generated some sample sales data for you in `data/raw/sales_data.csv`. The `scripts/bronze_ingestion.py` script reads this CSV and saves it as a Delta table in `data/bronze/sales`.

**Exercise 2.1**: Run the `bronze_ingestion.py` script from your terminal (or a new cell if you prefer, but it's designed to be run as a script). Then, use PySpark to read the ingested Bronze table and display its schema and a few rows.

```bash
# From your project root directory in the terminal:
# PYTHONPATH=$(pwd) python3 scripts/bronze_ingestion.py
```

**Exercise 2.2**: Read the Bronze table using PySpark.


In [5]:
# Run the bronze ingestion script to create the bronze layer
from scripts.bronze_ingestion import ingest_raw_sales_data

# Ingest data from raw CSV to bronze layer (Parquet format)
ingest_raw_sales_data(spark, input_path="/Users/vamsi_mbmax/Developer/VAM_Documents/01_vam_PROJECTS/LEARNING/proj_Databases/dev_proj_Databases/practise_db_book_pyspark_learn/data/raw/sales_data.csv", output_path="data/bronze/sales")

print("\n" + "="*50)
print("Bronze layer ingestion completed!")
print("="*50)

# Now read the bronze layer data to verify it was created correctly
bronze_df = spark.read.parquet("data/bronze/sales")
bronze_df.printSchema()
bronze_df.show(5)

Reading raw data from /Users/vamsi_mbmax/Developer/VAM_Documents/01_vam_PROJECTS/LEARNING/proj_Databases/dev_proj_Databases/practise_db_book_pyspark_learn/data/raw/sales_data.csv...
Data schema:
root
 |-- transaction_id: integer (nullable = true)
 |-- product_name: string (nullable = true)
 |-- customer_name: string (nullable = true)
 |-- city: string (nullable = true)
 |-- quantity: integer (nullable = true)
 |-- price: double (nullable = true)
 |-- transaction_date: date (nullable = true)

Number of records: 1000
Writing data to Bronze layer at data/bronze/sales...


                                                                                

Bronze layer ingestion complete.
Sample data:
+--------------+------------+-------------+-----------+--------+------+----------------+
|transaction_id|product_name|customer_name|       city|quantity| price|transaction_date|
+--------------+------------+-------------+-----------+--------+------+----------------+
|             1|    Keyboard|  Customer_49|   New York|       4|678.03|      2024-06-30|
|             2|      Laptop|  Customer_37|Los Angeles|       3|637.83|      2024-07-05|
|             3|      Laptop|  Customer_40|    Houston|       2|665.58|      2024-10-21|
|             4|  Headphones|   Customer_3|Los Angeles|       4|222.11|      2024-11-20|
|             5|       Mouse|  Customer_10|Los Angeles|       3|  87.2|      2024-07-25|
+--------------+------------+-------------+-----------+--------+------+----------------+
only showing top 5 rows

Bronze layer ingestion completed!
root
 |-- transaction_id: integer (nullable = true)
 |-- product_name: string (nullable = true

## 3. Silver Layer: Cleaned and Conformed Data

The Silver layer contains cleaned, conformed, and enriched data. In our example, we'll calculate the `total_price` and add a `processing_timestamp`. The `scripts/silver_transformation.py` script performs this transformation.

**Exercise 3.1**: Run the `silver_transformation.py` script from your terminal. Then, read the Silver table and inspect its schema and data.

```bash
# From your project root directory in the terminal:
# PYTHONPATH=$(pwd) python3 scripts/silver_transformation.py
```

**Exercise 3.2**: Read the Silver table using PySpark and display its content.


In [None]:
# Run the silver transformation script to create the silver layer
from scripts.silver_transformation import transform_sales_data

# Transform data from bronze to silver layer
transform_sales_data(spark, input_path="data/bronze/sales", output_path="data/silver/sales")

print("\n" + "="*50)
print("Silver layer transformation completed!")
print("="*50)

# Now read the silver layer data to verify it was created correctly
silver_df = spark.read.parquet("data/silver/sales")
silver_df.printSchema()
silver_df.show(5)

**Exercise 3.3**: Try to perform an additional transformation. For example, filter the data to only include sales where `quantity` is greater than 1.


In [None]:
# Exercise 3.3: Filter the silver_df to show only sales where quantity > 1
filtered_silver_df = silver_df.filter(silver_df.quantity > 1)

print(f"Records with quantity > 1: {filtered_silver_df.count()}")
print("Sample filtered data:")
filtered_silver_df.show(5)

# Additional transformations you can try:
# 1. Filter by date range
recent_sales = silver_df.filter(F.col("transaction_date") >= "2024-06-01")
print(f"\nRecent sales (from June 2024): {recent_sales.count()}")

# 2. Filter by high-value transactions
high_value_sales = silver_df.filter(F.col("total_price") > 1000)
print(f"High-value sales (> $1000): {high_value_sales.count()}")

# 3. Group by product and show summary statistics
product_summary = silver_df.groupBy("product_name").agg(
    F.count("transaction_id").alias("total_transactions"),
    F.sum("total_price").alias("total_revenue"),
    F.avg("total_price").alias("avg_order_value")
).orderBy(F.col("total_revenue").desc())

print("\nProduct summary:")
product_summary.show()

## 4. Gold Layer: Aggregated Data for Analytics

The Gold layer is optimized for analytics and reporting. Here, we'll aggregate the sales data to get the total revenue per product and city. The `scripts/gold_aggregation.py` script handles this.

**Exercise 4.1**: Run the `gold_aggregation.py` script from your terminal. Then, read the Gold table and examine the aggregated results.

```bash
# From your project root directory in the terminal:
# PYTHONPATH=$(pwd) python3 scripts/gold_aggregation.py
```

**Exercise 4.2**: Read the Gold table using PySpark and display its content.


In [None]:
# Run the gold aggregation script to create the gold layer
from scripts.gold_aggregation import aggregate_sales_data

# Aggregate data from silver to gold layer
aggregate_sales_data(spark, input_path="data/silver/sales", output_path="data/gold/sales_summary")

print("\n" + "="*50)
print("Gold layer aggregation completed!")
print("="*50)

# Now read the gold layer data to verify it was created correctly
gold_df = spark.read.parquet("data/gold/sales_summary")
gold_df.printSchema()
gold_df.show()

**Exercise 4.3**: Use Spark SQL to query the Gold table. For example, find the top 5 products by total revenue.


In [None]:
# Exercise 4.3: Use Spark SQL to query the Gold table
# First, create a temporary view
gold_df.createOrReplaceTempView("sales_summary")

# Query 1: Find the top 5 products by total revenue
print("Top 5 products by total revenue:")
spark.sql("""
    SELECT product_name, SUM(total_revenue) as total_product_revenue
    FROM sales_summary
    GROUP BY product_name
    ORDER BY total_product_revenue DESC
    LIMIT 5
""").show()

# Query 2: Find the top 5 cities by total revenue
print("Top 5 cities by total revenue:")
spark.sql("""
    SELECT city, SUM(total_revenue) as total_city_revenue
    FROM sales_summary
    GROUP BY city
    ORDER BY total_city_revenue DESC
    LIMIT 5
""").show()

# Query 3: Find the best performing product-city combinations
print("Top 10 product-city combinations by revenue:")
spark.sql("""
    SELECT product_name, city, total_revenue, total_transactions
    FROM sales_summary
    ORDER BY total_revenue DESC
    LIMIT 10
""").show()

# Query 4: Summary statistics across all data
print("Overall summary statistics:")
spark.sql("""
    SELECT
        COUNT(*) as total_combinations,
        SUM(total_revenue) as grand_total_revenue,
        AVG(total_revenue) as avg_revenue_per_combination,
        SUM(total_transactions) as total_transactions,
        SUM(total_quantity_sold) as total_quantity_sold
    FROM sales_summary
""").show()

## 5. Cleaning Up

After you're done practicing, it's good practice to stop your SparkSession.

**Exercise 5.1**: Stop the SparkSession.


In [None]:
# Exercise 5.1: Stop the SparkSession
spark.stop()
print("SparkSession stopped.")

# Summary of what we accomplished:
print("\n" + "="*60)
print("MEDALLION ARCHITECTURE PIPELINE COMPLETED!")
print("="*60)
print("✅ Bronze Layer: Raw data ingested from CSV to Parquet")
print("✅ Silver Layer: Data cleaned, enriched with calculated columns")
print("✅ Gold Layer: Data aggregated for analytics and reporting")
print("✅ SQL Queries: Performed analytics on the Gold layer")
print("="*60)

## Next Steps

- Experiment with different data transformations and aggregations.
- Try ingesting data from other formats (e.g., JSON, Parquet) into the Bronze layer.
- Explore more advanced PySpark features like UDFs, window functions, and structured streaming.
- Apply these concepts to your Databricks projects!

Happy PySparking!