# PySpark Medallion Architecture Learning Guide

Welcome to your PySpark learning repository! This notebook will guide you through the core concepts of PySpark for data engineering, focusing on the Medallion Architecture (Bronze, Silver, Gold layers). You'll learn how to set up your local environment, ingest raw data, transform it, and aggregate it for analytical purposes.

## 1. Setting up your Spark Session

Before we begin, let's ensure your Spark session is correctly configured. We've provided a utility function in `conf/spark_session_config.py` to help with this. Run the following Python code to initialize your Spark session.

**Exercise 1.1**: Run the cell below to create a SparkSession and print its version.


In [None]:
import os
import sys

# Add the project root to the Python path to import custom modules
project_root = os.path.abspath(os.path.join(os.getcwd(), os.pardir))
if project_root not in sys.path:
    sys.path.append(project_root)

from conf.spark_session_config import get_spark_session

spark = get_spark_session(app_name="MedallionArchitectureGuide")
print(f"Spark Version: {spark.version}")

## 2. Bronze Layer: Raw Data Ingestion

The Bronze layer is where raw data is ingested as-is from source systems. We've already generated some sample sales data for you in `data/raw/sales_data.csv`. The `scripts/bronze_ingestion.py` script reads this CSV and saves it as a Delta table in `data/bronze/sales`.

**Exercise 2.1**: Run the `bronze_ingestion.py` script from your terminal (or a new cell if you prefer, but it's designed to be run as a script). Then, use PySpark to read the ingested Bronze table and display its schema and a few rows.

```bash
# From your project root directory in the terminal:
# PYTHONPATH=$(pwd) python3 scripts/bronze_ingestion.py
```

**Exercise 2.2**: Read the Bronze table using PySpark.


In [None]:
bronze_df = spark.read.format("delta").load("../data/bronze/sales")
bronze_df.printSchema()
bronze_df.show(5)

## 3. Silver Layer: Cleaned and Conformed Data

The Silver layer contains cleaned, conformed, and enriched data. In our example, we'll calculate the `total_price` and add a `processing_timestamp`. The `scripts/silver_transformation.py` script performs this transformation.

**Exercise 3.1**: Run the `silver_transformation.py` script from your terminal. Then, read the Silver table and inspect its schema and data.

```bash
# From your project root directory in the terminal:
# PYTHONPATH=$(pwd) python3 scripts/silver_transformation.py
```

**Exercise 3.2**: Read the Silver table using PySpark and display its content.


In [None]:
silver_df = spark.read.format("delta").load("../data/silver/sales")
silver_df.printSchema()
silver_df.show(5)

**Exercise 3.3**: Try to perform an additional transformation. For example, filter the data to only include sales where `quantity` is greater than 1.


In [None]:
# Your code here for Exercise 3.3
# filtered_silver_df = silver_df.filter(silver_df.quantity > 1)
# filtered_silver_df.show()

## 4. Gold Layer: Aggregated Data for Analytics

The Gold layer is optimized for analytics and reporting. Here, we'll aggregate the sales data to get the total revenue per product and city. The `scripts/gold_aggregation.py` script handles this.

**Exercise 4.1**: Run the `gold_aggregation.py` script from your terminal. Then, read the Gold table and examine the aggregated results.

```bash
# From your project root directory in the terminal:
# PYTHONPATH=$(pwd) python3 scripts/gold_aggregation.py
```

**Exercise 4.2**: Read the Gold table using PySpark and display its content.


In [None]:
gold_df = spark.read.format("delta").load("../data/gold/sales_summary")
gold_df.printSchema()
gold_df.show()

**Exercise 4.3**: Use Spark SQL to query the Gold table. For example, find the top 5 products by total revenue.


In [None]:
# Your code here for Exercise 4.3
# spark.sql("SELECT * FROM sales_summary ORDER BY total_revenue DESC LIMIT 5").show()

## 5. Cleaning Up

After you're done practicing, it's good practice to stop your SparkSession.

**Exercise 5.1**: Stop the SparkSession.


In [None]:
spark.stop()
print("SparkSession stopped.")

## Next Steps

- Experiment with different data transformations and aggregations.
- Try ingesting data from other formats (e.g., JSON, Parquet) into the Bronze layer.
- Explore more advanced PySpark features like UDFs, window functions, and structured streaming.
- Apply these concepts to your Databricks projects!

Happy PySparking!