# Lesson 7: Architecture Overview & Demo

Welcome to Lesson 7! In this notebook, we will explore the architecture of our Big Data platform and run a simple end-to-end data pipeline.

## 1. Architecture Overview

Our platform consists of several key components working together:

1.  **PostgreSQL (`postgres`)**: Our relational database source, hosting the `pagila` (DVD rental) dataset.
2.  **Apache Spark (`spark-master`, `spark-worker`)**: The distributed processing engine used for ETL (Extract, Transform, Load).
3.  **MinIO (`minio`)**: An S3-compatible object storage serving as our Data Lake (Bronze, Silver, Gold layers).
4.  **Trino (`trino`)**: A distributed SQL query engine for analytics across our data sources.
5.  **Jupyter (`jupyter`)**: This environment, used for development and exploration.

### Data Flow Diagram

```mermaid
graph LR
    A[Postgres (Source)] -->|Extract| B(Spark Engine)
    B -->|Load Raw| C[MinIO Bronze]
    C -->|Read| B
    B -->|Transform| D[MinIO Silver]
    D -->|Read| E[Trino]
    E -->|Analyze| F[Jupyter/SQL Client]
```

## 2. Environment Setup

First, we need to initialize our Spark session with the necessary configurations to communicate with MinIO (via S3A) and Postgres.

In [None]:
import os
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("Lesson7-Architecture-Demo") \
    .config("spark.hadoop.fs.s3a.endpoint", "http://minio:9000") \
    .config("spark.hadoop.fs.s3a.access.key", "minioadmin") \
    .config("spark.hadoop.fs.s3a.secret.key", "minioadmin") \
    .config("spark.hadoop.fs.s3a.path.style.access", "true") \
    .config("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") \
    .config("spark.jars.packages", "org.apache.hadoop:hadoop-aws:3.3.4,org.postgresql:postgresql:42.5.4") \
    .getOrCreate()

print("Spark Session Created Successfully!")

## 3. Step 1: Extract (Postgres to Spark)

We will read the `customer` table from the `pagila` database currently running in our Postgres container.

In [None]:
# Postgres Connection Details
jdbc_url = "jdbc:postgresql://postgres:5432/pagila"
db_properties = {
    "user": "postgres",
    "password": "postgres",
    "driver": "org.postgresql.Driver"
}

# Read data from Postgres
df_customer = spark.read.jdbc(url=jdbc_url, table="customer", properties=db_properties)

print(f"Extracted {df_customer.count()} records from Postgres.")
df_customer.show(5)

## 4. Step 2: Load (Spark to MinIO - Bronze)

Now we save this raw data into our Data Lake (MinIO) in the `bronze` bucket. We'll use the Parquet format, which is optimized for analytics.

In [None]:
# Define S3 path
bronze_path = "s3a://bronze/customer_raw"

# Write to MinIO
df_customer.write.mode("overwrite").parquet(bronze_path)

print(f"Data written to {bronze_path}")

## 5. Step 3: Transform (Bronze to Silver)

Let's perform a simple transformation. We'll read the data back from Bronze, filter for active users, and select specific columns.

In [None]:
# Read from Bronze
df_bronze = spark.read.parquet(bronze_path)

# Transformation: Filter active customers (active=1) and select key fields
df_silver = df_bronze.filter("active = 1") \
    .select("customer_id", "first_name", "last_name", "email", "create_date")

print(f"Transformed data has {df_silver.count()} active customers.")

# Write to Silver bucket
silver_path = "s3a://silver/customer_active"
df_silver.write.mode("overwrite").parquet(silver_path)

print(f"Transformed data written to {silver_path}")

## 6. Step 4: Verification

Let's read back the silver data to verify everything worked as expected.

In [None]:
df_verify = spark.read.parquet(silver_path)
df_verify.show(10)

# Optional: Stop Spark Session
spark.stop()

## Conclusion

You have successfully ran a data pipeline that:
1.  Extracted data from a transactional DB (Postgres).
2.  Loaded it into a raw data lake layer (MinIO Bronze).
3.  Transformed it and saved it to a refined layer (MinIO Silver).

This is the foundational pattern for modern data engineering platforms!