# PySpark Basics Demo

This notebook demonstrates basic PySpark operations and lays the groundwork for later ETL development.

**Runtime**: AWS Glue Interactive Session (Notebook)

**Learning objectives**:
- Understand SparkSession initialization
- Learn basic DataFrame operations
- Read and write CSV and Parquet files

## 1. Initialize Glue Session

In a Glue Notebook, configure the magic commands first to initialize the environment.

In [None]:
# Glue Notebook magic configuration
%idle_timeout 60
%glue_version 4.0
%worker_type G.1X
%number_of_workers 2

In [None]:
# Import required libraries
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.types import *

# Initialize Glue Context
sc = SparkContext.getOrCreate()
glueContext = GlueContext(sc)
spark = glueContext.spark_session

print("Spark version:", spark.version)
print("Initialization complete!")

## 2. Create sample data

Let's create some sample data to simulate an e-commerce scenario.

In [None]:
# Create product data
products_data = [
    (1, "Organic Banana", 24, 4),
    (2, "Whole Milk", 84, 16),
    (3, "Organic Strawberries", 24, 4),
    (4, "Bag of Organic Bananas", 24, 4),
    (5, "Organic Baby Spinach", 123, 4),
    (6, "Large Lemon", 24, 4),
    (7, "Strawberries", 24, 4),
    (8, "Limes", 24, 4),
    (9, "Organic Avocado", 24, 4),
    (10, "Organic Whole Milk", 84, 16),
]

products_schema = StructType([
    StructField("product_id", IntegerType(), False),
    StructField("product_name", StringType(), True),
    StructField("aisle_id", IntegerType(), True),
    StructField("department_id", IntegerType(), True),
])

products_df = spark.createDataFrame(products_data, products_schema)
products_df.show()

In [None]:
# Create aisle data
aisles_data = [
    (24, "fresh fruits"),
    (84, "milk"),
    (123, "packaged vegetables fruits"),
]

aisles_df = spark.createDataFrame(aisles_data, ["aisle_id", "aisle"])
aisles_df.show()

In [None]:
# Create department data
departments_data = [
    (4, "produce"),
    (16, "dairy eggs"),
]

departments_df = spark.createDataFrame(departments_data, ["department_id", "department"])
departments_df.show()

## 3. Basic DataFrame operations

In [None]:
# View schema
print("=== Products Schema ===")
products_df.printSchema()

In [None]:
# Count records
print(f"Product count: {products_df.count()}")
print(f"Aisle count: {aisles_df.count()}")
print(f"Department count: {departments_df.count()}")

In [None]:
# Select specific columns
products_df.select("product_id", "product_name").show(5)

In [None]:
# Filter data
# Find all organic products
organic_products = products_df.filter(
    F.col("product_name").contains("Organic")
)
organic_products.show()

In [None]:
# Add a new column
products_with_flag = products_df.withColumn(
    "is_organic",
    F.when(F.col("product_name").contains("Organic"), True).otherwise(False)
)
products_with_flag.show()

## 4. JOIN operations

This is the core operation for Bronze to Silver transformations.

In [None]:
# LEFT JOIN: products + aisles
products_with_aisle = products_df.join(
    aisles_df,
    on="aisle_id",
    how="left"
)
products_with_aisle.show()

In [None]:
# Multi-table JOIN: products + aisles + departments
dim_products = products_df \
    .join(aisles_df, "aisle_id", "left") \
    .join(departments_df, "department_id", "left") \
    .select(
        F.col("product_id"),
        F.col("product_name"),
        F.col("aisle_id"),
        F.col("aisle"),
        F.col("department_id"),
        F.col("department")
    )

print("=== Dimension table: dim_products ===")
dim_products.show()

## 5. Aggregations

In [None]:
# Count products by department
products_by_dept = dim_products.groupBy("department").agg(
    F.count("product_id").alias("product_count")
)
products_by_dept.show()

In [None]:
# Count products by aisle and sort
products_by_aisle = dim_products.groupBy("aisle").agg(
    F.count("product_id").alias("product_count")
).orderBy(F.desc("product_count"))

products_by_aisle.show()

## 6. De-duplication and null handling

In [None]:
# Create data with duplicates and nulls
dirty_data = [
    (1, "Apple", 10),
    (1, "Apple", 10),      # duplicate
    (2, "Banana", None),   # null
    (3, None, 20),         # null
]

dirty_df = spark.createDataFrame(dirty_data, ["id", "name", "price"])
print("=== Raw dirty data ===")
dirty_df.show()

In [None]:
# Remove duplicates
deduped_df = dirty_df.dropDuplicates(["id"])
print("=== After de-duplication ===")
deduped_df.show()

In [None]:
# Fill nulls
clean_df = deduped_df.fillna({
    "name": "Unknown",
    "price": 0
})
print("=== After filling nulls ===")
clean_df.show()

## 7. Write files

**Note**: Replace `YOUR_BUCKET_NAME` with your S3 bucket name.

In [None]:
# Set your S3 bucket name
BUCKET_NAME = "YOUR_BUCKET_NAME"  # ‚Üê replace with your bucket name

# Write CSV
dim_products.write \
    .mode("overwrite") \
    .option("header", "true") \
    .csv(f"s3://{BUCKET_NAME}/demo/dim_products_csv/")

print("CSV write complete!")

In [None]:
# Write Parquet (recommended format)
dim_products.write \
    .mode("overwrite") \
    .parquet(f"s3://{BUCKET_NAME}/demo/dim_products_parquet/")

print("Parquet write complete!")

In [None]:
# Read back the Parquet we just wrote
read_back_df = spark.read.parquet(f"s3://{BUCKET_NAME}/demo/dim_products_parquet/")
read_back_df.show()

## 8. Class exercise

Try to complete the following tasks:

1. Create an orders DataFrame with order_id, user_id, order_dow (day of week)
2. Count the number of orders per day of week
3. Find the day of week with the most orders

In [None]:
# Exercise code area
# Hint:
# orders_data = [(1, 101, 0), (2, 102, 1), ...]
# orders_df = spark.createDataFrame(orders_data, ["order_id", "user_id", "order_dow"])

# Your code:


## Clean up resources

Remember to stop the session to save cost.

In [None]:
# Stop Spark Session
# spark.stop()
# print("Session stopped")