# LAB 02: ELT Ingestion & Transformations

**Duration:** ~40 min | **Day:** 1 | **Difficulty:** Beginner-Intermediate
**After module:** M02: ELT Data Ingestion

> *"Load raw data from CSV/JSON files, transform it, and save as Delta tables in the Bronze layer."*

Complete the `# TODO` cells below. Each task has a validation cell.

## Setup

In [None]:
%run ../../setup/00_setup

In [None]:
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType, TimestampType
from pyspark.sql.functions import col, concat, lit, upper, trim, count

---
## Task 1: Read Customers CSV with Explicit Schema

Define a `StructType` schema and read the customers CSV file.

**Columns:** customer_id (string), first_name (string), last_name (string), email (string), city (string), state (string), country (string)

In [None]:
# TODO: Define the schema
customers_schema = StructType([
    StructField("customer_id", ________, True),
    StructField("first_name", ________, True),
    StructField("last_name", ________, True),
    StructField("email", ________, True),
    StructField("city", ________, True),
    StructField("state", ________, True),
    StructField("country", ________, True),
])

# TODO: Read the CSV with your schema
customers_path = f"{DATASET_PATH}/customers/customers.csv"

df_customers = (
    spark.read
    .format(________)
    .schema(________)
    .option("header", True)
    .load(customers_path)
)

df_customers.printSchema()
display(df_customers.limit(5))

In [None]:
# -- Validation --
assert df_customers.count() > 0, "DataFrame is empty!"
assert df_customers.schema["customer_id"].dataType == StringType(), "customer_id should be StringType"
assert df_customers.schema["first_name"].dataType == StringType(), "first_name should be StringType"
print(f"Task 1 OK: {df_customers.count()} customers loaded with correct schema")

---
## Task 2: Read Orders JSON

Read the orders batch JSON file. JSON files have self-describing schema - no need for explicit definition.

**File:** `{DATASET_PATH}/orders/orders_batch.json`

In [None]:
# TODO: Read orders JSON
orders_path = f"{DATASET_PATH}/orders/orders_batch.json"

df_orders = (
    spark.read
    .format(________)
    .load(orders_path)
)

df_orders.printSchema()
display(df_orders.limit(5))

In [None]:
# -- Validation --
assert df_orders.count() > 0, "Orders DataFrame is empty!"
print(f"Task 2 OK: {df_orders.count()} orders loaded")

---
## Task 3: Read Products CSV

Read the products CSV file. Use `inferSchema` this time (for comparison with Task 1).

**File:** `{DATASET_PATH}/products/products.csv`

In [None]:
# TODO: Read products CSV with inferSchema
products_path = f"{DATASET_PATH}/products/products.csv"

df_products = (
    spark.read
    .format("csv")
    .option("header", True)
    .option(________, ________)
    .load(products_path)
)

df_products.printSchema()
display(df_products.limit(5))

In [None]:
# -- Validation --
assert df_products.count() > 0, "Products DataFrame is empty!"
print(f"Task 3 OK: {df_products.count()} products loaded")

---
## Task 4: Transform Customer Data

Apply the following transformations to create `df_customers_clean`:

1. **Select** columns: customer_id, first_name, last_name, email, city, country
2. **Add column** `full_name` = first_name + " " + last_name
3. **Transform** email to lowercase using `lower()`
4. **Filter** to keep only non-null emails

In [None]:
from pyspark.sql.functions import lower

# TODO: Apply transformations
df_customers_clean = (
    df_customers
    .select("customer_id", "first_name", "last_name", "email", "city", "country")
    .withColumn("full_name", concat(col("first_name"), lit(" "), col(________)))
    .withColumn("email", lower(col(________)))
    .filter(col("email").________())
)

display(df_customers_clean.limit(5))

In [None]:
# -- Validation --
assert "full_name" in df_customers_clean.columns, "Missing 'full_name' column"
sample = df_customers_clean.first()
assert sample["email"] == sample["email"].lower(), "Email should be lowercase"
assert df_customers_clean.filter(col("email").isNull()).count() == 0, "Should have no null emails"
print(f"Task 4 OK: {df_customers_clean.count()} clean customer records")

---
## Task 5: Temporary View + SQL Query

1. Register `df_customers_clean` as temp view `v_customers`
2. Write a SQL query to count customers per **country**, ordered by count DESC

In [None]:
# TODO: Create temporary view
df_customers_clean.createOrReplaceTempView(________)

In [None]:
# TODO: SQL query - customers per country
df_by_country = spark.sql("""
    SELECT ________, COUNT(*) as customer_count
    FROM v_customers
    GROUP BY ________
    ORDER BY customer_count ________
""")

display(df_by_country)

In [None]:
# -- Validation --
assert df_by_country.count() > 0, "Query returned no results"
assert "customer_count" in df_by_country.columns, "Missing 'customer_count' column"
first_row = df_by_country.first()
second_row = df_by_country.collect()[1] if df_by_country.count() > 1 else first_row
assert first_row["customer_count"] >= second_row["customer_count"], "Should be sorted DESC"
print(f"Task 5 OK: Found {df_by_country.count()} countries")

---
## Task 6: Save as Delta Tables (Bronze Layer)

Save all three DataFrames as managed Delta tables in the Bronze schema.

Use `mode("overwrite")` so the lab can be re-run.

In [None]:
# TODO: Save customers to Bronze
(
    df_customers_clean
    .write
    .mode(________)
    .saveAsTable(f"{CATALOG}.{BRONZE_SCHEMA}.customers")
)
print("customers saved")

# TODO: Save orders to Bronze
(
    df_orders
    .write
    .mode(________)
    .saveAsTable(f"{CATALOG}.{BRONZE_SCHEMA}.orders")
)
print("orders saved")

# TODO: Save products to Bronze
(
    df_products
    .write
    .mode(________)
    .saveAsTable(f"{CATALOG}.{BRONZE_SCHEMA}.products")
)
print("products saved")

In [None]:
# -- Validation --
tables = [f"{CATALOG}.{BRONZE_SCHEMA}.customers",
          f"{CATALOG}.{BRONZE_SCHEMA}.orders",
          f"{CATALOG}.{BRONZE_SCHEMA}.products"]

for t in tables:
    c = spark.table(t).count()
    assert c > 0, f"Table {t} is empty!"
    print(f"  {t}: {c} rows")

print("\nTask 6 OK: All Bronze tables created!")

---
## Task 7: Verify with SQL

Run a SQL query to show all tables in your Bronze schema.

In [None]:
# TODO: Show all tables in bronze schema
display(spark.sql(f"SHOW TABLES IN {CATALOG}.{BRONZE_SCHEMA}"))

---
## Lab Complete!

You have:
- Read CSV (explicit schema) and JSON (inferred schema) files
- Applied transformations: select, withColumn, filter, concat, lower
- Created a temp view and ran SQL aggregation queries
- Saved 3 Delta tables in the Bronze layer

> **Exam Tip:** `inferSchema` reads the file twice (once for schema, once for data). Always prefer explicit schema in production. JSON and Parquet have embedded schemas.

> **Next:** LAB 03 - Delta DML & Time Travel

## Cleanup (Optional)

In [None]:
# Optional cleanup
# spark.sql(f"DROP TABLE IF EXISTS {CATALOG}.{BRONZE_SCHEMA}.customers")
# spark.sql(f"DROP TABLE IF EXISTS {CATALOG}.{BRONZE_SCHEMA}.orders")
# spark.sql(f"DROP TABLE IF EXISTS {CATALOG}.{BRONZE_SCHEMA}.products")
print("LAB 02 complete. Bronze tables preserved for LAB 03.")