# Workshop 1: Ingestion & Transformations

## The Story

You are a Data Engineer at a retail company. The marketing team has requested a clean list of customers to run a new email campaign.
The data is currently sitting in a CSV file in the landing zone, but it's raw and needs processing.

**Your Mission:**
1. Ingest the raw customer data from CSV.
2. Select only the relevant columns (Name, Email, Company).
3. Create a `FullName` column by combining First and Last names.
4. Add an audit timestamp to track when the data was processed.
5. Save the clean data as a Delta table for the marketing team to use.

**Time:** 30 minutes


In [0]:
%run ../00_setup

In [0]:
# --- INDEPENDENT SETUP ---
# Ensure source data exists for this workshop
import os

# Define path
source_dir = f"{DATASET_BASE_PATH}/workshop/main"
source_file = f"{source_dir}/Customers.csv"

# Check if source file exists
try:
    dbutils.fs.ls(source_file)
    print(f"Source file found: {source_file}")
except:
    print(f"WARNING: Source file not found at {source_file}. Please ensure datasets are uploaded to the Volume.")

print(f"Catalog: {CATALOG}")
print(f"Bronze Schema: {BRONZE_SCHEMA}")
print(f"Silver Schema: {SILVER_SCHEMA}")
print(f"Gold Schema:   {GOLD_SCHEMA}")

In [0]:
# Check if variables were loaded
print(f"Catalog: {CATALOG}")
print(f"Volume:  {DATASET_BASE_PATH}")

## Step 1: Source Data Exploration

Before loading data, let's see what we have available in the source directory.


In [0]:
# List files in the workshop directory
dbutils.fs.ls(f"{DATASET_BASE_PATH}/workshop/")

## Step 2: Loading Customer Data

### Task 2.1: Load `Customers.csv` file

**Requirements:**
- Use CSV format
- File has headers
- Let Spark automatically detect data types (`inferSchema`)

**Hint:**
```python
spark.read.format("csv") \
    .option("header", True) \
    .option("inferSchema", True) \
    .load("path")
```


In [0]:
# Path to file
customers_path = f"{DATASET_BASE_PATH}/workshop/Customers.csv"

# TODO: Load Customers.csv file into df_customers DataFrame
df_customers = (
    spark.read
    .format("csv")
    .option("header", "true")
    .option("inferSchema", "true")
    .load(customers_path)

)

display(df_customers)

In [0]:
df_customers.createOrReplaceTempView("customer_view")

In [0]:
%sql

select * from customer_view

In [0]:
# Check result
print(f"Loaded {df_customers.count()} customers")
display(df_customers.limit(5))


## Step 3: Transformations

### Task 3.1: Select required columns

The marketing team needs only:
- `CustomerID`
- `FirstName`
- `LastName`
- `EmailAddress`
- `CompanyName`
- `Phone`

**Hint:** Use `.select("column1", "column2", ...)`


In [0]:
%sql

select CustomerID,
FirstName,
LastName,
EmailAddress,
CompanyName,
Phone from customer_view

In [0]:
# TODO: Select only required columns
df_customers_clean = df_customers.select(
    # Add columns
)

### Task 3.2: Create `FullName` column

Combine `FirstName` and `LastName` into a single `FullName` column.

**Hint:** Use the `concat_ws` function:


In [0]:
from pyspark.sql.functions import concat_ws, col, upper, trim, current_timestamp

# TODO: Add FullName column
df_customers_enriched = df_customers_clean.withColumn(
    "FullName",
    # Complete the code here
)

### Task 3.3: Filter invalid emails

Filter out customers who do not have a valid email address (must contain '@').

In [0]:
# TODO: Filter rows where EmailAddress contains '@'
df_customers_filtered = df_customers_enriched.filter(
    # Complete the code here
)

### Task 3.4: Analyze Company Distribution

Check how many customers belong to each company. Sort the result by count in descending order.

In [0]:
# TODO: Group by CompanyName and count
# display(...)

## Step 4: Adding Audit Column

### Task 4.1: Add audit column

Add an `ingestion_timestamp` column with the current time - this is a good practice in ETL!


In [0]:
# TODO: Add ingestion_timestamp column
df_final = df_customers_filtered.withColumn(
    "ingestion_timestamp",
    # Complete the code here - use current_timestamp()
)


## Step 5: Save to Delta Lake

### Task 5.1: Save as Delta table

Save the resulting DataFrame as a managed Delta Lake table named `customers_silver`.

**Hint:**
```python
df.write \
    .format("delta") \
    .mode("overwrite") \
    .saveAsTable("catalog.schema.table_name")
```


In [0]:
table_name = f"{CATALOG}.{SILVER_SCHEMA}.customers_silver"

# TODO: Save df_final as Delta table
(
    df_final.write
    # Complete the code here
)

print(f"Saved table: {table_name}")

## Step 6: Verification

Let's check if the table was created correctly.


In [0]:
# Check the table
display(spark.table(table_name))

In [0]:
# Check Delta metadata
display(spark.sql(f"DESCRIBE DETAIL {table_name}"))

## Step 7: SQL Access (The Lakehouse Advantage)

You just created a table using Python. Now, let's query it immediately using SQL!
This demonstrates how Data Engineers and Data Analysts can work on the same data.


In [0]:
# Query the table using SQL (via Spark)
display(spark.sql(f"""
SELECT * 
FROM {CATALOG}.{SILVER_SCHEMA}.customers_silver 
WHERE CompanyName = 'Johnson and Sons'
"""))

## Cleanup (Optional)


In [0]:
# WARNING: Uncomment only if you want to delete the table!
# spark.sql(f"DROP TABLE IF EXISTS {table_name}")

# Solution

The complete code is below. Try to solve it yourself first!


In [0]:
# ============================================================
# FULL SOLUTION - Workshop 1: Ingestion & Transformations
# ============================================================

from pyspark.sql.functions import concat_ws, col, current_timestamp, trim

# --- Step 2: Loading data ---
customers_path = f"{DATASET_BASE_PATH}/workshop/Customers.csv"

df_customers = (
    spark.read
    .format("csv")
    .option("header", True)
    .option("inferSchema", True)
    .load(customers_path)
)

# --- Step 3: Transformations ---
df_customers_clean = df_customers.select(
    "CustomerID", "FirstName", "LastName", 
    "EmailAddress", "CompanyName", "Phone"
)

df_customers_enriched = df_customers_clean.withColumn(
    "FullName",
    concat_ws(" ", col("FirstName"), col("LastName"))
)

# Task 3.3: Filter
df_customers_filtered = df_customers_enriched.filter(col("EmailAddress").contains("@"))

# Task 3.4: Analysis
print("Company Distribution:")
# Using display() allows for built-in plotting!
display(df_customers_filtered.groupBy("CompanyName").count().orderBy("count", ascending=False))

# --- Step 4: Add audit column ---
df_final = df_customers_filtered.withColumn(
    "ingestion_timestamp",
    current_timestamp()
)

# --- Step 5: Save to Delta ---
table_name = f"{CATALOG}.{SILVER_SCHEMA}.customers_silver"

(
    df_final.write
    .format("delta")
    .mode("overwrite")
    .saveAsTable(table_name)
)


display(spark.sql(f"""
SELECT * 
FROM {CATALOG}.{SILVER_SCHEMA}.customers_silver 
WHERE CompanyName = 'Johnson and Sons'
"""))

print(f"Solution executed! Table: {table_name}")
print(f"Row count: {spark.table(table_name).count()}")
display(spark.table(table_name).limit(5))