---
## üîß Step 0: Environment Setup

Run the cell below to load configuration variables (`catalog`, `schema`, `volume_path`).

In [None]:
# Load configuration variables
%run ../00_setup

In [None]:
# Check if variables were loaded
print(f"üìÅ Catalog: {catalog}")
print(f"üìÅ Schema:  {schema}")
print(f"üìÅ Volume:  {volume_path}")

---
## üìÇ Step 1: Source Data Exploration

Before loading data, let's see what we have available.

In [None]:
# List files in the workshop directory
dbutils.fs.ls(f"{volume_path}/main/")

---
## üì• Step 2: Loading Customer Data

### Task 2.1: Load `Customers.csv` file

**Requirements:**
- Use CSV format
- File has headers (header)
- Let Spark automatically detect data types (`inferSchema`)

**Hint:**
```python
spark.read.format("csv") \
    .option("header", True) \
    .option("inferSchema", True) \
    .load("path")
```

In [None]:
# Path to file
customers_path = f"{volume_path}/main/Customers.csv"

# TODO: Load Customers.csv file into df_customers DataFrame
df_customers = (
    spark.read
    # Complete the code here
    # .format(...)
    # .option(...)
    # .load(...)
)

In [None]:
# Check result - should be 847 rows
print(f"‚úÖ Loaded {df_customers.count()} customers")
display(df_customers.limit(5))

---
## üîÑ Step 3: Transformations

### Task 3.1: Select required columns from customers

The marketing team needs only:
- `CustomerID`
- `FirstName`
- `LastName`
- `EmailAddress`
- `CompanyName`
- `Phone`

**Hint:** Use `.select("column1", "column2", ...)`

In [None]:
# TODO: Select only required columns
df_customers_clean = df_customers.select(
    # Add columns
)

### Task 3.2: Create `FullName` column

Combine `FirstName` and `LastName` into a single `FullName` column.

**Hint:** Use the `concat_ws` or `concat` function:
```python
from pyspark.sql.functions import concat_ws, col
df.withColumn("FullName", concat_ws(" ", col("FirstName"), col("LastName")))
```

In [None]:
from pyspark.sql.functions import concat_ws, col, upper, trim, current_timestamp

# TODO: Add FullName column
df_customers_enriched = df_customers_clean.withColumn(
    "FullName",
    # Complete the code here
)

---
## üîó Step 4: Adding Audit Column

### Task 4.1: Add audit column

Add an `ingestion_timestamp` column with the current time - this is a good practice in ETL!

In [None]:
# TODO: Add ingestion_timestamp column
df_final = df_customers_enriched.withColumn(
    "ingestion_timestamp",
    # Complete the code here - use current_timestamp()
)

---
## üíæ Step 5: Save to Delta Lake

### Task 5.1: Save as Delta table

Save the resulting DataFrame as a managed Delta Lake table.

**Hint:**
```python
df.write \
    .format("delta") \
    .mode("overwrite") \
    .saveAsTable("catalog.schema.table_name")
```

In [None]:
table_name = f"{catalog}.{schema}.customers_silver"

# TODO: Save df_final as Delta table
(
    df_final.write
    # Complete the code here
)

print(f"‚úÖ Saved table: {table_name}")

---
## ‚úÖ Step 6: Verification

Let's check if the table was created correctly.

In [None]:
# Check the table
display(spark.table(table_name))

In [None]:
# Check Delta metadata
display(spark.sql(f"DESCRIBE DETAIL {table_name}"))

---
## üéØ Bonus: Additional Tasks (if you have time)

1. **Add validation:** Check if all emails have the correct format (`@` in the middle)
2. **Aggregation:** Count how many customers are from each company
3. **Filtering:** Find all customers from company "A Bike Store"

In [None]:
# Bonus 1: Email validation
# TODO: Count how many emails contain '@'
# df_final.filter(col("EmailAddress").contains("@")).count()

In [None]:
# Bonus 2: Customers per company
# TODO: groupBy aggregation
# df_final.groupBy("CompanyName").count().orderBy("count", ascending=False).show()

---
## üßπ Cleanup (optional)

If you want to remove created resources:

In [None]:
# WARNING: Uncomment only if you want to delete the table!
# spark.sql(f"DROP TABLE IF EXISTS {table_name}")

---
---

# üìã SOLUTION

‚ö†Ô∏è **Don't look here until you've tried it yourself!** ‚ö†Ô∏è

Below you'll find the complete code solving all workshop tasks.

In [None]:
# ============================================================
# üìã FULL SOLUTION - Workshop 1: Ingestion & Transformations
# ============================================================

from pyspark.sql.functions import concat_ws, col, current_timestamp, trim

# --- Step 2: Loading data ---
customers_path = f"{volume_path}/main/Customers.csv"

df_customers = (
    spark.read
    .format("csv")
    .option("header", True)
    .option("inferSchema", True)
    .load(customers_path)
)

# --- Step 3: Transformations ---
df_customers_clean = df_customers.select(
    "CustomerID", "FirstName", "LastName", 
    "EmailAddress", "CompanyName", "Phone"
)

df_customers_enriched = df_customers_clean.withColumn(
    "FullName",
    concat_ws(" ", col("FirstName"), col("LastName"))
)

# --- Step 4: Add audit column ---
df_final = df_customers_enriched.withColumn(
    "ingestion_timestamp",
    current_timestamp()
)

# --- Step 5: Save to Delta ---
table_name = f"{catalog}.{schema}.customers_silver"

(
    df_final.write
    .format("delta")
    .mode("overwrite")
    .saveAsTable(table_name)
)

print(f"‚úÖ Solution executed! Table: {table_name}")
print(f"üìä Row count: {spark.table(table_name).count()}")
display(spark.table(table_name).limit(5))