# M02: ELT Ingestion & Transformations

| Exam Domain | Weight |
|---|---|
| ELT with Spark SQL and Python | 29% |
| Incremental Data Processing | 20% |

---

## Why This Matters

> **Best Practice:** Use `inferSchema=true` only for exploration. In production, define schemas explicitly.

| Scenario | With inferSchema | With explicit schema |
|---|---|---|
| Large file | Scans entire file first | Direct read |
| Column `"123"` | INT or STRING? | You control it |
| New column added | Schema changes silently | Fails fast (good!) |

### Bronze Layer Philosophy

In Medallion architecture, Bronze layer should:
- Keep data as STRING (preserve original values)
- Add metadata (ingestion time, source file)
- NOT apply business logic
- Be idempotent (safe to re-run)

### Data Process Patterns

| Pattern | Description | Trigger | Use Case |
|---|---|---|---|
| **Full Load** | Entire dataset reloaded | Batch Read | Small tables |
| **Incremental Batch** | Only new/changed data | `Trigger.AvailableNow` | Daily/Hourly ETL |
| **Continuous** | Real-time processing | `Trigger.ProcessingTime` | Monitoring, alerts |

<img src="../../../assets/images/1ba507cd604849878e74e586f4df3559.png" width="800">

---

## ELT vs ETL

| Aspect | ETL | ELT |
|---|---|---|
| Transform location | External engine (before load) | Inside platform (after load) |
| Data stored | Only transformed | Raw + transformed |
| Flexibility | Fixed schema upfront | Schema-on-read, evolve later |
| Databricks | **Not recommended** | **Default pattern** |

> **Exam Tip:** Databricks uses ELT: Extract raw data → Load as-is into Bronze (Delta) → Transform in Silver/Gold layers.

## Setup

Running the setup notebook and configuring paths for all data sources used in this module.

---

In [0]:
%run ../../setup/00_setup

### Configuration

Defining file paths for customers, orders, and products datasets.

In [0]:
# Paths to data directories (subdirectories in DATASET_PATH from 00_setup)
CUSTOMERS_PATH = f"{DATASET_PATH}/customers"
ORDERS_PATH = f"{DATASET_PATH}/orders"
PRODUCTS_PATH = f"{DATASET_PATH}/products"

# Paths to specific files
CUSTOMERS_CSV = f"{CUSTOMERS_PATH}/customers.csv"
ORDERS_JSON = f"{ORDERS_PATH}/orders_batch.json"
PRODUCTS_PARQUET = f"{PRODUCTS_PATH}/products.parquet"
EXCEL_PATH = f"{DATASET_PATH}/customers/customers_extented.xlsx"  # Note: filename has typo 'extented'

### Imports

PySpark types for schema definition and functions for transformations.

In [0]:
# Import PySpark types for schema definition
from pyspark.sql.types import (
    StructType, StructField, StringType, IntegerType, 
    DoubleType, TimestampType, DateType, ArrayType
)

# Import PySpark functions for transformations and aggregations
from pyspark.sql.functions import (
    col, lit, concat, upper, lower, year, month, day,
    sum, avg, min, max, count, stddev, desc, asc,
    explode, explode_outer, struct, array,
    get_json_object, from_json, to_json,
    to_date, current_date, datediff
)

import time
import pandas as pd

## CSV Import (Customers)

Loading CSV files into DataFrames using `spark.read` with schema inference and manual schema definition. We'll work with customer data throughout this section.

---

### Auto Schema Inference

Using `inferSchema=true` to let Spark detect column types automatically.

In [0]:
# Load CSV data with automatic schema inference
customers_auto_df = (
    spark.read
    .format("csv")
    .option("header", "true")       # First line contains column names
    .option("inferSchema", "true")  # Automatic data type inference
    .load(CUSTOMERS_CSV)
)

In [0]:
# Verify loaded data
customers_auto_df.printSchema()
display(customers_auto_df.limit(5))

### Extended Reader Options

Key CSV options for production:

| Option | Purpose |
|---|---|
| `delimiter` | Custom column separator (e.g. `;`) |
| `quote` | Character for text fields |
| `escape` | Escape special characters |
| `mode` | `PERMISSIVE` / `DROPMALFORMED` / `FAILFAST` |

### Manual Schema Definition

Defining an explicit `StructType` schema for type safety and performance.

In [0]:
# Schema definition for customers
# Structure: customer_id (string), first_name (string), last_name (string), email (string), city (string), country (string), registration_date (timestamp)
customers_schema = StructType([
    StructField("customer_id", StringType(), nullable=False),
    StructField("first_name", StringType(), nullable=True),
    StructField("last_name", StringType(), nullable=True),
    StructField("city", StringType(), nullable=True),
    StructField("email", StringType(), nullable=True),
    StructField("country", StringType(), nullable=True),
    StructField("registration_date", TimestampType(), nullable=True)
])

In [0]:
# Load CSV data with manually defined schema
customers_df = (
    spark.read
    .format("csv")
    .option("header", "true")
    .schema(customers_schema)
    .load(CUSTOMERS_CSV)
)

In [0]:
# Verify loaded data
customers_df.printSchema()
display(customers_df.limit(5))

## JSON Import (Orders)

Reading JSON files into DataFrames with automatic and manual schema approaches. JSON supports nested structures natively.

---

### Auto Schema Inference

Letting Spark infer the JSON schema automatically.

In [0]:
# Load JSON data with automatic schema inference
orders_auto_df = (
    spark.read
    .format("json")
    .option("inferSchema", "true")
    .load(ORDERS_JSON)
)

In [0]:
# Verify loaded data
orders_auto_df.printSchema()
display(orders_auto_df.limit(5))

### Manual Schema Definition

Providing an explicit schema to select and type only the columns we need.

In [0]:
# Schema definition for orders
# Actual structure: order_id, customer_id, product_id, store_id, order_datetime, quantity, unit_price, discount_percent, total_amount, payment_method
orders_schema = StructType([
    StructField("order_id", StringType(), nullable=True),  # Can be null in data
    StructField("customer_id", StringType(), nullable=False),
    StructField("order_datetime", StringType(), nullable=True),  # String, will convert to timestamp later
    StructField("total_amount", DoubleType(), nullable=False),
    StructField("payment_method", StringType(), nullable=False)
])

In [0]:
# Load JSON data with manually defined schema
orders_df = (
    spark.read
    .format("json")
    .schema(orders_schema)
    .load(ORDERS_JSON)
)

In [0]:
# Verify loaded data
orders_df.printSchema()
display(orders_df.limit(5))

## Parquet Import (Products)

Reading Parquet files — the preferred columnar format in Databricks. Parquet has built-in schema, so no manual definition is needed.

---

In [0]:
# Parquet already contains built-in schema - no need to define it
products_df = (
    spark.read
    .format("parquet")
    .load(PRODUCTS_PARQUET)
)

In [0]:
# Verify loaded data
products_df.printSchema()
display(products_df.limit(5))

## Excel Import (Extended Customers)

Databricks Runtime has built-in support for Excel via `spark-excel` library.

> **Note:** Requires spark-excel library installed on cluster.

In [0]:
# Load Excel file using spark-excel library

customers_extended_df = (
    spark.read
    .format("excel")
    .option("dataAddress", "Arkusz1!D6:M26")
    .option("headerRows", 1)
    .load(EXCEL_PATH)
)

In [0]:
# Verify loaded data
customers_extended_df.printSchema()
display(customers_extended_df.limit(5))

## Performance: inferSchema vs Manual Schema

Comparing read performance between automatic schema inference and explicit schema definition to demonstrate why manual schemas are recommended in production.

---

In [0]:
# Automatic schema inference
start_auto = time.time()
df_auto = (
    spark.read
    .format("csv")
    .option("header", "true")
    .option("inferSchema", "true")
    .load(CUSTOMERS_CSV)
)
count_auto = df_auto.count()  # Action - forces execution
time_auto = time.time() - start_auto

In [0]:
# Manual schema definition
start_manual = time.time()
df_manual = (
    spark.read
    .format("csv")
    .option("header", "true")
    .schema(customers_schema)
    .load(CUSTOMERS_CSV)
)
count_manual = df_manual.count()  # Action - forces execution
time_manual = time.time() - start_manual

In [0]:
# Comparison
speedup = (time_auto - time_manual) / time_auto * 100
print(f"\nConclusion: Manual schema is faster by {speedup:.1f}%")

## DataFrame Transformations

Core PySpark transformations for data engineering: selecting columns, adding/renaming columns, type casting, deduplication, and sorting.

---

### select — Choosing Columns

Projecting specific columns and applying expressions in the `select()` method.

In [0]:
# Select specific columns
customers_selected = customers_df.select("customer_id", "first_name", "last_name", "email")
display(customers_selected.limit(5))

# Select with column expressions
customers_transformed = customers_df.select(
    col("customer_id"),
    upper(col("first_name")).alias("first_name_upper"),
    col("email")
)
display(customers_transformed.limit(5))

### withColumn — Adding/Modifying Columns

Creating new columns or modifying existing ones using expressions.

In [0]:
# Add a new column
customers_with_fullname = customers_df.withColumn(
    "full_name", 
    concat(col("first_name"), lit(" "), col("last_name"))
)
display(customers_with_fullname.select("customer_id", "first_name", "last_name", "full_name").limit(5))

# Add multiple columns
customers_enriched = customers_auto_df \
    .withColumn("email_lower", lower(col("email"))) \
    .withColumn("registration_year", year(col("registration_date")))
    
display(customers_enriched.select("customer_id", "email", "email_lower", "registration_date", "registration_year").limit(5))

### cast — Type Conversion

Converting column data types using `cast()`.

In [0]:
# Cast customer_id to string
customers_casted = customers_df.withColumn("customer_id_str", col("customer_id").cast(StringType()))
customers_df.select("customer_id").printSchema()
customers_casted.select("customer_id_str").printSchema()

# Cast timestamp to date
customers_date = customers_df.withColumn("registration_date_only", col("registration_date").cast(DateType()))
display(customers_date.select("customer_id", "registration_date", "registration_date_only").limit(5))

### withColumnRenamed — Renaming

Renaming columns using `withColumnRenamed()` or `select()` with `alias()`.

In [0]:
# Rename single column
customers_renamed = customers_df.withColumnRenamed("customer_id", "id")
display(customers_renamed.limit(5))

# Rename multiple columns using select with alias
customers_multi_renamed = customers_df.select(
    col("customer_id").alias("id"),
    col("first_name").alias("fname"),
    col("last_name").alias("lname"),
    col("email"),
    col("city")
)
display(customers_multi_renamed.limit(5))

### drop — Removing Columns

Removing unwanted columns from a DataFrame.

In [0]:
# Drop single column
customers_dropped = customers_df.drop("email")

# Drop multiple columns
customers_minimal = customers_df.drop("email", "city", "country")
display(customers_minimal.limit(5))

### distinct / dropDuplicates — Unique Rows

Removing duplicate rows entirely or based on specific columns.

In [0]:
# Get distinct countries
distinct_countries = customers_df.select("country").distinct()
display(distinct_countries.orderBy("country"))

# Drop duplicates based on specific columns
unique_locations = customers_df.select("city", "country").dropDuplicates(["city", "country"])
display(unique_locations.orderBy("country", "city").limit(10))

### orderBy — Sorting

Sorting rows by one or more columns in ascending or descending order.

In [0]:
# Sort by single column (ascending)
customers_sorted_asc = customers_df.orderBy("registration_date")
display(customers_sorted_asc.select("customer_id", "first_name", "last_name", "registration_date").limit(5))

# Sort by multiple columns with different directions
customers_sorted_multi = customers_df.orderBy(asc("country"), desc("registration_date"))
display(customers_sorted_multi.select("customer_id", "first_name", "country", "registration_date").limit(10))

## Filtering Data

Filtering rows using conditions, multiple predicates, `isin()`, null handling, and string pattern matching.

---

### Simple Conditions

`filter()` and `where()` are equivalent.

In [0]:
# Filter by country
usa_customers = customers_auto_df.filter(col("country") == "Texas")
display(usa_customers.select("customer_id", "first_name", "last_name", "country").limit(5))

# Filter using where (equivalent to filter)
nyc_customers = customers_auto_df.where(col("city") == "New York")
display(nyc_customers.select("customer_id", "first_name", "city", "country").limit(5))

### Multiple Conditions (`&` = AND, `|` = OR)

Combining conditions with logical operators — remember to wrap each condition in parentheses.

In [0]:
# AND condition
usa_2023 = customers_df.filter(
    (col("country") == "USA") & (year(col("registration_date")) == 2023)
)
display(usa_2023.select("customer_id", "first_name", "country", "registration_date").limit(5))

# OR condition
usa_or_uk = customers_df.filter(
    (col("country") == "USA") | (col("country") == "UK")
)
display(usa_or_uk.select("customer_id", "first_name", "country").limit(5))

### isin — Filter by List

Filtering rows where a column value matches any item in a list.

In [0]:
# Filter by list of countries
selected_countries = ["USA", "UK", "Germany", "France"]
customers_selected_countries = customers_df.filter(col("country").isin(selected_countries))

# Show distribution by country
display(customers_selected_countries.groupBy("country").count().orderBy(desc("count")))

### Null Handling (`isNull`, `isNotNull`)

Filtering rows based on null or non-null values.

In [0]:
# Filter rows where email is NOT null
customers_with_email = customers_df.filter(col("email").isNotNull())

# Filter rows where city IS null
customers_no_city = customers_df.filter(col("city").isNull())
if customers_no_city.count() > 0:
    display(customers_no_city.select("customer_id", "first_name", "city", "country").limit(5))

### String Operations (`like`, `contains`, `startswith`)

Pattern matching and string-based filtering on column values.

In [0]:
# Filter using like (SQL-style pattern matching)
gmail_customers = customers_df.filter(col("email").like("%@gmail.com"))
display(gmail_customers.select("customer_id", "first_name", "email").limit(5))

# Filter using contains
new_cities = customers_df.filter(col("city").contains("New"))
display(new_cities.select("customer_id", "first_name", "city").limit(5))

# Filter using startswith
j_names = customers_df.filter(col("first_name").startswith("J"))
display(j_names.select("customer_id", "first_name", "last_name").limit(5))

## Aggregations

Computing summary statistics with `groupBy`, `agg()`, and built-in aggregate functions like `count`, `sum`, `avg`, `min`, and `max`.

---

### groupBy with count, sum, avg

Basic grouping and aggregation operations.

In [0]:
# Count by country
customers_by_country = customers_df.groupBy("country").count().orderBy(desc("count"))
display(customers_by_country.limit(10))

In [0]:
# Sum and average on orders
revenue_by_payment = orders_df.groupBy("payment_method").agg(
    sum("total_amount").alias("total_revenue"),
    avg("total_amount").alias("avg_order_value"),
    count("*").alias("order_count")
).orderBy(desc("total_revenue"))

display(revenue_by_payment)

### min / max

Finding minimum and maximum values across the dataset.

In [0]:
# Min and max order amounts
order_stats = orders_df.agg(
    min("total_amount").alias("min_amount"),
    max("total_amount").alias("max_amount"),
    avg("total_amount").alias("avg_amount")
)
display(order_stats)

### Multiple Aggregations with agg()

Combining several aggregate functions in a single `agg()` call.

In [0]:
# Example: Multiple aggregations on orders by customer
customer_stats = orders_df.groupBy("customer_id").agg(
    count("*").alias("total_orders"),
    sum("total_amount").alias("total_spent"),
    avg("total_amount").alias("avg_order_value"),
    min("total_amount").alias("min_order"),
    max("total_amount").alias("max_order")
).orderBy(desc("total_spent"))

display(customer_stats.limit(10))

### HAVING Equivalent (filter after groupBy)

Applying `filter()` after `groupBy().agg()` to replicate SQL's `HAVING` clause.

In [0]:
# Customers with more than 5 orders
high_frequency = orders_df.groupBy("customer_id").agg(
    count("*").alias("order_count"),
    sum("total_amount").alias("total_spent")
).filter(col("order_count") > 5).orderBy(desc("order_count"))

display(high_frequency.limit(10))

## Temporary Views & SQL

Registering DataFrames as temporary views and querying them with SQL. Covers temp views, global temp views, and the DataFrame API vs SQL equivalence.

---

### Creating Temporary Views

Registering DataFrames as views for SQL access within the current SparkSession.

| View Type | Scope | Persistence | Syntax |
|---|---|---|---|
| **Temp View** | Current SparkSession | Session only | `CREATE TEMP VIEW` |
| **Global Temp View** | All sessions on cluster | Until cluster restart | `CREATE GLOBAL TEMP VIEW` → `global_temp.name` |
| **Permanent View** | Unity Catalog | Persistent | `CREATE VIEW catalog.schema.name` |

> **Exam Tip:** Temp views are NOT visible across notebooks. Views store the query definition, not data.

In [0]:
# Create temporary views from our DataFrames

customers_df.createOrReplaceTempView("customers")
orders_df.createOrReplaceTempView("orders")
products_df.createOrReplaceTempView("products")

print("\These views are available for SQL queries in this session.")

### SQL Queries via spark.sql()

Executing SQL statements against registered temporary views.

In [0]:
# Simple SELECT query
result = spark.sql("""
    SELECT 
        customer_id,
        COUNT(*) as order_count,
        SUM(total_amount) as total_spent
    FROM orders
    GROUP BY customer_id
    ORDER BY order_count DESC
    LIMIT 10
""")

display(result)

### DataFrame API vs SQL — Same Result

Demonstrating that both approaches produce identical results under the same Catalyst optimizer.

In [0]:
# Same query - two approaches

# DataFrame API
df_api_result = customers_df  \
    .groupBy("city") \
    .count() \
    .orderBy(desc("count")) \
    .limit(5)

display(df_api_result)

In [0]:
# SQL
sql_result = spark.sql("""
    SELECT 
        city,
        COUNT(*) as count
    FROM customers
    GROUP BY city
    ORDER BY count DESC
    LIMIT 5
""")

display(sql_result)

### Global Temporary Views

Creating views accessible across all SparkSessions on the cluster via `global_temp` schema.

In [0]:
# Create global temporary view
customers_df.createOrReplaceGlobalTempView("global_customers")

In [0]:
# Query global temp view
global_result = spark.sql("""
    SELECT country, COUNT(*) as count
    FROM global_temp.global_customers
    GROUP BY country
    ORDER BY count DESC
    LIMIT 5
""")

display(global_result)

### Complex SQL (JOINs + Aggregations)

Combining JOINs with GROUP BY and HAVING in a single SQL query.

In [0]:
# Example: Complex query with JOIN and aggregation

complex_query = spark.sql("""
    SELECT 
        c.customer_id,
        c.first_name,
        c.last_name,
        c.country,
        COUNT(o.order_id) as total_orders,
        SUM(o.total_amount) as total_spent,
        AVG(o.total_amount) as avg_order_value,
        MAX(o.total_amount) as largest_order
    FROM customers c
    LEFT JOIN orders o ON c.customer_id = try_cast(o.customer_id AS STRING)
    GROUP BY c.customer_id, c.first_name, c.last_name, c.country
    HAVING COUNT(o.order_id) > 0
    ORDER BY total_spent DESC
    LIMIT 10
""")

display(complex_query)

| Criteria | SQL | DataFrame API |
|---|---|---|
| **Best for** | JOINs, ad-hoc analysis | Pipelines, custom logic |
| **Team** | SQL-first, analysts | Engineers, ML |
| **IDE support** | Limited | Autocomplete, type safety |

## JSON Operations

Working with complex JSON data: flattening arrays with `explode`, accessing nested fields with dot notation, and parsing JSON strings with `get_json_object` and `from_json`.

---

### explode — Flattening Arrays

Converting array elements into individual rows using `explode()` and `explode_outer()`.

In [0]:
# Create sample data with arrays to demonstrate explode
sample_data = [
    (1, "Customer A", ["product_1", "product_2", "product_3"]),
    (2, "Customer B", ["product_1"]),
    (3, "Customer C", []),  # Empty array
    (4, "Customer D", None)  # Null array
]

sample_schema = StructType([
    StructField("customer_id", IntegerType(), False),
    StructField("customer_name", StringType(), False),
    StructField("purchased_products", ArrayType(StringType()), True)
])

sample_df = spark.createDataFrame(sample_data, schema=sample_schema)
display(sample_df)

In [0]:
# explode() - skips null and empty arrays
exploded_df = sample_df.select(
    "customer_id",
    "customer_name",
    explode("purchased_products").alias("product")
)
display(exploded_df)

# explode_outer() - keeps null and empty arrays
exploded_outer_df = sample_df.select(
    "customer_id",
    "customer_name",
    explode_outer("purchased_products").alias("product")
)
display(exploded_outer_df)

### Nested JSON — dot notation

Accessing fields within nested structs using `col("parent.child")` syntax.

In [0]:
# Create sample data with nested structures
nested_data = customers_df.select(
    col("customer_id"),
    struct(
        col("first_name"),
        col("last_name"),
        col("email")
    ).alias("personal_info"),
    struct(
        col("city"),
        col("country")
    ).alias("location")
)

nested_data.printSchema()
display(nested_data.limit(3))

# Access nested fields
flattened = nested_data.select(
    "customer_id",
    col("personal_info.first_name").alias("first_name"),
    col("personal_info.email").alias("email"),
    col("location.country").alias("country")
)
display(flattened.limit(5))

### Parsing JSON Strings (`get_json_object`, `from_json`)

Extracting values from JSON-encoded string columns.

In [0]:
# Create sample data with JSON strings
json_string_data = [
    (1, '{"name": "John", "age": 30, "city": "New York"}'),
    (2, '{"name": "Jane", "age": 25, "city": "London"}'),
    (3, '{"name": "Bob", "age": 35, "city": "Paris"}')
]

json_df = spark.createDataFrame(json_string_data, ["id", "json_data"])
display(json_df)

# Extract fields using get_json_object
parsed_df = json_df.select(
    "id",
    get_json_object("json_data", "$.name").alias("name"),
    get_json_object("json_data", "$.age").alias("age"),
    get_json_object("json_data", "$.city").alias("city")
)
display(parsed_df)

## Joins

Combining DataFrames using different join types: inner, left, right, and full outer. All join types use the same Catalyst optimizer under the hood.

---

### Inner Join

Returning only rows with matching keys in both DataFrames.

In [0]:
# Inner join - customers with their orders

customers_with_orders = customers_df.join(
    orders_df,
    customers_df.customer_id == orders_df.customer_id,
    "inner"
).select(
    customers_df.customer_id,
    customers_df.first_name,
    customers_df.last_name,
    orders_df.order_id,
    orders_df.total_amount,
    orders_df.payment_method
)

display(customers_with_orders.limit(10))

### Left Join

Keeping all rows from the left DataFrame, with nulls where no match exists on the right.

In [0]:
# Left join - all orders with customer details (if available)

orders_with_customers = orders_df.join(
    customers_df,
    orders_df['customer_id'] == customers_df['customer_id'],
    "left"
).select(
    orders_df.order_id,
    orders_df.customer_id,
    customers_df.first_name,
    customers_df.last_name,
    orders_df.total_amount
)

display(orders_with_customers.limit(10))

### Right Join

Keeping all rows from the right DataFrame, with nulls where no match exists on the left.

In [0]:
# Right join - all customers with their orders (if any)

customers_with_orders = orders_df.join(
    customers_df,
    orders_df['customer_id'] == customers_df['customer_id'],
    "right"
).select(
    customers_df.customer_id,
    customers_df.first_name,
    customers_df.last_name,
    orders_df.order_id,
    orders_df.total_amount
)

display(customers_with_orders.limit(10))

### Full Outer Join

Keeping all rows from both DataFrames, with nulls on either side where no match exists.

In [0]:
# Full outer join - all customers and all orders

from pyspark.sql.functions import col

full_join = customers_df.join(
    orders_df,
    customers_df.customer_id == orders_df.customer_id,
    "outer"
).select(
    customers_df.customer_id.alias("cust_id"),
    orders_df.customer_id.alias("order_cust_id"),
    customers_df.first_name,
    col("order_id").alias("order_id"),
    orders_df.total_amount
)

print("\nSample with potential nulls:")
display(full_join.limit(10))

## read_files() — Unity Catalog Native Reader

| Feature | `read_files()` | `spark.read` |
|---|---|---|
| Language | SQL | Python |
| Schema inference | Automatic | Manual or `inferSchema` |
| UC integration | Native | Via path |
| Use case | SQL-first workflows | Programmatic pipelines |

```sql
-- Read CSV from a Volume
SELECT * FROM read_files('/Volumes/catalog/schema/volume/file.csv', format => 'csv', header => true);

-- Create table from files
CREATE TABLE bronze.customers AS SELECT * FROM read_files('/Volumes/.../file.csv');
```

> **Exam Tip:** `read_files()` is the recommended way to read files in SQL workflows with Unity Catalog.

In [0]:
%sql
--Read CSV from a real Unity Catalog Volume using Spark DataFrame API

-- Read CSV from a Volume
SELECT * FROM read_files(
  CUSTOMERS_CSV,
  format => 'csv',
  header => true
);

-- Create table from files
CREATE TABLE bronze.customers AS
SELECT * FROM read_files(CUSTOMERS_CSV);


---

## Summary

| Topic | Key Takeaway |
|---|---|
| **Schema** | Use explicit schemas in production for performance and data quality |
| **Formats** | CSV (compatibility), Parquet (performance), JSON (flexibility) |
| **ELT** | Load raw → Bronze, transform in Silver/Gold |
| **Transformations** | `select`, `withColumn`, `cast`, `drop`, `distinct`, `orderBy` |
| **Filtering** | `filter`/`where`, `isin`, `isNull`, string ops |
| **Aggregations** | `groupBy` + `agg()`, HAVING = filter after groupBy |
| **Views** | Temp (session), Global Temp (`global_temp.`), Permanent (UC) |
| **JSON** | `explode` for arrays, dot notation for nested, `get_json_object` for strings |
| **Joins** | inner, left, right, outer — same Catalyst optimizer |
| **read_files()** | SQL-native file reader, recommended for UC |

---

> **← M01: Platform & Workspace | Day 1 | M03: Delta Lake Fundamentals →**