# M02: ELT Ingestion & Transformations

## 2.1. The Story Continues...

Remember our e-commerce company? The data team has just gained access to multiple source systems:

* **CRM System** exports customers as CSV (daily dump)
* **Order Management** sends JSON via API 
* **Product Catalog** is in Parquet (from a legacy Hadoop system)
* **Extended Customer Data** arrives in Excel from the marketing team

**Your mission:** Import data from all sources, transform it, analyze it, and prepare it for downstream analytics.

---

## 2.2. Why This Matters (Real-World Context)

### The "inferSchema" Trap

Most tutorials show: `spark.read.option("inferSchema", "true")`

**In production, this is often a bad idea:**

| Scenario         | With inferSchema         | With explicit schema      |
|------------------|-------------------------|--------------------------|
| 10 GB file       | Scans entire file first | Direct read              |
| Column `"123"`   | Might be INT or STRING? | You control it           |
| New column added | Schema changes silently | Fails fast (good!)       |
| Null values      | Might guess wrong type  | Explicit nullable        |

**Rule of thumb:** Use `inferSchema=true` for exploration, explicit schema for production.

### The Bronze Layer Philosophy

In the Medallion architecture, the Bronze layer should:
* Keep data as STRING (preserve original values)
* Add metadata (ingestion time, source file)
* NOT apply business logic
* Be idempotent (safe to re-run)

---

### Data Process Patterns

Understanding how data flows is critical for designing robust pipelines.

| Pattern | Description | Trigger Type | Use Case |
|---------|-------------|--------------|----------|
| **Full Load** | Reloads the entire dataset from scratch every time. | Batch Read | Small tables, full history refreshes. |
| **Incremental Batch** | Processes only new/changed data but runs as a scheduled job. | `Trigger.AvailableNow` (or `Once`) | Daily/Hourly ETL. efficient and cost-effective. |
| **Continuous** | Continuously processes data as it arrives to minimize latency. | `Trigger.ProcessingTime` | Real-time monitoring, alerts. |

![Data Ingestion Architecture](../../../assets/images/1ba507cd604849878e74e586f4df3559.png)

<br>

### Module Content Overview

| Section | Topics Covered |
|---------|----------------|
| **Part 1: Data Ingestion** | • **DataFrame Reader API** – support for CSV, JSON, Parquet, Excel <br> • **Schema Definition** – explicit vs inferred approach <br> • **Format Options** – handling delimiters, quotes, headers |
| **Part 2: DataFrame Operations** | • **Transformations** – `select`, `withColumn`, `cast`, `rename`, `drop` <br> • **Filtering** – complex conditions, null handling <br> • **Aggregations** – `groupBy`, `agg`, statistical functions |
| **Part 3: Advanced Techniques** | • **SQL Integration** – Temp Views, mixed Python/SQL logic <br> • **Complex Types** – `explode`, `struct`, JSON parsing <br> • **Joins** – combining multiple datasets efficiently |

---

## 2.3. ELT vs ETL

**Key Concept for Exam:**

| Aspect | ETL (Extract-Transform-Load) | ELT (Extract-Load-Transform) |
|--------|------------------------------|-------------------------------|
| Transform Location | External engine (before load) | Inside data platform (after load) |
| Data Storage | Only transformed data stored | Raw + transformed data stored |
| Flexibility | Fixed schema upfront | Schema-on-read, evolve later |
| Scalability | Limited by transform engine | Scales with cloud compute |
| Use Case | Traditional DW | Modern Lakehouse |
| Databricks Approach | **Not recommended** | **Default pattern** |

**Databricks uses ELT pattern:**
1. **Extract** raw data into Bronze layer (minimal transformation)
2. **Load** as-is into Delta tables
3. **Transform** using Spark SQL/PySpark in Silver and Gold layers

**Why ELT in Lakehouse?**
- Raw data preserved for reprocessing
- Transformations can be updated without re-ingestion
- Compute and storage scale independently
- Schema evolution handled by Delta Lake

## 2.4. Environment Initialization

Run the central configuration script:

In [0]:
%run ../../setup/00_setup

### 2.4.1. Notebook Configuration

Define variables specific to this notebook:

In [0]:
# Paths to data directories (subdirectories in DATASET_PATH from 00_setup)
CUSTOMERS_PATH = f"{DATASET_PATH}/customers"
ORDERS_PATH = f"{DATASET_PATH}/orders"
PRODUCTS_PATH = f"{DATASET_PATH}/products"

# Paths to specific files
CUSTOMERS_CSV = f"{CUSTOMERS_PATH}/customers.csv"
ORDERS_JSON = f"{ORDERS_PATH}/orders_batch.json"
PRODUCTS_PARQUET = f"{PRODUCTS_PATH}/products.parquet"
EXCEL_PATH = f"{DATASET_PATH}/customers/customers_extented.xlsx"  # Note: filename has typo 'extented'

### 2.4.2. Import Libraries

Import all necessary libraries for data ingestion and transformations:

In [0]:
# Import PySpark types for schema definition
from pyspark.sql.types import (
    StructType, StructField, StringType, IntegerType, 
    DoubleType, TimestampType, DateType, ArrayType
)

# Import PySpark functions for transformations and aggregations
from pyspark.sql.functions import (
    col, lit, concat, upper, lower, year, month, day,
    sum, avg, min, max, count, stddev, desc, asc,
    explode, explode_outer, struct, array,
    get_json_object, from_json, to_json,
    to_date, current_date, datediff
)

import time
import pandas as pd

## 2.5. CSV Data Import (Customers)

### 2.5.1. Loading CSV with Automatic Schema Inference

In [0]:
# Load CSV data with automatic schema inference
customers_auto_df = (
    spark.read
    .format("csv")
    .option("header", "true")       # First line contains column names
    .option("inferSchema", "true")  # Automatic data type inference
    .load(CUSTOMERS_CSV)
)

In [0]:
# Verify loaded data
customers_auto_df.printSchema()
display(customers_auto_df.limit(5))

### 2.5.2. Extended Reader Options – Typical Production Issues

In a production environment, we often encounter CSV files with:

- different separator (`;` instead of `,`),
- quotes inside fields,
- corrupted rows.

Key options:

- `delimiter` – custom column separator,
- `quote` – character opening/closing text fields,
- `escape` – way to "escape" special characters,
- `mode` – handling of malformed records (`PERMISSIVE`, `DROPMALFORMED`, `FAILFAST`).

### 2.5.3. Manual Schema Definition for CSV

**Best Practice:** Defining schema manually ensures control and performance.

In [0]:
# Schema definition for customers
# Structure: customer_id (string), first_name (string), last_name (string), email (string), city (string), country (string), registration_date (timestamp)
customers_schema = StructType([
    StructField("customer_id", StringType(), nullable=False),
    StructField("first_name", StringType(), nullable=True),
    StructField("last_name", StringType(), nullable=True),
    StructField("city", StringType(), nullable=True),
    StructField("email", StringType(), nullable=True),
    StructField("country", StringType(), nullable=True),
    StructField("registration_date", TimestampType(), nullable=True)
])

In [0]:
# Load CSV data with manually defined schema
customers_df = (
    spark.read
    .format("csv")
    .option("header", "true")
    .schema(customers_schema)
    .load(CUSTOMERS_CSV)
)

In [0]:
# Verify loaded data
customers_df.printSchema()
display(customers_df.limit(5))

## 2.6. JSON Data Import (Orders)

### 2.6.1. Loading JSON with Automatic Schema Inference

In [0]:
# Load JSON data with automatic schema inference
orders_auto_df = (
    spark.read
    .format("json")
    .option("inferSchema", "true")
    .load(ORDERS_JSON)
)

In [0]:
# Verify loaded data
orders_auto_df.printSchema()
display(orders_auto_df.limit(5))

### 2.6.2. Manual Schema Definition for JSON

In [0]:
# Schema definition for orders
# Actual structure: order_id, customer_id, product_id, store_id, order_datetime, quantity, unit_price, discount_percent, total_amount, payment_method
orders_schema = StructType([
    StructField("order_id", StringType(), nullable=True),  # Can be null in data
    StructField("customer_id", StringType(), nullable=False),
    StructField("order_datetime", StringType(), nullable=True),  # String, will convert to timestamp later
    StructField("total_amount", DoubleType(), nullable=False),
    StructField("payment_method", StringType(), nullable=False)
])

In [0]:
# Load JSON data with manually defined schema
orders_df = (
    spark.read
    .format("json")
    .schema(orders_schema)
    .load(ORDERS_JSON)
)

In [0]:
# Verify loaded data
orders_df.printSchema()
display(orders_df.limit(5))

## 2.7. Parquet Data Import (Products)

### 2.7.1. Loading Parquet (built-in schema)

In [0]:
# Parquet already contains built-in schema - no need to define it
products_df = (
    spark.read
    .format("parquet")
    .load(PRODUCTS_PARQUET)
)

In [0]:
# Verify loaded data
products_df.printSchema()
display(products_df.limit(5))

## 2.8. Excel Data Import (Extended Customer Data)

### 2.8.1. Loading Excel with Spark-Excel Library

For Excel files, Databricks Runtime provides built-in support for Excel format.

**Benefits:**
* Native Spark integration (distributed processing)
* Supports large Excel files
* Consistent API with other formats
* Works with Unity Catalog Volumes

**Note:** Requires spark-excel library to be installed on the cluster.

In [0]:
# Load Excel file using spark-excel library

customers_extended_df = (
    spark.read
    .format("excel")
    .option("dataAddress", "Arkusz1!D6:M26")
    .option("headerRows", 1)
    .load(EXCEL_PATH)
)

In [0]:
# Verify loaded data
customers_extended_df.printSchema()
display(customers_extended_df.limit(5))

## 2.9. Performance Comparison

### 2.9.1. Loading CSV: inferSchema vs Manual Schema

In [0]:
# Automatic schema inference
start_auto = time.time()
df_auto = (
    spark.read
    .format("csv")
    .option("header", "true")
    .option("inferSchema", "true")
    .load(CUSTOMERS_CSV)
)
count_auto = df_auto.count()  # Action - forces execution
time_auto = time.time() - start_auto

In [0]:
# Manual schema definition
start_manual = time.time()
df_manual = (
    spark.read
    .format("csv")
    .option("header", "true")
    .schema(customers_schema)
    .load(CUSTOMERS_CSV)
)
count_manual = df_manual.count()  # Action - forces execution
time_manual = time.time() - start_manual

In [0]:
# Comparison
speedup = (time_auto - time_manual) / time_auto * 100
print(f"\nConclusion: Manual schema is faster by {speedup:.1f}%")

## 2.10. DataFrame Transformations

Now that we have loaded data from multiple sources (CSV, JSON, Parquet, Excel), let's explore common DataFrame transformation operations.

### 2.10.1. Select - Choosing Columns

The `select()` operation allows you to choose specific columns from a DataFrame.

In [0]:
# Select specific columns
customers_selected = customers_df.select("customer_id", "first_name", "last_name", "email")
display(customers_selected.limit(5))

# Select with column expressions
customers_transformed = customers_df.select(
    col("customer_id"),
    upper(col("first_name")).alias("first_name_upper"),
    col("email")
)
display(customers_transformed.limit(5))

### 2.10.2. WithColumn - Adding/Modifying Columns

The `withColumn()` operation adds a new column or replaces an existing one.

In [0]:
# Add a new column
customers_with_fullname = customers_df.withColumn(
    "full_name", 
    concat(col("first_name"), lit(" "), col("last_name"))
)
display(customers_with_fullname.select("customer_id", "first_name", "last_name", "full_name").limit(5))

# Add multiple columns
customers_enriched = customers_auto_df \
    .withColumn("email_lower", lower(col("email"))) \
    .withColumn("registration_year", year(col("registration_date")))
    
display(customers_enriched.select("customer_id", "email", "email_lower", "registration_date", "registration_year").limit(5))

### 2.10.3. Cast - Type Conversion

The `cast()` operation converts column data types.

In [0]:
# Cast customer_id to string
customers_casted = customers_df.withColumn("customer_id_str", col("customer_id").cast(StringType()))
customers_df.select("customer_id").printSchema()
customers_casted.select("customer_id_str").printSchema()

# Cast timestamp to date
customers_date = customers_df.withColumn("registration_date_only", col("registration_date").cast(DateType()))
display(customers_date.select("customer_id", "registration_date", "registration_date_only").limit(5))

### 2.10.4. Rename - Changing Column Names

The `withColumnRenamed()` operation renames columns.

In [0]:
# Rename single column
customers_renamed = customers_df.withColumnRenamed("customer_id", "id")
display(customers_renamed.limit(5))

# Rename multiple columns using select with alias
customers_multi_renamed = customers_df.select(
    col("customer_id").alias("id"),
    col("first_name").alias("fname"),
    col("last_name").alias("lname"),
    col("email"),
    col("city")
)
display(customers_multi_renamed.limit(5))

### 2.10.5. Drop - Removing Columns

The `drop()` operation removes columns from a DataFrame.

In [0]:
# Drop single column
customers_dropped = customers_df.drop("email")

# Drop multiple columns
customers_minimal = customers_df.drop("email", "city", "country")
display(customers_minimal.limit(5))

### 2.10.6. Distinct - Unique Rows

The `distinct()` operation returns unique rows, and `dropDuplicates()` allows column-specific deduplication.

In [0]:
# Get distinct countries
distinct_countries = customers_df.select("country").distinct()
display(distinct_countries.orderBy("country"))

# Drop duplicates based on specific columns
unique_locations = customers_df.select("city", "country").dropDuplicates(["city", "country"])
display(unique_locations.orderBy("country", "city").limit(10))

### 2.10.7. OrderBy - Sorting Data

The `orderBy()` or `sort()` operations sort DataFrame rows.

In [0]:
# Sort by single column (ascending)
customers_sorted_asc = customers_df.orderBy("registration_date")
display(customers_sorted_asc.select("customer_id", "first_name", "last_name", "registration_date").limit(5))

# Sort by multiple columns with different directions
customers_sorted_multi = customers_df.orderBy(asc("country"), desc("registration_date"))
display(customers_sorted_multi.select("customer_id", "first_name", "country", "registration_date").limit(10))

## 2.11. Filtering Data

Filtering allows you to select rows that meet specific conditions.

### 2.11.1. Simple Filter Conditions

Use `filter()` or `where()` (they are equivalent) to apply conditions.

In [0]:
# Filter by country
usa_customers = customers_auto_df.filter(col("country") == "Texas")
display(usa_customers.select("customer_id", "first_name", "last_name", "country").limit(5))

# Filter using where (equivalent to filter)
nyc_customers = customers_auto_df.where(col("city") == "New York")
display(nyc_customers.select("customer_id", "first_name", "city", "country").limit(5))

### 2.11.2. Multiple Conditions

Combine conditions using `&` (AND) and `|` (OR). Remember to wrap each condition in parentheses.

In [0]:
# AND condition
usa_2023 = customers_df.filter(
    (col("country") == "USA") & (year(col("registration_date")) == 2023)
)
display(usa_2023.select("customer_id", "first_name", "country", "registration_date").limit(5))

# OR condition
usa_or_uk = customers_df.filter(
    (col("country") == "USA") | (col("country") == "UK")
)
display(usa_or_uk.select("customer_id", "first_name", "country").limit(5))

### 2.11.3. isin() - Filtering by List of Values

The `isin()` method checks if a column value is in a list.

In [0]:
# Filter by list of countries
selected_countries = ["USA", "UK", "Germany", "France"]
customers_selected_countries = customers_df.filter(col("country").isin(selected_countries))

# Show distribution by country
display(customers_selected_countries.groupBy("country").count().orderBy(desc("count")))

### 2.11.4. Null Handling

Use `isNull()` and `isNotNull()` to filter based on null values.

In [0]:
# Filter rows where email is NOT null
customers_with_email = customers_df.filter(col("email").isNotNull())

# Filter rows where city IS null
customers_no_city = customers_df.filter(col("city").isNull())
if customers_no_city.count() > 0:
    display(customers_no_city.select("customer_id", "first_name", "city", "country").limit(5))

### 2.11.5. String Operations

Use string functions like `like()`, `contains()`, `startswith()`, and `endswith()` for pattern matching.

In [0]:
# Filter using like (SQL-style pattern matching)
gmail_customers = customers_df.filter(col("email").like("%@gmail.com"))
display(gmail_customers.select("customer_id", "first_name", "email").limit(5))

# Filter using contains
new_cities = customers_df.filter(col("city").contains("New"))
display(new_cities.select("customer_id", "first_name", "city").limit(5))

# Filter using startswith
j_names = customers_df.filter(col("first_name").startswith("J"))
display(j_names.select("customer_id", "first_name", "last_name").limit(5))

## 2.12. Aggregations and Grouping

Aggregations allow you to compute summary statistics on your data.

### 2.12.1. Basic Aggregations

Use `groupBy()` with aggregation functions like `count()`, `sum()`, `avg()`.

In [0]:
# Count by country
customers_by_country = customers_df.groupBy("country").count().orderBy(desc("count"))
display(customers_by_country.limit(10))

In [0]:
# Sum and average on orders
revenue_by_payment = orders_df.groupBy("payment_method").agg(
    sum("total_amount").alias("total_revenue"),
    avg("total_amount").alias("avg_order_value"),
    count("*").alias("order_count")
).orderBy(desc("total_revenue"))

display(revenue_by_payment)

### 2.12.2. Min/Max Aggregations

Find minimum and maximum values in your data.

In [0]:
# Min and max order amounts
order_stats = orders_df.agg(
    min("total_amount").alias("min_amount"),
    max("total_amount").alias("max_amount"),
    avg("total_amount").alias("avg_amount")
)
display(order_stats)

### 2.12.3. Multiple Aggregations with agg()

Use `agg()` to apply multiple aggregation functions at once.

In [0]:
# Example: Multiple aggregations on orders by customer
customer_stats = orders_df.groupBy("customer_id").agg(
    count("*").alias("total_orders"),
    sum("total_amount").alias("total_spent"),
    avg("total_amount").alias("avg_order_value"),
    min("total_amount").alias("min_order"),
    max("total_amount").alias("max_order")
).orderBy(desc("total_spent"))

display(customer_stats.limit(10))

### 2.12.4. Having Clause (Filter After Aggregation)

In Spark, use `filter()` after `groupBy()` to implement SQL's HAVING clause.

In [0]:
# Customers with more than 5 orders
high_frequency = orders_df.groupBy("customer_id").agg(
    count("*").alias("order_count"),
    sum("total_amount").alias("total_spent")
).filter(col("order_count") > 5).orderBy(desc("order_count"))

display(high_frequency.limit(10))

## 2.13. Temporary Views & SQL

Bridge between DataFrame API and SQL by creating temporary views.

### 2.13.1. Creating Temporary Views

**Types of Views in Databricks (Exam Topic):**

| View Type | Scope | Persistence | Syntax |
|-----------|-------|-------------|--------|
| Temp View | Current SparkSession | Session only | `CREATE TEMP VIEW` |
| Global Temp View | All SparkSessions in cluster | Until cluster restart | `CREATE GLOBAL TEMP VIEW` (access via `global_temp.name`) |
| Permanent View | Unity Catalog | Persistent | `CREATE VIEW catalog.schema.name` |

**Key Exam Points:**
- Temp views are **not visible** across notebooks
- Global temp views use special schema `global_temp`
- Permanent views are stored in Unity Catalog with full governance
- Views do NOT store data - they store the query definition

In [0]:
# Create temporary views from our DataFrames

customers_df.createOrReplaceTempView("customers")
orders_df.createOrReplaceTempView("orders")
products_df.createOrReplaceTempView("products")

print("\These views are available for SQL queries in this session.")

### 2.13.2. Running SQL Queries

Use `spark.sql()` to run SQL queries against temporary views.

In [0]:
# Simple SELECT query
result = spark.sql("""
    SELECT 
        customer_id,
        COUNT(*) as order_count,
        SUM(total_amount) as total_spent
    FROM orders
    GROUP BY customer_id
    ORDER BY order_count DESC
    LIMIT 10
""")

display(result)

### 2.13.3. DataFrame API vs SQL - Same Result, Different Syntax

Compare how the same operation looks in DataFrame API and SQL.

In [0]:
# Same query - two approaches

# DataFrame API
df_api_result = customers_df  \
    .groupBy("city") \
    .count() \
    .orderBy(desc("count")) \
    .limit(5)

display(df_api_result)

In [0]:
# SQL
sql_result = spark.sql("""
    SELECT 
        city,
        COUNT(*) as count
    FROM customers
    GROUP BY city
    ORDER BY count DESC
    LIMIT 5
""")

display(sql_result)

### 2.13.4. Global Temporary Views

Global temp views are accessible across different notebooks on the same cluster (across SparkSessions).

In [0]:
# Create global temporary view
customers_df.createOrReplaceGlobalTempView("global_customers")

In [0]:
# Query global temp view
global_result = spark.sql("""
    SELECT country, COUNT(*) as count
    FROM global_temp.global_customers
    GROUP BY country
    ORDER BY count DESC
    LIMIT 5
""")

display(global_result)

### 2.13.5. Complex SQL Queries

Combine multiple operations in SQL queries.

In [0]:
# Example: Complex query with JOIN and aggregation

complex_query = spark.sql("""
    SELECT 
        c.customer_id,
        c.first_name,
        c.last_name,
        c.country,
        COUNT(o.order_id) as total_orders,
        SUM(o.total_amount) as total_spent,
        AVG(o.total_amount) as avg_order_value,
        MAX(o.total_amount) as largest_order
    FROM customers c
    LEFT JOIN orders o ON c.customer_id = try_cast(o.customer_id AS STRING)
    GROUP BY c.customer_id, c.first_name, c.last_name, c.country
    HAVING COUNT(o.order_id) > 0
    ORDER BY total_spent DESC
    LIMIT 10
""")

display(complex_query)

### 2.13.6. When to Use SQL vs DataFrame API?

**Use SQL when:**
* Team is more familiar with SQL
* Complex queries with multiple JOINs
* Ad-hoc analysis and exploration

**Use DataFrame API when:**
* Building reusable data pipelines
* Need type safety and compile-time checks
* Complex transformations with custom logic
* Better IDE support and autocomplete

## 2.14. JSON Operations

Working with semi-structured JSON data requires special operations.

### 2.14.1. Explode - Flattening Arrays

The `explode()` function creates a new row for each element in an array.

In [0]:
# Create sample data with arrays to demonstrate explode
sample_data = [
    (1, "Customer A", ["product_1", "product_2", "product_3"]),
    (2, "Customer B", ["product_1"]),
    (3, "Customer C", []),  # Empty array
    (4, "Customer D", None)  # Null array
]

sample_schema = StructType([
    StructField("customer_id", IntegerType(), False),
    StructField("customer_name", StringType(), False),
    StructField("purchased_products", ArrayType(StringType()), True)
])

sample_df = spark.createDataFrame(sample_data, schema=sample_schema)
display(sample_df)

In [0]:
# explode() - skips null and empty arrays
exploded_df = sample_df.select(
    "customer_id",
    "customer_name",
    explode("purchased_products").alias("product")
)
display(exploded_df)

# explode_outer() - keeps null and empty arrays
exploded_outer_df = sample_df.select(
    "customer_id",
    "customer_name",
    explode_outer("purchased_products").alias("product")
)
display(exploded_outer_df)

### 2.14.2. Nested JSON Structures

Access nested fields using dot notation or `getField()`.

In [0]:
# Create sample data with nested structures
nested_data = customers_df.select(
    col("customer_id"),
    struct(
        col("first_name"),
        col("last_name"),
        col("email")
    ).alias("personal_info"),
    struct(
        col("city"),
        col("country")
    ).alias("location")
)

nested_data.printSchema()
display(nested_data.limit(3))

# Access nested fields
flattened = nested_data.select(
    "customer_id",
    col("personal_info.first_name").alias("first_name"),
    col("personal_info.email").alias("email"),
    col("location.country").alias("country")
)
display(flattened.limit(5))

### 2.14.3. Parsing JSON Strings

Use `get_json_object()` or `from_json()` to parse JSON stored as strings.

In [0]:
# Create sample data with JSON strings
json_string_data = [
    (1, '{"name": "John", "age": 30, "city": "New York"}'),
    (2, '{"name": "Jane", "age": 25, "city": "London"}'),
    (3, '{"name": "Bob", "age": 35, "city": "Paris"}')
]

json_df = spark.createDataFrame(json_string_data, ["id", "json_data"])
display(json_df)

# Extract fields using get_json_object
parsed_df = json_df.select(
    "id",
    get_json_object("json_data", "$.name").alias("name"),
    get_json_object("json_data", "$.age").alias("age"),
    get_json_object("json_data", "$.city").alias("city")
)
display(parsed_df)

## 2.15. Joins - Combining Datasets

Joins allow you to combine data from multiple DataFrames based on common keys.

### 2.15.1. Inner Join

Inner join returns only matching rows from both DataFrames.

In [0]:
# Inner join - customers with their orders

customers_with_orders = customers_df.join(
    orders_df,
    customers_df.customer_id == orders_df.customer_id,
    "inner"
).select(
    customers_df.customer_id,
    customers_df.first_name,
    customers_df.last_name,
    orders_df.order_id,
    orders_df.total_amount,
    orders_df.payment_method
)

display(customers_with_orders.limit(10))

### 2.15.2. Left Join

Left join returns all rows from the left DataFrame and matching rows from the right.

In [0]:
# Left join - all orders with customer details (if available)

orders_with_customers = orders_df.join(
    customers_df,
    orders_df['customer_id'] == customers_df['customer_id'],
    "left"
).select(
    orders_df.order_id,
    orders_df.customer_id,
    customers_df.first_name,
    customers_df.last_name,
    orders_df.total_amount
)

display(orders_with_customers.limit(10))

### 2.15.3. Right Join

Right join returns all rows from the right DataFrame and matching rows from the left.

In [0]:
# Right join - all customers with their orders (if any)

customers_with_orders = orders_df.join(
    customers_df,
    orders_df['customer_id'] == customers_df['customer_id'],
    "right"
).select(
    customers_df.customer_id,
    customers_df.first_name,
    customers_df.last_name,
    orders_df.order_id,
    orders_df.total_amount
)

display(customers_with_orders.limit(10))

### 2.15.4. Full Outer Join

Full outer join returns all rows from both DataFrames, with nulls where there's no match.

In [0]:
# Full outer join - all customers and all orders

from pyspark.sql.functions import col

full_join = customers_df.join(
    orders_df,
    customers_df.customer_id == orders_df.customer_id,
    "outer"
).select(
    customers_df.customer_id.alias("cust_id"),
    orders_df.customer_id.alias("order_cust_id"),
    customers_df.first_name,
    col("order_id").alias("order_id"),
    orders_df.total_amount
)

print("\nSample with potential nulls:")
display(full_join.limit(10))

## 2.16. read_files() - Unity Catalog Native Reader

**Theoretical Introduction:**

`read_files()` is a Databricks SQL function that reads files from Volumes or cloud storage.

| Feature | `read_files()` | `spark.read` |
|---------|---------------|-------------|
| Language | SQL | Python |
| Schema inference | Automatic | Manual or `inferSchema` |
| UC integration | Native | Via path |
| Use case | SQL-first workflows | Programmatic workflows |

```sql
-- Read CSV from a Volume
SELECT * FROM read_files(
  '/Volumes/catalog/schema/volume/customers.csv',
  format => 'csv',
  header => true
);

-- Create table from files
CREATE TABLE bronze.customers AS
SELECT * FROM read_files('/Volumes/catalog/schema/volume/customers.csv');
```

**Exam Note:** `read_files()` is the recommended way to read files in SQL workflows with Unity Catalog.

In [0]:
%sql
--Read CSV from a real Unity Catalog Volume using Spark DataFrame API

-- Read CSV from a Volume
SELECT * FROM read_files(
  CUSTOMERS_CSV,
  format => 'csv',
  header => true
);

-- Create table from files
CREATE TABLE bronze.customers AS
SELECT * FROM read_files(CUSTOMERS_CSV);


## 2.17. Summary

In this comprehensive notebook you learned:

**Data Ingestion (Sections 2.4-2.9)**
* Loading CSV with inferSchema vs manual schema
* Loading JSON with automatic schema detection
* Loading Parquet (built-in schema)
* Loading Excel using spark-excel library
* Performance comparison and best practices

**DataFrame Transformations (Section 2.10)**
* `select()` - choosing columns
* `withColumn()` - adding/modifying columns
* `cast()` - type conversion
* `withColumnRenamed()` - renaming columns
* `drop()` - removing columns
* `distinct()` and `dropDuplicates()` - unique rows
* `orderBy()` - sorting data

**Filtering Data (Section 2.11)**
* Simple filter conditions with `filter()` and `where()`
* Multiple conditions with `&` (AND) and `|` (OR)
* `isin()` - filtering by list of values
* `isNull()` and `isNotNull()` - null handling
* String operations: `like()`, `contains()`, `startswith()`

**Aggregations (Section 2.12)**
* `groupBy()` with `count()`, `sum()`, `avg()`
* `min()` and `max()` aggregations
* Multiple aggregations with `agg()`
* HAVING clause equivalent with `filter()` after groupBy

**Temporary Views & SQL (Section 2.13)**
* Creating temp views with `createOrReplaceTempView()`
* Running SQL queries with `spark.sql()`
* DataFrame API vs SQL comparison
* Global temporary views
* Complex SQL queries with JOINs

**JSON Operations (Section 2.14)**
* `explode()` and `explode_outer()` for arrays
* Accessing nested structures with dot notation
* `struct()` for creating nested structures
* `get_json_object()` for parsing JSON strings

**Joins (Section 2.15)**
* Inner join - matching rows only
* Left join - all left + matching right
* Right join - all right + matching left
* Full outer join - all rows from both
* Joining with extended data sources

---

### Key Takeaways

1. **Schema Definition**: Use explicit schemas in production for performance and data quality
2. **Format Selection**: CSV for compatibility, Parquet for performance, JSON for flexibility
3. **Transformations**: Chain operations for readable, maintainable code
4. **SQL Bridge**: Use temp views to leverage SQL when appropriate
5. **Joins**: Choose the right join type based on your data requirements

**Next Steps**: Apply these techniques to build ingestion pipelines! 