## Context and Requirements

- **Training Day**: Day 1 - Data Ingestion
- **Notebook Type**: Workshop
- **Technical Requirements**:
  - Databricks Runtime 13.0+ (recommended: 14.3 LTS)
  - Unity Catalog enabled
  - Permissions: CREATE TABLE, CREATE SCHEMA, SELECT, MODIFY
  - Cluster: Standard with minimum 2 workers

## Theoretical Introduction

**Section Objective:** Understanding data ingestion methods in Databricks Lakehouse

### COPY INTO - Batch Ingestion
- **Purpose**: Load data from external files into Delta tables
- **Idempotency**: Automatically tracks processed files, preventing duplicates
- **Use case**: Scheduled batch jobs, one-time data migrations
- **Supported formats**: CSV, JSON, Parquet, Avro, ORC, TEXT

### Auto Loader - Streaming Ingestion
- **Purpose**: Incrementally process new files as they arrive
- **cloudFiles format**: Uses `.format("cloudFiles")` for streaming read
- **Schema inference**: Automatically detects and evolves schema
- **Use case**: Near real-time processing, continuous data pipelines

### Key Differences

| Feature | COPY INTO | Auto Loader |
|---------|-----------|-------------|
| Processing | Batch | Streaming |
| File tracking | Built-in | Checkpoint-based |
| Schema evolution | Manual | Automatic |
| Scalability | Medium | High |
| Cost | Per execution | Per file |

## Environment Initialization

Run the initialization script for per-user catalog and schema isolation:

In [None]:
%run ../00_setup

## Configuration

Define workshop-specific variables and paths:

In [None]:
# Workshop configuration
from pyspark.sql import functions as F
from pyspark.sql.types import *

# Paths to data files (defined in 00_setup)
CUSTOMERS_CSV = f"{DATASET_BASE_PATH}/customers/customers.csv"
ORDERS_JSON = f"{DATASET_BASE_PATH}/orders/orders_batch.json"
ORDERS_STREAM_PATH = f"{DATASET_BASE_PATH}/orders/stream/"
PRODUCTS_PARQUET = f"{DATASET_BASE_PATH}/products/products.parquet"

# Checkpoint path for Auto Loader
CHECKPOINT_PATH = f"{DATASET_BASE_PATH}/_checkpoints/workshop_02"

# Display configuration
print("="*60)
print("WORKSHOP CONFIGURATION")
print("="*60)
print(f"Catalog:          {CATALOG}")
print(f"Bronze Schema:    {BRONZE_SCHEMA}")
print(f"Customers CSV:    {CUSTOMERS_CSV}")
print(f"Orders JSON:      {ORDERS_JSON}")
print(f"Orders Stream:    {ORDERS_STREAM_PATH}")
print(f"Products Parquet: {PRODUCTS_PARQUET}")
print(f"Checkpoint Path:  {CHECKPOINT_PATH}")
print("="*60)

**Data Schema Reference:**

**Customers (CSV):**
- `customer_id` (STRING): CUST000001
- `first_name`, `last_name` (STRING): Customer name
- `email` (STRING): Email address
- `phone` (STRING): Phone number
- `city`, `state`, `country` (STRING): Location
- `registration_date` (DATE): Registration date
- `customer_segment` (STRING): Basic, Premium, etc.

**Orders (JSON):**
- `order_id` (STRING): ORD00000001
- `customer_id`, `product_id`, `store_id` (STRING): Foreign keys
- `order_datetime` (TIMESTAMP): Order timestamp
- `quantity` (INT), `unit_price`, `discount_percent`, `total_amount` (DOUBLE)
- `payment_method` (STRING): Cash, Credit Card, etc.

**Products (Parquet):**
- `product_id` (STRING): PROD000001
- `product_name` (STRING): Product name
- `subcategory_code` (STRING): Category code
- `brand` (STRING): Brand name
- `unit_cost`, `list_price` (DOUBLE): Prices
- `weight_kg` (DOUBLE): Weight
- `status` (STRING): Active, Inactive

---

## Part 1: COPY INTO - Batch Ingestion

### Task 1.1: CSV File Ingestion (Customers)

**Objective:** Load customer data from CSV file into Delta table using COPY INTO.

**Instructions:**
1. Create target table `bronze_customers_batch` with ALL columns as STRING (bronze layer principle)
2. Use `COPY INTO` to load data from `customers.csv`
3. Add `_ingestion_timestamp` metadata column
4. Verify loaded records

**Hints:**
- Use `USING DELTA` for table format
- Use `FILEFORMAT = CSV` with `FORMAT_OPTIONS ('header' = 'true')`
- Use `current_timestamp()` for ingestion timestamp
- Do NOT use inferSchema - keep all data as STRING in bronze layer

In [None]:
spark.sql(f"""
    CREATE TABLE IF NOT EXISTS {CATALOG}.{BRONZE_SCHEMA}.bronze_customers_batch (
        customer_id STRING,
        first_name STRING,
        last_name STRING,
        email STRING,
        phone STRING,
        city STRING,
        state STRING,
        country STRING,
        registration_date STRING,
        customer_segment STRING,
        _ingestion_timestamp TIMESTAMP
    )
    USING ____
""")

In [None]:
spark.sql(f"""
    ____ INTO {CATALOG}.{BRONZE_SCHEMA}.bronze_customers_batch
    FROM (
        SELECT 
            customer_id,
            first_name,
            last_name,
            email,
            phone,
            city,
            state,
            country,
            registration_date,
            customer_segment,
            ____() as _ingestion_timestamp
        FROM '{CUSTOMERS_CSV}'
    )
    FILEFORMAT = ____
    FORMAT_OPTIONS ('header' = '____')
""")

In [None]:
count_result = spark.sql(f"""
    SELECT COUNT(*) as total_records 
    FROM {CATALOG}.{BRONZE_SCHEMA}.bronze_customers_batch
""").collect()[0]['total_records']

print(f"Total customers loaded: {count_result}")

spark.sql(f"""
    SELECT customer_id, first_name, last_name, email, customer_segment, _ingestion_timestamp
    FROM {CATALOG}.{BRONZE_SCHEMA}.bronze_customers_batch
    LIMIT 5
""").show(truncate=False)

### Task 1.2: JSON File Ingestion (Orders)

**Objective:** Load order data from JSON file into Delta table.

**Instructions:**
1. Create target table `bronze_orders_batch` with ALL columns as STRING
2. Use `COPY INTO` with `FILEFORMAT = JSON`
3. Add ingestion timestamp

**Hints:**
- JSON format does not need `header` option
- All columns remain STRING in bronze layer
- Type casting happens in silver layer

In [None]:
spark.sql(f"""
    CREATE TABLE IF NOT EXISTS {CATALOG}.{BRONZE_SCHEMA}.bronze_orders_batch (
        order_id STRING,
        customer_id STRING,
        product_id STRING,
        store_id STRING,
        order_datetime STRING,
        quantity STRING,
        unit_price STRING,
        discount_percent STRING,
        total_amount STRING,
        payment_method STRING,
        _ingestion_timestamp TIMESTAMP
    )
    USING DELTA
""")

In [None]:
spark.sql(f"""
    COPY INTO {CATALOG}.{BRONZE_SCHEMA}.bronze_orders_batch
    FROM (
        SELECT 
            order_id,
            customer_id,
            product_id,
            store_id,
            ____,
            quantity,
            unit_price,
            discount_percent,
            total_amount,
            payment_method,
            current_timestamp() as _ingestion_timestamp
        FROM '{____}'
    )
    FILEFORMAT = ____
""")

In [None]:
count_result = spark.sql(f"""
    SELECT COUNT(*) as total_records 
    FROM {CATALOG}.{BRONZE_SCHEMA}.bronze_orders_batch
""").collect()[0]['total_records']

print(f"Total orders loaded: {count_result}")

spark.sql(f"""
    SELECT order_id, customer_id, order_datetime, total_amount, payment_method
    FROM {CATALOG}.{BRONZE_SCHEMA}.bronze_orders_batch
    LIMIT 5
""").show(truncate=False)

### Task 1.3: Parquet File Ingestion (Products)

**Objective:** Load product data from Parquet file with source file metadata.

**Instructions:**
1. Create target table `bronze_products_batch` with ALL columns as STRING
2. Use `COPY INTO` with `FILEFORMAT = PARQUET`
3. Add `_source_file` column using `_metadata.file_path`

**Hints:**
- Use `_metadata.file_path` to capture source file path
- Even for Parquet, keep STRING types in bronze for consistency

In [None]:
spark.sql(f"""
    CREATE TABLE IF NOT EXISTS {CATALOG}.{BRONZE_SCHEMA}.bronze_products_batch (
        product_id STRING,
        product_name STRING,
        subcategory_code STRING,
        brand STRING,
        unit_cost STRING,
        list_price STRING,
        weight_kg STRING,
        status STRING,
        _source_file STRING,
        _ingestion_timestamp TIMESTAMP
    )
    USING DELTA
""")

In [None]:
spark.sql(f"""
    COPY INTO {CATALOG}.{BRONZE_SCHEMA}.bronze_products_batch
    FROM (
        SELECT 
            product_id,
            product_name,
            subcategory_code,
            brand,
            CAST(unit_cost AS STRING) as unit_cost,
            CAST(list_price AS STRING) as list_price,
            CAST(weight_kg AS STRING) as weight_kg,
            status,
            ____.____ as _source_file,
            current_timestamp() as _ingestion_timestamp
        FROM '{PRODUCTS_PARQUET}'
    )
    FILEFORMAT = ____
""")

In [None]:
count_result = spark.sql(f"""
    SELECT COUNT(*) as total_records 
    FROM {CATALOG}.{BRONZE_SCHEMA}.bronze_products_batch
""").collect()[0]['total_records']

print(f"Total products loaded: {count_result}")

spark.sql(f"""
    SELECT product_id, product_name, brand, list_price, status, _source_file
    FROM {CATALOG}.{BRONZE_SCHEMA}.bronze_products_batch
    LIMIT 5
""").show(truncate=False)

### Task 1.4: Testing COPY INTO Idempotency

**Objective:** Verify that COPY INTO is idempotent and doesn't duplicate data.

**Instructions:**
1. Check record count before re-run
2. Execute COPY INTO again
3. Verify record count unchanged

In [None]:
before_count = spark.sql(f"""
    SELECT COUNT(*) as count 
    FROM {CATALOG}.{BRONZE_SCHEMA}.bronze_customers_batch
""").collect()[0]['count']

print(f"Records BEFORE re-run: {before_count}")

In [None]:
spark.sql(f"""
    COPY INTO {CATALOG}.{BRONZE_SCHEMA}.bronze_customers_batch
    FROM (
        SELECT 
            customer_id, first_name, last_name, email, phone,
            city, state, country, registration_date, customer_segment,
            current_timestamp() as _ingestion_timestamp
        FROM '{CUSTOMERS_CSV}'
    )
    FILEFORMAT = CSV
    FORMAT_OPTIONS ('header' = 'true')
""")

print("COPY INTO executed again...")

In [None]:
after_count = spark.sql(f"""
    SELECT COUNT(*) as count 
    FROM {CATALOG}.{BRONZE_SCHEMA}.bronze_customers_batch
""").collect()[0]['count']

print(f"Records AFTER re-run: {after_count}")
print(f"\nIdempotency test: {'PASSED - No duplicates!' if before_count == after_count else 'FAILED - Data was duplicated!'}")

---

## Part 2: Auto Loader - Streaming Ingestion

### Task 2.1: Auto Loader for CSV Files

**Objective:** Set up streaming ingestion for customer data using Auto Loader.

**Instructions:**
1. Configure `cloudFiles` format for streaming read
2. Set schema location for schema evolution
3. Add metadata columns (`_ingestion_timestamp`, `_source_file`)
4. Write stream to Delta table

**Hints:**
- Format: `cloudFiles`
- Option `cloudFiles.format`: `csv`
- Use `current_timestamp()` and `input_file_name()` for metadata
- Output mode: `append`
- Set `checkpointLocation` for streaming state

In [None]:
from pyspark.sql.functions import current_timestamp, input_file_name

customers_stream = (
    spark.readStream
    .format("____")
    .option("cloudFiles.format", "____")
    .option("cloudFiles.schemaLocation", f"{CHECKPOINT_PATH}/customers_schema")
    .option("header", "true")
    .load(CUSTOMERS_CSV)
)

In [None]:
customers_enriched = (
    customers_stream
    .withColumn("_ingestion_timestamp", ____)
    .withColumn("_source_file", ____)
)

In [None]:
query_customers = (
    customers_enriched.writeStream
    .format("____")
    .outputMode("____")
    .option("checkpointLocation", f"{CHECKPOINT_PATH}/customers_stream")
    .option("mergeSchema", "true")
    .table(f"{CATALOG}.{BRONZE_SCHEMA}.bronze_customers_stream")
)

In [None]:
import time
time.sleep(15)

count_result = spark.sql(f"""
    SELECT COUNT(*) as total_records 
    FROM {CATALOG}.{BRONZE_SCHEMA}.bronze_customers_stream
""").collect()[0]['total_records']

print(f"Records ingested via Auto Loader: {count_result}")

### Task 2.2: Auto Loader for JSON Stream

**Objective:** Set up streaming ingestion for order files from stream folder.

**Instructions:**
1. Configure Auto Loader for JSON format
2. Configure rescued data column for invalid records

**Hints:**
- Option `cloudFiles.format`: `json`
- Option `cloudFiles.rescuedDataColumn`: `_rescued_data`
- Set `checkpointLocation` to `orders_stream` subfolder

In [None]:
orders_stream = (
    spark.readStream
    .format("cloudFiles")
    .option("cloudFiles.format", "____")
    .option("cloudFiles.schemaLocation", f"{CHECKPOINT_PATH}/orders_schema")
    .option("cloudFiles.rescuedDataColumn", "____")
    .load(ORDERS_STREAM_PATH)
)

In [None]:
query_orders = (
    orders_stream.writeStream
    .format("delta")
    .outputMode("____")
    .option("checkpointLocation", f"{CHECKPOINT_PATH}/____")
    .table(f"{CATALOG}.{BRONZE_SCHEMA}.bronze_orders_stream")
)

In [None]:
time.sleep(15)

count_result = spark.sql(f"""
    SELECT COUNT(*) as total_records 
    FROM {CATALOG}.{BRONZE_SCHEMA}.bronze_orders_stream
""").collect()[0]['total_records']

print(f"Orders ingested via Auto Loader: {count_result}")

spark.sql(f"""
    SELECT order_id, customer_id, order_datetime, total_amount
    FROM {CATALOG}.{BRONZE_SCHEMA}.bronze_orders_stream
    LIMIT 5
""").show()

### Task 2.3: Monitoring Streaming Queries

**Objective:** Learn to monitor active streaming queries.

**Instructions:**
1. List all active streams
2. Check stream status and progress
3. View metrics (processed records, batch duration)

**Hints:**
- Use `spark.streams.active` to get list of streams
- Use `stream.lastProgress` for metrics

In [None]:
active_streams = spark.streams.____

print(f"Number of active streams: {len(active_streams)}")
print("="*60)

for stream in active_streams:
    print(f"\nStream ID: {stream.id}")
    print(f"Name: {stream.name}")
    print(f"Is Active: {stream.isActive}")
    print(f"Status: {stream.status}")

In [None]:
if len(active_streams) > 0:
    last_progress = active_streams[0].____
    
    if last_progress:
        print("Last Progress Metrics:")
        print(f"  Batch ID: {last_progress.get('batchId', 'N/A')}")
        print(f"  Num Input Rows: {last_progress.get('numInputRows', 'N/A')}")
        print(f"  Input Rows/sec: {last_progress.get('inputRowsPerSecond', 'N/A')}")
        print(f"  Processed Rows/sec: {last_progress.get('processedRowsPerSecond', 'N/A')}")
    else:
        print("No progress data available yet.")
else:
    print("No active streams.")

### Task 2.4: Stopping Streaming Queries

**Objective:** Gracefully stop all active streaming queries.

**Instructions:**
1. Iterate over active streams
2. Stop each stream
3. Verify all streams stopped

**Hints:**
- Use `stream.stop()` to stop a stream
- Iterate over `spark.streams.active`

In [None]:
for stream in spark.streams.active:
    print(f"Stopping stream: {stream.name or stream.id}")
    stream.____()

print("\nAll streams stopped!")

In [None]:
print(f"Active streams remaining: {len(spark.streams.active)}")

---

## Part 3: Comparison and Analysis

### Task 3.1: Compare COPY INTO vs Auto Loader

**Objective:** Analyze and compare results from both ingestion methods.

In [None]:
comparison = spark.sql(f"""
    SELECT 'COPY INTO (batch)' as method, COUNT(*) as records 
    FROM {CATALOG}.{BRONZE_SCHEMA}.bronze_customers_batch
    UNION ALL
    SELECT 'Auto Loader (stream)' as method, COUNT(*) as records 
    FROM {CATALOG}.{BRONZE_SCHEMA}.bronze_customers_stream
""")

comparison.show()

In [None]:
spark.sql(f"""
    ____ HISTORY {CATALOG}.{BRONZE_SCHEMA}.bronze_customers_batch
""").select("version", "timestamp", "operation", "operationMetrics").show(truncate=False)

In [None]:
spark.sql(f"""
    DESCRIBE HISTORY {CATALOG}.{BRONZE_SCHEMA}.bronze_customers_stream
""").select("version", "timestamp", "operation", "operationMetrics").show(truncate=False)

---

## Workshop Summary

**Achieved Objectives:**
- Implemented batch ingestion with COPY INTO for CSV, JSON, Parquet
- Configured Auto Loader for streaming ingestion
- Added metadata columns for lineage tracking
- Verified COPY INTO idempotency
- Monitored and managed streaming queries

**When to Use COPY INTO:**
- Scheduled batch processing
- Known, stable data structure
- Need for explicit control over loading
- Built-in idempotency without checkpoints

**When to Use Auto Loader:**
- Near real-time processing requirements
- Schema evolution expected
- Continuous monitoring for new files
- High scalability needs

### Quick Reference

| Command | Usage |
|---------|-------|
| `COPY INTO table FROM path` | Batch load from files |
| `.format("cloudFiles")` | Auto Loader streaming read |
| `.option("cloudFiles.format", "csv")` | Specify source format |
| `.option("checkpointLocation", path)` | Set checkpoint for streaming |
| `spark.streams.active` | List active streams |
| `stream.stop()` | Stop a streaming query |

---

## Solutions

Below are the complete solutions for all workshop tasks.

In [None]:
spark.sql(f"""
    CREATE TABLE IF NOT EXISTS {CATALOG}.{BRONZE_SCHEMA}.bronze_customers_batch (
        customer_id STRING,
        first_name STRING,
        last_name STRING,
        email STRING,
        phone STRING,
        city STRING,
        state STRING,
        country STRING,
        registration_date STRING,
        customer_segment STRING,
        _ingestion_timestamp TIMESTAMP
    )
    USING DELTA
""")

spark.sql(f"""
    COPY INTO {CATALOG}.{BRONZE_SCHEMA}.bronze_customers_batch
    FROM (
        SELECT 
            customer_id, first_name, last_name, email, phone,
            city, state, country, registration_date, customer_segment,
            current_timestamp() as _ingestion_timestamp
        FROM '{CUSTOMERS_CSV}'
    )
    FILEFORMAT = CSV
    FORMAT_OPTIONS ('header' = 'true')
""")

spark.sql(f"""
    CREATE TABLE IF NOT EXISTS {CATALOG}.{BRONZE_SCHEMA}.bronze_orders_batch (
        order_id STRING,
        customer_id STRING,
        product_id STRING,
        store_id STRING,
        order_datetime STRING,
        quantity STRING,
        unit_price STRING,
        discount_percent STRING,
        total_amount STRING,
        payment_method STRING,
        _ingestion_timestamp TIMESTAMP
    )
    USING DELTA
""")

spark.sql(f"""
    COPY INTO {CATALOG}.{BRONZE_SCHEMA}.bronze_orders_batch
    FROM (
        SELECT 
            order_id, customer_id, product_id, store_id,
            order_datetime, quantity, unit_price, discount_percent, 
            total_amount, payment_method,
            current_timestamp() as _ingestion_timestamp
        FROM '{ORDERS_JSON}'
    )
    FILEFORMAT = JSON
""")

spark.sql(f"""
    CREATE TABLE IF NOT EXISTS {CATALOG}.{BRONZE_SCHEMA}.bronze_products_batch (
        product_id STRING,
        product_name STRING,
        subcategory_code STRING,
        brand STRING,
        unit_cost STRING,
        list_price STRING,
        weight_kg STRING,
        status STRING,
        _source_file STRING,
        _ingestion_timestamp TIMESTAMP
    )
    USING DELTA
""")

spark.sql(f"""
    COPY INTO {CATALOG}.{BRONZE_SCHEMA}.bronze_products_batch
    FROM (
        SELECT 
            product_id, product_name, subcategory_code, brand,
            CAST(unit_cost AS STRING) as unit_cost,
            CAST(list_price AS STRING) as list_price,
            CAST(weight_kg AS STRING) as weight_kg,
            status,
            _metadata.file_path as _source_file,
            current_timestamp() as _ingestion_timestamp
        FROM '{PRODUCTS_PARQUET}'
    )
    FILEFORMAT = PARQUET
""")

print("Part 1 Solutions executed!")

In [None]:
from pyspark.sql.functions import current_timestamp, input_file_name

customers_stream = (
    spark.readStream
    .format("cloudFiles")
    .option("cloudFiles.format", "csv")
    .option("cloudFiles.schemaLocation", f"{CHECKPOINT_PATH}/customers_schema")
    .option("header", "true")
    .load(CUSTOMERS_CSV)
)

customers_enriched = (
    customers_stream
    .withColumn("_ingestion_timestamp", current_timestamp())
    .withColumn("_source_file", input_file_name())
)

query_customers = (
    customers_enriched.writeStream
    .format("delta")
    .outputMode("append")
    .option("checkpointLocation", f"{CHECKPOINT_PATH}/customers_stream")
    .option("mergeSchema", "true")
    .table(f"{CATALOG}.{BRONZE_SCHEMA}.bronze_customers_stream")
)

orders_stream = (
    spark.readStream
    .format("cloudFiles")
    .option("cloudFiles.format", "json")
    .option("cloudFiles.schemaLocation", f"{CHECKPOINT_PATH}/orders_schema")
    .option("cloudFiles.rescuedDataColumn", "_rescued_data")
    .load(ORDERS_STREAM_PATH)
)

query_orders = (
    orders_stream.writeStream
    .format("delta")
    .outputMode("append")
    .option("checkpointLocation", f"{CHECKPOINT_PATH}/orders_stream")
    .table(f"{CATALOG}.{BRONZE_SCHEMA}.bronze_orders_stream")
)

active_streams = spark.streams.active
print(f"Active streams: {len(active_streams)}")

for stream in active_streams:
    print(f"  - {stream.name or stream.id}: {stream.status}")
    if stream.lastProgress:
        print(f"    Last batch: {stream.lastProgress.get('batchId')}")

print("\nPart 2 Solutions executed!")

In [None]:
comparison = spark.sql(f"""
    SELECT 'COPY INTO' as method, COUNT(*) as records 
    FROM {CATALOG}.{BRONZE_SCHEMA}.bronze_customers_batch
    UNION ALL
    SELECT 'Auto Loader' as method, COUNT(*) as records 
    FROM {CATALOG}.{BRONZE_SCHEMA}.bronze_customers_stream
""")
comparison.show()

spark.sql(f"""
    DESCRIBE HISTORY {CATALOG}.{BRONZE_SCHEMA}.bronze_customers_batch
""").select("version", "operation").show()

print("Part 3 Solutions executed!")

---

## Resource Cleanup (optional)

In [None]:
for stream in spark.streams.active:
    stream.stop()

# spark.sql(f"DROP TABLE IF EXISTS {CATALOG}.{BRONZE_SCHEMA}.bronze_customers_batch")
# spark.sql(f"DROP TABLE IF EXISTS {CATALOG}.{BRONZE_SCHEMA}.bronze_orders_batch")
# spark.sql(f"DROP TABLE IF EXISTS {CATALOG}.{BRONZE_SCHEMA}.bronze_products_batch")
# spark.sql(f"DROP TABLE IF EXISTS {CATALOG}.{BRONZE_SCHEMA}.bronze_customers_stream")
# spark.sql(f"DROP TABLE IF EXISTS {CATALOG}.{BRONZE_SCHEMA}.bronze_orders_stream")
# dbutils.fs.rm(CHECKPOINT_PATH, recurse=True)

print("Streams stopped. Uncomment DROP statements to delete tables.")