# Delta Lake Deep Dive: Hands-on with Apache Spark

Executable, didactic walkthrough of Delta Lake operations using PySpark/Delta.

**Sections**
- Environment setup
- Table creation (SQL + DataFrame API)
- DML (INSERT/UPDATE/DELETE) + MERGE
- Read queries + time travel
- Delta management (HISTORY, VACUUM, OPTIMIZE where supported)
- Schema evolution (ADD/RENAME/DROP)
- Bounded (batch) vs Continuous (streaming) reads
- Catalog configuration examples (reference)


## 0) Environment Setup

If you’re running locally, you typically need:

```bash
pip install pyspark==3.4.1 delta-spark==2.4.0
```

This notebook assumes a Delta-enabled Spark environment (Databricks or OSS Delta via `delta-spark`).


In [1]:
import os
import pyspark
from delta import configure_spark_with_delta_pip
from pyspark.sql import SparkSession

# Use a predictable local path for path-based Delta examples
BASE_DIR = os.environ.get('DELTA_DEMO_BASE', '/tmp/delta_demo')
CUSTOMERS_PATH = os.path.join(BASE_DIR, 'customers_delta')
STREAM_OUT_PATH = os.path.join(BASE_DIR, 'stream_out')
STREAM_CHECKPOINT = os.path.join(BASE_DIR, 'stream_checkpoint')

builder = (
    SparkSession.builder.appName('DeltaLakeDemo')
    .config('spark.sql.extensions', 'io.delta.sql.DeltaSparkSessionExtension')
    .config('spark.sql.catalog.spark_catalog', 'org.apache.spark.sql.delta.catalog.DeltaCatalog')
)
spark = configure_spark_with_delta_pip(builder).getOrCreate()
spark.sparkContext.setLogLevel('WARN')
print('PySpark version:', pyspark.__version__)
print('BASE_DIR:', BASE_DIR)


PySpark version: 3.4.1
BASE_DIR: /tmp/delta_demo


### Output validation helpers
A couple of tiny helpers so we can validate outputs after each major step.


In [2]:
from pyspark.sql import functions as F

def assert_eq(actual, expected, msg=''):
    if actual != expected:
        raise AssertionError(f"{msg} Expected {expected}, got {actual}")
    print(f"✅ {msg} = {expected}")

def assert_true(cond, msg=''):
    if not cond:
        raise AssertionError(msg or 'Assertion failed')
    print(f"✅ {msg}")


## 1) Create a Delta table (Spark SQL)

We create a managed Delta table partitioned by `state` and then run basic DML.


In [3]:
# Clean slate
spark.sql('DROP TABLE IF EXISTS customers_delta_demo')

spark.sql('''
CREATE TABLE customers_delta_demo (
  customer_id INT,
  first_name STRING,
  last_name STRING,
  email STRING,
  charges FLOAT,
  state STRING
) USING DELTA
PARTITIONED BY (state)
''')
print('✅ Created table customers_delta_demo')


✅ Created table customers_delta_demo


### Insert + Update + Delete (SQL DML)


In [4]:
spark.sql('''
INSERT INTO customers_delta_demo VALUES
  (10, 'Lin', 'Chan', 'lin.chan@example.com', 425.3, 'CA'),
  (11, 'Iris', 'Huang', 'iris.huang@example.com', 820.0, 'NY')
''')

# Update charges for CA
spark.sql("UPDATE customers_delta_demo SET charges = charges * 1.05 WHERE state = 'CA'")

# Delete low-charge customers
spark.sql("DELETE FROM customers_delta_demo WHERE charges < 250")
print('✅ DML completed')


✅ DML completed


### Validate: row count and partitions


In [5]:
df = spark.table('customers_delta_demo')
df.show(truncate=False)

cnt = df.count()
assert_eq(cnt, 2, 'customers_delta_demo row count')

states = [r['state'] for r in df.select('state').distinct().collect()]
assert_true(set(states) == {'CA', 'NY'}, 'distinct states are CA and NY')


+-----------+----------+---------+----------------------+-------+-----+
|customer_id|first_name|last_name|email                 |charges|state|
+-----------+----------+---------+----------------------+-------+-----+
|11         |Iris      |Huang    |iris.huang@example.com|820.0  |NY   |
|10         |Lin       |Chan     |lin.chan@example.com  |446.565|CA   |
+-----------+----------+---------+----------------------+-------+-----+

✅ customers_delta_demo row count = 2
✅ distinct states are CA and NY


## 2) Create/overwrite using the DataFrame API

We overwrite the same table via the DataFrame API to demonstrate non-SQL writes.

Note: This will replace the prior two-row contents.


In [6]:
from pyspark.sql.types import StructType, StructField, IntegerType, StringType, FloatType

schema = StructType([
    StructField('customer_id', IntegerType(), True),
    StructField('first_name', StringType(), True),
    StructField('last_name', StringType(), True),
    StructField('email', StringType(), True),
    StructField('charges', FloatType(), True),
    StructField('state', StringType(), True),
])

data = [
    (1, 'John',  'Doe',   'john.doe@example.com',     250.5,  'CA'),
    (2, 'Jane',  'Smith', 'jane.smith@example.com',  300.0,  'NY'),
    (3, 'Alice', 'Brown', 'alice.brown@example.com', 180.75, 'TX'),
]

df_seed = spark.createDataFrame(data, schema=schema)
df_seed.write.format('delta').mode('overwrite').partitionBy('state').saveAsTable('customers_delta_demo')
print('✅ Overwrote customers_delta_demo via DataFrame API')


✅ Overwrote customers_delta_demo via DataFrame API


### Validate: the overwrite happened


In [7]:
df = spark.table('customers_delta_demo')
df.orderBy('customer_id').show(truncate=False)
assert_eq(df.count(), 3, 'row count after overwrite')

tx_cnt = df.filter(F.col('state') == 'TX').count()
assert_eq(tx_cnt, 1, 'TX row count')
min_charges = df.agg(F.min('charges')).first()[0]
assert_true(min_charges < 250, 'minimum charges < 250 exists')


+-----------+----------+---------+-----------------------+-------+-----+
|customer_id|first_name|last_name|email                  |charges|state|
+-----------+----------+---------+-----------------------+-------+-----+
|1          |John      |Doe      |john.doe@example.com   |250.5  |CA   |
|2          |Jane      |Smith    |jane.smith@example.com |300.0  |NY   |
|3          |Alice     |Brown    |alice.brown@example.com|180.75 |TX   |
+-----------+----------+---------+-----------------------+-------+-----+

✅ row count after overwrite = 3
✅ TX row count = 1
✅ minimum charges < 250 exists


## 3) DML Operations (INSERT/DELETE/UPDATE) + MERGE

We’ll add a few rows, delete some, update some, and then demonstrate an UPSERT with `MERGE INTO`.


In [8]:
spark.sql("""
INSERT INTO customers_delta_demo VALUES
  (21, 'John',  'Doe',   'john.doe+21@example.com', 150.75, 'CA'),
  (22, 'Alice', 'Smith', 'alice.smith@example.com', 200.50, 'NY'),
  (23, 'Bob',   'Brown', 'bob.brown@example.com',   175.25, 'TX'),
  (24, 'Emily', 'Davis', 'emily.davis@example.com',  220.30, 'FL')
""")

spark.sql("DELETE FROM customers_delta_demo WHERE customer_id = 21")
spark.sql("DELETE FROM customers_delta_demo WHERE state = 'TX'")
spark.sql("UPDATE customers_delta_demo SET charges = charges * 1.1 WHERE state = 'CA'")
print('✅ INSERT/DELETE/UPDATE complete')


✅ INSERT/DELETE/UPDATE complete


### Validate: deletions and updates applied


In [9]:
df = spark.table('customers_delta_demo')
assert_eq(df.filter(F.col('customer_id') == 21).count(), 0, 'customer_id 21 deleted')
assert_eq(df.filter(F.col('state') == 'TX').count(), 0, 'TX deleted')
ca_cnt = df.filter(F.col('state') == 'CA').count()
assert_true(ca_cnt > 0, 'CA rows exist')
assert_true(df.filter((F.col('state') == 'CA') & (F.col('charges') <= 0)).count() == 0, 'CA charges are positive')
df.orderBy('customer_id').show(truncate=False)


✅ customer_id 21 deleted = 0
✅ TX deleted = 0
✅ CA rows exist
✅ CA charges are positive
+-----------+----------+---------+-----------------------+-------+-----+
|customer_id|first_name|last_name|email                  |charges|state|
+-----------+----------+---------+-----------------------+-------+-----+
|1          |John      |Doe      |john.doe@example.com   |275.55 |CA   |
|2          |Jane      |Smith    |jane.smith@example.com |300.0  |NY   |
|22         |Alice     |Smith    |alice.smith@example.com|200.5  |NY   |
|24         |Emily     |Davis    |emily.davis@example.com|220.3  |FL   |
+-----------+----------+---------+-----------------------+-------+-----+



### MERGE INTO (Upsert)

We create a small staging table with updates and new inserts, then merge into `customers_delta_demo`.


In [10]:
spark.sql('DROP TABLE IF EXISTS staging_updates')
spark.sql('''
CREATE TABLE staging_updates (
  customer_id INT,
  first_name STRING,
  last_name STRING,
  email STRING,
  charges FLOAT,
  state STRING
) USING DELTA
''')

spark.sql('''
INSERT INTO staging_updates VALUES
  (2,  'Jane', 'Smith', 'jane.smith+updated@example.com', 999.0, 'NY'),
  (99, 'Zane', 'Otto',  'zane.otto@example.com',        1234.5, 'TX')
''')

spark.sql('''
MERGE INTO customers_delta_demo AS target
USING staging_updates AS source
ON target.customer_id = source.customer_id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *
''')
print('✅ MERGE completed')


✅ MERGE completed


### Validate: MERGE updated id=2 and inserted id=99


In [11]:
df = spark.table('customers_delta_demo')
r2 = df.filter(F.col('customer_id') == 2).select('email', 'charges').collect()[0]
assert_true('updated' in r2['email'], 'customer_id=2 email updated')
assert_true(abs(float(r2['charges']) - 999.0) < 1e-6, 'customer_id=2 charges updated to 999.0')
assert_eq(df.filter(F.col('customer_id') == 99).count(), 1, 'customer_id=99 inserted')
df.orderBy('customer_id').show(truncate=False)


✅ customer_id=2 email updated
✅ customer_id=2 charges updated to 999.0
✅ customer_id=99 inserted = 1
+-----------+----------+---------+------------------------------+-------+-----+
|customer_id|first_name|last_name|email                         |charges|state|
+-----------+----------+---------+------------------------------+-------+-----+
|1          |John      |Doe      |john.doe@example.com          |275.55 |CA   |
|2          |Jane      |Smith    |jane.smith+updated@example.com|999.0  |NY   |
|22         |Alice     |Smith    |alice.smith@example.com       |200.5  |NY   |
|24         |Emily     |Davis    |emily.davis@example.com       |220.3  |FL   |
|99         |Zane      |Otto     |zane.otto@example.com         |1234.5 |TX   |
+-----------+----------+---------+------------------------------+-------+-----+



## 4) Read Queries & Time Travel

Delta tables keep transaction log history. We can time travel by version number.


In [12]:
from delta.tables import DeltaTable
dt = DeltaTable.forName(spark, 'customers_delta_demo')
history_df = dt.history()
history_df.show(truncate=False)

versions = [int(r['version']) for r in history_df.select('version').collect()]
min_v, max_v = min(versions), max(versions)
print('min_version:', min_v, 'max_version:', max_v)
assert_true(max_v >= min_v, 'history has at least one version')


+-------+-----------------------+------+--------+---------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----+--------+---------+-----------+--------------+-------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------+-----------------------------------+
|version|timestamp              |userId|userName|operation                        |operation

### Time travel read (versionAsOf)


In [13]:
historic = spark.read.format('delta').option('versionAsOf', min_v).table('customers_delta_demo')
historic.show(truncate=False)

assert_eq(historic.count(), 0, "version 0 is empty (table created, no data yet)")


+-----------+----------+---------+-----+-------+-----+
|customer_id|first_name|last_name|email|charges|state|
+-----------+----------+---------+-----+-------+-----+
+-----------+----------+---------+-----+-------+-----+

✅ version 0 is empty (table created, no data yet) = 0


## 5) Delta Management Commands

- `VACUUM` cleans up unreferenced files (retention rules apply)
- `OPTIMIZE` is environment-dependent
- `DESCRIBE DETAIL/HISTORY` are useful diagnostics


In [14]:
spark.sql('DESCRIBE HISTORY customers_delta_demo').show(truncate=False)
try:
    spark.sql('DESCRIBE DETAIL customers_delta_demo').show(truncate=False)
except Exception as e:
    print('DESCRIBE DETAIL not available in this environment:', type(e).__name__, str(e)[:200])


+-------+-----------------------+------+--------+---------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----+--------+---------+-----------+--------------+-------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------+-----------------------------------+
|version|timestamp              |userId|userName|operation                        |operation

### VACUUM (dry run + actual)


In [15]:
try:
    spark.sql('VACUUM customers_delta_demo DRY RUN').show(truncate=False)
except Exception as e:
    print('VACUUM DRY RUN not available:', type(e).__name__, str(e)[:200])

try:
    spark.sql('VACUUM customers_delta_demo RETAIN 168 HOURS')
    print('✅ VACUUM completed (retain 168 hours)')
except Exception as e:
    print('VACUUM failed (environment-dependent):', type(e).__name__, str(e)[:200])


+----+
|path|
+----+
+----+

✅ VACUUM completed (retain 168 hours)


### OPTIMIZE (optional)


In [16]:
try:
    spark.sql('OPTIMIZE customers_delta_demo ZORDER BY (state)')
    print('✅ OPTIMIZE completed')
except Exception:
    print('OPTIMIZE not supported in this environment — skipping.')


OPTIMIZE not supported in this environment — skipping.


## 6) More Spark SQL DDL/DML examples (Schema Evolution + CTAS)

We’ll create a derived table via CTAS, then demonstrate ADD/RENAME/DROP columns.


In [17]:
spark.sql("""
CREATE OR REPLACE TABLE high_value_customers
USING DELTA
AS
SELECT customer_id, first_name, last_name, state, charges
FROM customers_delta_demo
WHERE charges > 500
""")
print('✅ Created/updated high_value_customers')

hvc = spark.table('high_value_customers')
hvc.show(truncate=False)
assert_true(set(hvc.columns) == {'customer_id','first_name','last_name','state','charges'}, 'high_value_customers columns correct')


✅ Created/updated high_value_customers
+-----------+----------+---------+-----+-------+
|customer_id|first_name|last_name|state|charges|
+-----------+----------+---------+-----+-------+
|2          |Jane      |Smith    |NY   |999.0  |
|99         |Zane      |Otto     |TX   |1234.5 |
+-----------+----------+---------+-----+-------+

✅ high_value_customers columns correct


In [18]:
# enable column mapping (and upgrade protocol if needed)
spark.sql("""
ALTER TABLE customers_delta_demo SET TBLPROPERTIES (
  'delta.columnMapping.mode' = 'name',
  'delta.minReaderVersion' = '2',
  'delta.minWriterVersion' = '5'
)
""")

spark.sql("ALTER TABLE customers_delta_demo ADD COLUMNS (phone_number STRING)")
cols = spark.table('customers_delta_demo').columns
assert_true('phone_number' in cols, 'phone_number added')

spark.sql("ALTER TABLE customers_delta_demo RENAME COLUMN charges TO total_spent")
cols = spark.table('customers_delta_demo').columns
assert_true('total_spent' in cols and 'charges' not in cols, 'charges renamed to total_spent')

spark.sql("ALTER TABLE customers_delta_demo DROP COLUMN phone_number")
cols = spark.table('customers_delta_demo').columns
assert_true('phone_number' not in cols, 'phone_number dropped')

spark.table('customers_delta_demo').show(truncate=False)


✅ phone_number added
✅ charges renamed to total_spent
✅ phone_number dropped
+-----------+----------+---------+------------------------------+-----------+-----+
|customer_id|first_name|last_name|email                         |total_spent|state|
+-----------+----------+---------+------------------------------+-----------+-----+
|2          |Jane      |Smith    |jane.smith+updated@example.com|999.0      |NY   |
|22         |Alice     |Smith    |alice.smith@example.com       |200.5      |NY   |
|24         |Emily     |Davis    |emily.davis@example.com       |220.3      |FL   |
|99         |Zane      |Otto     |zane.otto@example.com         |1234.5     |TX   |
|1          |John      |Doe      |john.doe@example.com          |275.55     |CA   |
+-----------+----------+---------+------------------------------+-----------+-----+



### Validate schema via DESCRIBE TABLE


In [19]:
spark.sql('DESCRIBE TABLE customers_delta_demo').show(truncate=False)


+-----------------------+---------+-------+
|col_name               |data_type|comment|
+-----------------------+---------+-------+
|customer_id            |int      |null   |
|first_name             |string   |null   |
|last_name              |string   |null   |
|email                  |string   |null   |
|total_spent            |float    |null   |
|state                  |string   |null   |
|# Partition Information|         |       |
|# col_name             |data_type|comment|
|state                  |string   |null   |
+-----------------------+---------+-------+



## 7) MERGE after rename (total_spent)

After renaming, staging data must match the updated schema.


In [20]:
spark.sql('DROP TABLE IF EXISTS staging_updates_v2')
spark.sql('''
CREATE TABLE staging_updates_v2 (
  customer_id INT,
  first_name STRING,
  last_name STRING,
  email STRING,
  total_spent FLOAT,
  state STRING
) USING DELTA
''')

spark.sql('''
INSERT INTO staging_updates_v2 VALUES
  (10, 'Lin', 'Chan', 'lin.chan+v2@example.com', 1111.0, 'CA'),
  (77, 'Nova', 'Kerr', 'nova.kerr@example.com',  2222.0, 'WA')
''')

spark.sql('''
MERGE INTO customers_delta_demo AS target
USING staging_updates_v2 AS source
ON target.customer_id = source.customer_id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *
''')
print('✅ MERGE v2 completed')


✅ MERGE v2 completed


### Validate: id=10 updated and id=77 inserted


In [21]:
df = spark.table('customers_delta_demo')
r10 = df.filter(F.col('customer_id') == 10).select('email', 'total_spent').collect()[0]
assert_true('v2' in r10['email'], 'customer_id=10 email updated (v2)')
assert_true(abs(float(r10['total_spent']) - 1111.0) < 1e-6, 'customer_id=10 total_spent updated to 1111.0')
assert_eq(df.filter(F.col('customer_id') == 77).count(), 1, 'customer_id=77 inserted')
df.orderBy('customer_id').show(truncate=False)


✅ customer_id=10 email updated (v2)
✅ customer_id=10 total_spent updated to 1111.0
✅ customer_id=77 inserted = 1
+-----------+----------+---------+------------------------------+-----------+-----+
|customer_id|first_name|last_name|email                         |total_spent|state|
+-----------+----------+---------+------------------------------+-----------+-----+
|1          |John      |Doe      |john.doe@example.com          |275.55     |CA   |
|2          |Jane      |Smith    |jane.smith+updated@example.com|999.0      |NY   |
|10         |Lin       |Chan     |lin.chan+v2@example.com       |1111.0     |CA   |
|22         |Alice     |Smith    |alice.smith@example.com       |200.5      |NY   |
|24         |Emily     |Davis    |emily.davis@example.com       |220.3      |FL   |
|77         |Nova      |Kerr     |nova.kerr@example.com         |2222.0     |WA   |
|99         |Zane      |Otto     |zane.otto@example.com         |1234.5     |TX   |
+-----------+----------+---------+-------------

## 8) Path-based Delta table examples (bounded reads)

Write a path-based Delta table, then read it back (latest + versionAsOf).


In [22]:
import shutil

if os.path.exists(CUSTOMERS_PATH):
    shutil.rmtree(CUSTOMERS_PATH)

spark.table('customers_delta_demo').write.format('delta').mode('overwrite').save(CUSTOMERS_PATH)
print('✅ Wrote path-based Delta table to:', CUSTOMERS_PATH)

latest_path_df = spark.read.format('delta').load(CUSTOMERS_PATH)
latest_path_df.show(truncate=False)
assert_true(latest_path_df.count() > 0, 'path-based latest read has rows')


✅ Wrote path-based Delta table to: /tmp/delta_demo/customers_delta
+-----------+----------+---------+------------------------------+-----------+-----+
|customer_id|first_name|last_name|email                         |total_spent|state|
+-----------+----------+---------+------------------------------+-----------+-----+
|2          |Jane      |Smith    |jane.smith+updated@example.com|999.0      |NY   |
|24         |Emily     |Davis    |emily.davis@example.com       |220.3      |FL   |
|22         |Alice     |Smith    |alice.smith@example.com       |200.5      |NY   |
|10         |Lin       |Chan     |lin.chan+v2@example.com       |1111.0     |CA   |
|99         |Zane      |Otto     |zane.otto@example.com         |1234.5     |TX   |
|77         |Nova      |Kerr     |nova.kerr@example.com         |2222.0     |WA   |
|1          |John      |Doe      |john.doe@example.com          |275.55     |CA   |
+-----------+----------+---------+------------------------------+-----------+-----+

✅ path-b

In [23]:
from delta.tables import DeltaTable
dt_path = DeltaTable.forPath(spark, CUSTOMERS_PATH)
hist = dt_path.history()
hist.show(truncate=False)
versions = [int(r['version']) for r in hist.select('version').collect()]
min_v = min(versions)

path_historic = spark.read.format('delta').option('versionAsOf', min_v).load(CUSTOMERS_PATH)
path_historic.show(truncate=False)
assert_true(path_historic.count() > 0, 'path-based historic read has rows')


+-------+-----------------------+------+--------+---------+--------------------------------------+----+--------+---------+-----------+--------------+-------------+------------------------------------------------------------+------------+-----------------------------------+
|version|timestamp              |userId|userName|operation|operationParameters                   |job |notebook|clusterId|readVersion|isolationLevel|isBlindAppend|operationMetrics                                            |userMetadata|engineInfo                         |
+-------+-----------------------+------+--------+---------+--------------------------------------+----+--------+---------+-----------+--------------+-------------+------------------------------------------------------------+------------+-----------------------------------+
|0      |2025-12-02 19:56:25.839|null  |null    |WRITE    |{mode -> Overwrite, partitionBy -> []}|null|null    |null     |null       |Serializable  |false        |{numFiles -> 7,

## 9) Continuous (streaming) reads in PySpark

Runnable streaming template using `trigger(once=True)` so it runs one micro-batch and stops.
We stream from the *path-based* Delta table.


In [24]:
import shutil

for p in [STREAM_OUT_PATH, STREAM_CHECKPOINT]:
    if os.path.exists(p):
        shutil.rmtree(p)

stream_df = spark.readStream.format('delta').load(CUSTOMERS_PATH)

q = (
    stream_df
    .groupBy('state')
    .count()
    .writeStream
    .outputMode('complete')
    .format('console')
    .option('checkpointLocation', STREAM_CHECKPOINT)
    .trigger(once=True)
    .start()
)
q.awaitTermination()
print('✅ Streaming micro-batch (once) completed')


✅ Streaming micro-batch (once) completed


### Append data and run streaming again


In [25]:
from pyspark.sql import Row, functions as F

# Load existing schema from the delta path
base = spark.read.format("delta").load(CUSTOMERS_PATH)

new_rows = [
    Row(customer_id=500, first_name="Stream", last_name="One", email="stream.one@example.com", total_spent=10.0, state="CA"),
    Row(customer_id=501, first_name="Stream", last_name="Two", email="stream.two@example.com", total_spent=20.0, state="WA"),
]

new_df = spark.createDataFrame(new_rows)

# Force types to match existing table
new_df = (
    new_df
    .withColumn("customer_id", F.col("customer_id").cast(base.schema["customer_id"].dataType))
    .withColumn("total_spent", F.col("total_spent").cast(base.schema["total_spent"].dataType))
)

# (optional but nice) reorder columns to match
new_df = new_df.select([c.name for c in base.schema.fields])

new_df.write.format("delta").mode("append").save(CUSTOMERS_PATH)
print('✅ Appended new rows to path-based table')

q2 = (
    spark.readStream.format('delta').load(CUSTOMERS_PATH)
    .groupBy('state')
    .count()
    .writeStream
    .outputMode('complete')
    .format('console')
    .option('checkpointLocation', STREAM_CHECKPOINT)
    .trigger(once=True)
    .start()
)
q2.awaitTermination()
print('✅ Second streaming micro-batch completed')


✅ Appended new rows to path-based table
✅ Second streaming micro-batch completed


### Validate: appended rows exist in batch read


In [26]:
latest = spark.read.format('delta').load(CUSTOMERS_PATH)
assert_eq(latest.filter(F.col('customer_id').isin([500, 501])).count(), 2, 'stream-appended rows present')
latest.orderBy('customer_id').show(truncate=False)


✅ stream-appended rows present = 2
+-----------+----------+---------+------------------------------+-----------+-----+
|customer_id|first_name|last_name|email                         |total_spent|state|
+-----------+----------+---------+------------------------------+-----------+-----+
|1          |John      |Doe      |john.doe@example.com          |275.55     |CA   |
|2          |Jane      |Smith    |jane.smith+updated@example.com|999.0      |NY   |
|10         |Lin       |Chan     |lin.chan+v2@example.com       |1111.0     |CA   |
|22         |Alice     |Smith    |alice.smith@example.com       |200.5      |NY   |
|24         |Emily     |Davis    |emily.davis@example.com       |220.3      |FL   |
|77         |Nova      |Kerr     |nova.kerr@example.com         |2222.0     |WA   |
|99         |Zane      |Otto     |zane.otto@example.com         |1234.5     |TX   |
|500        |Stream    |One      |stream.one@example.com        |10.0       |CA   |
|501        |Stream    |Two      |stream.

## 10) Catalog configurations in Spark (reference)

### Hadoop catalog
```text
spark.sql.catalog.hadoop_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog
spark.sql.catalog.hadoop_catalog.type=hadoop
spark.sql.catalog.hadoop_catalog.warehouse=s3a://deltalake/warehouse
```

### Hive Metastore catalog
```text
spark.sql.catalog.hive_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog
spark.sql.catalog.hive_catalog.type=hive
spark.sql.catalog.hive_catalog.uri=thrift://metastore-host:9083
```

### AWS Glue Data Catalog
```text
spark.sql.catalog.glue_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog
spark.sql.catalog.glue_catalog.type=glue
spark.sql.catalog.glue_catalog.warehouse=s3a://deltalake/warehouse
```


## Closing Notes

- This notebook is runnable end-to-end, with validation checks after key operations.
- `OPTIMIZE` support varies by environment; the notebook auto-skips if not supported.
