# Apache Hudi Deep Dive (Chapter 4)

End-to-end Hudi walkthrough for Spark: install dependencies, start Spark with the Hudi bundle, create a small customer dataset, write/read a Copy-On-Write table, run DDL/DML patterns, and validate results along the way. Each step is separated for clarity and reproducibility.

## 1) Environment setup

In [1]:
# Install Python deps inside the container if needed
!pip install -q pyspark findspark

## 2) Build a SparkSession configured for Hudi

In [2]:
from pyspark.sql import SparkSession

In [3]:
HUDI_VERSION = "0.15.0"
SPARK_MAJOR = "3.5"  # Change to match your Spark (3.5, 3.4, 3.3, ...)

hudi_bundle = f"org.apache.hudi:hudi-spark{SPARK_MAJOR}-bundle_2.12:{HUDI_VERSION}"

In [4]:
spark = (
    SparkSession.builder
    .appName("Apache Hudi Chapter 4 Demo")
    .config("spark.jars.packages", hudi_bundle)
    .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
    .config("spark.sql.extensions", "org.apache.spark.sql.hudi.HoodieSparkSessionExtension")
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.hudi.catalog.HoodieCatalog")
    .getOrCreate()
)

In [5]:
print("Spark version:", spark.version)
spark

Spark version: 3.5.0


## 3) Create a sample customer dataset

In [6]:
from pyspark.sql import Row

sample_rows = [
    Row(customer_id=1, first_name="John", last_name="Doe", email="john.doe@example.com", charges=150.75, state="CA"),
    Row(customer_id=2, first_name="Jane", last_name="Smith", email="jane.smith@example.com", charges=950.00, state="NY"),
    Row(customer_id=3, first_name="Tom", last_name="Lee", email="tom.lee@example.com", charges=1200.00, state="CA"),
]

In [7]:
df = spark.createDataFrame(sample_rows)

In [8]:
print("Row count:", df.count())
df.printSchema()
df.show()

Row count: 3
root
 |-- customer_id: long (nullable = true)
 |-- first_name: string (nullable = true)
 |-- last_name: string (nullable = true)
 |-- email: string (nullable = true)
 |-- charges: double (nullable = true)
 |-- state: string (nullable = true)

+-----------+----------+---------+--------------------+-------+-----+
|customer_id|first_name|last_name|               email|charges|state|
+-----------+----------+---------+--------------------+-------+-----+
|          1|      John|      Doe|john.doe@example.com| 150.75|   CA|
|          2|      Jane|    Smith|jane.smith@exampl...|  950.0|   NY|
|          3|       Tom|      Lee| tom.lee@example.com| 1200.0|   CA|
+-----------+----------+---------+--------------------+-------+-----+



## 4) Write a Copy-On-Write Hudi table

In [9]:
base_path = "/tmp/hudi_customers_cow"
table_name = "customers_hudi_demo"

In [10]:
(
    df.write
    .format("hudi")
    .option("hoodie.datasource.write.table.type", "COPY_ON_WRITE")
    .option("hoodie.table.name", table_name)
    .option("hoodie.datasource.write.recordkey.field", "customer_id")
    .option("hoodie.datasource.write.partitionpath.field", "state")
    .option("hoodie.datasource.write.precombine.field", "charges")
    .option("hoodie.datasource.write.hive_style_partitioning", "true")
    .mode("overwrite")
    .save(base_path)
)

In [11]:
written_df = spark.read.format("hudi").load(base_path)
print("Written rows:", written_df.count())
written_df.select("customer_id", "state", "charges").orderBy("customer_id").show()

Written rows: 3
+-----------+-----+-------+
|customer_id|state|charges|
+-----------+-----+-------+
|          1|   CA| 150.75|
|          2|   NY|  950.0|
|          3|   CA| 1200.0|
+-----------+-----+-------+



## 5) Register the table and run a snapshot query

In [12]:
spark.sql("DROP TABLE IF EXISTS customers_hudi_demo")

DataFrame[]

In [13]:
spark.sql(f"""
CREATE TABLE customers_hudi_demo (
  customer_id INT,
  first_name  STRING,
  last_name   STRING,
  email       STRING,
  charges     FLOAT,
  state       STRING
)
USING hudi
LOCATION '{base_path}'
PARTITIONED BY (state)
""")

DataFrame[]

In [14]:
spark.sql("SELECT * FROM customers_hudi_demo ORDER BY customer_id").show()

+-------------------+--------------------+------------------+----------------------+--------------------+-----------+----------+---------+--------------------+-------+-----+
|_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|   _hoodie_file_name|customer_id|first_name|last_name|               email|charges|state|
+-------------------+--------------------+------------------+----------------------+--------------------+-----------+----------+---------+--------------------+-------+-----+
|  20251209073341872|20251209073341872...|                 1|              state=CA|e17666c0-25af-41a...|          1|      John|      Doe|john.doe@example.com| 150.75|   CA|
|  20251209073341872|20251209073341872...|                 2|              state=NY|2d65233d-bfbe-45f...|          2|      Jane|    Smith|jane.smith@exampl...|  950.0|   NY|
|  20251209073341872|20251209073341872...|                 3|              state=CA|e17666c0-25af-41a...|          3|       Tom|  

## 6) DDL examples (create, CTAS, alter, rename)

In [15]:
spark.sql("""
CREATE TABLE IF NOT EXISTS customers_hudi_demo (
  customer_id INT,
  first_name STRING,
  last_name  STRING,
  email      STRING,
  charges    FLOAT,
  state      STRING
)
USING hudi
PARTITIONED BY (state)
""")

DataFrame[]

In [16]:
spark.sql("DROP TABLE IF EXISTS high_value_customers PURGE")

DataFrame[]

In [17]:
spark.sql("""
CREATE TABLE high_value_customers
USING hudi
PARTITIONED BY (state)
AS
SELECT customer_id, first_name, last_name, state, charges
FROM customers_hudi_demo
WHERE charges > 1000
""")

DataFrame[]

In [18]:
spark.sql("SELECT * FROM high_value_customers ORDER BY customer_id").show()

+-------------------+--------------------+--------------------+----------------------+--------------------+-----------+----------+---------+-------+-----+
|_hoodie_commit_time|_hoodie_commit_seqno|  _hoodie_record_key|_hoodie_partition_path|   _hoodie_file_name|customer_id|first_name|last_name|charges|state|
+-------------------+--------------------+--------------------+----------------------+--------------------+-----------+----------+---------+-------+-----+
|  20251209073347349|20251209073347349...|20251209073347349...|              state=CA|4cb0eab4-b446-496...|          3|       Tom|      Lee| 1200.0|   CA|
+-------------------+--------------------+--------------------+----------------------+--------------------+-----------+----------+---------+-------+-----+



In [19]:
spark.sql("""
ALTER TABLE customers_hudi_demo
ADD COLUMN phone_number STRING
""")

DataFrame[]

In [20]:
spark.sql("""
CREATE TABLE customers_hudi_demo_v2
USING hudi
PARTITIONED BY (state)
AS
SELECT customer_id, first_name, last_name, email, charges AS total_spent, phone_number, state
FROM customers_hudi_demo
""")

DataFrame[]

In [21]:
spark.sql("DROP TABLE customers_hudi_demo")

DataFrame[]

In [22]:
spark.sql("ALTER TABLE customers_hudi_demo_v2 RENAME TO customers_hudi_demo")

DataFrame[]

In [23]:
spark.sql("DESCRIBE TABLE customers_hudi_demo").show(truncate=False)

+-----------------------+---------+-------+
|col_name               |data_type|comment|
+-----------------------+---------+-------+
|_hoodie_commit_time    |string   |NULL   |
|_hoodie_commit_seqno   |string   |NULL   |
|_hoodie_record_key     |string   |NULL   |
|_hoodie_partition_path |string   |NULL   |
|_hoodie_file_name      |string   |NULL   |
|customer_id            |bigint   |NULL   |
|first_name             |string   |NULL   |
|last_name              |string   |NULL   |
|email                  |string   |NULL   |
|total_spent            |double   |NULL   |
|phone_number           |string   |NULL   |
|state                  |string   |NULL   |
|# Partition Information|         |       |
|# col_name             |data_type|comment|
|state                  |string   |NULL   |
+-----------------------+---------+-------+



## 7) DML operations (insert, merge, overwrite, delete, update)

In [24]:
spark.sql("SET hoodie.write.set.null.for.missing.columns=true")


DataFrame[key: string, value: string]

In [25]:
spark.sql("""
INSERT INTO customers_hudi_demo (customer_id, first_name, last_name, email, total_spent, phone_number, state)
VALUES (1, 'John', 'Doe', 'john.doe@example.com', 150.75, NULL, 'CA')
""")

DataFrame[]

In [26]:
spark.sql("""
INSERT INTO customers_hudi_demo (customer_id, first_name, last_name, email, total_spent, phone_number, state)
VALUES (2, 'Jane', 'Smith', 'jane.smith@example.com', 250.00, NULL, 'NY')
""")

DataFrame[]

In [27]:
spark.sql("SELECT * FROM customers_hudi_demo ORDER BY customer_id").show()

+-------------------+--------------------+--------------------+----------------------+--------------------+-----------+----------+---------+--------------------+-----------+------------+-----+
|_hoodie_commit_time|_hoodie_commit_seqno|  _hoodie_record_key|_hoodie_partition_path|   _hoodie_file_name|customer_id|first_name|last_name|               email|total_spent|phone_number|state|
+-------------------+--------------------+--------------------+----------------------+--------------------+-----------+----------+---------+--------------------+-----------+------------+-----+
|  20251209073407411|20251209073407411...|20251209073407411...|              state=CA|aa7ffb9c-1e96-4bb...|          1|      John|      Doe|john.doe@example.com|     150.75|        NULL|   CA|
|  20251209073426894|20251209073426894...|20251209073426894...|              state=CA|aa7ffb9c-1e96-4bb...|          1|      John|      Doe|john.doe@example.com|     150.75|        NULL|   CA|
|  20251209073407411|20251209073407

In [28]:

spark.sql("""
CREATE OR REPLACE TEMP VIEW updates AS
SELECT 1 AS customer_id, 'John' AS first_name, 'Doe' AS last_name,
       'john.new@example.com' AS email, 200.00 AS total_spent, 'CA' AS state, CAST(NULL AS STRING) AS phone_number
UNION ALL
SELECT 3, 'Alice', 'Brown', 'alice@example.com', 300.00, 'TX', CAST(NULL AS STRING)
""")


DataFrame[]

In [29]:
spark.sql("""
MERGE INTO customers_hudi_demo AS target
USING updates AS source
ON target.customer_id = source.customer_id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *
""")

DataFrame[]

In [30]:
spark.sql("SELECT * FROM customers_hudi_demo ORDER BY customer_id").show()

+-------------------+--------------------+--------------------+----------------------+--------------------+-----------+----------+---------+--------------------+-----------+------------+-----+
|_hoodie_commit_time|_hoodie_commit_seqno|  _hoodie_record_key|_hoodie_partition_path|   _hoodie_file_name|customer_id|first_name|last_name|               email|total_spent|phone_number|state|
+-------------------+--------------------+--------------------+----------------------+--------------------+-----------+----------+---------+--------------------+-----------+------------+-----+
|  20251209073455497|20251209073455497...|20251209073407411...|              state=CA|aa7ffb9c-1e96-4bb...|          1|      John|      Doe|john.new@example.com|      200.0|        NULL|   CA|
|  20251209073455497|20251209073455497...|20251209073426894...|              state=CA|aa7ffb9c-1e96-4bb...|          1|      John|      Doe|john.new@example.com|      200.0|        NULL|   CA|
|  20251209073407411|20251209073407

In [31]:

spark.sql("""
CREATE OR REPLACE TEMP VIEW staging_updates AS
SELECT
  1 AS customer_id,
  'John' AS first_name,
  'Doe' AS last_name,
  'john.ca@example.com' AS email,
  220.0 AS total_spent,
  'CA' AS state,
  CAST(NULL AS STRING) AS phone_number
""")


DataFrame[]

In [32]:
spark.sql("""
INSERT OVERWRITE customers_hudi_demo
PARTITION (state = 'CA')
SELECT customer_id, first_name, last_name, email, total_spent, phone_number
FROM staging_updates
WHERE state = 'CA'
""")

DataFrame[]

In [33]:
spark.sql("""
INSERT OVERWRITE customers_hudi_demo
SELECT customer_id, first_name, last_name, email, total_spent, phone_number, state
FROM staging_updates
GROUP BY customer_id, first_name, last_name, email, total_spent, phone_number, state
""")

DataFrame[]

In [34]:
spark.sql("SELECT * FROM customers_hudi_demo ORDER BY customer_id").show()

+-------------------+--------------------+--------------------+----------------------+--------------------+-----------+----------+---------+-------------------+-----------+------------+-----+
|_hoodie_commit_time|_hoodie_commit_seqno|  _hoodie_record_key|_hoodie_partition_path|   _hoodie_file_name|customer_id|first_name|last_name|              email|total_spent|phone_number|state|
+-------------------+--------------------+--------------------+----------------------+--------------------+-----------+----------+---------+-------------------+-----------+------------+-----+
|  20251209073536551|20251209073536551...|20251209073536551...|              state=CA|85169c22-6c8a-496...|          1|      John|      Doe|john.ca@example.com|      220.0|        NULL|   CA|
+-------------------+--------------------+--------------------+----------------------+--------------------+-----------+----------+---------+-------------------+-----------+------------+-----+



In [38]:
spark.sql("""
DELETE FROM customers_hudi_demo
WHERE customer_id = 1
""")

DataFrame[]

In [39]:
spark.sql("""
DELETE FROM customers_hudi_demo
WHERE state = 'CA'
""")

DataFrame[]

> Hudi often rejects `UPDATE` on meta columns; use MERGE-based update instead.


In [40]:

spark.sql("""
MERGE INTO customers_hudi_demo AS target
USING (
  SELECT customer_id, total_spent * 1.1 AS new_total
  FROM customers_hudi_demo
  WHERE state = 'NY'
) AS source
ON target.customer_id = source.customer_id
WHEN MATCHED THEN
  UPDATE SET target.total_spent = source.new_total
""")


DataFrame[]

In [41]:
spark.sql("""
SELECT * FROM customers_hudi_demo ORDER BY customer_id
""").show()

+-------------------+--------------------+------------------+----------------------+-----------------+-----------+----------+---------+-----+-----------+------------+-----+
|_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|_hoodie_file_name|customer_id|first_name|last_name|email|total_spent|phone_number|state|
+-------------------+--------------------+------------------+----------------------+-----------------+-----------+----------+---------+-----+-----------+------------+-----+
+-------------------+--------------------+------------------+----------------------+-----------------+-----------+----------+---------+-----+-----------+------------+-----+



## 8) Read patterns: snapshot, record index, time travel, CDC templates

In [42]:
spark.sql("SET hoodie.enable.data.skipping=true")
spark.sql("SET hoodie.metadata.column.stats.enable=true")
spark.sql("SET hoodie.metadata.enable=true")

DataFrame[key: string, value: string]

In [43]:
spark.sql("""
SELECT *
FROM customers_hudi_demo
WHERE total_spent > 1.0 AND total_spent < 1000.0
""").show()

+-------------------+--------------------+------------------+----------------------+-----------------+-----------+----------+---------+-----+-----------+------------+-----+
|_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|_hoodie_file_name|customer_id|first_name|last_name|email|total_spent|phone_number|state|
+-------------------+--------------------+------------------+----------------------+-----------------+-----------+----------+---------+-----+-----------+------------+-----+
+-------------------+--------------------+------------------+----------------------+-----------------+-----------+----------+---------+-----+-----------+------------+-----+



In [44]:
spark.sql("SET hoodie.metadata.record.index.enable=true")
spark.sql("""
SELECT *
FROM customers_hudi_demo
WHERE customer_id = 2
""").show()

+-------------------+--------------------+------------------+----------------------+-----------------+-----------+----------+---------+-----+-----------+------------+-----+
|_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|_hoodie_file_name|customer_id|first_name|last_name|email|total_spent|phone_number|state|
+-------------------+--------------------+------------------+----------------------+-----------------+-----------+----------+---------+-----+-----------+------------+-----+
+-------------------+--------------------+------------------+----------------------+-----------------+-----------+----------+---------+-----+-----------+------------+-----+



In [46]:
time_travel_sql = """
SELECT *
FROM customers_hudi_demo
TIMESTAMP AS OF '2025-01-01 00:00:00.000'
WHERE total_spent > 100.0
"""
print("Time-travel template:", time_travel_sql)
# spark.sql(time_travel_sql).show()

Time-travel template: 
SELECT *
FROM customers_hudi_demo
TIMESTAMP AS OF '2025-01-01 00:00:00.000'
WHERE total_spent > 100.0



In [48]:
cdc_template = """
SELECT *
FROM hudi_table_changes(
  'customers_hudi_demo',
  'cdc',
  'earliest',
  NULL
)
"""

latest_state_template = """
SELECT *
FROM hudi_table_changes(
  'customers_hudi_demo',
  'latest_state',
  'earliest',
  NULL
)
"""
print("CDC template:", cdc_template)
print("Incremental latest_state template:", latest_state_template)

CDC template: 
SELECT *
FROM hudi_table_changes(
  'customers_hudi_demo',
  'cdc',
  'earliest',
  NULL
)

Incremental latest_state template: 
SELECT *
FROM hudi_table_changes(
  'customers_hudi_demo',
  'latest_state',
  'earliest',
  NULL
)



## 9) Common Hudi config cheat sheet

In [49]:
hudi_config_cheatsheet = {
    "hoodie.write.set.null.for.missing.columns": "true",
    "hoodie.schema.on.read.enable": "true",
    "hoodie.table.cdc.supplemental.logging.mode": "op_key,op_old,op_new",
    "hoodie.compact.inline.max.delta.commits": "10",
    "hoodie.datasource.compaction.async.enable": "true",
    "hoodie.compact.inline": "true",
    "hoodie.clean.automatic": "true",
    "hoodie.cleaner.commits.retained": "10",
    "hoodie.clean.async": "true",
    "hoodie.parquet.small.file.limit": "104857600",
    "hoodie.parquet.max.file.size": "125829120",
    "hoodie.copyonwrite.record.size.estimate": "1024",
    "hoodie.merge.small.file.group.candidates.limit": "5",
    "hoodie.logfile.max.size": "1073741824",
    "hoodie.clustering.plan.strategy.small.file.limit": "134217728",
    "hoodie.clustering.plan.strategy.target.file.max.bytes": "134217728",
    "hoodie.keep.max.commits": "20",
    "hoodie.cleaner.fileversions.retained": "20",
}

hudi_config_cheatsheet

{'hoodie.write.set.null.for.missing.columns': 'true',
 'hoodie.schema.on.read.enable': 'true',
 'hoodie.table.cdc.supplemental.logging.mode': 'op_key,op_old,op_new',
 'hoodie.compact.inline.max.delta.commits': '10',
 'hoodie.datasource.compaction.async.enable': 'true',
 'hoodie.compact.inline': 'true',
 'hoodie.clean.automatic': 'true',
 'hoodie.cleaner.commits.retained': '10',
 'hoodie.clean.async': 'true',
 'hoodie.parquet.small.file.limit': '104857600',
 'hoodie.parquet.max.file.size': '125829120',
 'hoodie.copyonwrite.record.size.estimate': '1024',
 'hoodie.merge.small.file.group.candidates.limit': '5',
 'hoodie.logfile.max.size': '1073741824',
 'hoodie.clustering.plan.strategy.small.file.limit': '134217728',
 'hoodie.clustering.plan.strategy.target.file.max.bytes': '134217728',
 'hoodie.keep.max.commits': '20',
 'hoodie.cleaner.fileversions.retained': '20'}

## 10) Shut down Spark when finished

In [50]:
spark.stop()

# Section 4: Flink + Hudi code snippets (for shell / Flink SQL)

The commands below run in a shell or Flink SQL CLI, not in Python.

## 4.1 Environment setup
```bash
export FLINK_VERSION=1.17
export HUDI_VERSION=0.15.0
export HADOOP_HOME=/path/to/hadoop
export HADOOP_CLASSPATH="$($HADOOP_HOME/bin/hadoop classpath)"
$FLINK_HOME/bin/start-cluster.sh
wget   "https://repo1.maven.org/maven2/org/apache/hudi/hudi-flink${FLINK_VERSION}-bundle/${HUDI_VERSION}/hudi-flink${FLINK_VERSION}-bundle-${HUDI_VERSION}.jar"   -P "$FLINK_HOME/lib/"
$FLINK_HOME/bin/sql-client.sh embedded   -j "lib/hudi-flink${FLINK_VERSION}-bundle-${HUDI_VERSION}.jar"   shell
```

## 4.2 Flink SQL Examples
```sql
CREATE CATALOG hudi_catalog
WITH (
  'type' = 'hudi',
  'catalog.path' = 'file:///tmp/hudi_catalog',
  'hive.conf.dir' = '/path/to/hive/conf',
  'mode' = 'hms'
);
USE CATALOG hudi_catalog;
CREATE DATABASE db;
USE db;
CREATE TABLE product_daily_price (
  id   BIGINT PRIMARY KEY NOT ENFORCED,
  name STRING,
  price DOUBLE,
  ts   BIGINT,
  dt   STRING
)
PARTITIONED BY (dt)
WITH (
  'connector' = 'hudi',
  'path' = 'file:///tmp/hudi_table',
  'table.type' = 'MERGE_ON_READ',
  'precombine.field' = 'ts',
  'hoodie.cleaner.fileversions.retained' = '20',
  'hoodie.keep.max.commits' = '20',
  'hoodie.datasource.write.hive_style_partitioning' = 'true'
);
INSERT INTO product_daily_price
SELECT 1, 'Lakehouse Book', 50, 1732256367, '2024-11-21';
INSERT INTO product_daily_price + OPTIONS('write.operation' = 'upsert')
SELECT 1, 'Lakehouse Book', 60, 1732256367, '2024-11-21';
UPDATE product_daily_price
SET price = price * 2, ts = 1732258867
WHERE id = 1;
DELETE FROM product_daily_price
WHERE price < 50;
INSERT INTO product_daily_price + OPTIONS('hoodie.keep.max.commits' = '10')
SELECT 2, 'Another Book', 40, 1732256367, '2024-11-21';
```