# Lesson 14 - Introduction to Delta Lake

Okay, here are detailed technical notes on Introduction to Delta Lake, suitable for professional learners using PySpark.

---

## PySpark Technical Notes: Introduction to Delta Lake

**Introduction**

Traditional data lakes, often built on top of distributed file systems like HDFS or cloud storage like AWS S3, Azure Data Lake Storage (ADLS), or Google Cloud Storage (GCS) using formats like Apache Parquet or ORC, provide massive scalability and cost-effectiveness for storing vast amounts of diverse data. However, they historically lacked critical features found in traditional data warehouses, leading to challenges around data reliability, consistency, and manageability. Issues like failed production jobs leaving data in corrupt states, inability to enforce data quality standards (schema), and difficulty performing updates or deletes were common, often leading to unreliable "data swamps."

Delta Lake is an open-source storage layer that brings ACID (Atomicity, Consistency, Isolation, Durability) transactions, scalable metadata handling, schema enforcement, and time travel capabilities directly to data lakes. It runs on top of your existing storage (S3, ADLS, GCS, HDFS) and is fully compatible with Apache Spark APIs. By using Delta Lake with PySpark, organizations can build reliable, high-performance data pipelines and enable data warehousing capabilities directly on their data lake, creating what is often referred to as a "Lakehouse" architecture.

These notes explore the core concepts of Delta Lake and how to leverage its features using PySpark.

---

### 1. Why Delta Lake? Addressing Data Lake Challenges

**Theory**

Standard data lake storage formats like Parquet are immutable. Once a Parquet file is written, it cannot be easily updated row by row. This poses several challenges that Delta Lake aims to solve:

1.  **Lack of ACID Transactions:** If a Spark job writing multiple Parquet files fails mid-way, the data lake is left in an inconsistent state with partial writes. There's no atomicity – either the whole operation succeeds or fails cleanly. Concurrent reads during writes can see inconsistent data. Concurrent writes can corrupt data.
2.  **Schema Enforcement Issues:** Data lakes traditionally follow a "schema-on-read" approach. While flexible, this can lead to data quality issues if data with incorrect schemas (wrong data types, missing/extra columns) is ingested. Detecting and fixing these issues downstream is complex and costly.
3.  **Difficulty with Updates and Deletes:** Performing record-level updates or deletes on Parquet data is inefficient. It typically requires rewriting entire partitions or datasets, which is slow and computationally expensive. This makes use cases like Change Data Capture (CDC) or GDPR/CCPA compliance (right to be forgotten) difficult to implement.
4.  **Scalability of Metadata:** Listing large numbers of files (common in partitioned datasets) in cloud storage can be slow and become a bottleneck, especially for tables with millions of small files.
5.  **Lack of Versioning:** There's no built-in way to easily query the state of the data as it was at a specific point in time, making auditing, rollbacks, or reproducing experiments difficult.

Delta Lake addresses these by introducing a transaction log (`_delta_log`) alongside the data files (which are typically stored in Parquet format). This log is the single source of truth, recording every transaction that modifies the data or table metadata.

---

### 2. ACID Transactions

**Theory**

ACID properties guarantee data reliability during transactions (read/write operations):

*   **Atomicity:** Ensures that all changes within a transaction are performed successfully, or none are. If a job writing to a Delta table fails, the transaction is aborted, and the table remains in its previous consistent state.
*   **Consistency:** Guarantees that data is always in a valid state. Schema enforcement (discussed next) plays a key role here. Transactions bring the data from one valid state to another.
*   **Isolation:** Ensures that concurrent transactions do not interfere with each other. Delta Lake uses optimistic concurrency control. When a transaction starts, it checks the table's current version. Before committing, it checks if the table has been modified by another transaction since it started reading. If so, the commit fails (optimistic concurrency exception), and the operation typically needs to be retried. This ensures that reads are always consistent and concurrent writes don't corrupt the table. Snapshot isolation ensures a reader always sees a consistent snapshot of the table from a specific version.
*   **Durability:** Ensures that once a transaction is committed, its changes are permanent and survive system failures. This is achieved by writing the transaction log entries reliably to the underlying distributed storage.

**How Delta Achieves ACID:**

The core is the `_delta_log` directory within the Delta table's root directory.

1.  When a transaction occurs (e.g., writing a DataFrame), Spark stages the new data files (Parquet).
2.  It records the changes (files added, files removed) in a JSON file within the `_delta_log` directory (e.g., `00000000000000000001.json`).
3.  Committing the transaction involves atomically writing this JSON file. Because writing a single file to most distributed file systems is atomic, this ensures the entire transaction is atomic.
4.  Readers consult the log first to determine the current valid version of the table and which data files constitute that version. They ignore any staged files from incomplete transactions.

**Code Example (Illustrating Atomic Appends/Overwrites)**

```python
from pyspark.sql import SparkSession
from delta import * # Import Delta Lake helper functions

# --- Setup Spark Session with Delta Lake support ---
# Note: This configuration is typically needed when running PySpark locally or on clusters
# where Delta Lake is not pre-configured (like standard Spark distributions).
# Databricks Runtime includes Delta Lake.
builder = SparkSession.builder.appName("DeltaACIDDemo") \
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")

spark = configure_spark_with_delta_pip(builder).getOrCreate()

# Define data and paths
data1 = [(1, "Apple"), (2, "Banana")]
columns = ["id", "fruit"]
delta_path = "/tmp/delta_acid_table" # Use a suitable path (local or cloud storage)

# --- Initial Write (Transaction 1) ---
print("Performing initial write...")
df1 = spark.createDataFrame(data1, columns)
df1.write.format("delta").mode("overwrite").save(delta_path)
print(f"Data written to Delta table at: {delta_path}")

# Read the initial state
print("\nReading initial state:")
df_read1 = spark.read.format("delta").load(delta_path)
df_read1.show()

# --- Append Data (Transaction 2 - Atomic) ---
print("\nPerforming atomic append...")
data2 = [(3, "Cherry")]
df2 = spark.createDataFrame(data2, columns)
df2.write.format("delta").mode("append").save(delta_path) # Append is atomic

# Read the state after append
print("\nReading state after append:")
df_read2 = spark.read.format("delta").load(delta_path)
df_read2.show()

# --- Overwrite Data (Transaction 3 - Atomic) ---
print("\nPerforming atomic overwrite...")
data3 = [(4, "Date"), (5, "Elderberry")]
df3 = spark.createDataFrame(data3, columns)
# Overwrite replaces the entire table atomically
df3.write.format("delta").mode("overwrite").save(delta_path)

# Read the final state
print("\nReading state after overwrite:")
df_read3 = spark.read.format("delta").load(delta_path)
df_read3.show()

# Simulate a FAILED write (conceptual - hard to force failure easily in simple script)
# Imagine a large job writing data4 that crashes halfway.
# Because Delta uses atomic commits to the log, the table would remain in the
# state after Transaction 3 (data3), not partially written with data4.
# The staged Parquet files from the failed job would exist but wouldn't be referenced
# by the transaction log, so readers would ignore them.
print("\n(Conceptual) If a subsequent write failed, the table remains consistent.")

```

**Line-by-Line Explanation:**

1.  `from delta import *`: Imports necessary functions from the Delta Lake library, like `configure_spark_with_delta_pip`.
2.  `builder = SparkSession.builder...`: Standard SparkSession builder setup.
3.  `.config("spark.sql.extensions", ...)` and `.config("spark.sql.catalog.spark_catalog", ...)`: These configurations integrate Delta Lake's SQL parser and catalog capabilities into Spark.
4.  `spark = configure_spark_with_delta_pip(builder).getOrCreate()`: A helper function from the `delta-spark` package that ensures necessary JARs are added and configurations applied.
5.  `df1.write.format("delta").mode("overwrite").save(delta_path)`: Writes the DataFrame `df1` to the specified path in Delta format. `mode("overwrite")` ensures any existing data at the path is replaced. This is the first transaction.
6.  `spark.read.format("delta").load(delta_path)`: Reads the data back from the Delta table. It automatically finds the latest valid version by consulting the `_delta_log`.
7.  `df2.write.format("delta").mode("append").save(delta_path)`: Appends data from `df2` to the existing Delta table. This is Transaction 2, also atomic.
8.  `df3.write.format("delta").mode("overwrite").save(delta_path)`: Overwrites the *entire* table contents with `df3`. This is Transaction 3, atomic. Even though it replaces all data, it's done as a single transactional unit.

**Practical Use Cases:**

*   **Reliable ETL Pipelines:** Ensures that data pipelines either complete fully or leave the target table untouched, preventing data corruption from partial writes.
*   **Streaming Data Ingestion:** Delta Lake is a popular sink for Spark Structured Streaming, providing exactly-once write semantics and allowing concurrent batch reads while streaming continues.
*   **Concurrent Operations:** Allows multiple users or jobs to safely read and write to the same table concurrently (with optimistic concurrency handling potential conflicts).

---

### 3. Schema Enforcement

**Theory**

Schema enforcement prevents writing data to a Delta table that does not conform to the table's predefined schema. By default, Delta Lake enforces schema on write:

*   If the DataFrame being written has columns that are *not* present in the target Delta table's schema, the write operation will fail.
*   If the DataFrame being written has different data types for existing columns compared to the target table, the write will fail.
*   If the DataFrame is *missing* columns that exist in the target table (and are not nullable), the write will fail unless those target columns are nullable (in which case `null` will be written).

This strictness ensures data quality and consistency, preventing accidental corruption of the table structure.

**Schema Evolution:** While enforcement is the default, Delta Lake also supports explicit schema evolution. If you intend to change the table schema (e.g., add new columns), you can use the `mergeSchema` option during the write:

*   `df.write.format("delta").option("mergeSchema", "true").mode("append").save(path)`: Allows adding new columns present in the DataFrame but not in the target table. Existing columns must still match data types.
*   `df.write.format("delta").option("overwriteSchema", "true").mode("overwrite").save(path)`: Allows completely replacing the table schema and data with the schema and data of the DataFrame being written (use with caution!).

**Code Example**

```python
# Continuing from the previous SparkSession and delta_path setup

# Current Schema (after overwrite with df3):
print("Current table schema:")
current_df = spark.read.format("delta").load(delta_path)
current_df.printSchema()
current_df.show()

# --- Attempt 1: Write with Incompatible Schema (Extra Column) ---
print("\nAttempting to write data with an extra column (will fail by default)...")
data_extra_col = [(6, "Fig", "Green")]
df_extra_col = spark.createDataFrame(data_extra_col, ["id", "fruit", "color"]) # Added 'color'

try:
    df_extra_col.write.format("delta").mode("append").save(delta_path)
except Exception as e:
    print(f"Write failed as expected: {e}")

# --- Attempt 2: Write with Incompatible Schema (Wrong Data Type) ---
print("\nAttempting to write data with wrong data type for 'id' (will fail by default)...")
data_wrong_type = [("7", "Grape")] # 'id' as string instead of integer
df_wrong_type = spark.createDataFrame(data_wrong_type, ["id", "fruit"])

try:
    df_wrong_type.write.format("delta").mode("append").save(delta_path)
except Exception as e:
    print(f"Write failed as expected: {e}")

# --- Attempt 3: Write with Schema Evolution (Adding a new column) ---
print("\nAppending data with a new column using schema evolution ('mergeSchema')...")
df_extra_col.write.format("delta") \
    .option("mergeSchema", "true") \
    .mode("append") \
    .save(delta_path)
print("Write successful with schema evolution.")

# Read the table with the evolved schema
print("\nReading table after schema evolution:")
df_evolved = spark.read.format("delta").load(delta_path)
df_evolved.printSchema() # Note the new 'color' column
df_evolved.show()
# Explanation: Rows written before the schema change will have null for the new 'color' column.

# --- Attempt 4: Overwriting Schema ---
print("\nOverwriting table and schema using 'overwriteSchema'...")
data_new_schema = [("A", 100.5), ("B", 200.0)]
df_new_schema = spark.createDataFrame(data_new_schema, ["item_code", "value"])

df_new_schema.write.format("delta") \
    .option("overwriteSchema", "true") \
    .mode("overwrite") \
    .save(delta_path)
print("Table and schema overwritten.")

# Read the completely new table structure
print("\nReading table after schema overwrite:")
df_overwritten = spark.read.format("delta").load(delta_path)
df_overwritten.printSchema()
df_overwritten.show()
```

**Line-by-Line Explanation:**

1.  `current_df.printSchema()`: Displays the schema of the Delta table before attempting modifications.
2.  `df_extra_col = ...`: Creates a DataFrame with an additional `color` column not present in the target table.
3.  `try...except`: Catches the exception thrown by Spark/Delta when attempting to write `df_extra_col` without schema evolution, demonstrating default enforcement.
4.  `df_wrong_type = ...`: Creates a DataFrame where the `id` column is a String, mismatching the IntegerType in the target table.
5.  The second `try...except` block catches the error due to the data type mismatch.
6.  `df_extra_col.write...option("mergeSchema", "true")...save(delta_path)`: Performs the write again, but this time includes the `mergeSchema` option. Delta Lake adds the new `color` column to the table's schema definition in the transaction log.
7.  `df_evolved.printSchema()`: Shows the updated schema including the nullable `color` column.
8.  `df_new_schema = ...`: Creates a DataFrame with a completely different structure.
9.  `df_new_schema.write...option("overwriteSchema", "true")...save(delta_path)`: Uses `overwriteSchema` along with `mode("overwrite")`. This replaces not only the data but also the schema of the target table entirely.
10. `df_overwritten.printSchema()`: Shows the final, completely changed schema.

**Practical Use Cases:**

*   **Data Quality:** Prevents accidental corruption of tables by ensuring incoming data adheres to expected structures and types.
*   **Pipeline Robustness:** Makes ETL pipelines more resilient to unexpected changes in source data schemas.
*   **Controlled Evolution:** Allows deliberate, controlled changes to table schemas over time as requirements evolve, without breaking the table.

---

### 4. Time Travel & Versioning

**Theory**

Every operation (write, update, delete, merge, schema change) that modifies a Delta table creates a new version. Delta Lake retains the transaction log entries and, through them, the history of the table. This allows users to query previous versions of the table data, a feature known as "Time Travel."

You can query older versions using either:

1.  **Version Number:** Querying the table `AS OF VERSION <version_number>`. Version numbers are sequential integers starting from 0.
2.  **Timestamp:** Querying the table `AS OF TIMESTAMP <timestamp_string>`. Delta finds the latest version committed at or before the specified timestamp. The timestamp string should be in a format Spark can parse (e.g., `'yyyy-MM-dd HH:mm:ss'`).

**How it Works:**

When you query a specific version or timestamp, Delta consults the transaction log to identify the set of data files (Parquet files) that constituted the table at that specific point in time. It then instructs Spark to read only those specific files.

**Code Example**

```python
# Continuing from the previous SparkSession and delta_path setup
# Let's see the history of our table first

from delta.tables import * # Import DeltaTable for history and DML

delta_table = DeltaTable.forPath(spark, delta_path)

print("\nShowing table history:")
history_df = delta_table.history()
history_df.select("version", "timestamp", "operation", "operationParameters").show(truncate=False)

# Assuming the 'overwriteSchema' write was version 4 (check history output)
target_version = history_df.selectExpr("max(version)").first()[0] - 1 # Get previous version
if target_version < 0: target_version = 0

print(f"\n--- Time Travelling ---")

# --- Query using Version Number ---
print(f"\nReading table AS OF VERSION {target_version}:")
try:
    df_version = spark.read.format("delta").option("versionAsOf", target_version).load(delta_path)
    df_version.show()
    df_version.printSchema()
except Exception as e:
    print(f"Could not read version {target_version}: {e}") # Might fail if version doesn't exist or was VACUUMed

# --- Query using Timestamp ---
# Get timestamp of the target version from history
target_timestamp = history_df.filter(f"version = {target_version}").select("timestamp").first()[0]
# Format timestamp for the option (adjust precision if needed)
timestamp_str = target_timestamp.strftime('%Y-%m-%d %H:%M:%S.%f')[:-3] # Example format

print(f"\nReading table AS OF TIMESTAMP '{timestamp_str}' (corresponds to version {target_version}):")
try:
    df_timestamp = spark.read.format("delta").option("timestampAsOf", timestamp_str).load(delta_path)
    df_timestamp.show()
    df_timestamp.printSchema()
except Exception as e:
    print(f"Could not read timestamp '{timestamp_str}': {e}")

# Query the latest version (standard read)
print("\nReading latest version (default):")
df_latest = spark.read.format("delta").load(delta_path)
df_latest.show()
df_latest.printSchema()
```

**Line-by-Line Explanation:**

1.  `from delta.tables import *`: Imports the `DeltaTable` class, which provides methods like `history()`, `update()`, `delete()`, `merge()`.
2.  `delta_table = DeltaTable.forPath(spark, delta_path)`: Creates a `DeltaTable` object representing the table at the specified path.
3.  `history_df = delta_table.history()`: Retrieves the transaction history of the table as a DataFrame. Each row represents a committed transaction (version).
4.  `history_df.select(...).show()`: Displays key information about each transaction, like the version number, timestamp, operation type (WRITE, MERGE, etc.), and parameters.
5.  `target_version = ...`: Determines a previous version number to query (e.g., the second to last version).
6.  `spark.read.format("delta").option("versionAsOf", target_version).load(delta_path)`: Reads the Delta table, specifically requesting the data as it existed at `target_version`.
7.  `target_timestamp = ...`: Retrieves the timestamp associated with the `target_version` from the history DataFrame.
8.  `timestamp_str = ...`: Formats the timestamp into a string suitable for the `timestampAsOf` option.
9.  `spark.read.format("delta").option("timestampAsOf", timestamp_str).load(delta_path)`: Reads the Delta table as it existed at the specified timestamp.

**Practical Use Cases:**

*   **Auditing & Governance:** Track changes to data over time, identify when specific records were modified or deleted.
*   **Debugging Data Pipelines:** If a pipeline introduces bad data, time travel allows inspecting the table state *before* the problematic job ran.
*   **Rollbacks:** Easily revert the table to a previous known-good state if a recent transaction caused issues (often done by reading the old version and overwriting the table with it).
*   **Reproducing ML Experiments:** Query the exact version of the dataset used for training a model to ensure reproducibility.

**Important Note on VACUUM:** Delta Lake retains historical data files and log entries indefinitely by default. However, the `VACUUM` command can be used to physically remove data files no longer referenced by recent versions within a specified retention period (default is 7 days). Once files are vacuumed, you can no longer time travel back beyond the retention period for those files. Use `VACUUM` carefully to manage storage costs while preserving necessary history.

---

### 5. Advanced Considerations & Best Practices

*   **Updates, Deletes, and Merges:** Delta Lake supports standard DML operations (`UPDATE`, `DELETE`, `MERGE INTO`) using the `DeltaTable` API or SQL. These operations also create new table versions and are ACID compliant. The `MERGE INTO` (upsert) operation is particularly powerful for efficiently applying changes from a source dataset to a target Delta table.
    ```python
    # Example MERGE (Upsert)
    # Assume delta_table is the target DeltaTable object
    # source_df contains new data/updates with columns 'id', 'fruit', 'color'

    # delta_table.alias("target").merge(
    #     source=source_df.alias("source"),
    #     condition="target.id = source.id"  # Join condition
    # ).whenMatchedUpdate(set={          # Action if match found
    #     "fruit": col("source.fruit"),
    #     "color": col("source.color")
    # }).whenNotMatchedInsert(values={    # Action if no match found
    #     "id": col("source.id"),
    #     "fruit": col("source.fruit"),
    #     "color": col("source.color")
    # }).execute()
    ```
*   **Partitioning:** Just like with Parquet, partitioning Delta tables (e.g., by date) using `partitionBy()` during writes significantly improves query performance when filtering on partition columns, as Spark can skip reading irrelevant partitions (data skipping). Delta Lake stores partition information in the transaction log, making partition discovery faster than traditional file listing.
*   **OPTIMIZE Command:** Over time, many small files can accumulate in a Delta table due to frequent appends, merges, or streaming updates. This "small file problem" hurts read performance. The `OPTIMIZE` command compacts small files into larger ones, improving read throughput.
    ```python
    # delta_table.optimize().executeCompaction() # Basic compaction
    # spark.sql(f"OPTIMIZE '{delta_path}'")      # SQL equivalent
    ```
*   **Z-Ordering (Multi-dimensional Clustering):** An enhancement to `OPTIMIZE`. `OPTIMIZE ZORDER BY (col1, col2)` co-locates related data within the compacted files based on the specified Z-Order columns. If queries frequently filter on these columns (especially non-partition columns), Z-Ordering can significantly improve data skipping and query speed. Choose low-cardinality columns frequently used in filters for Z-Ordering.
    ```python
    # delta_table.optimize().executeZOrderBy("filter_col1", "filter_col2")
    # spark.sql(f"OPTIMIZE '{delta_path}' ZORDER BY (filter_col1, filter_col2)")
    ```
*   **Performance Tuning:** Besides partitioning and optimization, standard Spark tuning practices apply (e.g., cluster sizing, shuffle partitions, memory management). Delta Lake's transaction log processing adds a small overhead but enables features that often lead to better overall pipeline reliability and performance compared to managing raw Parquet files manually.

---

**Conclusion**

Delta Lake significantly enhances data lakes by bringing reliability, data quality guarantees, and powerful DML/versioning features previously associated with data warehouses. By leveraging ACID transactions, schema enforcement, and time travel through its transaction log mechanism, Delta Lake allows organizations to build robust, scalable, and trustworthy "Lakehouse" architectures directly on cost-effective cloud storage. Its seamless integration with the PySpark API makes it a powerful tool for data engineers and data scientists working with large-scale datasets in the Spark ecosystem. Understanding and utilizing these features effectively is key to building modern, reliable data platforms.

---