<a href="https://colab.research.google.com/github/CynicDog/delta-lake-lab/blob/main/Delta_Lake_Colab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Welcome to Delta Lake Colab!

Delta Lake is an open-source storage layer that brings **ACID transactions, scalable metadata handling, and unifies streaming and batch data processing** on top of your existing data lake (like Parquet on S3 or local storage).  

This notebook is designed for **hands-on experimentation** with Delta Lake using **PySpark**. You will be able to:

- Create, read, and write Delta tables  
- Explore Delta table features such as **updates, deletes, and merges**  
- Examine Delta table history and the `_delta_log`  
- Experiment with **time travel** to query older versions of data  

Additionally, this is a space to explore **Databricks features and solutions** in a Colab environment, understand how they work, and implement them yourself.

In [1]:
# Uninstall any existing conflicting packages to avoid version conflicts.
# - pyspark: removes any pre-installed PySpark version
# - delta-spark: removes any previous Delta Lake Python package
# - dataproc-spark-connect: removes Google Colab’s built-in Spark connect package
!pip uninstall -y pyspark delta-spark dataproc-spark-connect

# Install compatible versions:
# - PySpark 3.5.1: works with Delta Lake 3.2.0
# - delta-spark 3.2.0: Delta Lake Python library
!pip install -q pyspark==3.5.1 delta-spark==3.2.0

def get_spark():
    """Creates and returns a SparkSession configured for Delta Lake.

    This function sets up a SparkSession with the necessary Delta Lake
    extensions and catalog, ensuring that Delta features such as
    time travel, updates, and deletes are available.

    Returns:
        pyspark.sql.SparkSession: Configured SparkSession for Delta Lake.
    """
    from pyspark.sql import SparkSession
    from delta import configure_spark_with_delta_pip

    # Build the SparkSession with Delta Lake configurations
    builder = (
        SparkSession.builder.appName("DeltaLakeApp")
        .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")
        .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")
    )

    # Apply Delta Lake pip configuration and return the SparkSession
    return configure_spark_with_delta_pip(builder).getOrCreate()

spark = get_spark()
spark

Found existing installation: pyspark 3.5.1
Uninstalling pyspark-3.5.1:
  Successfully uninstalled pyspark-3.5.1
[0mFound existing installation: dataproc-spark-connect 0.8.3
Uninstalling dataproc-spark-connect-0.8.3:
  Successfully uninstalled dataproc-spark-connect-0.8.3
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m317.0/317.0 MB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone


# Introduction

## Common Use Cases

Delta Lake is widely used across organizations of all sizes because it provides a reliable and scalable foundation for managing large volumes of data in analytics and AI workflows. One of its most common applications is modernizing existing data lakes: by adding ACID transactions, schema enforcement, and scalable metadata handling on top of open storage formats, Delta Lake helps teams resolve the long-standing issues of unreliable or inconsistent data that often hinder traditional lakes. Many organizations also use Delta Lake as part of a lakehouse architecture to support <u>data warehousing techniques, enabling fast and dependable SQL analytics</u> while still maintaining the <u>flexibility and cost-efficiency of a data lake environment</u>. Because Delta Lake unifies batch and streaming data, it plays a central role in real-time data processing workloads, allowing developers to ingest continuous streams while applying the same transformation logic used for historical batch data.

Beyond analytics, Delta Lake is a key component in machine learning and data science pipelines. It provides teams with a consistent and high-quality source of truth, ensuring that training datasets remain accurate, reproducible, and easy to version. Data engineering teams rely on Delta Lake to build robust pipelines that maintain data quality across ingestion, transformation, and operationalization stages. At the same time, business intelligence users benefit from its SQL accessibility, which makes it simple to query data directly from dashboards and reporting tools. Overall, Delta Lake’s emphasis on reliability, performance, and openness makes it an essential platform for data engineers, data scientists, and analysts working across modern big data ecosystems.

## Key Features

Delta Lake introduces a number of core capabilities that form the backbone of the lakehouse paradigm. At its foundation, Delta Lake provides ACID transactions, ensuring that every data modification is executed safely and consistently, even under concurrent workloads or unexpected failures. These transactional guarantees are made possible by Delta Lake’s scalable metadata handling, which uses an append-only transaction log to record every change to a table. This design allows Delta Lake to manage very large datasets without suffering from the metadata bottlenecks common in raw data lakes.

One of the most powerful capabilities enabled by the transaction log is time travel, which allows users to query earlier versions of a table by version number or timestamp. This feature is particularly valuable for debugging, validating model inputs, recovering from accidental deletions, or meeting audit and regulatory requirements. Delta Lake also unifies batch and streaming processing, allowing Spark Structured Streaming jobs to operate with the same APIs and logic used for batch workloads, while the underlying storage guarantees preserve correctness and consistency in both modes.

To maintain data quality, Delta Lake enforces schemas on write and supports controlled schema evolution, preventing corrupted or malformed data from entering pipelines while still allowing tables to adapt as requirements change. Delta Lake also tracks a complete audit history of all operations, enabling transparency into who made changes and when. Modern workloads rely heavily on DML operations—such as updates, deletes, and merges—and Delta Lake provides efficient support for these across multiple execution engines and languages. Its open-source nature encourages broad adoption and collaboration, while its performance optimizations ensure that most workloads run efficiently without extensive tuning. Taken together, these features create a storage layer that is both powerful and approachable for a wide range of data professionals.

## Anatomy of a Delta Lake Table

A Delta Lake table is composed of several tightly integrated components that together provide reliable storage, strong transactional guarantees, and efficient performance on large datasets. Each part of the table contributes to how Delta Lake manages data, tracks changes, and scales across distributed environments. Understanding these components makes it easier to reason about Delta Lake’s behavior, optimize pipelines, and work more effectively with the lakehouse format.

### Data Files

At the base of every Delta Lake table are the data files themselves, stored in the Parquet format. These files hold the raw records and are distributed across object stores or file systems such as HDFS, Amazon S3, Azure Data Lake Storage (ADLS), Google Cloud Storage, or MinIO. Parquet is chosen because its columnar layout, compression, and encoding techniques make it highly efficient for analytical queries and large-scale processing. Delta Lake does not alter the Parquet format but enhances its reliability by layering transactional control and metadata management on top of it.

### Transaction Log

Above the data files sits the transaction log—often referred to as the **_delta_log**—which is the heart of Delta Lake’s architecture. This log is a chronological sequence of JSON entries, each representing a single transaction against the table. Every change, whether inserting new data files, removing outdated ones, or modifying table metadata, is written as a new log entry. By recording operations rather than mutating files directly, the transaction log guarantees ACID semantics: all changes are atomic, consistent, isolated, and durable. This log is the mechanism that makes time travel, concurrent writes, schema enforcement, and recoverability possible.

### Metadata

The metadata tracked in the _delta_log describes the structure, layout, and configuration of the table. It includes information such as the table’s schema, partition columns, data skipping statistics, and protocol versions supported by the client. Metadata can be accessed programmatically through Spark, SQL, Python, or Rust APIs, giving users full insight into how the table is organized and how it has evolved. This metadata layer enables Delta Lake to optimize queries, enforce structural constraints, and adapt as data grows or workloads change.

### Schema

A Delta Lake table’s schema defines the structure of its data, including column names, data types, and nested fields. The schema is enforced whenever data is written, preventing corrupted or mismatched records from entering the table. Delta Lake also supports schema evolution, allowing new columns to be added or existing structures to change without breaking downstream processes. Because the schema and its modifications are captured in the transaction log, every version of the table retains a complete understanding of how the data was structured at that point in time.

### Checkpoints

To improve performance when reading table history, Delta Lake periodically writes **checkpoints**, which are compact Parquet summaries of the current state of the table. Instead of replaying every JSON log entry from the beginning, readers can load the most recent checkpoint and then apply only the newer transactions that follow it. By default, a checkpoint is generated every ten commits. This optimization significantly speeds up table initialization, reduces metadata overhead, and allows large tables to remain responsive even at massive scale.


# Getting Started

## 1. Basic CRUD Operations with Delta Tables

The table contains sample data with `id`, `name`, and `amount` columns and is saved to `/tmp/delta-crud-table`.  
We use `overwrite` mode to replace any existing data. Delta Lake automatically maintains a transaction log (`_delta_log`) for ACID compliance and time travel.

In [None]:
from pyspark.sql import SparkSession
from delta.tables import DeltaTable

# Create: Create a Delta table from a PySpark DataFrame
data = [(1, "Alice", 100), (2, "Bob", 200), (3, "Charlie", 300)]
columns = ["id", "name", "amount"]

df = spark.createDataFrame(data, columns)
delta_path = "/tmp/delta-crud-table"

# Save the DataFrame as a Delta table (overwrite mode)
df.write.format("delta").mode("overwrite").save(delta_path)

This cell loads the Delta table from the specified path into a PySpark DataFrame. We then display its contents using `show()`.  
Delta Lake ensures we always read a consistent snapshot of the table, even after updates or deletes.

In [None]:
# Read: Load the Delta table into a DataFrame
delta_df = spark.read.format("delta").load(delta_path)
delta_df.show()

+---+-------+------+
| id|   name|amount|
+---+-------+------+
|  2|    Bob|   250|
|  3|Charlie|   300|
+---+-------+------+



Update Bob's `amount` to 250 using a `DeltaTable` object. The `condition` selects the rows, and `set` specifies the column to update. Delta Lake ensures the update is transactional and consistent.

In [None]:
# Update: Update records in Delta table
delta_table = DeltaTable.forPath(spark, delta_path)

# Update Bob's amount to 250
delta_table.update(
    condition="name = 'Bob'",  # Rows matching this condition will be updated
    set={"amount": "250"}      # Columns to update
)
delta_table.toDF().show()

+---+-------+------+
| id|   name|amount|
+---+-------+------+
|  2|    Bob|   250|
|  3|Charlie|   300|
|  1|  Alice|   100|
+---+-------+------+



Delete Alice's row using the `condition` parameter. Delta Lake ensures the deletion is transactional and preserves table consistency.

In [None]:
# Delete: Delete records from Delta table
# Delete Alice's row
delta_table.delete(condition="name = 'Alice'")
delta_table.toDF().show()

+---+-------+------+
| id|   name|amount|
+---+-------+------+
|  2|    Bob|   250|
|  3|Charlie|   300|
+---+-------+------+



## 2. Merging / Upserting Data

Delta Lake provides a powerful mechanism for upserting data into existing tables through the DeltaTable API and the DeltaMergeBuilder. A merge operation lets you define how incoming records should interact with existing ones by specifying a matching condition and separate actions for matched and unmatched rows. When the condition identifies a matching record, Delta Lake can update the existing row; when no match is found, a new row is inserted. Both operations are chained into a single merge() call, and because the entire merge is treated as one ACID transaction, the table always remains in a consistent state even under concurrent workloads or failures.

With the DeltaTable API, you use a class called the `DeltaMergeBuilder` to define how new data should be merged into an existing table. Each combination of matching condition and action has its own method—`whenMatchedUpdate()` for updates and `whenNotMatchedInsert()` for inserts.  

In this example, we merge a DataFrame of new records into the Delta table: rows with matching `id`s are updated, while new rows are inserted. Chaining these actions together in a single `merge()` ensures that each operation is atomic and that the table remains consistent.

In [None]:
from pyspark.sql import SparkSession
from delta.tables import DeltaTable

# New data to merge (upsert) into the Delta table
data = [
    (3, "Charlie", 350),  # Existing ID, will update
    (4, "David", 400)     # New ID, will insert
]
new_df = spark.createDataFrame(data, ["id", "name", "amount"])

# Perform merge (upsert) operation
delta_table.alias("target").merge(
    source=new_df.alias("source"),           # New data
    condition="target.id = source.id"        # Matching condition
).whenMatchedUpdate(
    set={
        "name": "source.name",               # Update existing rows
        "amount": "source.amount"
    }
).whenNotMatchedInsert(
    values={
        "id": "source.id",                   # Insert new rows
        "name": "source.name",
        "amount": "source.amount"
    }
).execute()

# Show table after merge/upsert
delta_table.toDF().show()

+---+-------+------+
| id|   name|amount|
+---+-------+------+
|  2|    Bob|   250|
|  3|Charlie|   350|
|  4|  David|   400|
+---+-------+------+



# ACID in Depth

Delta Lake provides **ACID (Atomicity, Consistency, Isolation, Durability) guarantees** on top of your data lake.  
This ensures that every operation—whether a simple write, update, delete, or merge—is transactional and consistent, even in the presence of concurrent operations or failures.

Delta Lake achieves this using the **_delta_log** folder, which tracks all changes made to the table as a sequence of JSON and checkpoint files.

## The `_delta_log` Folder

The `_delta_log` folder contains:

- **JSON commit files** (`00000000000000000000.json`, etc.): Each file represents a single transaction and records all actions (add, remove, update) in that commit.  
- **Checkpoint Parquet files** (`*.checkpoint.parquet`): Periodically created for faster table recovery and reducing the need to read all JSON files.  

These files together allow Delta Lake to:

- Track table history for **time travel**
- Ensure **atomicity**: either a transaction fully completes or has no effect
- Maintain **consistency**: table state always conforms to schema and constraints
- Provide **isolation**: concurrent operations see consistent snapshots
- Guarantee **durability**: committed changes survive crashes or failures

In [None]:
from pyspark.sql import SparkSession
from delta import configure_spark_with_delta_pip
from delta.tables import DeltaTable
import os
spark = get_spark()

# Create initial DataFrame
data = [(1, "Alice", 100), (2, "Bob", 200)]
columns = ["id", "name", "amount"]
df = spark.createDataFrame(data, columns)

# Define Delta table path
delta_path = "/tmp/acid_demo_table"

# Write DataFrame as Delta table (initial commit)
df.write.format("delta").mode("overwrite").save(delta_path)

# Perform additional operations to generate multiple commits
delta_table = DeltaTable.forPath(spark, delta_path)

Running this command lists all the files in the `_delta_log` folder of the Delta table. Each `.json` file corresponds to a single transaction, recording all the actions that were performed, such as adding, updating, or deleting data. The `_commits` file contains metadata about the committed transactions. The numbering of the JSON files reflects the sequence of operations, which Delta Lake uses to enforce ACID guarantees and enable time travel.

In [None]:
# List all files in the Delta table log folder
!ls /tmp/acid_demo_table/_delta_log

00000000000000000000.json  00000000000000000003.json  00000000000000000006.json
00000000000000000001.json  00000000000000000004.json  00000000000000000007.json
00000000000000000002.json  00000000000000000005.json  _commits


<h2> Understanding a Delta Lake Commit JSON File </h2>

When you write data to a Delta table, Delta Lake records the operation as a **commit** in the `_delta_log` folder. Each commit is stored as a **JSON file**, such as `00000000000000000000.json`. This file is crucial because it is **the atomic record of the transaction** and contains all the metadata, schema information, and file operations associated with that write.

Let’s dissect the structure and fields of the JSON you posted:

<h3> 1. <code>commitInfo</code> Object </h3>

```json
{"commitInfo":{
    "timestamp":1763730403573,
    "operation":"WRITE",
    "operationParameters":{"mode":"Overwrite","partitionBy":"[]"},
    "isolationLevel":"Serializable",
    "isBlindAppend":false,
    "operationMetrics":{"numFiles":"2","numOutputRows":"2","numOutputBytes":"1934"},
    "engineInfo":"Apache-Spark/3.5.1 Delta-Lake/3.2.0",
    "txnId":"1e8029ec-2cf7-4712-8c72-2b18bf4a35e0"
}}
```

**Explanation:**

* **`timestamp`** – The exact time the commit occurred (in milliseconds since epoch). Useful for time travel queries.
* **`operation`** – The type of operation performed (`WRITE`, `UPDATE`, `DELETE`, `MERGE`, etc.). In this case, it’s a `WRITE`.
* **`operationParameters`** – Additional info about the operation:

  * `mode`: Write mode (`Overwrite` here).
  * `partitionBy`: Any partition columns used (empty array here).
* **`isolationLevel`** – The transactional isolation level used (`Serializable` ensures full ACID isolation).
* **`isBlindAppend`** – Indicates if the operation is a blind append (true if data is appended without checking existing files). Here it’s false.
* **`operationMetrics`** – Metrics about the commit:

  * `numFiles`: Number of files written (2).
  * `numOutputRows`: Number of rows written (2).
  * `numOutputBytes`: Approximate size in bytes (1934).
* **`engineInfo`** – Spark and Delta versions used for the operation.
* **`txnId`** – Unique identifier for this transaction. Every commit has a unique ID to track operations.

**Key insight:** This object gives **a full audit trail of what the transaction did**, including metrics and configuration.

<h3> 2. <code>metaData</code> Object </h3>

```json
{"metaData":{
    "id":"f4d6ad9d-c095-4c34-bb48-64b4011a43c3",
    "format":{"provider":"parquet","options":{}},
    "schemaString":"{\"type\":\"struct\",\"fields\":[{\"name\":\"id\",\"type\":\"long\",\"nullable\":true},{\"name\":\"name\",\"type\":\"string\",\"nullable\":true},{\"name\":\"amount\",\"type\":\"long\",\"nullable\":true}]}",
    "partitionColumns":[],
    "configuration":{},
    "createdTime":1763730402440
}}
```

**Explanation:**

* **`id`** – A unique identifier for the table metadata.
* **`format`** – The storage format (`parquet`) and any specific options.
* **`schemaString`** – The table schema serialized as a JSON string. Here we have three columns:

  * `id` (long, nullable)
  * `name` (string, nullable)
  * `amount` (long, nullable)
* **`partitionColumns`** – Lists any columns used for partitioning (empty here).
* **`configuration`** – Any table-level configuration properties. Empty in this simple example.
* **`createdTime`** – Timestamp when the table metadata was created.

**Key insight:** This object records **the schema of the table** at this commit. Delta uses this to validate writes and enable schema evolution.

<h3> 3. <code>protocol</code> Object </h3>

```json
{"protocol":{"minReaderVersion":1,"minWriterVersion":2}}
```

**Explanation:**

* **`minReaderVersion`** – Minimum Delta protocol version required to read this table.
* **`minWriterVersion`** – Minimum Delta protocol version required to write to this table.

**Key insight:** The protocol object ensures **compatibility across different Delta Lake versions**, so older readers/writers know if they can interact with this table safely.

<h3> 4. <code>add</code> Objects </h3>

```json
{"add":{
    "path":"part-00000-d40ffd41-6e37-447c-9ec7-e1ec58f31ef1-c000.snappy.parquet",
    "partitionValues":{},
    "size":974,
    "modificationTime":1763730402871,
    "dataChange":true,
    "stats":"{\"numRecords\":1,\"minValues\":{\"id\":1,\"name\":\"Alice\",\"amount\":100},\"maxValues\":{\"id\":1,\"name\":\"Alice\",\"amount\":100},\"nullCount\":{\"id\":0,\"name\":0,\"amount\":0}}"
}}
```

* **`path`** – The relative file path of the Parquet file that was added.
* **`partitionValues`** – Partition values for the file (empty here since the table isn’t partitioned).
* **`size`** – File size in bytes.
* **`modificationTime`** – File timestamp.
* **`dataChange`** – `true` indicates this commit modifies data. `false` would indicate metadata-only changes.
* **`stats`** – Statistics about this file:

  * `numRecords`: Number of rows in the file.
  * `minValues` / `maxValues`: Min/max per column.
  * `nullCount`: Count of nulls per column.

**Key insight:** Each `add` entry describes **exactly what Parquet files were added** in this transaction. Delta uses this to maintain ACID guarantees and efficiently plan queries.

<h3> Putting it all together </h3>

1. **`commitInfo`** – Audit trail of the transaction.
2. **`metaData`** – Table schema and configuration at this commit.
3. **`protocol`** – Ensures version compatibility.
4. **`add` / `remove` / other actions** – Physical file changes for this commit.

Delta Lake builds the table state **by replaying all commits** in `_delta_log`, combining all `add` and `remove` actions, while ensuring **atomicity, consistency, isolation, and durability**.

This single JSON file is therefore **both a record of the operation and the foundation of Delta Lake’s ACID guarantees**.

In [None]:
!cat /tmp/acid_demo_table/_delta_log/00000000000000000000.json | jq

[1;39m{
  [0m[34;1m"commitInfo"[0m[1;39m: [0m[1;39m{
    [0m[34;1m"timestamp"[0m[1;39m: [0m[0;39m1763730403573[0m[1;39m,
    [0m[34;1m"operation"[0m[1;39m: [0m[0;32m"WRITE"[0m[1;39m,
    [0m[34;1m"operationParameters"[0m[1;39m: [0m[1;39m{
      [0m[34;1m"mode"[0m[1;39m: [0m[0;32m"Overwrite"[0m[1;39m,
      [0m[34;1m"partitionBy"[0m[1;39m: [0m[0;32m"[]"[0m[1;39m
    [1;39m}[0m[1;39m,
    [0m[34;1m"isolationLevel"[0m[1;39m: [0m[0;32m"Serializable"[0m[1;39m,
    [0m[34;1m"isBlindAppend"[0m[1;39m: [0m[0;39mfalse[0m[1;39m,
    [0m[34;1m"operationMetrics"[0m[1;39m: [0m[1;39m{
      [0m[34;1m"numFiles"[0m[1;39m: [0m[0;32m"2"[0m[1;39m,
      [0m[34;1m"numOutputRows"[0m[1;39m: [0m[0;32m"2"[0m[1;39m,
      [0m[34;1m"numOutputBytes"[0m[1;39m: [0m[0;32m"1934"[0m[1;39m
    [1;39m}[0m[1;39m,
    [0m[34;1m"engineInfo"[0m[1;39m: [0m[0;32m"Apache-Spark/3.5.1 Delta-Lake/3.2.0"[0m[1;39m,
    [0m[34;

Below cell updates Bob's `amount` to 250, deletes Alice's row, and appends new rows for Charlie and David.  
Each operation creates a new commit in `_delta_log`, ensuring the table remains consistent and transactional.

In [None]:
# Update Bob's amount
delta_table.update(
    condition="name = 'Bob'",
    set={"amount": "250"}
)

# Delete Alice's row
delta_table.delete(condition="name = 'Alice'")

# Insert new rows (append)
new_data = [(3, "Charlie", 300), (4, "David", 400)]
new_df = spark.createDataFrame(new_data, ["id", "name", "amount"])
new_df.write.format("delta").mode("append").save(delta_path)

When we update a row in a Delta table, Delta creates a commit JSON in `_delta_log` to record the transaction. The `commitInfo` section logs the operation type, timestamp, read version, isolation level, and detailed metrics like the number of rows updated and files added or removed.  

The `remove` entry marks the old Parquet file that contained the outdated data as deleted, while the `add` entry points to a new Parquet file with the updated row. Delta never modifies files in place; instead, it replaces them atomically. This design preserves ACID guarantees, allows time travel queries, and keeps a complete, auditable history of all changes to the table.  

By inspecting these commit files, you can see exactly what changed in each operation and how Delta manages consistent snapshots of the table over time.

In [None]:
!cat /tmp/acid_demo_table/_delta_log/00000000000000000001.json | jq

[1;39m{
  [0m[34;1m"commitInfo"[0m[1;39m: [0m[1;39m{
    [0m[34;1m"timestamp"[0m[1;39m: [0m[0;39m1763730410838[0m[1;39m,
    [0m[34;1m"operation"[0m[1;39m: [0m[0;32m"UPDATE"[0m[1;39m,
    [0m[34;1m"operationParameters"[0m[1;39m: [0m[1;39m{
      [0m[34;1m"predicate"[0m[1;39m: [0m[0;32m"[\"(name#7709 = Bob)\"]"[0m[1;39m
    [1;39m}[0m[1;39m,
    [0m[34;1m"readVersion"[0m[1;39m: [0m[0;39m0[0m[1;39m,
    [0m[34;1m"isolationLevel"[0m[1;39m: [0m[0;32m"Serializable"[0m[1;39m,
    [0m[34;1m"isBlindAppend"[0m[1;39m: [0m[0;39mfalse[0m[1;39m,
    [0m[34;1m"operationMetrics"[0m[1;39m: [0m[1;39m{
      [0m[34;1m"numRemovedFiles"[0m[1;39m: [0m[0;32m"1"[0m[1;39m,
      [0m[34;1m"numRemovedBytes"[0m[1;39m: [0m[0;32m"960"[0m[1;39m,
      [0m[34;1m"numCopiedRows"[0m[1;39m: [0m[0;32m"0"[0m[1;39m,
      [0m[34;1m"numDeletionVectorsAdded"[0m[1;39m: [0m[0;32m"0"[0m[1;39m,
      [0m[34;1m"numDeletionVecto

## Time Travel

Time Travel in Delta Lake allows you to query a table **as it existed at a specific point in time or version**.  
By using `.option("versionAsOf", n)` when reading a Delta table, you can access historical data safely and consistently.  
For example, version 0 shows the original write, while later versions reflect updates, deletes, or merges.  
This feature is useful for **auditing changes, recovering previous data, or comparing versions** over time.  
All previous states are reconstructed from the `_delta_log` transaction logs, ensuring **ACID compliance** even in large-scale analytics.

In [2]:
from pyspark.sql import SparkSession
from delta.tables import *
import os

# Create SparkSession with Delta Lake support
spark = get_spark()

# Create initial Delta table
data = spark.createDataFrame([
    (1, "Alice", 25),
    (2, "Bob", 30),
    (3, "Charlie", 28)
], ["id", "name", "age"])

delta_path = "/tmp/delta-table"
data.write.format("delta").mode("overwrite").save(delta_path)  # Write initial data

# Load Delta table for further operations
deltaTable = DeltaTable.forPath(spark, delta_path)

# Update operation: change name where id=1
deltaTable.update(
    condition="id == 1",
    set={"name": "'Alicia'"}
)

# Delete operation: remove rows where age > 28
deltaTable.delete(condition="age > 28")

# Merge operation: upsert new and updated data
new_data = spark.createDataFrame([
    (2, "Bobby", 31),  # Update existing record
    (4, "David", 22)   # Insert new record
], ["id", "name", "age"])

deltaTable.alias("t").merge(
    new_data.alias("s"),
    "t.id = s.id"
).whenMatchedUpdate(set={
    "name": "s.name",
    "age": "s.age"
}).whenNotMatchedInsert(values={
    "id": "s.id",
    "name": "s.name",
    "age": "s.age"
}).execute()

# Time Travel: read previous versions of the table
# Version 0: original write
print("=== Version 0 (original write) ===")
spark.read.format("delta").option("versionAsOf", 0).load(delta_path).show()

# Version 1: after update
print("=== Version 1 (after update) ===")
spark.read.format("delta").option("versionAsOf", 1).load(delta_path).show()

# Version 2: after delete
print("=== Version 2 (after delete) ===")
spark.read.format("delta").option("versionAsOf", 2).load(delta_path).show()

# Version 3: after merge
print("=== Version 3 (after merge) ===")
spark.read.format("delta").option("versionAsOf", 3).load(delta_path).show()

# Inspect Delta table history with metrics
deltaTable.history().select(
    "version",
    "timestamp",
    "operation",
    "operationMetrics"
).show(truncate=False)

# inspect physical files on disk
print("\n--- Delta Table Files ---")
for root, dirs, files in os.walk(delta_path):
    for f in files:
        print(os.path.join(root, f))


=== Version 0 (original write) ===
+---+-------+---+
| id|   name|age|
+---+-------+---+
|  2|    Bob| 30|
|  3|Charlie| 28|
|  1|  Alice| 25|
+---+-------+---+

=== Version 1 (after update) ===
+---+-------+---+
| id|   name|age|
+---+-------+---+
|  2|    Bob| 30|
|  3|Charlie| 28|
|  1| Alicia| 25|
+---+-------+---+

=== Version 2 (after delete) ===
+---+-------+---+
| id|   name|age|
+---+-------+---+
|  3|Charlie| 28|
|  1| Alicia| 25|
+---+-------+---+

=== Version 3 (after merge) ===
+---+-------+---+
| id|   name|age|
+---+-------+---+
|  3|Charlie| 28|
|  1| Alicia| 25|
|  2|  Bobby| 31|
|  4|  David| 22|
+---+-------+---+

+-------+-----------------------+---------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------