# What is the Delta Lake transaction log?

The Delta Lake transaction log, also known as the **_Delta Log_**, is an ordered record of every change that has ever been performed on a Delta Lake table since its inception.

Delta Log is the crux of Delta Lake which ensures **atomicity, consistency, isolation, and durability** of user-initiated transactions

## What is the Delta Lake transaction log?

The Delta Lake transaction log, also known as the **_Delta Log_**, is an ordered record of every change that has ever been performed on a Delta Lake table since its inception.

Delta Log is the crux of Delta Lake which ensures **atomicity, consistency, isolation, and durability** of user-initiated transactions


## How Does the Transaction Log Work?

Whenever a user performs an operation to modify a table (e.g., INSERT, DELETE, UPDATE or MERGE), Delta Lake breaks that operation down into a series of discrete steps composed of one or more of the actions below.

**- Update metadata**

    Updates the table’s metadata including but not limited to changing the table’s name, schema or partitioning.

**- Add file**
    
    Adds a data file to the transaction log.

**- Remove file**

    Removes a data file from the transaction log.

**- Set transaction**
    
    Records that a structured streaming job has committed a micro-batch with the given ID.

**- Change protocol**

    Enables new features by switching the Delta Lake transaction log to the newest software protocol.

**- Commit info**

    Contains information around the commit, which operation was made, from where, and at what time.

Those actions are then recorded in the transaction log as ordered, atomic units known as **_commits_**

In [1]:
# Clean previous run
spark.sql("DROP TABLE IF EXISTS demo.training")

# Create data DataFrame
data = spark.range(0, 5)

# Write the data DataFrame to onelake location
data.write.mode("overwrite").format("delta").save("Tables/training")


StatementMeta(, 52de3d8e-a562-4516-8879-81c97371e8d1, 3, Finished, Available)

## Delta Lake API

In [2]:
from delta.tables import * 

deltaLog = DeltaTable.forName(spark, 'training')  
details = deltaLog.detail()
display(details)

StatementMeta(, 52de3d8e-a562-4516-8879-81c97371e8d1, 4, Finished, Available)

SynapseWidget(Synapse.DataFrame, f9ffde0c-0cee-42a9-b3d6-d71f88cb7a3f)

In [3]:
delta_log = spark.read.json("Tables/training/_delta_log/00000000000000000000.json")

StatementMeta(, 52de3d8e-a562-4516-8879-81c97371e8d1, 5, Finished, Available)

## Metadata of Transaction log

The metadata contains a lot of interesting information, but let’s focus on the ones around the file system.


In [4]:
delta_log.printSchema()

StatementMeta(, 52de3d8e-a562-4516-8879-81c97371e8d1, 6, Finished, Available)

root
 |-- add: struct (nullable = true)
 |    |-- dataChange: boolean (nullable = true)
 |    |-- modificationTime: long (nullable = true)
 |    |-- path: string (nullable = true)
 |    |-- size: long (nullable = true)
 |    |-- stats: string (nullable = true)
 |    |-- tags: struct (nullable = true)
 |    |    |-- VORDER: string (nullable = true)
 |-- commitInfo: struct (nullable = true)
 |    |-- engineInfo: string (nullable = true)
 |    |-- isBlindAppend: boolean (nullable = true)
 |    |-- isolationLevel: string (nullable = true)
 |    |-- operation: string (nullable = true)
 |    |-- operationMetrics: struct (nullable = true)
 |    |    |-- numFiles: string (nullable = true)
 |    |    |-- numOutputBytes: string (nullable = true)
 |    |    |-- numOutputRows: string (nullable = true)
 |    |-- operationParameters: struct (nullable = true)
 |    |    |-- mode: string (nullable = true)
 |    |    |-- partitionBy: string (nullable = true)
 |    |-- tags: struct (nullable = true)
 | 

In [6]:
display(delta_log)

StatementMeta(, 52de3d8e-a562-4516-8879-81c97371e8d1, 7, Finished, Available)

SynapseWidget(Synapse.DataFrame, a0964900-f099-4d55-9599-49b1f4e9c416)

In [7]:
mssparkutils.fs.ls("Tables/training/")

StatementMeta(, 52de3d8e-a562-4516-8879-81c97371e8d1, 8, Finished, Available)

[FileInfo(path=abfss://7e0f3777-dd5b-4c28-b9bb-b53152661882@onelake.dfs.fabric.microsoft.com/cbb3f5fb-9a8e-40cc-8b0a-367745f5540c/Tables/training/_delta_log, name=_delta_log, size=0),
 FileInfo(path=abfss://7e0f3777-dd5b-4c28-b9bb-b53152661882@onelake.dfs.fabric.microsoft.com/cbb3f5fb-9a8e-40cc-8b0a-367745f5540c/Tables/training/part-00000-4385b14d-cbfe-44ae-b8de-ae1b3630cd51-c000.snappy.parquet, name=part-00000-4385b14d-cbfe-44ae-b8de-ae1b3630cd51-c000.snappy.parquet, size=854)]

In [8]:
sc.binaryFiles("Tables/training/").count()

StatementMeta(, 52de3d8e-a562-4516-8879-81c97371e8d1, 9, Finished, Available)

1

## Commit Information

In [9]:
# Commit Information
display(delta_log.select(
    "commitInfo.engineInfo",
    "commitInfo.isBlindAppend",
    "commitInfo.isolationLevel",
    "commitInfo.operation",
    "commitInfo.operationMetrics.numFiles",
    "commitInfo.operationMetrics.numOutputBytes",
    "commitInfo.operationMetrics.numOutputRows",
    "commitInfo.operationParameters.mode",
    "commitInfo.operationParameters.partitionBy",
    "commitInfo.tags.VORDER",
    "commitInfo.timestamp",
    "commitInfo.txnId")
    .where("commitInfo is not null"))

StatementMeta(, 52de3d8e-a562-4516-8879-81c97371e8d1, 10, Finished, Available)

SynapseWidget(Synapse.DataFrame, 561a6278-f764-4d41-a470-bc4d71604f82)

## Add information

In [10]:
# Add info
display(delta_log.select(
    "add.dataChange",
    "add.modificationTime",
    "add.path", 
    "add.size",
    "add.stats",
    "add.tags.VORDER")
    .where("add is not null"))

StatementMeta(, 52de3d8e-a562-4516-8879-81c97371e8d1, 11, Finished, Available)

SynapseWidget(Synapse.DataFrame, b9adeed8-8cd4-43dc-9647-9397462f510c)

In [11]:
%%sql
-- Describe table history using table name
DESCRIBE HISTORY demo.training

StatementMeta(, 52de3d8e-a562-4516-8879-81c97371e8d1, 12, Finished, Available)

<Spark SQL result set with 1 rows and 15 fields>

In [None]:
%%sql

-- Describe table history using file path
DESCRIBE HISTORY delta.`/table_delta`

In [12]:
spark.read.table("training").count()

StatementMeta(, 52de3d8e-a562-4516-8879-81c97371e8d1, 13, Finished, Available)

5

## Version 1: Appending data

In [13]:
# Add 4 new rows of data to our Delta table
data = spark.range(6, 10)
data.write.format("delta").mode("append").save("Tables/training")

StatementMeta(, 52de3d8e-a562-4516-8879-81c97371e8d1, 14, Finished, Available)

    You can confirm there are a total number of 9 rows in the table by running the following command.  

In [14]:
spark.read.table("training").count()

StatementMeta(, 52de3d8e-a562-4516-8879-81c97371e8d1, 15, Finished, Available)

9

    Review file system

In [15]:
mssparkutils.fs.ls("Tables/training/")

StatementMeta(, 52de3d8e-a562-4516-8879-81c97371e8d1, 16, Finished, Available)

[FileInfo(path=abfss://7e0f3777-dd5b-4c28-b9bb-b53152661882@onelake.dfs.fabric.microsoft.com/cbb3f5fb-9a8e-40cc-8b0a-367745f5540c/Tables/training/_delta_log, name=_delta_log, size=0),
 FileInfo(path=abfss://7e0f3777-dd5b-4c28-b9bb-b53152661882@onelake.dfs.fabric.microsoft.com/cbb3f5fb-9a8e-40cc-8b0a-367745f5540c/Tables/training/part-00000-4385b14d-cbfe-44ae-b8de-ae1b3630cd51-c000.snappy.parquet, name=part-00000-4385b14d-cbfe-44ae-b8de-ae1b3630cd51-c000.snappy.parquet, size=854),
 FileInfo(path=abfss://7e0f3777-dd5b-4c28-b9bb-b53152661882@onelake.dfs.fabric.microsoft.com/cbb3f5fb-9a8e-40cc-8b0a-367745f5540c/Tables/training/part-00000-d468e34f-c3f8-48cb-8120-03321e883732-c000.snappy.parquet, name=part-00000-d468e34f-c3f8-48cb-8120-03321e883732-c000.snappy.parquet, size=851)]

In [16]:
sc.binaryFiles("Tables/training/").count()

StatementMeta(, 52de3d8e-a562-4516-8879-81c97371e8d1, 17, Finished, Available)

2

In [17]:
delta_log1 = spark.read.json("Tables/training/_delta_log/00000000000000000001.json")

StatementMeta(, 52de3d8e-a562-4516-8879-81c97371e8d1, 18, Finished, Available)

    Let’s follow up by looking at the Delta table version 1 by running the following commands to review the path column of the add metadata

In [18]:
# Add info
display(delta_log1.select(
    "add.dataChange",
    "add.modificationTime",
    "add.path", 
    "add.size",
    "add.stats",
    "add.tags.VORDER")
    .where("add is not null"))

StatementMeta(, 52de3d8e-a562-4516-8879-81c97371e8d1, 19, Finished, Available)

SynapseWidget(Synapse.DataFrame, e4a8ff83-d4fd-4e93-8b6e-15d20f3c0ac5)

    Review table history - Let’s focus on the operationMetrics for version 1

In [19]:
%%sql
-- Describe table history using file path
DESCRIBE HISTORY demo.training

StatementMeta(, 52de3d8e-a562-4516-8879-81c97371e8d1, 20, Finished, Available)

<Spark SQL result set with 2 rows and 15 fields>

## Version 2: Deleting data

In [20]:
from delta.tables import * 
from pyspark.sql.functions import *

delete_df = DeltaTable.forName(spark, "Training")
delete_df.delete(col("id") == 2)

StatementMeta(, 52de3d8e-a562-4516-8879-81c97371e8d1, 21, Finished, Available)

    OR

In [None]:
%%sql
-- Delete from Delta table where id <= 2
DELETE FROM Training WHERE id <= 2

    Review data. You can confirm there are a total number of 6 rows



In [21]:
spark.read.table("training").count()

StatementMeta(, 52de3d8e-a562-4516-8879-81c97371e8d1, 22, Finished, Available)

8

    Review file system - Let’s re-run our file listing and as expected, there are more files and a new JSON

In [22]:
mssparkutils.fs.ls("Tables/training/_delta_log")

StatementMeta(, 52de3d8e-a562-4516-8879-81c97371e8d1, 23, Finished, Available)

[FileInfo(path=abfss://7e0f3777-dd5b-4c28-b9bb-b53152661882@onelake.dfs.fabric.microsoft.com/cbb3f5fb-9a8e-40cc-8b0a-367745f5540c/Tables/training/_delta_log/00000000000000000000.json, name=00000000000000000000.json, size=1048),
 FileInfo(path=abfss://7e0f3777-dd5b-4c28-b9bb-b53152661882@onelake.dfs.fabric.microsoft.com/cbb3f5fb-9a8e-40cc-8b0a-367745f5540c/Tables/training/_delta_log/00000000000000000001.json, name=00000000000000000001.json, size=708),
 FileInfo(path=abfss://7e0f3777-dd5b-4c28-b9bb-b53152661882@onelake.dfs.fabric.microsoft.com/cbb3f5fb-9a8e-40cc-8b0a-367745f5540c/Tables/training/_delta_log/00000000000000000002.json, name=00000000000000000002.json, size=1101),
 FileInfo(path=abfss://7e0f3777-dd5b-4c28-b9bb-b53152661882@onelake.dfs.fabric.microsoft.com/cbb3f5fb-9a8e-40cc-8b0a-367745f5540c/Tables/training/_delta_log/_temporary, name=_temporary, size=0)]

In [23]:
mssparkutils.fs.ls("Tables/training/")

StatementMeta(, 52de3d8e-a562-4516-8879-81c97371e8d1, 24, Finished, Available)

[FileInfo(path=abfss://7e0f3777-dd5b-4c28-b9bb-b53152661882@onelake.dfs.fabric.microsoft.com/cbb3f5fb-9a8e-40cc-8b0a-367745f5540c/Tables/training/_delta_log, name=_delta_log, size=0),
 FileInfo(path=abfss://7e0f3777-dd5b-4c28-b9bb-b53152661882@onelake.dfs.fabric.microsoft.com/cbb3f5fb-9a8e-40cc-8b0a-367745f5540c/Tables/training/part-00000-10f680c9-e850-4846-83aa-64520f0ddc9a-c000.snappy.parquet, name=part-00000-10f680c9-e850-4846-83aa-64520f0ddc9a-c000.snappy.parquet, size=849),
 FileInfo(path=abfss://7e0f3777-dd5b-4c28-b9bb-b53152661882@onelake.dfs.fabric.microsoft.com/cbb3f5fb-9a8e-40cc-8b0a-367745f5540c/Tables/training/part-00000-4385b14d-cbfe-44ae-b8de-ae1b3630cd51-c000.snappy.parquet, name=part-00000-4385b14d-cbfe-44ae-b8de-ae1b3630cd51-c000.snappy.parquet, size=854),
 FileInfo(path=abfss://7e0f3777-dd5b-4c28-b9bb-b53152661882@onelake.dfs.fabric.microsoft.com/cbb3f5fb-9a8e-40cc-8b0a-367745f5540c/Tables/training/part-00000-d468e34f-c3f8-48cb-8120-03321e883732-c000.snappy.parquet, n

In [24]:
sc.binaryFiles("Tables/training/").count()

StatementMeta(, 52de3d8e-a562-4516-8879-81c97371e8d1, 25, Finished, Available)

3

In [25]:
delta_log2 = spark.read.json("Tables/training/_delta_log/00000000000000000002.json")

StatementMeta(, 52de3d8e-a562-4516-8879-81c97371e8d1, 26, Finished, Available)

In [26]:
display(delta_log2)

StatementMeta(, 52de3d8e-a562-4516-8879-81c97371e8d1, 27, Finished, Available)

SynapseWidget(Synapse.DataFrame, 4254e078-f713-4779-944a-7f9614737686)

## Remove Information

It is important to note that in the transaction log the deletion is a _tombstone_.

_**We have not deleted the files**_. That is, these files are identified as removed so that when you query the table, Delta **will not** include the remove files.

In [27]:
# Remove information
display(delta_log2.select(
    "remove.dataChange",
    "remove.deletionTimestamp",
    "remove.extendedFileMetadata",
    "remove.path",
    "remove.size",
    "remove.tags.VORDER")
    .where("remove is not null"))

StatementMeta(, 52de3d8e-a562-4516-8879-81c97371e8d1, 28, Finished, Available)

SynapseWidget(Synapse.DataFrame, 3e71a1c4-5cfc-4068-b962-89c031b1beb9)

In [28]:
%%sql
-- Describe table history 
DESCRIBE HISTORY demo.training

StatementMeta(, 52de3d8e-a562-4516-8879-81c97371e8d1, 29, Finished, Available)

<Spark SQL result set with 3 rows and 15 fields>

    During our delete operation, no new rows or files were added to the table as was expected for this small example - 
    Look at operationMetrics column

## Checkpoint File

These checkpoint files save the entire state of the table at a point in time – in native Parquet format that is quick and easy for Spark to read. In other words, they offer the Spark reader a sort of **“shortcut”** to fully reproducing a table’s state that allows Spark to avoid reprocessing what could be thousands of tiny, inefficient JSON files.

In [29]:
mssparkutils.fs.ls("Tables/training/_delta_log/")

StatementMeta(, 52de3d8e-a562-4516-8879-81c97371e8d1, 30, Finished, Available)

[FileInfo(path=abfss://7e0f3777-dd5b-4c28-b9bb-b53152661882@onelake.dfs.fabric.microsoft.com/cbb3f5fb-9a8e-40cc-8b0a-367745f5540c/Tables/training/_delta_log/00000000000000000000.json, name=00000000000000000000.json, size=1048),
 FileInfo(path=abfss://7e0f3777-dd5b-4c28-b9bb-b53152661882@onelake.dfs.fabric.microsoft.com/cbb3f5fb-9a8e-40cc-8b0a-367745f5540c/Tables/training/_delta_log/00000000000000000001.json, name=00000000000000000001.json, size=708),
 FileInfo(path=abfss://7e0f3777-dd5b-4c28-b9bb-b53152661882@onelake.dfs.fabric.microsoft.com/cbb3f5fb-9a8e-40cc-8b0a-367745f5540c/Tables/training/_delta_log/00000000000000000002.json, name=00000000000000000002.json, size=1101),
 FileInfo(path=abfss://7e0f3777-dd5b-4c28-b9bb-b53152661882@onelake.dfs.fabric.microsoft.com/cbb3f5fb-9a8e-40cc-8b0a-367745f5540c/Tables/training/_delta_log/_temporary, name=_temporary, size=0)]

In [30]:
# Generating checkpoint file

rows_to_insert = [
    {"id": 5},
    {"id": 6},
    {"id": 7},
    {"id": 8},
    {"id": 9},
    {"id": 10},
    {"id": 11},
    {"id": 12},
    {"id": 13},
    {"id": 14}
]

# Iterate over the rows and insert them into the Delta table
for row_data in rows_to_insert:
    # Create a DataFrame from the row data
    data = spark.createDataFrame([row_data])
    
    # Insert the DataFrame into the Delta table
    data.write.format("delta").mode("append").save("Tables/training")


StatementMeta(, 52de3d8e-a562-4516-8879-81c97371e8d1, 31, Finished, Available)

In [31]:
mssparkutils.fs.ls("Tables/training/_delta_log/")

StatementMeta(, 52de3d8e-a562-4516-8879-81c97371e8d1, 32, Finished, Available)

[FileInfo(path=abfss://7e0f3777-dd5b-4c28-b9bb-b53152661882@onelake.dfs.fabric.microsoft.com/cbb3f5fb-9a8e-40cc-8b0a-367745f5540c/Tables/training/_delta_log/00000000000000000000.json, name=00000000000000000000.json, size=1048),
 FileInfo(path=abfss://7e0f3777-dd5b-4c28-b9bb-b53152661882@onelake.dfs.fabric.microsoft.com/cbb3f5fb-9a8e-40cc-8b0a-367745f5540c/Tables/training/_delta_log/00000000000000000001.json, name=00000000000000000001.json, size=708),
 FileInfo(path=abfss://7e0f3777-dd5b-4c28-b9bb-b53152661882@onelake.dfs.fabric.microsoft.com/cbb3f5fb-9a8e-40cc-8b0a-367745f5540c/Tables/training/_delta_log/00000000000000000002.json, name=00000000000000000002.json, size=1101),
 FileInfo(path=abfss://7e0f3777-dd5b-4c28-b9bb-b53152661882@onelake.dfs.fabric.microsoft.com/cbb3f5fb-9a8e-40cc-8b0a-367745f5540c/Tables/training/_delta_log/00000000000000000003.json, name=00000000000000000003.json, size=708),
 FileInfo(path=abfss://7e0f3777-dd5b-4c28-b9bb-b53152661882@onelake.dfs.fabric.microsoft.c

The metadata you will read is a union of all of the previous transactions. This becomes apparent when you read the query for the add information.
In addition to the checkpoint file being in Parquet format (thus Spark can read it even faster) and containing all of the transactions prior to it, notice how the stats in
the original JSON file were in string format.


In [32]:
checkpoint = spark.read.parquet("Tables/training/_delta_log/00000000000000000010.checkpoint.parquet")

StatementMeta(, 52de3d8e-a562-4516-8879-81c97371e8d1, 33, Finished, Available)

In [33]:
checkpoint.printSchema()

StatementMeta(, 52de3d8e-a562-4516-8879-81c97371e8d1, 34, Finished, Available)

root
 |-- txn: struct (nullable = true)
 |    |-- appId: string (nullable = true)
 |    |-- version: long (nullable = true)
 |    |-- lastUpdated: long (nullable = true)
 |-- add: struct (nullable = true)
 |    |-- path: string (nullable = true)
 |    |-- partitionValues: map (nullable = true)
 |    |    |-- key: string
 |    |    |-- value: string (valueContainsNull = true)
 |    |-- size: long (nullable = true)
 |    |-- modificationTime: long (nullable = true)
 |    |-- dataChange: boolean (nullable = true)
 |    |-- tags: map (nullable = true)
 |    |    |-- key: string
 |    |    |-- value: string (valueContainsNull = true)
 |    |-- deletionVector: struct (nullable = true)
 |    |    |-- storageType: string (nullable = true)
 |    |    |-- pathOrInlineDv: string (nullable = true)
 |    |    |-- offset: integer (nullable = true)
 |    |    |-- sizeInBytes: integer (nullable = true)
 |    |    |-- cardinality: long (nullable = true)
 |    |    |-- maxRowIndex: long (nullable = true

In addition to the checkpoint file being in Parquet format (thus Spark can read it even faster) and containing all of the transactions prior to it, notice how the stats in the original JSON file were in string format.

In [34]:
display(checkpoint.select("add.path","add.stats").where("add is not null"))

StatementMeta(, 52de3d8e-a562-4516-8879-81c97371e8d1, 35, Finished, Available)

SynapseWidget(Synapse.DataFrame, c8db30f4-e2de-4fb6-b364-c2738bb078e8)

In [35]:
from pyspark.sql.types import *
from pyspark.sql.functions import * 

schema = StructType([StructField("numRecords", IntegerType(), False),
                StructField("minValues", StringType(), False),
                StructField("maxValues", StringType(), False), 
                StructField("nullCount", StringType(), False)])

checkpoint = checkpoint.withColumn("parsed_stats", from_json(checkpoint["add.stats"], schema))

display(checkpoint.select("add.path", "parsed_stats.numRecords","parsed_stats.minValues","parsed_stats.maxValues","parsed_stats.nullCount").where("add is not null"))

StatementMeta(, 52de3d8e-a562-4516-8879-81c97371e8d1, 36, Finished, Available)

SynapseWidget(Synapse.DataFrame, e96bf594-b88a-48b9-aeb5-aeaf9750a91e)

# Clean up

In [36]:
spark.sql("DROP TABLE IF EXISTS demo.training")

StatementMeta(, 52de3d8e-a562-4516-8879-81c97371e8d1, 37, Finished, Available)

DataFrame[]