### Theory — What is Delta Lake?
  Delta Lake is an open-source storage layer built on top of your data lake that:

- Adds reliability to big data storage

- Supports ACID transactions (Atomicity, Consistency, Isolation, Durability)

- Handles scalable metadata

- Allows streaming + batch processing on the same data

- Keeps a transaction log (_delta_log) that enables:

  - Versioning (Time Travel)

  - Data updates and deletes

  - Schema enforcement & evolution

💡 In Azure Databricks:

- All new tables are Delta tables by default.

- Data can be accessed via SQL or PySpark.

- You can work with Delta using DataFrame API or SQL commands.

____

####Upload CSV to DBFS:

DBFS (Databricks File System) is a distributed file system in Databricks.
We upload our CSV file here so Spark can access it.

- Upload File & Set the destination folder : /FileStore/tables/salary-1.csv

_____

#### Define Schema & Read CSV:

- We define schema explicitly to avoid CANNOT_DETERMINE_TYPE errors when Spark infers data types.

In [0]:
from pyspark.sql.types import StructType, StructField, StringType, IntegerType

# Schema for salary.csv
schema = StructType([
    StructField("name", StringType(), True),
    StructField("id", IntegerType(), True),
    StructField("age", IntegerType(), True),
    StructField("department", StringType(), True),
    StructField("salary", IntegerType(), True)
])

In [0]:
# Read CSV into DataFrame
df = spark.read.option("header", True) \
               .schema(schema) \
               .csv("/FileStore/tables/salary.csv")

display(df)

name,id,age,department,salary
user1,1,25,Jr manager,98000
user2,2,30,sr manager,100000
user3,6,35,sr manager,100000
user4,4,32,head,70000
user5,1,45,Jr manager,60000
user6,6,47,head2,45000
user7,5,21,worker,25000
user8,1,22,Jr manager,50000
user9,10,54,lead,45000
user10,59,52,lead2,50000


#### Write Data in Delta Format:
We store the data as Delta format so it can have ACID support and time travel.


In [0]:
# Write DataFrame to Delta format
df.write.format("delta").mode("overwrite").save("/FileStore/tables/delta_train/")

In [0]:
# Verify files in Delta location
display(dbutils.fs.ls("/FileStore/tables/delta_train/"))

path,name,size,modificationTime
dbfs:/FileStore/tables/delta_train/_delta_log/,_delta_log/,0,1754894574000
dbfs:/FileStore/tables/delta_train/part-00000-6e97caf1-014e-4488-b749-c90a6c61116a.c000.snappy.parquet,part-00000-6e97caf1-014e-4488-b749-c90a6c61116a.c000.snappy.parquet,1812,1754895491000
dbfs:/FileStore/tables/delta_train/part-00000-ad127322-7a84-42ec-ba20-d8ffa4eece2d.c000.snappy.parquet,part-00000-ad127322-7a84-42ec-ba20-d8ffa4eece2d.c000.snappy.parquet,1812,1754895036000
dbfs:/FileStore/tables/delta_train/part-00000-b0413123-7afc-4245-b56b-f97fc77e4919.c000.snappy.parquet,part-00000-b0413123-7afc-4245-b56b-f97fc77e4919.c000.snappy.parquet,1812,1754894576000


####Create Database & External Delta Table:
We register our data as a Delta table inside a database.

In [0]:
# Create database if not exists
spark.sql("CREATE DATABASE IF NOT EXISTS delta_training")

DataFrame[]

In [0]:
# Create external Delta table pointing to our Delta data
spark.sql("""
CREATE TABLE IF NOT EXISTS delta_training.emp_file
USING DELTA
LOCATION '/FileStore/tables/delta_train/'
""")

DataFrame[]

In [0]:
# List tables in the database
spark.sql("SHOW TABLES IN delta_training").display()

database,tableName,isTemporary
delta_training,emp_file,False


####Query the Delta Table:
Now we can query data using SQL.

In [0]:
# Verify table
spark.sql("SELECT * FROM delta_training.emp_file").display()

name,id,age,department,salary
user1,1,25,Jr manager,98000
user2,2,30,sr manager,100000
user3,6,35,sr manager,100000
user4,4,32,head,70000
user5,1,45,Jr manager,60000
user6,6,47,head2,45000
user7,5,21,worker,25000
user8,1,22,Jr manager,50000
user9,10,54,lead,45000
user10,59,52,lead2,50000


####Update Records:
Delta Lake allows updates (not possible in plain Parquet).

- Example: Increase salary by 5000 for all Jr manager employees.



In [0]:
from delta.tables import *

# Load Delta table object
deltaTable = DeltaTable.forPath(spark, "/FileStore/tables/delta_train/")

# Update conditionally
deltaTable.update(
    condition="department = 'Jr manager'",
    set={"salary": "salary + 5000"}
)

# Verify update
display(spark.sql("SELECT * FROM delta_training.emp_file"))

name,id,age,department,salary
user2,2,30,sr manager,100000
user3,6,35,sr manager,100000
user4,4,32,head,70000
user6,6,47,head2,45000
user7,5,21,worker,25000
user9,10,54,lead,45000
user10,59,52,lead2,50000
user11,6,25,head2,50000
user12,2,27,sr manager,70000
user13,59,54,lead2,45000


####Delete Records:
We can delete rows from a Delta table.
- Example: Remove employees older than 40.

In [0]:
deltaTable.delete("age > 40")

# Verify delete
display(spark.sql("SELECT * FROM delta_training.emp_file"))

name,id,age,department,salary
user1,1,25,Jr manager,103000
user8,1,22,Jr manager,55000
user15,1,32,Jr manager,55000
user2,2,30,sr manager,100000
user3,6,35,sr manager,100000
user4,4,32,head,70000
user7,5,21,worker,25000
user11,6,25,head2,50000
user12,2,27,sr manager,70000
user14,2,25,sr manager,70000


####View Table History:
Delta Lake stores a transaction log (_delta_log) that records every change.
- DESCRIBE HISTORY lets us see all operations.

In [0]:
display(spark.sql("DESCRIBE HISTORY delta_training.emp_file"))

version,timestamp,userId,userName,operation,operationParameters,job,notebook,clusterId,readVersion,isolationLevel,isBlindAppend,operationMetrics,userMetadata,engineInfo
6,2025-08-11T07:13:45Z,141850294526812,azuser4016_mml.local@techademy.com,OPTIMIZE,"Map(predicate -> [], auto -> true, clusterBy -> [], zOrderBy -> [], batchId -> 0)",,List(3302580720025900),0806-091707-wzczz7ob,5.0,SnapshotIsolation,False,"Map(numRemovedFiles -> 1, numRemovedBytes -> 1814, p25FileSize -> 1712, numDeletionVectorsRemoved -> 1, minFileSize -> 1712, numAddedFiles -> 1, maxFileSize -> 1712, p75FileSize -> 1712, p50FileSize -> 1712, numAddedBytes -> 1712)",,Databricks-Runtime/16.4.x-photon-scala2.12
5,2025-08-11T07:13:44Z,141850294526812,azuser4016_mml.local@techademy.com,DELETE,"Map(predicate -> [""(age#1092 > 40)""])",,List(3302580720025900),0806-091707-wzczz7ob,4.0,WriteSerializable,False,"Map(numRemovedFiles -> 0, numRemovedBytes -> 0, numCopiedRows -> 0, numDeletionVectorsAdded -> 1, numDeletionVectorsRemoved -> 0, numAddedChangeFiles -> 0, executionTimeMs -> 788, numDeletionVectorsUpdated -> 0, numDeletedRows -> 6, scanTimeMs -> 492, numAddedFiles -> 0, numAddedBytes -> 0, rewriteTimeMs -> 296)",,Databricks-Runtime/16.4.x-photon-scala2.12
4,2025-08-11T07:11:25Z,141850294526812,azuser4016_mml.local@techademy.com,OPTIMIZE,"Map(predicate -> [], auto -> true, clusterBy -> [], zOrderBy -> [], batchId -> 0)",,List(3302580720025900),0806-091707-wzczz7ob,3.0,SnapshotIsolation,False,"Map(numRemovedFiles -> 2, numRemovedBytes -> 3307, p25FileSize -> 1814, numDeletionVectorsRemoved -> 1, minFileSize -> 1814, numAddedFiles -> 1, maxFileSize -> 1814, p75FileSize -> 1814, p50FileSize -> 1814, numAddedBytes -> 1814)",,Databricks-Runtime/16.4.x-photon-scala2.12
3,2025-08-11T07:11:23Z,141850294526812,azuser4016_mml.local@techademy.com,UPDATE,"Map(predicate -> [""(department#1093 = Jr manager)""])",,List(3302580720025900),0806-091707-wzczz7ob,2.0,WriteSerializable,False,"Map(numRemovedFiles -> 0, numRemovedBytes -> 0, numCopiedRows -> 0, numDeletionVectorsAdded -> 1, numDeletionVectorsRemoved -> 0, numAddedChangeFiles -> 0, executionTimeMs -> 4064, numDeletionVectorsUpdated -> 0, scanTimeMs -> 2527, numAddedFiles -> 1, numUpdatedRows -> 4, numAddedBytes -> 1495, rewriteTimeMs -> 1519)",,Databricks-Runtime/16.4.x-photon-scala2.12
2,2025-08-11T06:58:11Z,141850294526812,azuser4016_mml.local@techademy.com,WRITE,"Map(mode -> Overwrite, statsOnLoad -> false, partitionBy -> [])",,List(3302580720025900),0806-091707-wzczz7ob,1.0,WriteSerializable,False,"Map(numFiles -> 1, numRemovedFiles -> 1, numRemovedBytes -> 1812, numOutputRows -> 20, numOutputBytes -> 1812)",,Databricks-Runtime/16.4.x-photon-scala2.12
1,2025-08-11T06:50:37Z,141850294526812,azuser4016_mml.local@techademy.com,WRITE,"Map(mode -> Overwrite, statsOnLoad -> false, partitionBy -> [])",,List(3302580720025900),0806-091707-wzczz7ob,0.0,WriteSerializable,False,"Map(numFiles -> 1, numRemovedFiles -> 1, numRemovedBytes -> 1812, numOutputRows -> 20, numOutputBytes -> 1812)",,Databricks-Runtime/16.4.x-photon-scala2.12
0,2025-08-11T06:42:59Z,141850294526812,azuser4016_mml.local@techademy.com,WRITE,"Map(mode -> Overwrite, statsOnLoad -> false, partitionBy -> [])",,List(3302580720025900),0806-091707-wzczz7ob,,WriteSerializable,False,"Map(numFiles -> 1, numRemovedFiles -> 0, numRemovedBytes -> 0, numOutputRows -> 20, numOutputBytes -> 1812)",,Databricks-Runtime/16.4.x-photon-scala2.12


####Time Travel (Query Older Versions):
We can query previous versions of the table by version number or timestamp.

In [0]:
%sql
-- Example using version number
SELECT * FROM delta_training.emp_file VERSION AS OF 0;

name,id,age,department,salary
user1,1,25,Jr manager,98000
user2,2,30,sr manager,100000
user3,6,35,sr manager,100000
user4,4,32,head,70000
user5,1,45,Jr manager,60000
user6,6,47,head2,45000
user7,5,21,worker,25000
user8,1,22,Jr manager,50000
user9,10,54,lead,45000
user10,59,52,lead2,50000


In [0]:
%sql
-- Example using timestamp
SELECT * FROM delta_training.emp_file TIMESTAMP AS OF '2025-08-11 06:50:00';

name,id,age,department,salary
user1,1,25,Jr manager,98000
user2,2,30,sr manager,100000
user3,6,35,sr manager,100000
user4,4,32,head,70000
user5,1,45,Jr manager,60000
user6,6,47,head2,45000
user7,5,21,worker,25000
user8,1,22,Jr manager,50000
user9,10,54,lead,45000
user10,59,52,lead2,50000


####Optimize:
We can compact small files to improve performance.

- Combines many small Parquet files into fewer big ones.

- Improves query performance.

- Does not delete historical versions of the table.

In [0]:
%sql
-- Compact small files
OPTIMIZE delta_training.emp_file;

path,metrics
dbfs:/FileStore/tables/delta_train,"List(0, 0, List(null, null, 0.0, 0, 0), List(null, null, 0.0, 0, 0), 0, null, null, 0, 0, 1, 1, true, 0, 0, 1754896776967, 1754896777407, 4, 0, null, List(0, 0), null, 5, 5, 0, 0, null)"
