# Delta Lake Deep Dive – Storage, ACID, and Optimization

This document explains **how Delta Lake works internally**, how data is stored and versioned, and how performance optimizations such as **predicate pushdown, OPTIMIZE, ZORDER, and VACUUM** affect physical data layout.

The walkthrough uses a **sales dataset** and demonstrates reading, writing, updating, optimizing, and organizing data into the appropriate lakehouse layer.


## 1. Sales Dataset

The dataset represents transactional sales data generated by an online retail system. Each record represents a single order line item.

**Columns**
- customer_id: Unique customer identifier
- order_id: Order identifier
- country: Customer country
- product: Product category
- quantity: Units sold
- amount: Total order amount


In [None]:
sales_data = [
    (1, "ORD-001", "India", "Laptop", 1, 75000),
    (2, "ORD-002", "USA", "Phone", 2, 40000),
    (3, "ORD-003", "India", "Tablet", 1, 30000),
    (4, "ORD-004", "UK", "Laptop", 1, 72000),
    (5, "ORD-005", "India", "Phone", 3, 60000)
]

columns = ["customer_id", "order_id", "country", "product", "quantity", "amount"]
df_sales = spark.createDataFrame(sales_data, columns)
df_sales.show()

## 2. Writing Data as a Delta Table (Bronze Layer)

When data is written using the Delta format, two things happen:
1. Data is stored as Parquet files
2. A transaction log is created to track table state


In [None]:
df_sales.write \
  .format("delta") \
  .mode("overwrite") \
  .saveAsTable("training_catalog.bronze.sales_delta")

## 3. Delta Transaction Log

Delta Lake maintains a directory named `_delta_log` at the root of every Delta table. This directory contains **ordered commit files** that define the table state.

### JSON Commit Files
Each JSON file represents a **single atomic transaction**. The file records:
- Files added to the table
- Files removed from the table
- Schema and metadata changes

Delta Lake uses **optimistic concurrency control**, ensuring that concurrent writers do not corrupt data.
Delta Transaction Log (How ACID Works)

Delta Lake records **every change** in `_delta_log`.
- JSON files → individual commits
- Parquet checkpoint → optimized snapshots

In [None]:
%fs
ls /Volumes/training_catalog/bronze/sales_delta/_delta_log

### ACID Guarantees

- **Atomicity**: Transactions either fully commit or fail
- **Consistency**: Schema and constraints are enforced
- **Isolation**: Readers see a consistent snapshot
- **Durability**: Changes are persisted in the transaction log


## 4. Data Modifications (INSERT, UPDATE, DELETE)

Delta Lake never modifies Parquet files in place. Instead, it writes new files and updates the transaction log to reflect changes.

In [None]:
%sql
INSERT INTO training_catalog.bronze.sales_delta VALUES
(6, 'ORD-006', 'USA', 'Laptop', 1, 80000)

In [None]:
%sql
UPDATE training_catalog.bronze.sales_delta
SET amount = amount + 5000
WHERE country = 'India'

In [None]:
%sql
DELETE FROM training_catalog.bronze.sales_delta
WHERE order_id = 'ORD-002'

## 5. Time Travel

Delta Lake reconstructs table state by replaying the transaction log. This allows querying historical versions of the data.
(Audit & Debugging)

**Use cases:**
- Debug broken pipelines
- Re-run ML training
- Regulatory audits

In [None]:
%sql
DESCRIBE HISTORY training_catalog.bronze.sales_delta

## 6. Predicate Pushdown and Data Skipping

Predicate pushdown allows query filters to be applied at the storage layer. Delta Lake stores column-level statistics (min/max) in the transaction log.

During query execution, files that cannot satisfy filter conditions are skipped entirely, reducing I/O.


## 7. Small Files Problem

Frequent writes generate many small Parquet files. Each file introduces metadata and I/O overhead, which degrades query performance over time.

In [None]:
df_sales.repartition(20) \
  .write \
  .mode("overwrite") \
  .saveAsTable("training_catalog.bronze.sales_small_files")

## 8. OPTIMIZE

OPTIMIZE rewrites many small files into fewer, larger files. The transaction log is updated to reference the new files while preserving historical versions.
**What OPTIMIZE does:**
- Compacts small files
- Improves scan performance
- Reduces metadata overhead

In [None]:
%sql
OPTIMIZE training_catalog.bronze.sales_small_files

## 9. Z-ORDER

Z-Ordering reorganizes data so that related values are stored close together. This improves data skipping when filters are applied on Z-ordered columns.

**What ZORDER does:**
- Physically co-locates related data
- Enables efficient predicate pruning
- Reduces file scans

In [None]:
%sql
OPTIMIZE training_catalog.bronze.sales_small_files
ZORDER BY (country, product)

## 10. VACUUM

VACUUM removes unreferenced data files older than the retention period. This frees storage but limits time travel capability.

In [None]:
**What VACUUM does:**
- Removes obsolete data files
- Frees storage
- Limits time travel

⚠️ Never reduce retention blindly in production.


In [None]:
%sql
VACUUM training_catalog.bronze.sales_small_files RETAIN 168 HOURS

## 11. Writing Data to the Silver Layer

The Silver layer contains cleansed, optimized, and query-efficient data.

In [None]:
%sql
CREATE TABLE training_catalog.silver.sales_silver
USING DELTA
AS SELECT * FROM training_catalog.bronze.sales_small_files