### Delta lake

Delta Lake is an open-source storage layer that brings reliability to Big Data lakes. It sits on top of your existing cloud storage (like S3 or ADLS) and provides the structure and speed of a Data Warehouse with the flexibility of a Data Lake.

**1. Why do we need it?**
- In a standard Data Lake, you are just saving raw files (CSV, Parquet, JSON). This causes three major problems:
- Failed Jobs: If a job crashes halfway through, you end up with "garbage" partial data.
- No Updates: You can't easily run an UPDATE or DELETE on a raw Parquet file.
- Data Quality: There is no way to stop "bad data" from being written into your lake.

**2. The Core Features (ACID)**
- Delta Lake solves these problems using four main pillars:
- **ACID Transactions:** (Atomicity, Consistency, Isolation, Durability). This ensures that a job either succeeds completely or fails completely. No more partial data.
- **Scalable Metadata:** It uses a Transaction Log (the _delta_log folder) to keep track of every file. This makes listing files in a massive directory near-instant.
- **Time Travel:** Because Delta keeps a history of changes, you can query your data as it existed at a specific point in time.
- Example: SELECT * FROM table VERSION AS OF 5
- **Schema Enforcement:** It prevents "dirty data" from entering your table. If you try to write a String into an Integer column, Delta will block it.

**3. The Medallion Architecture**
- Databricks uses Delta Lake to organize data into three "Zones":
- **Bronze (Raw):** The landing zone. Data is kept in its rawest form.
- **Silver (Cleaned):** Data is filtered, joined, and standardized. Great for Data Scientists.
- **Gold (Aggregated):** Final, business-ready tables (e.g., "Monthly Revenue"). Ready for Power BI or Tableau.

**4. Simple Syntax**
- The best part is that Delta uses standard SQL/PySpark syntax. You just change the format from parquet to delta.
```
# Writing data as Delta
df.write.format("delta").save("/mnt/delta/orders")

# Updating data (Something you can't do in standard Spark!)
from delta.tables import DeltaTable
deltaTable = DeltaTable.forPath(spark, "/mnt/delta/orders")
deltaTable.update("id = 123", {"status": "'shipped'"})
```

### ACID

#### The Four Pillars of ACID

**1. Atomicity (All or Nothing)**
- Atomicity ensures that a transaction is "atomic"—it either succeeds completely or fails completely.
- The Benefit: If your Databricks cluster terminates in the middle of a 10GB write, Delta Lake will automatically roll back the changes. To the user, it will look as if the write never started, leaving your table clean.

**2. Consistency (Rules are Followed)**
- Consistency ensures that the data moves from one valid state to another, following all defined rules (like Schema Enforcement).
- The Benefit: If you try to write a "String" into a column defined as an "Integer," Delta Lake will block the transaction. This prevents "data corruption" where different rows have different data types.

**3. Isolation (No Interference)**
- Isolation ensures that multiple users can read and write to the same table at the exact same time without seeing each other's partial work.
- The Benefit: A Data Analyst can run a report on your Sales table while a Data Engineer is currently appending new November data. The Analyst will see a "consistent snapshot" of the data as it was before the update started.

**4. Durability (It Stays Saved)**
- Durability guarantees that once a transaction is committed, it remains committed, even in the event of a system failure or power loss.
- The Benefit: In Delta Lake, a transaction is only "committed" once it is written to the Transaction Log (_delta_log). Once it's in that log, it is permanent.

**How it works: The Transaction Log**
- The secret to ACID in Delta Lake is the Transaction Log (also called the DeltaLog).
- Recording Intent: When you start a write, Delta creates a new JSON file in the _delta_log folder.
- Tracking Files: It records exactly which new Parquet files were added and which old ones were removed.

### Schema enforcement

- It ensures data quality by preventing "poisonous" data from breaking your tables.
- In a traditional Data Lake, if you try to write a file with an extra column or a different data type, the write succeeds, but your downstream dashboards or ML models crash later. Schema Enforcement stops this at the source.

**1. How it Works**
- When a write operation is triggered, Delta Lake compares the schema of the incoming DataFrame to the existing table's schema.
- **Data Types:** If the table expects an Integer but you send a String, the write fails.
- **Column Names:** If you send a column that doesn't exist in the table, the write fails.
- **Missing Columns:** If you send fewer columns than the table has, Delta allows it but fills the missing values with null (unless specified otherwise).

**2. The Benefit: No More "Data Swamps"**
- Without enforcement, data lakes tend to get messy over time as different teams upload files with slightly different formats.
| Feature | Standard Parquet/CSV | Delta Lake |
|----- | ----- | ----- |
|New Column | Added silently (can break code). | Blocked (unless explicitly allowed).
| Wrong Data Type | Written as-is (causes crash later). | Immediate Error at write-time.
| Case Sensitivity | Varies by engine. | Strictly enforced by the Transaction Log.

**3. What if you want to change the schema? (Schema Evolution)**
- Sometimes, your data naturally changes (e.g., you start collecting a new discount_code field). You can explicitly tell Delta to allow this change using Schema Evolution.
- By adding .option("mergeSchema", "true"), you tell Spark: "I know the schema is different; please update the table to match my new data."
```
# This will fail if the schema doesn't match
df.write.format("delta").mode("append").save(path)

# This will succeed and update the table with new columns
df.write.format("delta").mode("append") \
  .option("mergeSchema", "true") \
  .save(path)
```

### Delta vs Parquet

- Think of Parquet as the bricks and Delta Lake as the entire building.
- Delta Lake uses Parquet files to store the actual data, but adds a "Brain" (the Transaction Log) on top to manage them.

**1. Key Differences at a Glance**

|Feature | Vanilla Parquet | Delta Lake
| ----- | ----- | ----- |
| Data Storage | Columnar Files (.parquet) | Columnar Files + Transaction Log (_delta_log)
| Transactions | None (crashes cause "dirty" data) | ACID (it works or it fails; no mess)
| Updates/Deletes | Must rewrite the whole table | Easy UPDATE, DELETE, and MERGE
| History | Latest version only | Time Travel (query past versions)
| Data Quality | None (allows wrong data types) | Schema Enforcement (blocks bad data)
| Performance | Basic (limited file skipping) | Advanced (Z-Ordering & fast metadata)

**2. The "Hidden" Problem with Parquet**
- If you have a folder full of Parquet files, Spark has to "list" all those files every time you run a query. 
- If you have 10,000 files in the cloud (S3/Azure), this "listing" can take minutes before the query even starts.
- Delta Lake solves this by storing the list of files in its Transaction Log. Spark just reads that one small log file and knows exactly where to go. This makes it significantly faster as your data grows.

**3. Visual Comparison**
- Parquet: Just a bunch of files. If a write fails, you don't know which files are good or bad.
- Delta: A "Managed" collection. The log acts as a checklist. If a file isn't on the checklist, Spark ignores it.

**4. When to use which?**
- Use Parquet for static, one-time exports or sharing data with legacy systems that don't support Delta.
- Use Delta for almost everything else in Databricks—especially if your data changes, needs to be updated, or requires high reliability.

In [0]:
## create a csv to delta format

## define file path
file_path =  "/Volumes/workspace/ecommerce/ecommerce_data/2019-Oct.csv"

## read csv
df = spark.read.format("csv").option("header","true").option("inferSchema","true").load(file_path)

## write to delta
df.write.format("delta").save("/Volumes/workspace/ecommerce/ecommerce_data/delta")

In [0]:
## create delta table (SQL and PySpark)

## Using PySpark
df.write.format("delta").saveAsTable("sales_october")

In [0]:
%sql
CREATE TABLE sales_october
USING DELTA
LOCATION '/Volumes/workspace/ecommerce/ecommerce_data/delta';

In [0]:
## taste schema enforcement

# Create a dummy row with a NEW column 'extra_info'
new_data = spark.createDataFrame([("user_1", "2026-01-12", "oops")], ["user_id", "event_time", "extra_info"])

# Try to append to your existing table
# This will FAIL and throw a 'schema mismatch' error
new_data.write.format("delta").mode("append").saveAsTable("sales_october")

In [0]:
df.createOrReplaceTempView("updates_df")

In [0]:
%sql
-- Ensure only one source row per key for MERGE
CREATE OR REPLACE TEMP VIEW updates_df_dedup AS
SELECT * FROM (
  SELECT *, ROW_NUMBER() OVER (
    PARTITION BY event_time, user_id, product_id ORDER BY event_time
  ) AS rn
  FROM updates_df
) WHERE rn = 1;

MERGE INTO sales_october AS target
USING updates_df_dedup AS source
ON target.event_time = source.event_time 
   AND target.user_id = source.user_id 
   AND target.product_id = source.product_id
WHEN MATCHED THEN
  UPDATE SET *
WHEN NOT MATCHED THEN
  INSERT *;