### Time Travel (Version history)

- Time Travel is one of the most popular features of Delta Lake. It allows you to query a table as it existed at a specific point in time or at a specific "version."
- Every time you change a Delta table (like your MERGE or INSERT tasks), Delta creates a new version. 
- It doesn’t delete the old data; it just keeps track of the changes in the Transaction Log.

**1. How to see your History**
- First, you look at the "Log Book" to see every change ever made to the table.
```
SQL

DESCRIBE HISTORY student_grades;
```
```
Python

from delta.tables import DeltaTable
history = DeltaTable.forName(spark, "student_grades").history()
display(history)
```
This shows you a list: Version 0 (Original), Version 1 (Updated), Version 2 (Deleted), etc.

**2. How to "Time Travel"**
- Imagine a student named "Sam" had a grade of 85. You updated it to 95, but then realized the first grade was actually correct. You can query the table exactly as it was before the change.

- By Version Number:
```
SQL

-- See the table exactly as it was at the very start
SELECT * FROM student_grades VERSION AS OF 0;
```
```
By Timestamp:

SQL

-- See the grades as they were at 10:00 AM this morning
SELECT * FROM student_grades TIMESTAMP AS OF '2026-01-13 10:00:00';
```

**3. Why is this useful?**
- Undo Mistakes: If you accidentally delete all students from the table, you don't lose the data. You just "travel back" to the version before the deletion.
- Comparison: You can compare today's grades with last month's grades side-by-side using two different versions of the same table.
- Audit: If someone asks, "Why did Sam have an 85 last week?", you can pull up the exact data from last week to prove it.

**4. How to Restore (The "Undo" Button)**
- If you want to permanently bring back an old version because the current one is wrong:
```
SQL

-- This permanently moves the table back to Version 0
RESTORE TABLE student_grades TO VERSION AS OF 0;
```

### MERGE Operations (upsert)
- A MERGE operation is like a "Smart Update." Instead of just adding new data (which creates duplicates) or overwriting everything (which is slow), Delta Lake looks at your table and decides row-by-row what to do.
- Think of it as a "Search and Act" command.

**1. The Logic: "Find, then Decide"**
- Imagine you have a Staff List table and you receive a new list of updates today.
- Search: Delta compares the New List to the Existing Table using a unique ID (like employee_id).
- Match Found: If the ID exists, it Updates the person's info (e.g., they got a promotion).
- No Match: If the ID is new, it Inserts them as a new employee.

**2. How to do it in SQL**
- This is the most common way to handle "Upserts" (Update + Insert).
```
SQL

%sql
MERGE INTO staff_table AS target
USING updates_df AS source
ON target.employee_id = source.employee_id
WHEN MATCHED THEN
  UPDATE SET 
    target.salary = source.salary,
    target.role = source.role
WHEN NOT MATCHED THEN
  INSERT (employee_id, name, salary, role)
  VALUES (source.employee_id, source.name, source.salary, source.role);
```

**3. How to do it in PySpark**
- PySpark gives you more control if you want to write the logic programmatically.
```
Python

from delta.tables import DeltaTable

# 1. Load the existing table
staff_table = DeltaTable.forName(spark, "staff_table")

# 2. Run the merge
staff_table.alias("target") \
  .merge(
    source = updates_df.alias("source"),
    condition = "target.employee_id = source.employee_id"
  ) \
  .whenMatchedUpdateAll() \
  .whenNotMatchedInsertAll() \
  .execute()
```

**4. Why is this a "Superpower"?**
- No Duplicates: It automatically prevents you from having two rows for the same person.
- Efficiency: Delta Lake doesn't rewrite the whole table. It only changes the specific files that contain the data you are updating.
- Reliability: If the computer crashes halfway through the Merge, Delta Lake uses its "Transaction Log" to ensure the table stays perfect—it’s all or nothing.

**5. Common Mistake to Avoid**
- The "Unresolved" Error: As you saw earlier, if you try to Merge using a column that doesn't exist (like order_id when the column is actually product_id), the operation will fail. Always check your column names first!

### OPTIMIZE & ZORDER
- As your data grows, even a fast Delta table can slow down. 
- OPTIMIZE and ZORDER are the two "clean-up" tools you use to keep your queries running at lightning speed.
- Think of it like organizing a messy library so you can find a specific book in seconds instead of hours.

**1. OPTIMIZE (Compaction)**
- When you write data frequently, Spark creates many tiny files. Reading 1,000 tiny files is much slower than reading one big file.
- OPTIMIZE takes all those small "splinter" files and compacts them into larger, more efficient chunks.
```
SQL

OPTIMIZE student_grades;
```

**2. Z-ORDER (Data Clustering)**
- Even with big files, Spark might still have to scan the whole file to find one student. 
- Z-ORDER physically rearranges the data inside the files so that similar information is stored together.
- If you often search for students by their subject_id, Z-Ordering by that column tells Spark exactly which part of the file to skip.
```
SQL

OPTIMIZE student_grades
ZORDER BY (subject_id);
```

**3. How they work together (The Library Analogy)**
- Without Optimize: Your books are scattered in 100 tiny boxes. You have to open every box to find anything.
- With Optimize: You dump all the books into 5 large, sturdy shelves.
- With Z-Order: You arrange the books on those shelves alphabetically. Now, if you need a book starting with "Z," you skip the first 4 shelves entirely.

**4. When should you use them?**
- OPTIMIZE: Run this once a day or once a week to clean up "fragmented" data.
- Z-ORDER: Use this only on columns you use most often in your WHERE clauses (like user_id, date, or product_id).

### VACUUM for cleanup
- While Time Travel is a superpower that lets you see old versions of your data, those old files take up storage space and cost money. 
- VACUUM is the "cleanup crew" that permanently deletes those old files once you no longer need them.
- Think of it like emptying the Recycle Bin on your computer.

**1. What does VACUUM do?**
- When you update or delete data in Delta Lake, the old files aren't immediately erased (so you can Time Travel). 
- VACUUM looks for files that are no longer part of the "latest" version of the table and have been "expired" for a certain amount of time.

**2. The "Retention Period"**
- By default, Delta Lake won't let you vacuum files that are less than 7 days old. 
- This is a safety feature to make sure you don't delete data that someone might still be querying or trying to time travel to.
- SQL Example:
```
SQL

-- Remove files older than the default 7 days
VACUUM student_grades;

-- Remove files older than 100 hours (roughly 4 days)
VACUUM student_grades RETAIN 100 HOURS;
```

**3. The Library Analogy (Updated)**
- Time Travel: You keep every old edition of a textbook in the basement just in case.
- VACUUM: You realize the basement is full and costs too much to rent. You decide to throw away any edition older than 2020.
- Warning: Once you throw them away, you can no longer "Time Travel" back to see them!

**4. Important Warning**
- Once you run VACUUM, the old data is permanently gone.
- If you try to run a Time Travel query on a version that has been vacuumed, you will get an error saying the underlying files are missing.

In [0]:
## creating a employee table

data = [
  (1, "Alice", "HR", 5000),
  (2, "Bob", "IT", 6000),
  (3, "Charlie", "Sales", 4500)
]

columns = ["emp_id", "name", "dept", "salary"]

## create dataframe and write as a delta
df = spark.createDataFrame(data, columns)
df.write.format("delta").mode("overwrite").saveAsTable("employee")

print("initial data table created as version 0")

In [0]:
## Task 1: Implement Incremental MERGE

new_data = [
    (1, "Alice", "HR", 5500), # Update: Salary increased
    (4, "David", "IT", 5000)   # New: new employee
]

updates_df = spark.createDataFrame(new_data, columns)
updates_df.createOrReplaceTempView("updates_view")

In [0]:
%sql
MERGE into employee as target
using updates_view as source
on target.emp_id = source.emp_id
when matched then 
update set target.salary = source.salary
when not matched then 
insert *;

#### Task 2 - time travel

In [0]:
%sql
-- see the history
describe history employee;

-- query version 0
select * from employee version as of 0;

#### Task 3 - optimize tables

In [0]:
%sql
OPTIMIZE employee  
ZORDER BY (dept);

#### Task 4 - clean old files: VACUUM

In [0]:
%sql
-- see which files are no longer required
VACUUM employee DRY RUN;

In [0]:
%sql
-- To actually delete (requires changing a setting for small retention)
-- SET spark.databricks.delta.retentionDurationCheck.enabled = false;
-- VACUUM employees RETAIN 0 HOURS;