# Time Travel

The most powerful feature is the ability to allow looking into history/version i.e., the changes that were made to the underlying delta table.

You can query previous snapshots of your Delta Lake table by using a feature called **Time Travel**. If you want to access the data that you overwrote, you can query a snapshot of the table before you overwrote the first set of data using the _**versionAsOf**_ option.

Time Travel is an **extremely powerful** feature that takes advantage of the power of the _Delta Lake transaction log_ to access data that is no longer in the table. Removing the version 0 option (or specifying version #) would let you see the newer data again. 

For more information, see [Query an older snapshot of a table (time travel)](https://docs.delta.io/latest/delta-batch.html#query-an-older-snapshot-of-a-table-time-travel).

Time travel has many use cases, including:

- Re-creating analyses, reports, or outputs (for example, the output of a machine learning model). This could be useful for debugging or auditing, especially in regulated industries. 
- Writing complex temporal queries.
- Fixing mistakes in your data.

- Providing snapshot isolation for a set of queries for fast changing tables.

In [None]:
# Generate dummy data

from pyspark.sql.functions import *

data = [(1,"open"),(2,"close"),(3,"open"),(4,"open"),(5,"close")]
schema =["id","action"]

df = spark.createDataFrame(data=data, schema=schema)

delta_table_name = 'time_travel_demo'

spark.sql(f"DROP TABLE IF EXISTS {delta_table_name}")
df.write.format("delta").mode("overwrite").saveAsTable(delta_table_name)

## Version 0
Displaying versions (history) of delta tables

In [None]:
import delta

delta_info = delta.DeltaTable.forName(spark, "demo.time_travel_demo")
display(delta_info.history())

    OR

In [None]:
%%sql

DESCRIBE HISTORY demo.time_travel_demo

    Checking some data before making changes

In [None]:
%%sql

SELECT * 
FROM demo.time_travel_demo 
WHERE 
        action = 'open' 
LIMIT 5

## Version 1

Now update some data to the delta table, which will create **version 1**

In [None]:
import delta
delta_info = delta.DeltaTable.forName(spark, "demo.time_travel_demo")

delta_info.update(
    condition = col("id") == 4,
    set = {'action': "'close'"} 
    )

    OR

In [None]:
%%sql

UPDATE demo.time_travel_demo SET action = 'close' where id = 4 

    Checking data after making changes

In [None]:
%%sql

SELECT * 
FROM demo.time_travel_demo 
WHERE id = 4


    Getting info of the new version   

In [None]:
delta_info = delta.DeltaTable.forName(spark, "demo.time_travel_demo")
display(delta_info.history())

## Version 2


Now delete some data to the Delta table, which will create **version 2**

In [None]:
delta_info.delete(col("id") == 3)

In [None]:
%%sql 

SELECT * 
FROM demo.time_travel_demo 
WHERE id = 3

    Getting info of the new version  
    Now there are three version of our delta table

In [None]:
delta_info = delta.DeltaTable.forName(spark, "demo.time_travel_demo")
display(delta_info.history())

# Time Travel by version

version is a value that can be obtained from the output of DESCRIBE HISTORY command

In [None]:
version0 = spark.read.option("versionAsOf", 0).table("demo.time_travel_demo")

In [None]:
display(version0.select("*"))

In [None]:
%%sql 

SELECT * 
FROM demo.time_travel_demo VERSION AS OF 0 

# Time Travel by timestamp

In [None]:
delta_info = delta.DeltaTable.forName(spark, "demo.time_travel_demo")
display(delta_info.history())

> Get the **timestamp for version 1** and _change_ the below commands

In [None]:
version1 = spark.read.option("timestampAsOf","2024-02-17 11:38:22.827").table("demo.time_travel_demo")

display(version1.select("*"))

In [None]:
%%sql 

SELECT * 
FROM demo.time_travel_demo TIMESTAMP AS OF "2024-02-17 11:38:22.827"

# Set up Data Retention

To time travel to a previous version, you **must retain both the log and the data files** for that version

By default, Delta tables retain the commit history **for 30 days**. This means that you can specify a version from 30 days ago. But if you run <u>VACUUM</u>, you lose the ability to go back to a version older than the default 7-day data retention period.

_<u>Log Files</u>_

- **delta.logRetentionDuration** - controls how long the history for a table is kept. The default is 30 days.

_<u>Data Files</u>_

- **delta.deletedFileRetentionDuration ** - controls how long ago a file must have been deleted before being a candidate for <u>VACUUM</u>. The default interval is 7 days.

You **must set both of these properties** to ensure table history is retained for longer duration for tables with frequent <u>VACUUM</u> operations.
To access 30 days of historical data, even if you run <u>VACUUM</u> on the delta table, set _delta.deletedFileRetentionDuration = "interval 30 days"_. <mark>This setting may cause your storage costs to go up</mark>.

In [None]:
%%sql

ALTER TABLE demo.time_travel_demo SET TBLPROPERTIES ('delta.deletedFileRetentionDuration'='interval 30 days')

In [None]:
%%sql

ALTER TABLE demo.time_travel_demo SET TBLPROPERTIES ('delta.logRetentionDuration'='interval 30 days')

    Showing table properties

In [None]:
display(spark.sql("SHOW TBLPROPERTIES demo.time_travel_demo"))

    OR

In [None]:
%%sql

SHOW TBLPROPERTIES demo.time_travel_demo

# Clean up

In [None]:
spark.sql("DROP TABLE IF EXISTS demo.time_travel_demo")