# Vacuum 

You can remove files no longer referenced by a Delta table and are older than the retention threshold by running **Vacuum** command 

Vaccum performs the deletion of files that don't belong to the latest version of the table.
Default is 7 days and **_deletedFileRetentionDuration_** controls the minimum age of the file before which it cant be deleted by vacuum. 

In [None]:
# Generate dummy data

from pyspark.sql.functions import *

data = [(1,"open"),(2,"close"),(3,"open"),(4,"open"),(5,"close")]
schema =["id","action"]

df = spark.createDataFrame(data=data, schema=schema)

delta_table_name = 'vacuum_demo'

spark.sql(f"DROP TABLE IF EXISTS {delta_table_name}")
df.write.format("delta").mode("overwrite").saveAsTable(delta_table_name)

## MSSparkUtils
Microsoft Spark Utilities (MSSparkUtils) is a builtin package to help you easily perform common tasks. You can use MSSparkUtils to work with file systems, to get environment variables, to chain notebooks together, and to work with secrets. 

MSSparkUtils are available in **PySpark (Python)**, **Scala**, and **R (Preview)** notebooks

In [None]:
mssparkutils.fs.ls("Tables/vacuum_demo/")

    Checking how many files there are in the storage account

## Delta table details

In [None]:
from delta.tables import *
from pyspark.sql.functions import *
delta_info = DeltaTable.forName(spark, 'demo.vacuum_demo')  

    Showing table properties

In [None]:
display(spark.sql("SHOW TBLPROPERTIES demo.vacuum_demo"))

    Making some changes in the table to generate new files

In [None]:
delta_info.update(
    condition = col("id") == 4,
    set = {'action': "'close'"} 
    )

    OR using SQL

In [None]:
%%sql

UPDATE demo.time_travel_demo SET action = 'close' where id = 4 

    Check lakehouse explore and see all files.

## Altering/Adding table property

In [None]:
%%sql
ALTER TABLE demo.vacuum_demo SET TBLPROPERTIES ('delta.deletedFileRetentionDuration'='interval 1 hour')

    Showing table properties

In [None]:
display(spark.sql("SHOW TBLPROPERTIES demo.vacuum_demo"))

## DRY RUN

Return a list of up to 1000 files to be deleted.

In [None]:
%%sql

VACUUM demo.vacuum_demo DRY RUN

In [None]:
delta_info.vacuum(0)

## Retention Check

**delta.retentionDurationCheck**

The shortest duration for Delta Lake to keep logically deleted data files before deleting them physically. This is to prevent failures in stale readers after compactions or partition overwrites.


In [None]:
spark.conf.get('spark.databricks.delta.retentionDurationCheck.enabled')

In [None]:
spark.conf.set('spark.databricks.delta.retentionDurationCheck.enabled', 'false')

In [None]:
spark.conf.get('spark.databricks.delta.retentionDurationCheck.enabled')

**Vacuum DRY RUN** will only show the result if there is any file that hits the threshold of deletedFileRetentionDuration 

_Forcing a lower value_

In [None]:
%%sql
ALTER TABLE demo.vacuum_demo SET TBLPROPERTIES ('delta.deletedFileRetentionDuration'='interval 0 hour')

In [None]:
%%sql

VACUUM demo.vacuum_demo DRY RUN

In [None]:
%%sql
ALTER TABLE demo.vacuum_demo SET TBLPROPERTIES ('delta.deletedFileRetentionDuration'='interval 1 hour')

## Vacuum

In [None]:
delta_info.vacuum(0)

    Check lakehouse explore and see all files.

In [None]:
spark.conf.set('spark.databricks.delta.retentionDurationCheck.enabled', 'true')

# Clean up

In [None]:
spark.sql("DROP TABLE IF EXISTS demo.vacuum_demo")