##Summary of Maintenance Routines:

- Vacuum: Removes old files to free up space.
- Optimize: Combines small files to improve read performance.
- Z-Ordering: Organizes the data to improve performance in filtered queries.
- Update/Delete: Allows efficient data modification operations.
- History/Time Travel: Audits and accesses previous versions of the data.
- Compaction: Groups small files together to improve read efficiency.

These maintenance practices are essential for efficiently managing a Delta Lake, maintaining both performance and data integrity.

Maintaining a well-managed Delta Lake is essential to ensure performance, data integrity, and efficient resource usage. Here are the main maintenance routines of Delta Lake, when, how, and why to use them:

###1. Vacuum
When to use: To remove old files that are no longer needed, such as those generated by update, merge, or delete operations.

Why to use: Delta Lake keeps old data versions (history) to provide features like time travel and rollback. Over time, these old files can consume a lot of disk space. Vacuum removes these files, freeing up space.

Recommendation: Avoid setting the retention period below 7 days without considering the implications on time travel. The default of 7 days is safe to keep the possibility of data recovery while cleaning up obsolete files.

In [0]:
from delta.tables import DeltaTable
from delta.tables import *
from delta.tables import DeltaTable, DeltaOptimizeBuilder
from pyspark.sql.functions import lit, max, current_timestamp, col, monotonically_increasing_id
from pyspark.sql import SparkSession
from pyspark.sql.functions import *

In [0]:
from delta.tables import DeltaTable

# Specify the Delta table
delta_table = DeltaTable.forName(spark, "sales_case.gold_fact_sales")


delta_table.vacuum(168)  


Out[5]: DataFrame[]

In [0]:
df_sales = spark.read.table('sales_case.gold_fact_sales')

# Option 1: Run vacuum operation in Python using spark.sql
spark.sql("VACUUM sales_case.gold_fact_sales RETAIN 168 HOURS")

# Run the vacuum operation, keeping the last 7 days (168 hours) of data

# Disable retention duration check
spark.conf.set("spark.databricks.delta.retentionDurationCheck.enabled", "false")

# Option 2: Run vacuum operation in Python using deltatable
delta_table = DeltaTable.forName(spark, "sales_case.gold_fact_sales")
delta_table.vacuum(168)  

# Enable retention duration check
spark.conf.set("spark.databricks.delta.retentionDurationCheck.enabled", "true")


SalesDate,sk_product,sk_category,sk_segment,sk_manufacturer,sk_client,Units,UnitPrice,UnitCost,SalesTotal,Year,Month
2011-03-04,77,2,6,1,5534,1,124.42,90.83,90.83,2011,3
2011-03-08,77,2,6,1,9318,1,124.42,90.83,90.83,2011,3
2011-03-08,77,2,6,1,4263,1,124.42,90.83,90.83,2011,3
2011-03-11,77,2,6,1,60129548537,1,124.42,90.83,90.83,2011,3
2011-03-25,77,2,6,1,8589939250,1,124.42,90.83,90.83,2011,3
2011-03-17,77,2,6,1,8589942648,1,124.42,90.83,90.83,2011,3
2011-03-17,77,2,6,1,34359742238,1,124.42,90.83,90.83,2011,3
2011-03-17,77,2,6,1,42949673405,1,124.42,90.83,90.83,2011,3
2011-03-17,77,2,6,1,60129543047,1,124.42,90.83,90.83,2011,3
2011-03-17,77,2,6,1,42949675525,1,124.42,90.83,90.83,2011,3


Out[4]: DataFrame[path: string]

### 2. Optimize
When to use: To optimize the layout of files stored in Delta Lake, especially after many write or update operations that can generate small files.

Why to use: Delta Lake can end up with many small files after write or merge operations. This can hurt query performance due to the overhead of reading many files. Optimize combines small files into larger files, improving read and processing performance.

Recommendation: Use optimize at regular intervals or after large write operations to ensure that the data layout remains efficient. To further enhance performance, optimize can be combined with Z-Ordering.

In [0]:

delta_table = DeltaTable.forName(spark, "sales_case.gold_fact_sales")
delta_table.optimize().executeCompaction()

Out[2]: DataFrame[path: string, metrics: struct<numFilesAdded:bigint,numFilesRemoved:bigint,filesAdded:struct<min:bigint,max:bigint,avg:double,totalFiles:bigint,totalSize:bigint>,filesRemoved:struct<min:bigint,max:bigint,avg:double,totalFiles:bigint,totalSize:bigint>,partitionsOptimized:bigint,zOrderStats:struct<strategyName:string,inputCubeFiles:struct<num:bigint,size:bigint>,inputOtherFiles:struct<num:bigint,size:bigint>,inputNumCubes:bigint,mergedFiles:struct<num:bigint,size:bigint>,numOutputCubes:bigint,mergedNumCubes:bigint>,numBatches:bigint,totalConsideredFiles:bigint,totalFilesSkipped:bigint,preserveInsertionOrder:boolean,numFilesSkippedToReduceWriteAmplification:bigint,numBytesSkippedToReduceWriteAmplification:bigint,startTimeMs:bigint,endTimeMs:bigint,totalClusterParallelism:bigint,totalScheduledTasks:bigint,autoCompactParallelismStats:struct<maxClusterActiveParallelism:bigint,minClusterActiveParallelism:bigint,maxSessionActiveParallelism:bigint,minSessionActiveParallelism:bi

### 3. Z-Ordering
When to use: To optimize queries that frequently filter on specific columns, such as date or key columns.

Why to use: Z-Ordering improves read performance by physically organizing the data on disk based on a column or set of columns, reducing the time required to retrieve the filtered records.

Recommendation: Use Z-Ordering on columns that are frequently used in filter clauses to improve the reading of related data. Combine this with optimize to have the data more efficiently organized on disk.

In [0]:
# Execute Z-Ordering optimization on the column "SalesDate"
delta_table = DeltaTable.forName(spark, "sales_case.gold_fact_sales")
delta_table.optimize().executeZOrderBy("SalesDate").execute()

# Using SQL
spark.sql(f"""OPTIMIZE sales_case.gold_fact_sales ZORDER BY (SalesDate)""")


Out[10]: DataFrame[path: string, metrics: struct<numFilesAdded:bigint,numFilesRemoved:bigint,filesAdded:struct<min:bigint,max:bigint,avg:double,totalFiles:bigint,totalSize:bigint>,filesRemoved:struct<min:bigint,max:bigint,avg:double,totalFiles:bigint,totalSize:bigint>,partitionsOptimized:bigint,zOrderStats:struct<strategyName:string,inputCubeFiles:struct<num:bigint,size:bigint>,inputOtherFiles:struct<num:bigint,size:bigint>,inputNumCubes:bigint,mergedFiles:struct<num:bigint,size:bigint>,numOutputCubes:bigint,mergedNumCubes:bigint>,numBatches:bigint,totalConsideredFiles:bigint,totalFilesSkipped:bigint,preserveInsertionOrder:boolean,numFilesSkippedToReduceWriteAmplification:bigint,numBytesSkippedToReduceWriteAmplification:bigint,startTimeMs:bigint,endTimeMs:bigint,totalClusterParallelism:bigint,totalScheduledTasks:bigint,autoCompactParallelismStats:struct<maxClusterActiveParallelism:bigint,minClusterActiveParallelism:bigint,maxSessionActiveParallelism:bigint,minSessionActiveParallelism:b

### 4. Update and Delete Operations (UPSERT)
When to use: To modify or remove data directly in a Delta table without needing to overwrite the entire table.

Why to use: Delta Lake allows you to perform upsert (a combination of update and insert) and delete operations, which is essential in data pipelines that require continuous corrections, removals, or updates, such as fact tables or historical data.

Recommendation: These operations are useful for efficiently adjusting data, especially when the data volume is not massive or when data needs frequent corrections.

### Insert Example


In [0]:
delta_table = DeltaTable.forName(spark, "sales_case.gold_dim_manufacturer")
# Calc next value to sk_manufacturer
next_sk = delta_table.toDF().select(max("sk_manufacturer")).collect()[0][0] + 1

# Create new row to be inserted
new_row = spark.createDataFrame([
    (8, "New manufacturer", next_sk)  
], ["manufacturerID", "manufacturer", "sk_manufacturer"])

# Add column
new_row = new_row.withColumn("Date_updated", current_timestamp())

new_row.show()

# Run Insert
delta_table.alias("target").merge(
    new_row.alias("source"),
    "target.manufacturerID = source.manufacturerID"
).whenNotMatchedInsertAll().execute()


+--------------+----------------+---------------+--------------------+
|manufacturerID|    manufacturer|sk_manufacturer|        Date_updated|
+--------------+----------------+---------------+--------------------+
|             8|New manufacturer|              3|2025-03-09 22:53:...|
+--------------+----------------+---------------+--------------------+



In [0]:
display(spark.sql("select * from sales_case.gold_dim_manufacturer"))

ManufacturerID,Manufacturer,sk_manufacturer
8,New manufacturer,2
7,VanArsdel,1


###Update Example

In [0]:
delta_table = DeltaTable.forName(spark, "sales_case.gold_dim_manufacturer")
delta_table.update(
    condition = col("manufacturerID") == 7,  
    set = { 
        "manufacturer": "'VanArsdel Inc.'"
    }
)

In [0]:
display(spark.sql("select * from sales_case.gold_dim_manufacturer"))

ManufacturerID,Manufacturer,sk_manufacturer
8,New manufacturer,2
7,VanArsdel Inc.,1


###Delete Example

In [0]:
# Exemplo de delete
delta_table.delete(condition = col("manufacturerID") == 8)


###UPSERT Example

In [0]:

# Carregue o DataFrame de origem (novos dados)
df_silver = spark.read.table('sales_case.silver_sales_table')

tb_source = "gold_dim_manufacturer"

# Extrair produtos únicos para a dimensão Fabricante    
dim_manufacturer_df = df_silver.select("manufacturerID", "manufacturer").dropDuplicates()

# Adicionar chave substituta (surrogate keys)
dim_manufacturer_df = dim_manufacturer_df.withColumn("sk_manufacturer", monotonically_increasing_id()+1)

# Carregue o DataFrame de destino (tabela existente)
df_target = DeltaTable.forName(spark, "sales_case.gold_dim_manufacturer")

# Realize a operação de merge
df_target.alias("target").merge(
    dim_manufacturer_df.alias("source"),
    "target.manufacturerID = source.manufacturerID"
).whenMatchedUpdate(set={
    "manufacturer": "source.manufacturer",
    "sk_manufacturer": "source.sk_manufacturer"
}).whenNotMatchedInsert(values={
    "manufacturer": "source.manufacturer",
    "manufacturerID": "source.manufacturerID",
    "sk_manufacturer": "source.sk_manufacturer"
}).execute()


### 5. History e Time Travel
When to use: To audit changes in the Delta table or to access previous versions of the data.

Why to use: Delta Lake maintains a transaction log that allows tracking all modifications made to the table. This is useful for auditing and recovering data from a previous point in time.

Recommendation: Use history and time travel to debug issues or restore previous versions of the data when necessary. However, remember to use vacuum to manage the amount of history that is retained.

### Getting history of a table

In [0]:
history_df = DeltaTable.forName(spark, "sales_case.gold_dim_manufacturer").history()

display(history_df)

version,timestamp,userId,userName,operation,operationParameters,job,notebook,clusterId,readVersion,isolationLevel,isBlindAppend,operationMetrics,userMetadata,engineInfo
9,2025-03-09T22:57:41.000+0000,942633494302329,rafael_rampineli@hotmail.com,MERGE,"Map(predicate -> [""(manufacturerID#39150 = manufacturerID#39023)""], matchedPredicates -> [{""actionType"":""update""}], notMatchedPredicates -> [{""actionType"":""insert""}], notMatchedBySourcePredicates -> [])",,List(2907468745482919),0309-155603-pos6077b,8.0,WriteSerializable,False,"Map(numTargetRowsCopied -> 0, numTargetRowsDeleted -> 0, numTargetFilesAdded -> 1, numTargetBytesAdded -> 1186, numTargetBytesRemoved -> 1221, numTargetDeletionVectorsAdded -> 0, numTargetRowsMatchedUpdated -> 1, executionTimeMs -> 13229, materializeSourceTimeMs -> 8985, numTargetRowsInserted -> 0, numTargetRowsMatchedDeleted -> 0, scanTimeMs -> 2742, numTargetRowsUpdated -> 1, numOutputRows -> 1, numTargetDeletionVectorsRemoved -> 0, numTargetRowsNotMatchedBySourceUpdated -> 0, numTargetChangeFilesAdded -> 0, numSourceRows -> 1, numTargetFilesRemoved -> 1, numTargetRowsNotMatchedBySourceDeleted -> 0, rewriteTimeMs -> 1241)",,Databricks-Runtime/12.2.x-scala2.12
8,2025-03-09T22:53:57.000+0000,942633494302329,rafael_rampineli@hotmail.com,DELETE,"Map(predicate -> [""(manufacturerID#37905 = 8)""])",,List(2907468745482919),0309-155603-pos6077b,7.0,WriteSerializable,False,"Map(numRemovedFiles -> 1, numRemovedBytes -> 1235, numCopiedRows -> 0, numDeletionVectorsAdded -> 0, numDeletionVectorsRemoved -> 0, numAddedChangeFiles -> 0, executionTimeMs -> 1178, numDeletedRows -> 1, scanTimeMs -> 640, numAddedFiles -> 0, numAddedBytes -> 0, rewriteTimeMs -> 538)",,Databricks-Runtime/12.2.x-scala2.12
7,2025-03-09T22:53:26.000+0000,942633494302329,rafael_rampineli@hotmail.com,UPDATE,"Map(predicate -> [""(manufacturerID#37905 = 7)""])",,List(2907468745482919),0309-155603-pos6077b,6.0,WriteSerializable,False,"Map(numRemovedFiles -> 1, numRemovedBytes -> 1186, numCopiedRows -> 0, numDeletionVectorsAdded -> 0, numDeletionVectorsRemoved -> 0, numAddedChangeFiles -> 0, executionTimeMs -> 1284, scanTimeMs -> 604, numAddedFiles -> 1, numUpdatedRows -> 1, numAddedBytes -> 1221, rewriteTimeMs -> 668)",,Databricks-Runtime/12.2.x-scala2.12
6,2025-03-09T22:53:05.000+0000,942633494302329,rafael_rampineli@hotmail.com,MERGE,"Map(predicate -> [""(cast(manufacturerID#37095 as bigint) = manufacturerID#37192L)""], matchedPredicates -> [], notMatchedPredicates -> [{""actionType"":""insert""}], notMatchedBySourcePredicates -> [])",,List(2907468745482919),0309-155603-pos6077b,5.0,WriteSerializable,False,"Map(numTargetRowsCopied -> 0, numTargetRowsDeleted -> 0, numTargetFilesAdded -> 0, numTargetBytesAdded -> 0, numTargetBytesRemoved -> 0, numTargetDeletionVectorsAdded -> 0, numTargetRowsMatchedUpdated -> 0, executionTimeMs -> 1730, materializeSourceTimeMs -> 15, numTargetRowsInserted -> 0, numTargetRowsMatchedDeleted -> 0, scanTimeMs -> 0, numTargetRowsUpdated -> 0, numOutputRows -> 0, numTargetDeletionVectorsRemoved -> 0, numTargetRowsNotMatchedBySourceUpdated -> 0, numTargetChangeFilesAdded -> 0, numSourceRows -> 1, numTargetFilesRemoved -> 0, numTargetRowsNotMatchedBySourceDeleted -> 0, rewriteTimeMs -> 1650)",,Databricks-Runtime/12.2.x-scala2.12
5,2025-03-09T22:48:21.000+0000,942633494302329,rafael_rampineli@hotmail.com,MERGE,"Map(predicate -> [""(cast(manufacturerID#35901 as bigint) = manufacturerID#36109L)""], matchedPredicates -> [], notMatchedPredicates -> [{""actionType"":""insert""}], notMatchedBySourcePredicates -> [])",,List(2907468745482919),0309-155603-pos6077b,4.0,WriteSerializable,False,"Map(numTargetRowsCopied -> 0, numTargetRowsDeleted -> 0, numTargetFilesAdded -> 0, numTargetBytesAdded -> 0, numTargetBytesRemoved -> 0, numTargetDeletionVectorsAdded -> 0, numTargetRowsMatchedUpdated -> 0, executionTimeMs -> 1289, materializeSourceTimeMs -> 5, numTargetRowsInserted -> 0, numTargetRowsMatchedDeleted -> 0, scanTimeMs -> 0, numTargetRowsUpdated -> 0, numOutputRows -> 0, numTargetDeletionVectorsRemoved -> 0, numTargetRowsNotMatchedBySourceUpdated -> 0, numTargetChangeFilesAdded -> 0, numSourceRows -> 1, numTargetFilesRemoved -> 0, numTargetRowsNotMatchedBySourceDeleted -> 0, rewriteTimeMs -> 1209)",,Databricks-Runtime/12.2.x-scala2.12
4,2025-03-09T22:47:40.000+0000,942633494302329,rafael_rampineli@hotmail.com,MERGE,"Map(predicate -> [""(cast(manufacturerID#35213 as bigint) = manufacturerID#35318L)""], matchedPredicates -> [], notMatchedPredicates -> [{""actionType"":""insert""}], notMatchedBySourcePredicates -> [])",,List(2907468745482919),0309-155603-pos6077b,3.0,WriteSerializable,False,"Map(numTargetRowsCopied -> 0, numTargetRowsDeleted -> 0, numTargetFilesAdded -> 0, numTargetBytesAdded -> 0, numTargetBytesRemoved -> 0, numTargetDeletionVectorsAdded -> 0, numTargetRowsMatchedUpdated -> 0, executionTimeMs -> 1600, materializeSourceTimeMs -> 12, numTargetRowsInserted -> 0, numTargetRowsMatchedDeleted -> 0, scanTimeMs -> 0, numTargetRowsUpdated -> 0, numOutputRows -> 0, numTargetDeletionVectorsRemoved -> 0, numTargetRowsNotMatchedBySourceUpdated -> 0, numTargetChangeFilesAdded -> 0, numSourceRows -> 1, numTargetFilesRemoved -> 0, numTargetRowsNotMatchedBySourceDeleted -> 0, rewriteTimeMs -> 1523)",,Databricks-Runtime/12.2.x-scala2.12
3,2025-03-09T22:44:53.000+0000,942633494302329,rafael_rampineli@hotmail.com,MERGE,"Map(predicate -> [""(cast(manufacturerID#34372 as bigint) = manufacturerID#34477L)""], matchedPredicates -> [], notMatchedPredicates -> [{""actionType"":""insert""}], notMatchedBySourcePredicates -> [])",,List(2907468745482919),0309-155603-pos6077b,2.0,WriteSerializable,False,"Map(numTargetRowsCopied -> 0, numTargetRowsDeleted -> 0, numTargetFilesAdded -> 0, numTargetBytesAdded -> 0, numTargetBytesRemoved -> 0, numTargetDeletionVectorsAdded -> 0, numTargetRowsMatchedUpdated -> 0, executionTimeMs -> 1504, materializeSourceTimeMs -> 5, numTargetRowsInserted -> 0, numTargetRowsMatchedDeleted -> 0, scanTimeMs -> 0, numTargetRowsUpdated -> 0, numOutputRows -> 0, numTargetDeletionVectorsRemoved -> 0, numTargetRowsNotMatchedBySourceUpdated -> 0, numTargetChangeFilesAdded -> 0, numSourceRows -> 1, numTargetFilesRemoved -> 0, numTargetRowsNotMatchedBySourceDeleted -> 0, rewriteTimeMs -> 1435)",,Databricks-Runtime/12.2.x-scala2.12
2,2025-03-09T22:39:54.000+0000,942633494302329,rafael_rampineli@hotmail.com,MERGE,"Map(predicate -> [""(cast(manufacturerID#33529 as bigint) = manufacturerID#33634L)""], matchedPredicates -> [], notMatchedPredicates -> [{""actionType"":""insert""}], notMatchedBySourcePredicates -> [])",,List(2907468745482919),0309-155603-pos6077b,1.0,WriteSerializable,False,"Map(numTargetRowsCopied -> 0, numTargetRowsDeleted -> 0, numTargetFilesAdded -> 1, numTargetBytesAdded -> 1235, numTargetBytesRemoved -> 0, numTargetDeletionVectorsAdded -> 0, numTargetRowsMatchedUpdated -> 0, executionTimeMs -> 1822, materializeSourceTimeMs -> 5, numTargetRowsInserted -> 1, numTargetRowsMatchedDeleted -> 0, scanTimeMs -> 0, numTargetRowsUpdated -> 0, numOutputRows -> 1, numTargetDeletionVectorsRemoved -> 0, numTargetRowsNotMatchedBySourceUpdated -> 0, numTargetChangeFilesAdded -> 0, numSourceRows -> 1, numTargetFilesRemoved -> 0, numTargetRowsNotMatchedBySourceDeleted -> 0, rewriteTimeMs -> 1752)",,Databricks-Runtime/12.2.x-scala2.12
1,2025-03-09T21:18:53.000+0000,942633494302329,rafael_rampineli@hotmail.com,CREATE OR REPLACE TABLE AS SELECT,"Map(isManaged -> true, description -> null, partitionBy -> [], properties -> {})",,List(2907468745482829),0309-155603-pos6077b,0.0,WriteSerializable,False,"Map(numFiles -> 1, numOutputRows -> 1, numOutputBytes -> 1186)",,Databricks-Runtime/12.2.x-scala2.12
0,2025-03-09T21:06:09.000+0000,942633494302329,rafael_rampineli@hotmail.com,CREATE OR REPLACE TABLE AS SELECT,"Map(isManaged -> true, description -> null, partitionBy -> [], properties -> {})",,List(2907468745482829),0309-155603-pos6077b,,WriteSerializable,False,"Map(numFiles -> 1, numOutputRows -> 1, numOutputBytes -> 1186)",,Databricks-Runtime/12.2.x-scala2.12


### Time Travel acessing version table

In [0]:
# Load the Delta table with a specific version number using Time Travel
df_time_travel_version = spark.read.option("versionAsOf", 9).table("sales_case.gold_dim_manufacturer")

# Show the DataFrame for the historical version
df_time_travel_version.show()


+--------------+------------+---------------+
|ManufacturerID|Manufacturer|sk_manufacturer|
+--------------+------------+---------------+
|             7|   VanArsdel|              1|
+--------------+------------+---------------+



In [0]:
# Load the Delta table with a specific version number using Time Travel
df_time_travel_version = spark.read.option("versionAsOf", 3).table("sales_case.gold_dim_manufacturer")

# Show the DataFrame for the historical version
df_time_travel_version.show()

+--------------+----------------+---------------+
|ManufacturerID|    Manufacturer|sk_manufacturer|
+--------------+----------------+---------------+
|             8|New manufacturer|              2|
|             7|       VanArsdel|              1|
+--------------+----------------+---------------+



### 6. Recovering a old version of delta table


In [0]:

delta_table = DeltaTable.forName(spark, "sales_case.gold_dim_manufacturer")

# Restaurar a tabela para a versão 
delta_table.restoreToVersion(2)

# Apresentar a tabela
display(spark.read.table('sales_case.gold_dim_manufacturer'))


ManufacturerID,Manufacturer,sk_manufacturer
8,New manufacturer,2
7,VanArsdel,1


##7. Compaction

When to use: To group small files resulting from multiple write operations into larger files, improving read performance.

Why to use: Over time, write operations can generate many small files, leading to an excessive number of small partitions, which impacts performance. Compaction groups these small files to improve read performance and reduce overhead.

Recommendation: Perform compaction operations regularly or after large write operations to maintain the data layout in optimized files.

In [0]:
df = spark.read.table('sales_case.gold_dim_manufacturer')
df.repartition(2) \
    .write.option("maxRecordsPerFile", 1000000) \
    .mode("overwrite") \
    .format("delta") \
    .option("checkpointLocation", f"/mnt/sales_case/_checkpoint_gold_dim_manufacturer") \
    .saveAsTable('sales_case.gold_dim_manufacturer')


1. Partitioning with repartition

repartition is used to increase or decrease the number of partitions evenly, redistributing the data through a shuffle. It is useful when you need more parallelism.

In [0]:
df_region = spark.read.table('sales_case.gold_dim_region')
# Verificar número de partições iniciais
print(f"# Partitions before: {df_region.rdd.getNumPartitions()}")

# Redefinir para 2 partições usando repartition
df_region_repartition = df_region.repartition(2)

# Persiste os dados em uma tabela Delta
df_region_repartition \
    .write.option("maxRecordsPerFile", 1000000) \
    .mode("overwrite") \
    .format("delta") \
    .option("checkpointLocation", f"/mnt/sales_case/_checkpoint_gold_dim_region") \
    .saveAsTable('sales_case.gold_dim_region')

# Verificar número de partições após repartition
print(f"# Partitions After: {df_region_repartition.rdd.getNumPartitions()}")


# Partitions before: 4
# Partitions After: 2


%md
2. Repartitioning with a Specific Column

If the dataset contains a key column (such as Region or Date), you can use repartition to redistribute the data based on a specific column, which can be useful to ensure that related data is processed together.

In [0]:
df_region = spark.read.table('sales_case.gold_dim_region')
# repartition by column Region
df_region_repartition \
 = df_region_repartition.repartition(10, "Region")

df_region_repartition \
    .write.option("maxRecordsPerFile", 1000000) \
    .mode("overwrite") \
    .format("delta") \
    .option("checkpointLocation", f"/mnt/sales_case/_checkpoint_gold_dim_region_rep") \
    .saveAsTable('sales_case.gold_dim_region_rep')

# Verificar número de partições após reparticionar pela coluna "Regiao"
print(f"# Partitions after repartitioned by region column: {df_region_repartition.rdd.getNumPartitions()}")


# Partitions after repartitioned by region column: 10


3. Reducing Partitions with coalesce

coalesce is used to reduce the number of partitions without performing a shuffle, which is useful when you want to consolidate partitions and reduce the number of tasks, such as when writing to disk.

In [0]:
# Usando coalesce para reduzir as partições para 5
df_coalesced = df_region.repartition(100).coalesce(5)

# Persiste os dados em uma tabela Delta
df_coalesced \
    .write.option("maxRecordsPerFile", 1000000) \
    .mode("overwrite") \
    .format("delta") \
    .option("checkpointLocation", f"/mnt/sales_case/_checkpoint_gold_dim_region_coalesce") \
    .saveAsTable('sales_case.gold_dim_region_colaesce')

# Verificar número de partições após o coalesce
print(f"# partitions after coalesce: {df_coalesced.rdd.getNumPartitions()}")


# partitions after coalesce: 5


Summary of Techniques:

- repartition(n): 
Redistributes the data evenly into n partitions. Useful for increasing the number of partitions or ensuring better distribution.

- repartition(col): 
Redistributes the data based on one or more columns, ensuring that similar values are in the same partition.

- coalesce(n): 
Reduces the number of partitions without a shuffle, efficiently consolidating the existing partitions.
When to Use:

- repartition: 
Use when you want to increase the number of partitions or redistribute the data more evenly, especially when there is a large number of small partitions.

- coalesce: 
Use when reducing the number of partitions to minimize shuffle and consolidate data, especially when writing data to storage.

###Checking Repartition/Compaction

In [0]:
#%fs ls /mnt/lhdw/gold/vendas_delta/
# Checking our delta table information
display(dbutils.fs.ls("dbfs:/user/hive/warehouse/sales_case.db/"))

path,name,size,modificationTime
dbfs:/user/hive/warehouse/sales_case.db/bronze_sales_table/,bronze_sales_table/,0,0
dbfs:/user/hive/warehouse/sales_case.db/gold_dim_category/,gold_dim_category/,0,0
dbfs:/user/hive/warehouse/sales_case.db/gold_dim_client/,gold_dim_client/,0,0
dbfs:/user/hive/warehouse/sales_case.db/gold_dim_manufacturer/,gold_dim_manufacturer/,0,0
dbfs:/user/hive/warehouse/sales_case.db/gold_dim_product/,gold_dim_product/,0,0
dbfs:/user/hive/warehouse/sales_case.db/gold_dim_region/,gold_dim_region/,0,0
dbfs:/user/hive/warehouse/sales_case.db/gold_dim_region_colaesce/,gold_dim_region_colaesce/,0,0
dbfs:/user/hive/warehouse/sales_case.db/gold_dim_region_rep/,gold_dim_region_rep/,0,0
dbfs:/user/hive/warehouse/sales_case.db/gold_dim_segment/,gold_dim_segment/,0,0
dbfs:/user/hive/warehouse/sales_case.db/gold_fact_sales/,gold_fact_sales/,0,0


In [0]:
display(dbutils.fs.ls("dbfs:/user/hive/warehouse/sales_case.db/gold_dim_region/"))

path,name,size,modificationTime
dbfs:/user/hive/warehouse/sales_case.db/gold_dim_region/_delta_log/,_delta_log/,0,0
dbfs:/user/hive/warehouse/sales_case.db/gold_dim_region/part-00000-99b75f1e-9700-4df7-8605-e1203e55aeb5-c000.snappy.parquet,part-00000-99b75f1e-9700-4df7-8605-e1203e55aeb5-c000.snappy.parquet,127154,1741554401000
dbfs:/user/hive/warehouse/sales_case.db/gold_dim_region/part-00000-f353c8a1-dfd4-4884-bd8a-9205f5095cd8-c000.snappy.parquet,part-00000-f353c8a1-dfd4-4884-bd8a-9205f5095cd8-c000.snappy.parquet,247445,1741561723000
dbfs:/user/hive/warehouse/sales_case.db/gold_dim_region/part-00001-67ab2126-df95-4891-84ce-d339eb4cc6d1-c000.snappy.parquet,part-00001-67ab2126-df95-4891-84ce-d339eb4cc6d1-c000.snappy.parquet,247196,1741561723000
dbfs:/user/hive/warehouse/sales_case.db/gold_dim_region/part-00001-d32e7ca9-4a4b-4512-8df2-0ebcac93f7eb-c000.snappy.parquet,part-00001-d32e7ca9-4a4b-4512-8df2-0ebcac93f7eb-c000.snappy.parquet,125727,1741554401000
dbfs:/user/hive/warehouse/sales_case.db/gold_dim_region/part-00002-e72b3223-94b9-48b7-b5dd-7896b97b6eeb-c000.snappy.parquet,part-00002-e72b3223-94b9-48b7-b5dd-7896b97b6eeb-c000.snappy.parquet,126301,1741554401000
dbfs:/user/hive/warehouse/sales_case.db/gold_dim_region/part-00003-de91f6cf-98e6-4087-ba16-0d07dea438b1-c000.snappy.parquet,part-00003-de91f6cf-98e6-4087-ba16-0d07dea438b1-c000.snappy.parquet,133094,1741554401000
dbfs:/user/hive/warehouse/sales_case.db/gold_dim_region/part-00004-8709e613-d182-403f-bc6f-95bc7232e991-c000.snappy.parquet,part-00004-8709e613-d182-403f-bc6f-95bc7232e991-c000.snappy.parquet,67406,1741561810000
dbfs:/user/hive/warehouse/sales_case.db/gold_dim_region/part-00005-e787e4d3-2e7e-4ac6-acbc-1e91cc67a6e5-c000.snappy.parquet,part-00005-e787e4d3-2e7e-4ac6-acbc-1e91cc67a6e5-c000.snappy.parquet,173432,1741561810000
dbfs:/user/hive/warehouse/sales_case.db/gold_dim_region/part-00009-613b7f9f-1ecf-4cc5-981f-e1bc0db7a2aa-c000.snappy.parquet,part-00009-613b7f9f-1ecf-4cc5-981f-e1bc0db7a2aa-c000.snappy.parquet,228538,1741561810000


In [0]:
display(dbutils.fs.ls("dbfs:/user/hive/warehouse/sales_case.db/gold_dim_region_colaesce/"))

path,name,size,modificationTime
dbfs:/user/hive/warehouse/sales_case.db/gold_dim_region_colaesce/_delta_log/,_delta_log/,0,0
dbfs:/user/hive/warehouse/sales_case.db/gold_dim_region_colaesce/part-00000-dd218818-5cbd-4e9d-a928-a5c89ea31322-c000.snappy.parquet,part-00000-dd218818-5cbd-4e9d-a928-a5c89ea31322-c000.snappy.parquet,104918,1741561943000
dbfs:/user/hive/warehouse/sales_case.db/gold_dim_region_colaesce/part-00001-b6061552-2c0a-40b8-bcea-8d5dd7356517-c000.snappy.parquet,part-00001-b6061552-2c0a-40b8-bcea-8d5dd7356517-c000.snappy.parquet,104170,1741561943000
dbfs:/user/hive/warehouse/sales_case.db/gold_dim_region_colaesce/part-00002-d545ba5e-1ed5-4f69-b9d9-071b8da7e320-c000.snappy.parquet,part-00002-d545ba5e-1ed5-4f69-b9d9-071b8da7e320-c000.snappy.parquet,105154,1741561943000
dbfs:/user/hive/warehouse/sales_case.db/gold_dim_region_colaesce/part-00003-66ac7a8d-a381-4ac3-8680-8c677ea775ee-c000.snappy.parquet,part-00003-66ac7a8d-a381-4ac3-8680-8c677ea775ee-c000.snappy.parquet,104641,1741561943000
dbfs:/user/hive/warehouse/sales_case.db/gold_dim_region_colaesce/part-00004-f638388e-b391-47da-9032-084ff998d66c-c000.snappy.parquet,part-00004-f638388e-b391-47da-9032-084ff998d66c-c000.snappy.parquet,104055,1741561943000


In [0]:
display(dbutils.fs.ls("dbfs:/user/hive/warehouse/sales_case.db/gold_dim_region_rep/"))

path,name,size,modificationTime
dbfs:/user/hive/warehouse/sales_case.db/gold_dim_region_rep/_delta_log/,_delta_log/,0,0
dbfs:/user/hive/warehouse/sales_case.db/gold_dim_region_rep/part-00004-68ce2ed7-3993-469d-aa13-f13ca4307f23-c000.snappy.parquet,part-00004-68ce2ed7-3993-469d-aa13-f13ca4307f23-c000.snappy.parquet,67406,1741562318000
dbfs:/user/hive/warehouse/sales_case.db/gold_dim_region_rep/part-00005-9d1c05ca-d17a-47c7-a6c2-f91d67977981-c000.snappy.parquet,part-00005-9d1c05ca-d17a-47c7-a6c2-f91d67977981-c000.snappy.parquet,173432,1741562318000
dbfs:/user/hive/warehouse/sales_case.db/gold_dim_region_rep/part-00009-5196676e-09c8-4d39-ba2d-45c6181c1612-c000.snappy.parquet,part-00009-5196676e-09c8-4d39-ba2d-45c6181c1612-c000.snappy.parquet,228538,1741562318000


In [0]:
%fs ls /mnt/lhdw/gold/vendas_delta/geo_coalesce/

path,name,size,modificationTime
dbfs:/mnt/lhdw/gold/vendas_delta/geo_coalesce/_delta_log/,_delta_log/,0,0
dbfs:/mnt/lhdw/gold/vendas_delta/geo_coalesce/part-00000-da5c299f-2306-49ad-ab30-0c80f9e62e43-c000.snappy.parquet,part-00000-da5c299f-2306-49ad-ab30-0c80f9e62e43-c000.snappy.parquet,82039,1727816375000
dbfs:/mnt/lhdw/gold/vendas_delta/geo_coalesce/part-00001-6746cb47-d3af-4800-a635-f31e8cd5c57d-c000.snappy.parquet,part-00001-6746cb47-d3af-4800-a635-f31e8cd5c57d-c000.snappy.parquet,82622,1727816375000
dbfs:/mnt/lhdw/gold/vendas_delta/geo_coalesce/part-00002-7f7e6236-5d10-4f91-b315-f762dfd22d76-c000.snappy.parquet,part-00002-7f7e6236-5d10-4f91-b315-f762dfd22d76-c000.snappy.parquet,82522,1727816375000
dbfs:/mnt/lhdw/gold/vendas_delta/geo_coalesce/part-00003-425738a7-3f4a-4857-8d71-80531cce9533-c000.snappy.parquet,part-00003-425738a7-3f4a-4857-8d71-80531cce9533-c000.snappy.parquet,82471,1727816375000
dbfs:/mnt/lhdw/gold/vendas_delta/geo_coalesce/part-00004-015bdc49-55f3-426a-bd63-a8e8eb39ccc8-c000.snappy.parquet,part-00004-015bdc49-55f3-426a-bd63-a8e8eb39ccc8-c000.snappy.parquet,82138,1727816375000
