
<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning">
</div>


# File Explosion
We see many data engineers partitioning their tables in ways that can cause major performance issues, without improving future query performance. This is called "over partitioning". We'll see what that looks like in practice in this demo.

##### Useful References
- [Partitioning Recomendations](https://docs.databricks.com/en/tables/partitions.html)
- [CREATE TABLE Syntax](https://docs.databricks.com/en/sql/language-manual/sql-ref-syntax-ddl-create-table-using.html)
- [About ZORDER](https://docs.databricks.com/en/delta/data-skipping.html)
- [About Liquid Clustering](https://docs.databricks.com/en/delta/clustering.html)

### Set up the classroom and disable caching
Run the following cell to set up the lesson.

In [0]:
%run ./Includes/Classroom-Setup-04.1

[43mNote: you may need to restart the kernel using dbutils.library.restartPython() to use updated packages.[0m
[43mNote: you may need to restart the kernel using dbutils.library.restartPython() to use updated packages.[0m


Resetting the learning environment (4.1):
| No action taken

Skipping install of existing datasets to "dbfs:/mnt/dbacademy-datasets/data-engineer-learning-path/v04"

Validating the locally installed datasets:
| listing local files...(5 seconds)
| validation completed...(5 seconds total)

Creating & using the schema "labuser9308469_1740138925_77gg_da_adewd_4_1" in the catalog "hive_metastore"...(1 seconds)


Run the following cell, which will set a Spark configuration variable that disables caching. Turning caching off makes the effect of the optimizations more apparent.

In [0]:
spark.conf.set('spark.databricks.io.cache.enabled', False)


### Process & Write IoT data
Let's generate some fake IoT data. This first time around, we are only going to generate 2500 rows.

In [0]:
from pyspark.sql.functions import *

df = (spark
      .range(0, 2500)
      .select(
          hash('id').alias('id'), # randomize our ids a bit
          rand().alias('value'),
          from_unixtime(lit(1701692381 + col('id'))).alias('time') 
      ))

df.display()

id,value,time
-1670924195,0.7004760670799265,2023-12-04 12:19:41
-1712319331,0.1845052646390091,2023-12-04 12:19:42
-797927272,0.872582005013014,2023-12-04 12:19:43
519220707,0.1120077344591474,2023-12-04 12:19:44
1344313940,0.2112286677760532,2023-12-04 12:19:45
1607884268,0.7052048004815178,2023-12-04 12:19:46
-1767354555,0.4491498646955369,2023-12-04 12:19:47
1293116811,0.4670710241460388,2023-12-04 12:19:48
-1131184084,0.0781718532952855,2023-12-04 12:19:49
1504843649,0.5760727926139791,2023-12-04 12:19:50


Now we'll write the data to a table partitioned by `id`, which will result in every row being written to a separate file. 2500 rows will take a long time to write in this fashion. Note how long it takes to generate the table.

In [0]:

(df
 .write
 .mode('overwrite')
 .option("overwriteSchema", "true")
 .partitionBy('id')
 .saveAsTable("iot_data")
)

### Query the Table
Run the two queries against the table we just wrote. Note the time taken to execute each query.

In [0]:
%sql
SELECT * FROM iot_data where id = 519220707

id,value,time
519220707,0.1120077344591474,2023-12-04 12:19:44


In [0]:
%sql
SELECT avg(value) FROM iot_data where time >= "2023-12-04 12:19:00" and time <= "2023-12-04 13:01:20"

avg(value)
0.4928446630442581


### Fixing the Problem

Up to this point, we have been working with 2,500 rows of data. We are now going to increase the volume dramatically by using 50,000,000 rows of data. If we had tried the code above with a dataset this large, it would take considerably longer.

As before, the following cell generates the data.

In [0]:
from pyspark.sql.functions import *

df = (spark
      .range(0,50000000, 1, 32) 
      .select(
          hash('id').alias('id'), # randomize our ids a bit
          rand().alias('value'),
          from_unixtime(lit(1701692381 + col('id'))).alias('time') 
      ))

df.display()

id,value,time
-1670924195,0.4755591242335129,2023-12-04 12:19:41
-1712319331,0.1478178756478926,2023-12-04 12:19:42
-797927272,0.5194975065418095,2023-12-04 12:19:43
519220707,0.3404843222988339,2023-12-04 12:19:44
1344313940,0.0151437514856965,2023-12-04 12:19:45
1607884268,0.5629979439109172,2023-12-04 12:19:46
-1767354555,0.7755208720063631,2023-12-04 12:19:47
1293116811,0.395524807885832,2023-12-04 12:19:48
-1131184084,0.4014444652048761,2023-12-04 12:19:49
1504843649,0.7432852345704998,2023-12-04 12:19:50



Now we'll establish a table to capture the data, this time without partitioning. Doing it this way accomplishes the following:
- take less time to run, even on larger data sets
- writes fewer files
- writes faster
- selects for one id in about the same time
- filters by time faster

In [0]:
(df
 .write
 .option("overwriteSchema", "true")
 .mode('overwrite')
 .saveAsTable("iot_data")
)

### Validate optimzation
The next two cells repeat the queries from earlier and will put this change to the test. The first cell should run almost as fast as before, and the second cell should run much faster.

In [0]:
%sql
SELECT * FROM iot_data where id = 519220707

id,value,time
519220707,0.3404843222988339,2023-12-04 12:19:44


In [0]:
%sql
SELECT avg(value) FROM iot_data where time >= "2023-12-04 12:19:00" and time <= "2023-12-04 13:01:20"

avg(value)
0.5100211151447167


### Liquid Clustering
An alternative to partitioning is [Liquid Clustering](https://docs.databricks.com/en/delta/clustering.html). Liquid clustering performs much better than partitioning, especially for high cardinality columns. We will look at Liquid Clustering in the next demo.


&copy; 2024 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the 
<a href="https://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/><a href="https://databricks.com/privacy-policy">Privacy Policy</a> | 
<a href="https://databricks.com/terms-of-use">Terms of Use</a> | 
<a href="https://help.databricks.com/">Support</a>