# Data Skipping

Data skipping information is collected **automatically** when you write data into a Delta Lake table. **_Delta Lake takes advantage_** of this information (minimum and maximum values for each column) at query time to provide faster queries. 

You do not need to configure data skipping; the feature is activated whenever applicable. However, its effectiveness depends on the layout of your data. For best results, apply Z-Ordering.

Collecting statistics on a column containing long values such as string or binary is an expensive operation. To avoid collecting statistics on such columns you can configure the table property **delta.dataSkippingNumIndexedCols**. 

This property indicates the position index of a column in the table's schema. All columns with a position index less than the **delta.dataSkippingNumIndexedCols** property will have statistics collected. 

For the purposes of collecting statistics, each field within a nested column is considered as an individual column. To avoid collecting statistics on columns containing long values, either set the **delta.dataSkippingNumIndexedCols** property so that the long value columns are after this index in the table's schema, or move columns containing long strings to an index position greater than the **delta.dataSkippingNumIndexedCols** property by using **ALTER TABLE ALTER COLUMN**.



## Generate dummy data

[Data Generator](https://github.com/databrickslabs/dbldatagen)

In [None]:
%pip install dbldatagen
%pip install jmespath

In [None]:
import dbldatagen as dg
from pyspark.sql.types import IntegerType, FloatType, StringType
column_count = 35
data_rows = 1000
df_spec = (dg.DataGenerator(spark, name="test_data_set1", rows=data_rows,
                                                  partitions=4)
           .withIdOutput()
           .withColumn("r", FloatType(), 
                            expr="floor(rand() * 350) * (86400 + 3600)",
                            numColumns=column_count)
           .withColumn("code1", IntegerType(), minValue=100, maxValue=200)
           .withColumn("code2", IntegerType(), minValue=0, maxValue=10)
           .withColumn("code3", StringType(), values=['a', 'b', 'c'])
           .withColumn("code4", StringType(), values=['a', 'b', 'c'], 
                          random=True)
           .withColumn("code5", StringType(), values=['a', 'b', 'c'], 
                          random=True, weights=[9, 1, 1])
 
           )

delta_table_name = 'demo.data_skipping_demo'
spark.sql(f"DROP TABLE IF EXISTS {delta_table_name}")
                         
df = df_spec.build()
df.write.format("delta").saveAsTable(delta_table_name)                        

## Checking stats

As you can see there are mode than 32 columns in the table, So only the first 32 columns will have statistics created. 

In [None]:
%%sql
DESCRIBE demo.data_skipping_demo

In [None]:
from pyspark.sql.types import StructType,StructField,BooleanType,LongType,StringType
from pyspark.sql.functions import col, from_json
deltalog = spark.read.json("Tables/data_skipping_demo/_delta_log/00000000000000000000.json")
schema= StructType([
    StructField('numRecords',LongType(), True), 
    StructField('minValues', StringType(), True), 
    StructField('maxValues', StringType(), True), 
    StructField('nullCount', StringType(), True)
    ])

df_add = deltalog.select(from_json(col('add.stats'),schema).alias('stats')).select(['stats.numRecords','stats.minValues','stats.maxValues','stats.nullCount']).where("add is not null")
display(df_add)


## Change column 

Now, let's change a column that is 

In [None]:
%%sql
ALTER TABLE data_skipping_demo CHANGE COLUMN code5 AFTER r_0

In [None]:
%%sql
DESCRIBE demo.data_skipping_demo

## Checking Delta log

In [None]:
import delta

delta_info = delta_info = delta.DeltaTable.forName(spark, "demo.data_skipping_demo")

display(delta_info.history())

In [None]:
deltalog = spark.read.json("Tables/data_skipping_demo/_delta_log/00000000000000000001.json")
display(deltalog)

## Appending new data

In [None]:
from pyspark.sql.types import IntegerType, FloatType, StringType
column_count = 35
data_rows = 10
df_spec = (dg.DataGenerator(spark, name="test_data_set1", rows=data_rows,
                                                  partitions=4)
           .withIdOutput()
           .withColumn("r", FloatType(), 
                            expr="floor(rand() * 350) * (86400 + 3600)",
                            numColumns=column_count)
           .withColumn("code1", IntegerType(), minValue=100, maxValue=200)
           .withColumn("code2", IntegerType(), minValue=0, maxValue=10)
           .withColumn("code3", StringType(), values=['a', 'b', 'c'])
           .withColumn("code4", StringType(), values=['a', 'b', 'c'], 
                          random=True)
           .withColumn("code5", StringType(), values=['a', 'b', 'c'], 
                          random=True, weights=[9, 1, 1])
 
           )
                            
df = df_spec.build()
df.write.format("delta").mode("append").saveAsTable("data_skipping_demo")    

In [None]:
display(delta_info.history())

New stats are collected only for new data.

Column code5 has now stats however the last column with stats has changed.

In [None]:
deltalog = spark.read.json("Tables/data_skipping_demo/_delta_log/00000000000000000002.json")

schema = StructType([StructField("numRecords", IntegerType(), False),
                StructField("minValues", StringType(), False),
                StructField("maxValues", StringType(), False), 
                StructField("nullCount", StringType(), False)])

deltalog = deltalog.withColumn("parsed_stats", from_json(deltalog["add.stats"], schema))

display(deltalog.select("add.path", "parsed_stats.numRecords","parsed_stats.minValues","parsed_stats.maxValues","parsed_stats.nullCount").where("add is not null"))

## Increasing / Decreasing number of columns stats

In [None]:
%%sql
ALTER TABLE demo.data_skipping_demo SET TBLPROPERTIES ("delta.dataSkippingNumIndexedCols" = 5)

In [None]:
from pyspark.sql.types import IntegerType, FloatType, StringType
column_count = 35
data_rows = 10
df_spec = (dg.DataGenerator(spark, name="test_data_set1", rows=data_rows,
                                                  partitions=4)
           .withIdOutput()
           .withColumn("r", FloatType(), 
                            expr="floor(rand() * 350) * (86400 + 3600)",
                            numColumns=column_count)
           .withColumn("code1", IntegerType(), minValue=100, maxValue=200)
           .withColumn("code2", IntegerType(), minValue=0, maxValue=10)
           .withColumn("code3", StringType(), values=['a', 'b', 'c'])
           .withColumn("code4", StringType(), values=['a', 'b', 'c'], 
                          random=True)
           .withColumn("code5", StringType(), values=['a', 'b', 'c'], 
                          random=True, weights=[9, 1, 1])
 
           )
                            
df = df_spec.build()
df.write.format("delta").mode("append").saveAsTable("data_skipping_demo")    

In [None]:
display(delta_info.history())

In [None]:
deltalog = spark.read.json("Tables/data_skipping_demo/_delta_log/00000000000000000004.json")

schema = StructType([StructField("numRecords", IntegerType(), False),
                StructField("minValues", StringType(), False),
                StructField("maxValues", StringType(), False), 
                StructField("nullCount", StringType(), False)])

deltalog = deltalog.withColumn("parsed_stats", from_json(deltalog["add.stats"], schema))

display(deltalog.select("add.path", "parsed_stats.numRecords","parsed_stats.minValues","parsed_stats.maxValues","parsed_stats.nullCount").where("add is not null"))

# Clean up

In [None]:
spark.sql("DROP TABLE IF EXISTS demo.data_skipping_demo")  