# Data Skipping Sample in Scala


[Data skipping](https://cloud.ibm.com/docs/services/AnalyticsEngine?topic=AnalyticsEngine-data-skipping&locale=en) can significantly boost the performance of SQL queries by skipping over irrelevant data objects or files based on a summary metadata associated with each object.

For every column in the object, the summary metadata might include minimum and maximum values, a list or bloom filter of the appearing values, or other metadata which succinctly represents the data in that column. This metadata is used during query evaluation to skip over objects which have no relevant data.

All Spark native data formats are supported, including Parquet, ORC, CSV, JSON and Avro. Data skipping is a performance optimization feature which means that using data skipping does not affect the content of the query results.

To use this feature, you need to create indexes on one or more columns of the data set. After this is done, Spark SQL queries can benefit from data skipping. In general, you should index the columns which are queried most often in the WHERE clause.

### Table of Contents:
* [1. Setup the environment](#cell0)
* [2. Creating a sample dataset](#cell1)
* [3. Setup the DataSkipping library](#cell2)
    * [3.1 Indexing a dataset](#cell2.1)
* [4. Using the data skipping indexes ](#cell3)
    * [4.1 Running queries](#cell3.2)
* [Authors](#authors)

<a id="cell0"></a>
## Setup the environment

In [1]:
import com.ibm.metaindex.metadata.metadatastore.parquet.{Parquet, ParquetMetadataBackend}
import com.ibm.metaindex.{MetaIndexManager, Registration}
import org.apache.log4j.{Level, LogManager}

**(optional)** set log level to DEBUG for the metaindex package - to view the skipped objects 

In [2]:
LogManager.getLogger("com.ibm.metaindex").setLevel(Level.DEBUG)

### Configure Stocator
For more info on how to config credentials see [here](https://github.com/CODAIT/stocator).

See [here](https://cloud.ibm.com/docs/services/cloud-object-storage?topic=cloud-object-storage-endpoints) for the list of endpoints.
If you are in Cloud, make sure you choose the private endpoint of your bucket.

In [3]:
spark.sparkContext.hadoopConfiguration.set("fs.cos.service.endpoint" ,"https://s3.private.us-east.cloud-object-storage.appdomain.cloud")
spark.sparkContext.hadoopConfiguration.set("fs.cos.service.access.key", "")
spark.sparkContext.hadoopConfiguration.set("fs.cos.service.secret.key", "")

<a id="cell1"></a>
## Creating a sample dataset

Creating a sample dataset consisting of 2 rows each row in a different object to demonstrate data skipping.
Please replace `dataset_location` with your own bucket location to save the sample dataset.

In [4]:
// trick to import spark implicits
import org.apache.spark.sql.SparkSession
val spark2: SparkSession = spark
import spark2.implicits._

val dataset_location = "cos://<mybucket>.service/tmp/sampleskipping" // i.e.: dataset_location = cos://guytestssouth.service/tmp/sampleskipping 
case class Record(dt: String, temp: Double, city: String, vid: String)

val ds = Seq(Record("2017-07-07", 20.0, "Tel-Aviv", "a"), Record("2017-07-08", 30.0, "Jerusalem", "b")).toDS()

// use partitionBy to make sure we have two objects
ds.write.partitionBy("dt").mode("overwrite").parquet(dataset_location)

spark2 = org.apache.spark.sql.SparkSession@1379d40
dataset_location = cos://cloud-object-storage-my-cos-standard-9f6.service/tmp/sampleskipping
defined class Record
ds = [dt: string, temp: double ... 2 more fields]


[dt: string, temp: double ... 2 more fields]

<a id="cell2"></a>
## Setup the DataSkipping library
In this example, we will set the JVM wide parameter to a base path to store all of the indexes. 

Metadata can be stored on the same storage system as the data however, not under the same path. For more configuration options, see [Data skipping configuration options](https://cloud.ibm.com/docs/services/AnalyticsEngine?topic=AnalyticsEngine-data-skipping-config-options&locale=en).\
please replace `md_base_path` with a location to save the sample dataset

In [5]:
val md_base_path = "cos://<mybucket>.service/tmp/sampleskippingmetadata" // i.e. md_base_path = cos://guytestssouth.service/tmp/sampleskipping 
Registration.setDefaultMetaDataStore(Parquet)
val jvmParameters = new java.util.HashMap[String, String]()
jvmParameters.put("spark.ibm.metaindex.parquet.mdlocation", md_base_path)
jvmParameters.put("spark.ibm.metaindex.parquet.mdlocation.type", "EXPLICIT_BASE_PATH_LOCATION")
MetaIndexManager.setConf(jvmParameters)

md_base_path = cos://cloud-object-storage-my-cos-standard-9f6.service/tmp/sampleskippingmetadata
jvmParameters = {spark.ibm.metaindex.parquet.mdlocation.type=EXPLICIT_BASE_PATH_LOCATION, spark.ibm.metaindex.parquet.mdlocation=cos://cloud-object-storage-my-cos-standard-9f6.service/tmp/sampleskippingmetadata}


{spark.ibm.metaindex.parquet.mdlocation.type=EXPLICIT_BASE_PATH_LOCATION, spark.ibm.metaindex.parquet.mdlocation=cos://cloud-object-storage-my-cos-standard-9f6.service/tmp/sampleskippingmetadata}

<a id="cell2.1"></a>
### Indexing a dataset

Skip this step if the data set is already indexed.

When creating a data skipping index on a data set, first decide which columns to index, then choose an index type for each column.\
These choices are workload and data dependent. Typically, choose columns to which predicates are applied in many queries.\
Currently the following index types are supported:\
1. Min/max – stores the minimum and maximum values for a column. Applies to all types except complex types.
2. Value list – stores the list of values appearing in a column. Applies to all types except complex types.
3. Bloom Filter – stores bloom filter with false positive probability of 1%. Applies to ByteType, StringType, LongType, IntegerType, and ShortType.


- Choose value list if the number of distinct values in an object is typically much smaller than the total number of values in that object
- Bloom filters are recommended for columns with high cardinality.
- (otherwise the index can get as big as that column in the data set).

In [6]:
val reader = spark.read.format("parquet")
val im = new MetaIndexManager(spark, dataset_location, ParquetMetadataBackend)

// remove existing index first
if (im.isIndexed()) {
  im.removeIndex()
}

// indexing
im.indexBuilder()
  .addMinMaxIndex("temp")
  .addValueListIndex("city")
  .addBloomFilterIndex("vid")
  .build(reader)
  .show(false)

+-------+-----------------+-------------------+
|status |new_entries_added|old_entries_removed|
+-------+-----------------+-------------------+
|SUCCESS|2                |0                  |
+-------+-----------------+-------------------+



reader = org.apache.spark.sql.DataFrameReader@215de63b
im = com.ibm.metaindex.MetaIndexManager@2e2147d5


com.ibm.metaindex.MetaIndexManager@2e2147d5

Note that each of the index types has a corresponding method in the indexBuilder class of the form:

`add[IndexType]Index(<index_params>)`

For example:

`addMinMaxIndex(col: String)`

`addValueListIndex(col: String)`

`addBloomFilterIndex(col: String)`

**(optional)** to refresh an indexed dataset use

In [7]:
im.refreshIndex(reader).show(false)

+-------+-----------------+-------------------+
|status |new_entries_added|old_entries_removed|
+-------+-----------------+-------------------+
|SUCCESS|0                |0                  |
+-------+-----------------+-------------------+



#### View index status

In [8]:
im.getIndexStats(reader).show(false)

+-------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------+
|Data Skipping Index Stats|cloud-object-storage-my-cos-standard-9f6/tmp/sampleskipping                                                                                       |
+-------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------+
|Status                   |Up to date                                                                                                                                        |
|Total objects indexed    |2                                                                                                                                                 |
|# Metadata properties    |                                                                                                  

<a id="cell3"></a>
## Using the data skipping indexes 

### Injecting the data skipping rule and enabling data skipping
the rule injection should be done only once per Spark session.

In [9]:
// inject the data skipping rule
MetaIndexManager.injectDataSkippingRule(spark)

// enable data skipping
MetaIndexManager.enableFiltering(spark)

// you can disable the data skipping any time by running:
// MetaIndexManager.disableFiltering(spark)

<a id="cell3.2"></a>
### Running queries

#### Load the dataframe

In [10]:
val df = reader.load(dataset_location)
df.createOrReplaceTempView("metergen")

df = [temp: double, city: string ... 2 more fields]


[temp: double, city: string ... 2 more fields]

Example query which uses the min/max index:

In [11]:
spark.sql("select count(*) from metergen where temp < 30").show()

+--------+
|count(1)|
+--------+
|       1|
+--------+



View the data skipping statistics:

In [12]:
MetaIndexManager.getLatestQueryAggregatedStats(spark).show(false)

+-------+-----------+-------------+------------+-----------+----------+
|status |isSkippable|skipped_Bytes|skipped_Objs|total_Bytes|total_Objs|
+-------+-----------+-------------+------------+-----------+----------+
|SUCCESS|true       |863          |1           |1717       |2         |
+-------+-----------+-------------+------------+-----------+----------+



**Optional:** clear the stats for the next query (otherwise, stats will acummulate)


In [13]:
MetaIndexManager.clearStats()

Example query which uses value list index:

In [14]:
spark.sql("select count(*) from metergen where city IN ('Jerusalem', 'Ramat-Gan')").show()

+--------+
|count(1)|
+--------+
|       1|
+--------+



View the data skipping statistics as follows:

In [15]:
MetaIndexManager.getLatestQueryAggregatedStats(spark).show(false)

+-------+-----------+-------------+------------+-----------+----------+
|status |isSkippable|skipped_Bytes|skipped_Objs|total_Bytes|total_Objs|
+-------+-----------+-------------+------------+-----------+----------+
|SUCCESS|true       |854          |1           |1717       |2         |
+-------+-----------+-------------+------------+-----------+----------+



**Optional:** clear the stats for the next query (otherwise, stats will acummulate)

In [16]:
MetaIndexManager.clearStats()

Example query which uses bloom filter index:

In [17]:
spark.sql("select count(*) from metergen where vid = 'abc'").show()

+--------+
|count(1)|
+--------+
|       0|
+--------+



View the data skipping statistics as follows.

In [18]:
MetaIndexManager.getLatestQueryAggregatedStats(spark).show(false)

+-------+-----------+-------------+------------+-----------+----------+
|status |isSkippable|skipped_Bytes|skipped_Objs|total_Bytes|total_Objs|
+-------+-----------+-------------+------------+-----------+----------+
|SUCCESS|true       |1717         |2           |1717       |2         |
+-------+-----------+-------------+------------+-----------+----------+



**Optional:** clear the stats for the next query (otherwise, stats will acummulate)

In [19]:
MetaIndexManager.clearStats()

<a id="authors"></a> 
### Authors

**Guy Khazma**, Cloud Data Researcher of IBM.

Copyright © 2020 IBM. This notebook and its source code are released under the terms of the MIT License.

<div style="background:#F5F7FA; height:110px; padding: 2em; font-size:14px;">
<span style="font-size:18px;color:#152935;">Love this notebook? </span>
<span style="font-size:15px;color:#152935;float:right;margin-right:40px;">Don't have an account yet?</span><br>
<span style="color:#5A6872;">Share it with your colleagues and help them discover the power of Watson Studio!</span>
<span style="border: 1px solid #3d70b2;padding:8px;float:right;margin-right:40px; color:#3d70b2;"><a href="https://ibm.co/wsnotebooks" target="_blank" style="color: #3d70b2;text-decoration: none;">Sign Up</a></span><br>
</div>