# Benchmarking IBM Db2 Event Store's Time Series Performance in Consecutive Duplication Removal

## Abstract
This notebook contains a simple benchmark comparison in the performance of consecutive duplication removal between IBM Event Store's time series function and naive approach.

## Procedure
- A simple table is created with records containing a monotonically increasing timestamp. 
- Dataframe is created, repartitioned, and cached. Same cached dataframe is used both in Event Store time series and naive approach so that to eliminate the effect of data fetching time on the time series query performance measurement.
- Consecutive duplication removal performance is measured with IBM Db2 Event Store's time series function
    - Performance is first measured when time series is created in years.
    - Performance is then measured when the time series is created in dates.
- Consecutive duplication removal performance is measured with the naive approach.
- Comparison and interpretation are concluded.

In [None]:
# show current Spark application ID
sc.applicationId

In [None]:
from eventstore.oltp import EventContext
from eventstore.sql import EventSession
from eventstore.common import ConfigurationReader
from pyspark.sql import SparkSession

ConfigurationReader.setEventUser("admin")
ConfigurationReader.setEventPassword("password")

In [None]:
sparkSession = SparkSession.builder.appName("EventStore SQL in Python").getOrCreate()
eventSession = EventSession(sparkSession.sparkContext, "EVENTDB")
eventSession.set_query_read_option("SnapshotNow")
eventSession._jvm.org.apache.spark.sql.types.SqlTimeSeries.register(eventSession._jsparkSession)
eventSession.open_database()
ctx = EventContext.get_event_context("EVENTDB")

In [None]:
from eventstore.catalog import TableSchema
from pyspark.sql.types import *

### Part 1: Function preparation

In [None]:
def duplication_stats(L0_sdf, es_session):
    """
    This function is rewritten based on the naive_on_change. The function will calculate and print follwoing info:\
    - total number of rows
    - total number of duplications (count of record with duplicated readings in consecutive timestamps
      , except for the ealirst record in each consecutive duplicated records)
    - percentage of duplicated records to be removed
    - Expected number of rows/records after duplication removal
    ---
    @param: L0_sdf: Spark Dataframe : Dataframe whose duplication statistics will be calculated
    @param: es_session: EventSession to be used
    @return: int : expected row count after duplication removal
    ---
    Example:
    # duplication stats should be same for granularity of years/dates
    duplication_stats(raw_table_with_dates, eventSession)
    Return:
            Total number of rows: 9999990
            Total number of duplications: 999502
            Duplication percentage: 0.09995029995029996
            Expected row number after processing: 9000488
    """
    from pyspark.sql import Window
    from pyspark.sql.functions import lag
    
    # We will use current time to build the temp views names
    def tabletag():
        from time import time
        return 'TABLE'+str(int(time()*1000000))
    
    # Time sort
    L1_sdf = L0_sdf.orderBy('timestamp')
    L0_sortedTable = tabletag()
    L1_sdf.createOrReplaceTempView(L0_sortedTable)
    
    # Keep first record
    L1_first_sdf = es_session.sql('SELECT timestamp, value FROM ' + L0_sortedTable + ' LIMIT 1')
    
    # Prepare lag program
    eng_col = L1_sdf['value']
    lag_eng = lag(eng_col).over(Window.orderBy('timestamp'))
    L1_sdf = L1_sdf.withColumn('prev_value', lag_eng)
    
    # Prepare diff program
    prev_eng_col = L1_sdf['prev_value']
    L1_sdf = L1_sdf.withColumn('diff_value', eng_col - prev_eng_col)
    
    # Prepare on-change filter program
    diff_eng_col = L1_sdf['diff_value']    
    L1_count = L1_sdf.count()
    L1_dup_count = L1_sdf.filter(diff_eng_col == 0).count()
    L1_dup_percent = L1_dup_count / L1_count
    L1_unique_count= L1_count - L1_dup_count
    print("Total number of rows: {}".format(L1_count))
    print("Total number of duplications: {}".format(L1_dup_count))
    print("Duplication percentage: {}".format(L1_dup_percent))
    print("Expected row number after processing: {}".format(L1_unique_count))
    return L1_unique_count

### Part 2: Data preparation
As a proof-of-concept, a simple data is created with 10Million randomly generated data.  
```
root
 |-- KEY: integer (nullable = false)
 |-- TIMESTAMP: long (nullable = false)
 |-- VALUE: float (nullable = false)
```

In [None]:
num_records = 10000000
table_name = "t" + str(num_records)

In [None]:
"""
Table creation/ data loading are commented out after the first run.
"""
# # Define table schema to be created
# with EventContext.get_event_context("EVENTDB") as ctx:
#     schema = StructType([
#         StructField("key", IntegerType(), nullable = False),
#         StructField("timestamp", LongType(), nullable = False),
#         StructField("value", FloatType(), nullable = False)
#     ])  
#     table_schema = TableSchema(table_name, schema,
#                                 sharding_columns=["key"],
#                                 pk_columns=["key","timestamp"])

In [None]:
# try:
#     ctx.create_table(table_schema)
# except Exception as error:
#     print(error)
#     pass
    
# table_names = ctx.get_names_of_tables()
# for idx, name in enumerate(table_names):
#     print(name)

In [None]:
# table = eventSession.load_event_table(table_name)

In [None]:
# ingest data into table
# import os
# resolved_table_schema = ctx.get_table(table_name)
# print(resolved_table_schema)
# for letter in list(["a","b","c","d","e","f","g","h","i","j"]):
#     with open(os.environ['DSX_PROJECT_DIR']+'/datasets/csv_10000000_realtime_xa'+letter+'.csv') as f:
#         f.readline()
#         content = f.readlines()
#         content = [l.split(",") for l in content]
#         batch = [dict(key=int(c[0]), timestamp=int(c[1]), value=float(c[2])) for c in content]
#         ctx.batch_insert(resolved_table_schema, batch)

### 2.1 Optimize parallelism by repartitioning
Note that when the query is pushed down to Db2 Event Store and the data is retrieved, the data will be received by Spark as one single partitioned data frame. It's necessary for the user to explicitly repartition the dataframe.  
It's suggested that one partition is created for each CPU core in the Spark cluster.  

In [None]:
# verify ingested result
raw_table = eventSession.load_event_table(table_name)

print("number of partitions prior to time series (after loading table): ",raw_table.rdd.getNumPartitions())
print("partition sizes prior to time series (after loading table): ", raw_table.rdd.mapPartitions(lambda s: iter([sum(1 for _ in s)])).collect())

In [None]:
"""
repartition the dataframe into 48 partitions (16 cores/node * 3 nodes = 48 partitions)
"""
raw_table_after_partition = raw_table.repartition(48)

print("number of partitions prior to time series (after loading table): ",raw_table_after_partition.rdd.getNumPartitions())
print("partition sizes prior to time series (after loading table): ", raw_table_after_partition.rdd.mapPartitions(lambda s: iter([sum(1 for _ in s)])).collect())

In [None]:
raw_table_after_partition.createOrReplaceTempView("raw_table_partitioned")

### 2.2 Generating new clustering key for time series creation

Records will be clustered into certain ranges, such as years or dates, and time series will be created on each such clustered ranges or record.  
Consecutive duplication removal will happen on each time series, which dramatically increase the computational parallelism and reduces computation time.  

There are some subtle differences that worth noticing in the clustering granularity:

1/ **Performance**

In general, performance will increase with smaller clustering granularity when clustering consecutive record to create time series.

When the records are clustered in smaller granularity, i.e. dates v.s. years, the number of time series created will be increased.  
Duplication removal will be executed concurrently on all-time series, thus the performance is better.  
  
2/ **Number of remaining duplications**

With smaller clustering granularity, more time series will be created, and the number of duplications left over will increase.

For example, that grouping by key [day 1] [day 2] … [day n]. If you remove dups on each one, If let's say the last value of day 1 is dup with the first   value of day 2, it will not catch that as dups were removed on a each-time-series basis. Having said that, on a large scale, this will always occur as it   depends on how much data you are querying each time, for instance, if they query 1 day at a time, unless they keep track of the last value in each day every   time they query, for the next time they query, they will also have this issue.
</span>

---
Two dataframes are created with different clustering granularity for performance comparsion:

- raw_table_with_years: Clustering key in years

- raw_table_with_dates: Clustering key in dates

In [None]:
# ts granularity in years
raw_table_with_years = eventSession.sql("select key, from_unixtime(TIMESTAMP/1000,'YYYY') as key2, TIMESTAMP, value from raw_table_partitioned").cache()
# ts granularity in dates
raw_table_with_dates = eventSession.sql("select key, from_unixtime(TIMESTAMP/1000,'YYYY-MM-dd') as key2, TIMESTAMP, value from raw_table_partitioned").cache()

In [None]:
raw_table_with_dates.show(5)

In [None]:
raw_table_with_years.show(5)

**Show the range of the timestamp**

- Notice that the time spans 19 years, representing 7045 days.
- Notice that the distinct dates are 6929, which means we have records for almost all dates.

In [None]:
raw_table_with_dates.agg({"key2":"max"}).collect()[0]

In [None]:
raw_table_with_dates.agg({"key2":"min"}).collect()[0]

In [None]:
raw_table_with_dates.select("key2").distinct().count()

In [None]:
# duplication stats should be same for granularity of years/dates
"""
Note that there are ~10% of duplicated records that's need to be removed.
Expected total number after processing is : 900488
"""
expected_row_num = duplication_stats(raw_table_with_dates, eventSession)

## Part 3. Performance Analysis
### 3.1 IBM Event Store Time Series Performance

In [None]:
def create(df, key_col, ts_col, val_col, new_key_name="joined_primary_keys", time_series_name=None):
    """
    Highly efficient algorithm that creates a time series from the Spark Dataframe.
    ---
    @param df: Spark Dataframe : Containing input columns for time series creation
    @key_col: List[String] : List of column name strings of primary key for the time series creation
    @ts_col: String : Column name of timestamp
    @val_col: String : Column name of value
    @new_key_name: String : Column name of the joined primary key column to be created.
    @time_series_name: String : [Default: <val_col>_time_series] Column name of the time series column to be created.
    return: [Spark Dataframe] Spark df containing 2 columns: key column and time series column
    ---
    Example:
    ts_df = create(raw_table_with_dates, ["SATID","PKID","DATE"], "TIMESTAMP", "READING")
    """
    from pyspark.sql import DataFrame
    from pyspark.sql.functions import concat, col, lit
    ts_column_name = val_col + "_time_series"
    df = df.withColumn(new_key_name, concat(*key_col))
    ts_df = DataFrame(
        df.sql_ctx._jvm.com.ibm.research.time_series.spark_timeseries_sql.utils.api.java.TimeSeriesDataFrame.create(
            df._jdf,
            new_key_name,
            ts_col,
            val_col
        ),
        df.sql_ctx
    )
    if time_series_name:
        ts_df = ts_df.withColumnRenamed(ts_column_name, time_series_name)
    return ts_df

### 3.1.1 Performance Comparision between SQL UDAF and create function

1/ **Spark User Defined Aggregate Function (UDAF) : `TIME_SERIES`**

Spark UDAF goes through each row of the dataframe, and creates a new time series by aggregating the previous rows and current row.
Because the Spark RDD is immutable, multiple intermediate RDDs will be created.

so for instance if you have 3 rows: [1] | [2] | [3], it will look as such [1] … [1] + [2] = [1,2] … [1,2] + [3] = [1,2,3]. Thus 5 intermediate rdds are created: [1] [2] [3] [1,2] [1,2,3].

Example: 
```sql
stmt = "SELECT location, TIME_SERIES(timestamp, humidity) AS ts FROM dht_raw_table where humidity < 70 GROUP BY location"
```

2/ **create function:**

The create function simply group the given dataframe by key, and create one time series for each key at once. The performance advantage of the create function will be increasingly obvious with the larger dataframe and time series size.

Example: 
```python
create(dht_table, ["location"], "timestamp", "HUMIDITY", "LOCATION")
```
 



In [None]:
%%time
'''
creating ts using create function
'''
create(raw_table_with_dates, ["key2"], "timestamp", "value", new_key_name="key2").show()

In [None]:
raw_table_with_dates.createOrReplaceTempView("raw_table_partitioned")

In [None]:
%%time
'''
creating ts using Spark UDAF
'''

stmt = """SELECT key2, TIME_SERIES(timestamp, value) AS ts FROM raw_table_partitioned GROUP BY key2"""

eventSession.sql(stmt).show()

### 3.1.2 Processing Performance

#### Case 1: TS granularity in years

In [None]:
%%time
import time
start = time.time()

ts_df = create(raw_table_with_years, ["key2"], "timestamp", "value", new_key_name="key2")
ts_df.createOrReplaceTempView("ts_table")
# force execution
ts_df_unique = eventSession.sql("select key2, ts_explode(ts_remove_consecutive_duplicates(value_time_series)) as (time_tick, value) from ts_table")
ts_df_unique.show()
end = time.time()
print("Total processing time is ",end - start, "seconds")

In [None]:
row_count_after_process = ts_df_unique.count()
print("Row count after process: ", row_count_after_process)

In [None]:
print("Number of duplicated record remaining after processing: {}".format(row_count_after_process- expected_row_num))

In [None]:
temp_ts_df = eventSession.sql("select key2, ts_count(value_time_series) as c from ts_table")
print("number of partitions after time series creation: ",temp_ts_df.rdd.getNumPartitions())
print("partition sizes after time series creation: ", temp_ts_df.rdd.mapPartitions(lambda s: iter([sum(1 for _ in s)])).collect())

#### Case 2: TS granularity in dates

When compared with the granularity in years:

**Pros:** Reduce the granularity to date will increase the number of time series to be created, leading to better parallelism.

**Cons:** Increasing the number of time series will also increase the number of duplications that will retain after duplication removal: when grouping by key [day 1] [day 2] … [day n]. If you remove dups on each one, If let's say the last value of day 1 is dup with the first value of day 2, it will not catch that as dups were removed on a each-time-series basis.

In [None]:
%%time
import time
start = time.time()

ts_df = create(raw_table_with_dates, ["key2"], "timestamp", "value","key2")
ts_df.createOrReplaceTempView("ts_table")
ts_df_unique = eventSession.sql("select key2, ts_explode(ts_remove_consecutive_duplicates(value_time_series)) as (time_tick, value) from ts_table")
ts_df_unique.show()
end = time.time()
print("Total processing time: ", end - start, "seconds")

In [None]:
ts_df.show(5)

In [None]:
row_count_after_process = ts_df_unique.count()
print("Row count after process: ", row_count_after_process)

In [None]:
print("Number of duplicated record remaining after processing: {}".format(row_count_after_process- expected_row_num))

In [None]:
"""
Partition distribution of time series created.
"""
temp_ts_df = eventSession.sql("select key2, ts_count(value_time_series) as c from ts_table")
print("number of partitions after time series creation: ",temp_ts_df.rdd.getNumPartitions())
print("partition sizes after time series creation: ", temp_ts_df.rdd.mapPartitions(lambda s: iter([sum(1 for _ in s)])).collect())

### 3.2 Naive approach performance

In the naive approach, we compare the reading at the current timestamp versus the reading at the previous timestamp.
If there is are consecutive duplications, only the earliest reading will be kept.

In [None]:
def naive_on_change(L0_sdf, es_session):
    from pyspark.sql import Window
    from pyspark.sql.functions import lag
    
    # We will use current time to build the temp views names
    def tabletag():
        from time import time
        return 'TABLE'+str(int(time()*1000000))
    
    # Time sort
    L1_sdf = L0_sdf.orderBy('TIMESTAMP')
    L0_sortedTable = tabletag()
    L1_sdf.createOrReplaceTempView(L0_sortedTable)
    
    # Keep first record
    L1_first_sdf = es_session.sql('SELECT TIMESTAMP, value FROM ' + L0_sortedTable + ' LIMIT 1')
    
    # Prepare lag program, for record at each timestamp, add a column of readings of the previous timestamp
    eng_col = L1_sdf['value']
    lag_eng = lag(eng_col).over(Window.orderBy('TIMESTAMP'))
    L1_sdf = L1_sdf.withColumn('prev_value', lag_eng)
    
    # Prepare diff program, compare reading(now_timestamp) v.s reading(previous_timestamp)
    prev_eng_col = L1_sdf['prev_value']
    L1_sdf = L1_sdf.withColumn('diff_value', eng_col - prev_eng_col)
    
    # Prepare on-change filter program, drop record row if reading(now_timestamp) is identical with the reading(previous_timestamp)
    diff_eng_col = L1_sdf['diff_value']
    L1_sdf = L1_sdf.filter(diff_eng_col != 0)
    
    # Remove intermediate computing columns, only keep the earlist record with the smallest timestamp
    # for duplicated consecutive records.
    L1_sdf = L1_sdf.select('TIMESTAMP', 'value')
    
    # Append first record
    L1_sdf = L1_first_sdf.union(L1_sdf).distinct().orderBy('TIMESTAMP')
    
    # Return result
    return L1_sdf

In [None]:
start = time.time()
df = naive_on_change(raw_table_with_dates,eventSession)
df.show()
end = time.time()
print(end - start, "seconds")


In [None]:
df.explain()

In [None]:
df.count()

## Conclusion:

Total processing times are:
- Time series approach
    - granularity in years: 
        103.34106540679932  seconds
    - granularity in dates:
        78.09961080551147 seconds
- Naive approach
    126.62504267692566 seconds


In general, the performance of consecutive duplication removal using IBM Db2 Event Store's Time Series approach is in linear order with that using the naive approach. There are, however, cases that IBM Db2 Event Store's Time Series function performs significantly better than the naive approach. Time series approaches also have the advantage of re-usability and flexible clustering key.

**1/ Performance:**

**1.1/ Time series creation using Spark UDAF SQL versus Create function**



**1.2/ Consecutive duplication removal using Time Series approach versus Naive approach**

The Time Series approach generally performs faster as the clustering granularity decreases when creating time series. For example, duplication removal performance on time series created per year is generally slower than the time series created per day. The reason is that more time series will be created with smaller cluster granularity, allowing for the duplication removal process to concurrently run on multiple time series. 

There is a small caveat, however, that the number of remaining duplications will increase with smaller clustering granularity. For example, that grouping by key [day 1] [day 2] … [day n]. If you remove dups on each one, If let's say the last value of day 1 is dup with the first value of day 2, it will not catch that as dups were removed on an each-time-series basis. Having said that, on a large scale, this will always occur as it depends on how much data you are querying each time, for instance, if they query 1 day at a time, unless they keep track of the last value in each day every time they query, for the next time they query, they will also have this issue.

**2/ Reusability:**

Intermediate data frame will be created containing time series. Those time series, which are compatible with other Event Store time series functions,  can be easily cached and re-used in future operations, whereas the naive approach will need manual manipulation.

**3/ Flexible clustering key:**

The create function provides a highly efficient way of creating time series using provided keys. It accepts multiple key columns.
If user has a table:
```
root
 |-- DEVICEID: integer (nullable = false)
 |-- SENSORID: integer (nullable = false)
 |-- TIMESTAMP: long (nullable = false)
 |-- READING: double (nullable = false)
 ```
User can chose to create time series on `key = [DEVICEID]` to eliminate consecutive duplications per device, or create time series on `key = ["DEVICEID", "SENSORID"]` to eliminate consecutive duplications per sensor on each device.