# Querying IBM Db2 Event Store
IBM Db2 Event Store is a hybrid transactional/analytical processing (HTAP) system. This notebook will demonstrate the best practices for querying a table stored in IBM Db2 Event Store.

***Pre-Req: Event_Store_Table_Creation***

## Connect to IBM Db2 Event Store

### Determine the IP address of your host

Obtain the IP address of the host that you want to connect to by running the appropriate command for your operating system:

* On Mac, run: `ifconfig`
* On Windows, run: `ipconfig`
* On Linux, run: `hostname -i`

Edit the `HOST = "XXX.XXX.XXX.XXX"` value in the next cell to provide the IP address.

In [2]:
# Set your host IP address
HOST = "XXX.XXX.XXX.XXX"

# Port will be 1100 for version 1.1.2 or later (5555 for version 1.1.1)
PORT = "1100"

# Database name
DB_NAME = "TESTDB"

# Table name
TABLE_NAME = "IOT_TEMPERATURE"

## Import Python modules

In [3]:
from eventstore.common import ConfigurationReader
from eventstore.oltp import EventContext
from eventstore.sql import EventSession
from pyspark.sql import SparkSession

## Connect to Event Store

In [4]:
endpoint = HOST + ":" + PORT
print("Event Store connection endpoint:", endpoint)
ConfigurationReader.setConnectionEndpoints(endpoint)

Event Store connection endpoint: 192.168.0.106:1100


## Open the database

The cells in this section are used to open the database and create a temporary view for the table that we created previously.   

To run Spark SQL queries, you first have to set up a Db2 Event Store Spark session. The EventSession class extends the optimizer of the SparkSession class.

In [5]:
sparkSession = SparkSession.builder.appName("EventStore SQL in Python").getOrCreate()
eventSession = EventSession(sparkSession.sparkContext, DB_NAME)

The next cell opens the database to allow operations against it to be executed.

In [6]:
eventSession.open_database()

With the following cells, we can list all existing tables and then load the table we previously created into the tab DataFrame reference. Note that we are defining the `tab` DataFrame reference that will be used later on in this notebook to create a temporary view.

In [7]:
with EventContext.get_event_context(DB_NAME) as ctx:
   print("Event context successfully retrieved.")

print("Table names:")
table_names = ctx.get_names_of_tables()
for name in table_names:
   print(name)

Event context successfully retrieved.
Table names:
IOT_TEMPERATURE


In [8]:
tab = eventSession.load_event_table(TABLE_NAME)

Let's recall the table schema we previously created.

In [9]:
try:
    resolved_table_schema = ctx.get_table(TABLE_NAME)
    print(resolved_table_schema)
except Exception as err:
    print("Table not found")

ResolvedTableSchema(tableName=IOT_TEMPERATURE, schema=StructType(List(StructField(deviceID,IntegerType,false),StructField(sensorID,IntegerType,false),StructField(ts,LongType,false),StructField(ambient_temp,DoubleType,false),StructField(power,DoubleType,false),StructField(temperature,DoubleType,false))), sharding_columns=['deviceID', 'sensorID'], pk_columns=['deviceID', 'sensorID', 'ts'], partition_columns=None)


## Best Practices for efficient queries

In the next cell we create a lazily evaluated "view" that we can then use like a hive table in Spark SQL, but this is only evaluated when we actually run or cache query results. We are calling this view "readings" and that is how we will refer to it in the queries below:

In [10]:
tab.createOrReplaceTempView("readings")

In [11]:
query = "SELECT count(*) FROM readings"
print("{}\nRunning query in Event Store...".format(query))
df_data = eventSession.sql(query)
df_data.toPandas()

SELECT count(*) FROM readings
Running query in Event Store...


Unnamed: 0,count(1)
0,999999


Let's have a look at the record structure 

In [12]:
query = "SELECT * FROM readings LIMIT 1"
print("{}\nRunning query in Event Store...".format(query))
df_data = eventSession.sql(query)
df_data.toPandas()

SELECT * FROM readings LIMIT 1
Running query in Event Store...


Unnamed: 0,deviceID,sensorID,ts,ambient_temp,power,temperature
0,1,24,1541019343497,22.545444,9.834895,39.065559


In [13]:
query = "SELECT MIN(ts), MAX(ts) FROM readings"
print("{}\nRunning query in Event Store...".format(query))
df_data = eventSession.sql(query)
df_data.toPandas()


SELECT MIN(ts), MAX(ts) FROM readings
Running query in Event Store...


Unnamed: 0,min(ts),max(ts)
0,1541019343497,1541773999825


## Optimal query through the index

- Index queries will significantly reduce amount of data that needs to be scanned for results. 
    - Indexes in IBM Db2 Event Store are formed asynchronously to avoid insert latency. 
    - They are stored as a Log Structured Merge (LSM) Tree.
    - The index is formed by "runs", which include sequences of sorted keys. These runs are written to disk during “Share” processing.
    - These index runs are merged together over time to improve scan and I/O efficiency.

- For an optimal query performance you must specify equality on all the equal_columns in the index and a range on the sort column in the index.

For example, in the following query we are retrieving all the values in the range of dates for a specific device and sensor, where both the `deviceID` and `sensorID` are in the equal_columns definition for the index schema and the `ts` column is the sort column for the index.

In [14]:
index_query = "SELECT ts, temperature  FROM readings where deviceID=1 and sensorID=12 and ts >1541021271619 and ts < 1541043671128 order by ts"
print("{}\nCreating a dataframe for the query ...".format(index_query))
df_index_query = eventSession.sql(index_query)

SELECT ts, temperature  FROM readings where deviceID=1 and sensorID=12 and ts >1541021271619 and ts < 1541043671128 order by ts
Creating a dataframe for the query ...


Then the following cell runs the query and caches the results. Note that this caching is for demostration purposes, to show the time it takes to run the query and cache the results in memory within Spark. This caching is recommended when you are going to do additional processing on this cached data, as the query against IBM Db2 Event Store is only run once. 

In [15]:
%%time
df_index_query.cache()

CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 634 ms


DataFrame[ts: bigint, temperature: double]

Finally the results from the cached data can be visualized:

In [16]:
df_index_query.toPandas()

Unnamed: 0,ts,temperature
0,1541021321393,40.204513
1,1541021364651,44.899199
2,1541021373123,38.694953
3,1541021399823,41.095604
4,1541021508549,43.350289
5,1541021523675,33.144291
6,1541021584929,40.710335
7,1541021751002,42.303187
8,1541021814624,43.006241
9,1541021838730,36.723280


## Sub-optimal query

This next query shows a sub-optimal query that only specifies equality in one of the equal_columns in the index schema, and for this reason ends up doing a full scan of the table.

In [17]:
fullscan_query = "SELECT count(*) FROM readings where sensorID = 7"
print("{}\nCreating a dataframe for the query...".format(fullscan_query))
df_fullscan_query = eventSession.sql(fullscan_query)

SELECT count(*)  FROM readings where sensorID = 7
Creating a dataframe for the query...


In [18]:
%%time
df_fullscan_query.cache()

CPU times: user 10 ms, sys: 0 ns, total: 10 ms
Wall time: 371 ms


DataFrame[count(1): bigint]

In [19]:
df_fullscan_query.toPandas()

Unnamed: 0,count(1)
0,20396


## Accessing multiple sensorIDs optimally

The easiest way to write a query that needs to retrieve multiple values in the equal_columns in the index schema is by using an *In-List*. With this, you can get optimal index access across multiple sensorID's. 

In this example we specify equality for a specific deviceID, and an In-List for the four sensors we are trying to retrieve. To limit the number of records we are returning we also include a range of timestamps.

In [20]:
inlist_query = "SELECT deviceID, sensorID, ts FROM readings where deviceID=1 and sensorID in (1, 5, 7, 12) and ts >1541021271619 and ts < 1541043671128 order by ts"
print("{}\nCreating a dataframe for the query...".format(inlist_query))
df_inlist_query = eventSession.sql(inlist_query)

SELECT deviceID, sensorID, ts  FROM readings where deviceID=1 and sensorID in (1, 5, 7, 12) and ts >1541021271619 and ts < 1541043671128 order by ts
Creating a dataframe for the query...


In [21]:
%%time
df_inlist_query.cache()

CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 226 ms


DataFrame[deviceID: int, sensorID: int, ts: bigint]

In [22]:
df_inlist_query.toPandas()

Unnamed: 0,deviceID,sensorID,ts
0,1,12,1541021321393
1,1,5,1541021350290
2,1,7,1541021353694
3,1,12,1541021364651
4,1,7,1541021365768
5,1,12,1541021373123
6,1,12,1541021399823
7,1,1,1541021441838
8,1,1,1541021447795
9,1,5,1541021451651


## Exploiting the synopsis table:  Advanced medium weight queries

- Event Store tables include a synopsis table which summarizes the minimum/maximum values of the data for each range of rows in the synopsis table. 

     - It contains one range for every 1000 rows.
     - It is stored in a separate internal table in the shared storage layer.
     - It is parquet compressed to minimize footprint.
     - For highly selective queries, it can improve performance by up to 1000x.

- Using an equality or a range predicate on a clustered field (e.g. the 'ts' column in our case because the data is inserted into the table in order) is faster than doing a full scan as it should be able to exploit the synopsis table, but this will be slower than an optimal index scan.

In [23]:
synopsis_query = "SELECT deviceID, sensorID, ts FROM readings where ts >1541021271619 and ts < 1541043671128 order by ts"
print("{}\nCreating a dataframe for the query...".format(synopsis_query))
df_synopsis_query = eventSession.sql(synopsis_query)

SELECT deviceID, sensorID, ts FROM readings where ts >1541021271619 and ts < 1541043671128 order by ts
Creating a dataframe for the query...


In [24]:
%%time
df_synopsis_query.cache()

CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 742 ms


DataFrame[deviceID: int, sensorID: int, ts: bigint]

In [25]:
df_synopsis_query.toPandas()

Unnamed: 0,deviceID,sensorID,ts
0,1,9,1541021272082
1,1,3,1541021272990
2,1,29,1541021273199
3,1,23,1541021274488
4,2,17,1541021275216
5,1,6,1541021276157
6,2,47,1541021276987
7,2,21,1541021277039
8,2,37,1541021277872
9,2,46,1541021278309


## Summary
This demo introduced you to the best practices querying the table stored in IBM Db2 Event Store database.

## Next Step
`"Event_Store_Data_Analytics.ipynb"` will show you how to perform data analytics with IBM Db2 Event Store with multiple scientific tools.