# Querying on IBM Db2 Event Store Database Table
IBM Db2 Event Store is a hybrid transactional/analytical processing (HTAP) system that is designed for IoT workloads. It is empowered by the Db2 Common SQL Engine, the most sophisticated SQL-based analytics query engine available. IBM Db2 Event Store can handle complex queries quickly and efficiently.

> This Demo is created with IBM Db2 Event Store 2.0 Enterprise edition.

In the previous session we have already created a database, a table with an index. We have seen how to ingest sample data into the table in the "EventStore_Table_Creation" notebook.

***Pre-Req: EventStore_Table_Creation***

In [None]:
# import event store's Python client interface libraries
from eventstore.oltp import EventContext
from eventstore.sql import EventSession
from pyspark.sql import SparkSession
from eventstore.common import ConfigurationReader

<a id="connect-to-es"></a>
### 1. Set up connection to IBM Db2 Event Store

**In this demo, we assume your IBM Db2 Event Store is installed with Watson Studio Local (WSL).**

You will need to set the Watson Studio Local's `userID` and `password` that will be used to connect to IBM Db2 Event Store instance.

By default, the connection will be estabilished to the IBM Db2 Event Store instance on the current Watson Studio Local cluster.

For more details on setting up IBM Db2 Event Store connection in Jupyter Notebook, please read the official documentation:
https://www.ibm.com/support/knowledgecenter/en/SSGNPV_2.0.0/dsx/jupyter_prereq.html

In [None]:
# Using the configuration reader API, set up the userID and password that 
# will be used to connect to IBM Db2 Event Store.

ConfigurationReader.setEventUser("admin")
ConfigurationReader.setEventPassword("password")

### 2. Connect to the database
**IBM Event Store 2.0 instance will by default have a database created with name `EVENTDB`, and the default database `EVENTDB` should not be deleted. Each IBM Event Store 2.0 instance only support exact ONE database.**

In [None]:
dbName = "EVENTDB"

To run Spark SQL queries, you must set up a Db2 Event Store Spark session. SparkSession is the entry point to programming Spark with the Dataset and DataFrame API. The EventSession class extends the optimizer of the SparkSession class.

In [None]:
sparkSession = SparkSession.builder.appName("EventStore SQL in Python").getOrCreate()
eventSession = EventSession(sparkSession.sparkContext, dbName)

Now you can execute the command to connect to the database in the event session you created:

In [None]:
eventSession.open_database()

### 3. Exploring the database by retrieving all tables
With the following cells we can list all user-created tables and explore the schema of the tables we are interested in.

In [None]:
with EventContext.get_event_context(dbName) as ctx:
   print("Event context successfully retrieved.")

table_names = ctx.get_names_of_tables()
for idx, name in enumerate(table_names):
   print(name)

In [None]:
tabName = "IOT_TEMP"

In [None]:
tab = eventSession.load_event_table(tabName)

Let's recall the table schema we previously created.

In [None]:
try:
    resolved_table_schema = ctx.get_table(tabName)
    print(resolved_table_schema)
except Exception as err:
    print("Table not found")

Let's recall that in the previous notebook, the table we created has the following tableSchema and indexSchema:

```python
tabSchema = TableSchema(tabName, StructType([
    StructField("DEVICEID", IntegerType(), nullable = False),
    StructField("SENSORID", IntegerType(), nullable = False),
    StructField("TS", LongType(), nullable = False),
    StructField("AMBIENT_TEMP", DoubleType(), nullable = False),
    StructField("POWER", DoubleType(), nullable = False),
    StructField("TEMPERATURE", DoubleType(), nullable = False)
    ]),
    sharding_columns = ["u'DEVICEID", "u'SENSORID"],
    pk_columns = ["DEVICEID", "SENSORID", "TS"]
                       )

indexSchema = IndexSpecification(
          index_name=tabName + "Index",
          table_schema=tabSchema,
          equal_columns = ["DEVICEID", "SENSORID"],
          sort_columns = [
            SortSpecification("TS", ColumnOrder.DESCENDING_NULLS_LAST)],
          include_columns = ["TEMPERATURE"]
        )
```

### 4. Best Practices for efficient queries

#### 4.1 Basic Event Store queries

In the next cell we create a lazily evaluated "view" that we can then use similar to a hive table in Spark SQL, but this is only evaluated when we actually run or cache query results. We are calling this view "readings" and that is how we will refer to it in the queries below:

In [None]:
tab.createOrReplaceTempView("readings")

In [None]:
query = "SELECT count(*) FROM readings"
print("{}\nRunning query in Event Store...".format(query))
df_data = eventSession.sql(query)
df_data.toPandas()

Let's have a look at the record structure 

In [None]:
query = "SELECT * FROM readings LIMIT 1"
print("{}\nRunning query in Event Store...".format(query))
df_data = eventSession.sql(query)
df_data.toPandas()

In [None]:
query = "SELECT MIN(ts), MAX(ts) FROM readings"
print("{}\nRunning query in Event Store...".format(query))
df_data = eventSession.sql(query)
df_data.toPandas()


#### 4.2 Optimal query through the index

- Index queries will significantly reduce amount of data that needs to be scanned for results. 
    - Indexes in IBM Db2 Event Store are formed asynchronously to avoid insert latency. 
    - They are stored as a Log Structured Merge (LSM) Tree.
    - The index is formed by "runs", which include sequences of sorted keys. These runs are written to disk during “Share” processing.
    - These index runs are merged together over time to improve scan and I/O efficiency.

- For an optimal query performance you must specify equality on all the equal_columns in the index and a range on the sort column in the index.

For example, in the following query we are retrieving all the values in the range of dates for a specific device and sensor, where both the `deviceID` and `sensorID` are in the equal_columns definition for the index schema and the `ts` column is the sort column for the index.

In [None]:
index_query = "SELECT ts, temperature  FROM readings where deviceID=1 and sensorID=12 and ts >1541021271619 and ts < 1541043671128 order by ts"
print("{}\nCreating a dataframe for the query ...".format(index_query))
df_index_query = eventSession.sql(index_query)

Then the following cell runs the query and caches the results. Note that this caching is for demostration purposes, to show the time it takes to run the query and cache the results in memory within Spark. This caching is recommended when you are going to do additional processing on this cached data, as the query against IBM Db2 Event Store is only run once. 

In [None]:
%%time
df_index_query.cache()

Finally the results from the cached data can be visualized:

In [None]:
df_index_query.toPandas()

#### 4.3 Sub-optimal query

This next query shows a sub-optimal query that only specifies equality in one of the equal_columns in the index schema, and for this reason ends up doing a full scan of the table.

In [None]:
fullscan_query = "SELECT count(*)  FROM readings where sensorID = 7"
print("{}\nCreating a dataframe for the query...".format(fullscan_query))
df_fullscan_query = eventSession.sql(fullscan_query)

In [None]:
%%time
df_fullscan_query.cache()

In [None]:
df_fullscan_query.toPandas()

#### 4.4 Accessing multiple sensorIDs optimally

The easiest way to write a query that needs to retrieve multiple values in the equal_columns in the index schema is by using an *In-List*. With this, you can get optimal index access across multiple sensorID's. 

In this example we specify equality for a specific deviceID, and an In-List for the four sensors we are trying to retrieve. To limit the number of records we are returning we also include a range of timestamps.

In [None]:
inlist_query = "SELECT deviceID, sensorID, ts  FROM readings where deviceID=1 and sensorID in (1, 5, 7, 12) and ts >1541021271619 and ts < 1541043671128 order by ts"
print("{}\nCreating a dataframe for the query...".format(inlist_query))
df_inlist_query = eventSession.sql(inlist_query)

In [None]:
%%time
df_inlist_query.cache()

In [None]:
df_inlist_query.toPandas()

#### 4.5 Exploiting the synopsis table:  Advanced medium weight queries

- Event Store tables include a synopsis table which summarizes the minimum/maximum values of the data for each range of rows in the synopsis table. 

     - It contains one range for every 1000 rows.
     - It is stored in a separate internal table in the shared storage layer.
     - It is parquet compressed to minimize footprint.
     - For highly selective queries, it can improve performance by up to 1000x.

- Using an equality or a range predicate on a clustered field (e.g. the 'TS' column in our case because the data is inserted into the table in order) is faster than doing a full scan as it should be able to exploit the synopsis table, but this will be slower than an optimal index scan.

In [None]:
synopsis_query = "SELECT deviceID, sensorID, ts FROM readings where ts >1541021271619 and ts < 1541043671128 order by ts"
print("{}\nCreating a dataframe for the query...".format(synopsis_query))
df_synopsis_query = eventSession.sql(synopsis_query)

In [None]:
%%time
df_synopsis_query.cache()

In [None]:
df_synopsis_query.toPandas()

## Summary
This demo introduced you to the best practices querying the table stored in IBM Db2 Event Store database.

## Next Step
`"Event_Store_Data_Analytics.ipynb"` will show you how to perform data analytics with IBM Db2 Event Store with multiple scientific tools.