# IBM Db2 Event Store - Data Analytics using Python API 

IBM Db2 Event Store is a hybrid transactional/analytical processing (HTAP) system. It extends the Spark SQL interface to accelerate analytics queries. 

This notebook illustrates how the IBM Db2 Event Store can be integrated with multiple popular scientific tools to perform data analytics.

***Pre-Req: Event_Store_Table_Creation***

## Connect to IBM Db2 Event Store

Edit the values in the next cell

In [None]:
CONNECTION_ENDPOINT=""

EVENT_USER_ID=""

EVENT_PASSWORD=""

# Port will be 1100 for version 1.1.2 or later (5555 for version 1.1.1)
PORT = "30370"

DEPLOYMENT_ID=""

# Database name
DB_NAME = "EVENTDB"

# Table name
TABLE_NAME = "IOT_TEMPERATURE"

HOSTNAME=""

DEPLOYMENT_SPACE=""

In [None]:
bearerToken=!echo `curl --silent -k -X GET https://{HOSTNAME}:443/v1/preauth/validateAuth -u admin:password |python -c "import sys, json; print(json.load(sys.stdin)['accessToken'])"`
bearerToken=bearerToken[0]
keystorePassword=!echo `curl -k --silent  GET -H "authorization: Bearer {bearerToken}" "https://{HOSTNAME}:443/icp4data-databases/{DEPLOYMENT_ID}/zen/com/ibm/event/api/v1/oltp/keystore_password"`
keystorePassword

## Import Python modules

In [None]:
## Note: Only run this cell if your IBM Db2 Event Store is installed with IBM Cloud Pak for Data (CP4D)

# In IBM Cloud Pak for Data, we need to create link to ensure Event Store Python library is 
# properly exposed to the Spark runtime.
import os
src = '/home/spark/user_home/eventstore/eventstore'
dst = '/home/spark/shared/user-libs/python3.6/eventstore'
try:
    os.remove(dst)
except EnvironmentError as e:
    print("Symlink doesn't exist, creating symlink to include Event Store Python library...")
os.symlink(src, dst)
print("Creating symlink to include Event Store Python library...")


In [None]:
%matplotlib inline  

from eventstore.common import ConfigurationReader
from eventstore.oltp import EventContext
from eventstore.sql import EventSession
from pyspark.sql import SparkSession
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn
from scipy import stats
import warnings
import datetime

warnings.filterwarnings('ignore')
plt.style.use("fivethirtyeight")

## Connect to Event Store

In [None]:
ConfigurationReader.setConnectionEndpoints(CONNECTION_ENDPOINT)
ConfigurationReader.setEventUser(EVENT_USER_ID)
ConfigurationReader.setEventPassword(EVENT_PASSWORD)
ConfigurationReader.setSslKeyAndTrustStorePasswords(keystorePassword[0])
ConfigurationReader.setDeploymentID(DEPLOYMENT_ID)
ConfigurationReader.getSslTrustStorePassword()

## Open the database

The cells in this section are used to open the database and create a temporary view for the table that we created previously.   

To run Spark SQL queries, you must set up a Db2 Event Store Spark session. The EventSession class extends the optimizer of the SparkSession class.

In [None]:
sparkSession = SparkSession.builder.appName("EventStore SQL in Python").getOrCreate()
eventSession = EventSession(sparkSession.sparkContext, DB_NAME)

Now you can execute the command to open the database in the event session you created:

In [None]:
eventSession.open_database()

## Access an existing table in the database
The following code section retrieves the names of all tables that exist in the database.

In [None]:
with EventContext.get_event_context(DB_NAME) as ctx:
   print("Event context successfully retrieved.")

print("Table names:")
table_names = ctx.get_names_of_tables()
for name in table_names:
   print(name)

Now we have the name of the existing table. We then load the corresponding table and get the DataFrame references to access the table with query. 

In [None]:
tab = eventSession.load_event_table(TABLE_NAME)
print("Table " + TABLE_NAME + " successfully loaded.")

The next code retrieves the schema of the table we want to investigate:

In [None]:
try:
    resolved_table_schema = ctx.get_table(TABLE_NAME)
    print(resolved_table_schema)
except Exception as err:
    print("Table not found")

In the following cell, we create a temporary view with that DataFrame called `readings` that we will use in the queries below.

In [None]:
tab.createOrReplaceTempView("readings")

## Data Analytics with IBM Db2 Event Store
Data analytics tasks can be performed on table stored in the IBM Db2 Event Store database with various data analytics tools. 

Let's first take a look at the timestamp range of the record.

In [None]:
query = "SELECT MIN(ts) MIN_TS, MAX(ts) MAX_TS FROM readings"
print("{}\nRunning query in Event Store...".format(query))
df_data = eventSession.sql(query)
df_data.toPandas()

The following cell converts the timestamps in miliseconds to datetime to make it human readable

In [None]:
MIN_TS=1541019342393
MAX_TS=1541773999825
print("The time range of the dataset is from {} to {}".format(
    datetime.datetime.fromtimestamp(MIN_TS/1000).strftime('%Y-%m-%d %H:%M:%S'), 
    datetime.datetime.fromtimestamp(MAX_TS/1000).strftime('%Y-%m-%d %H:%M:%S')))

## Sample Problem
Assume we are only interested in the data recorded by the 12th sensor on the 1st device in the time period on the day of 2018-11-01, and we want to investigate the effects of power consumption and ambient power on the temperature recorded by the sensor in this date.


Because the timestamp is recorded in milliseconds, we need to convert the datetime of interest to a time range in milliseconds, and then use the range as a filter in the query.

In [None]:
start_ts = (datetime.datetime(2018,11,1,0,0) - datetime.datetime(1970,1,1)).total_seconds() * 1000
end_ts = (datetime.datetime(2018,11,2,0,0) - datetime.datetime(1970,1,1)).total_seconds() * 1000
print("The time range of datetime 2018-11-01 in milisec is from {:.0f} to {:.0f}".format(start_ts, end_ts))

IBM Db2 Event Store extends the Spark SQL functionality, which allows users to apply filters with ease.  

In the following cell, the relevant data is extracted according to the problem scope. Note that because we are specifying a specific device and sensor, this query is fully exploiting the index.

In [None]:
query = "SELECT * FROM readings WHERE deviceID=1 AND sensorID=12 AND ts >1541030400000 AND ts < 1541116800000 ORDER BY ts"
print("{}\nRunning query in Event Store...".format(query))
refined_data = eventSession.sql(query)
refined_data.createOrReplaceTempView("refined_reading")
refined_data.toPandas()

### Basic Statistics 
For numerical data, knowing the descriptive summary statistics can help a lot in understanding the distribution of the data.

IBM Event Store extends the Spark DataFrame functionality. We can use the `describe` function to retrieve statistics about data stored in an IBM Event Store table.

In [None]:
refined_data.describe().toPandas()

It's worth noticing that some power reading records are negative, which may be caused by sensor error. The records with negative power reading will be dropped.

In [None]:
query = "SELECT * FROM readings WHERE deviceID=1 AND sensorID=12 AND ts >1541030400000 AND ts < 1541116800000 AND power > 0 ORDER BY ts"
print("{}\nRunning query in Event Store...".format(query))
refined_data = eventSession.sql(query)
refined_data.createOrReplaceTempView("refined_reading")

Total number of records in the refined table view

In [None]:
query = "SELECT count(*) count FROM refined_reading"
print("{}\nRunning query in Event Store...".format(query))
df_data = eventSession.sql(query)
df_data.toPandas()

### Covariance and correlation
- Covariance is a measure of how two variables change with respect to each other. It can be examined by calling `.stat.cov()` function on the table.

In [None]:
refined_data.stat.cov("AMBIENT_TEMP","TEMPERATURE")

In [None]:
refined_data.stat.cov("POWER","TEMPERATURE")

- Correlation is a normalized measure of covariance that is easier to understand, as it provides quantitative measurements of the statistical dependence between two random variables.  It can be examined by calling `.stat.corr()` function on the table.

In [None]:
refined_data.stat.corr("AMBIENT_TEMP","TEMPERATURE")

In [None]:
refined_data.stat.corr("POWER","TEMPERATURE")

### Visualization
Visualization of each feature provides insights into the underlying distributions.

- Distribution of Ambient Temperature

In [None]:
query = "SELECT ambient_temp FROM refined_reading"
print("{}\nRunning query in Event Store...".format(query))
ambient_temp = eventSession.sql(query)
ambient_temp= ambient_temp.toPandas()
ambient_temp.head()

In [None]:
fig, axs = plt.subplots(1,3, figsize=(16,6))
stats.probplot(ambient_temp.iloc[:,0], plot=plt.subplot(1,3,1))
axs[1].boxplot(ambient_temp.iloc[:,0])
axs[1].set_title("Boxplot on Ambient_temp")
axs[2].hist(ambient_temp.iloc[:,0], bins = 20)
axs[2].set_title("Histogram on Ambient_temp")

- Distribution of Power Consumption

In [None]:
query = "SELECT power FROM refined_reading"
print("{}\nRunning query in Event Store...".format(query))
power = eventSession.sql(query)
power= power.toPandas()
power.head()

In [None]:
fig, axs = plt.subplots(1,3, figsize=(16,6))
stats.probplot(power.iloc[:,0], plot=plt.subplot(1,3,1))
axs[1].boxplot(power.iloc[:,0])
axs[1].set_title("Boxplot on Power")
axs[2].hist(power.iloc[:,0], bins = 20)
axs[2].set_title("Histogram on Power")

- Distribution of Sensor Temperature

In [None]:
query = "SELECT temperature FROM refined_reading"
print("{}\nRunning query in Event Store...".format(query))
temperature = eventSession.sql(query)
temperature= temperature.toPandas()
temperature.head()

In [None]:
fig, axs = plt.subplots(1,3, figsize=(16,6))
stats.probplot(temperature.iloc[:,0], plot=plt.subplot(1,3,1))
axs[1].boxplot(temperature.iloc[:,0])
axs[1].set_title("Boxplot on Temperature")
axs[2].hist(temperature.iloc[:,0], bins = 20)
axs[2].set_title("Histogram on Temperature")

- Input-variable vs. Target-variable

In [None]:
fig, axs = plt.subplots(1,2, figsize=(16,6))
axs[0].scatter(power.iloc[:,0], temperature.iloc[:,0])
axs[0].set_xlabel("power in kW")
axs[0].set_ylabel("temperature in celsius")
axs[0].set_title("Power vs. Temperature")
axs[1].scatter(ambient_temp.iloc[:,0], temperature.iloc[:,0])
axs[1].set_xlabel("ambient_temp in celsius")
axs[1].set_ylabel("temperature in celsius")
axs[1].set_title("Ambient_temp  vs. Temperature")

**By observing the plots above, we noticed:**
- The distribution of power consumption, ambient temperature, and sensor temperature each follows an roughly normal distribution.
- The scatter plot shows the sensor temperature has linear relationships with power consumption and ambient temperature.

## Summary
This notebook introduced you to data analytics using IBM Db2 Event Store.

## Next Step
`"Event_Store_ML_Model_Deployment.ipynb"` will show you how to build and deploy a machine learning model.

<p><font size=-1 color=gray>
&copy; Copyright 2019 IBM Corp. All Rights Reserved.
<p>
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file
except in compliance with the License. You may obtain a copy of the License at
https://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the
License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either
express or implied. See the License for the specific language governing permissions and
limitations under the License.
</font></p>