# IBM Db2 Event Store - Machine Learning Modeling and Model Deployment 
IBM Db2 Event Store is a hybrid transactional/analytical processing (HTAP) system. This notebook illustrates the machine learning modeling and model deployment using IBM Db2 Event Store.

***Pre-Req: Event Store Data Analytics***

When finish this demo, you will learn:
- How to build a machine learning model
- How to save and deploy the model
- How to make realtime predictions with the deployed model

## Connect to IBM Db2 Event Store

### Determine the IP address of your host

Obtain the IP address of the host that you want to connect to by running the appropriate command for your operating system:

* On Mac, run: `ifconfig`
* On Windows, run: `ipconfig`
* On Linux, run: `hostname -i`

Edit the `HOST = "XXX.XXX.XXX.XXX"` value in the next cell to provide the IP address.

In [1]:
# Set your host IP address
HOST = "192.168.0.104"

# Port will be 1100 for version 1.1.2 or later (5555 for version 1.1.1)
PORT = "1100"

# Database name
DB_NAME = "TESTDB"

# Table name
TABLE_NAME = "IOT_TEMPERATURE"

## Import Python modules

In [2]:
from eventstore.common import ConfigurationReader
from eventstore.oltp import EventContext
from eventstore.sql import EventSession
from pyspark.sql import SparkSession
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

## Connect to Event Store

In [3]:
endpoint = HOST + ":" + PORT
print("Event Store connection endpoint:", endpoint)
ConfigurationReader.setConnectionEndpoints(endpoint)

Event Store connection endpoint: 192.168.0.104:1100


## Open the database

The following code is used to open a database to be able to access its tables and data.

To run Spark SQL queries, you must set up a Db2 Event Store Spark session. The EventSession class extends the optimizer of the SparkSession class.

In [4]:
sparkSession = SparkSession.builder.appName("EventStore SQL in Python").getOrCreate()
eventSession = EventSession(sparkSession.sparkContext, DB_NAME)

Now you can execute the command to open the database in the event session you created:

In [5]:
eventSession.open_database()

## Access an existing table in the database
The following code section retrieves the names of all tables that exist in the database.

In [6]:
with EventContext.get_event_context(DB_NAME) as ctx:
   print("Event context successfully retrieved.")

print("Table names:")
table_names = ctx.get_names_of_tables()
for name in table_names:
   print(name)

Event context successfully retrieved.
Table names:
IOT_TEMPERATURE


Now we have the name of the existing table. We then load the table and get a DataFrame references to access the table with queries. The following code loads the tables and creates a temporary view with the same name as the table.

In [7]:
tab = eventSession.load_event_table(TABLE_NAME)
tab.createOrReplaceTempView(TABLE_NAME)
print("Table " + TABLE_NAME + " successfully loaded and temporary view created.")

Table IOT_TEMPERATURE successfully loaded and temporary view created.


The next code retrieves the schema of the table we want to investigate:

In [8]:
try:
    resolved_table_schema = ctx.get_table(TABLE_NAME)
    print(resolved_table_schema)
except Exception as err:
    print("Table not found")

ResolvedTableSchema(tableName=IOT_TEMPERATURE, schema=StructType(List(StructField(deviceID,IntegerType,false),StructField(sensorID,IntegerType,false),StructField(ts,LongType,false),StructField(ambient_temp,DoubleType,false),StructField(power,DoubleType,false),StructField(temperature,DoubleType,false))), sharding_columns=['deviceID', 'sensorID'], pk_columns=['deviceID', 'sensorID', 'ts'], partition_columns=None)


## Machine Learning Modeling
This section shows how to build a machine learning model with the data stored in the IBM Db2 Event Store database.

### Recall from the *Event_Store_Data_Analytics* notebook
- There are two input variables: ambient temperature and power consumption. The dependent variable is the sensor temperature reading.
- All features follow normal distribution.
- There is an obvious linear relationship between each independent variable and the dependent variable.

Now let's try generating a linear model to predict sensor temperature with power consumption and ambient temperature using the data stored in the IBM Db2 Event Store database table.

First import the relevant PySpark machine learning libraries:

In [9]:
from pyspark.ml.regression import LinearRegression
from pyspark.ml.feature import VectorAssembler, StandardScaler 
from pyspark.ml import Pipeline
from pyspark.sql.functions import *
from pyspark.ml.evaluation import RegressionEvaluator

The following cell builds a new spark SQL DataFrame from the `tab` DataFrame, and prints out the `variable_df` DataFrame schema.

In [10]:
variables = ["ambient_temp", "power"]
variable_df = tab.select(col("temperature").alias("label"), *variables)
variable_df.printSchema()

root
 |-- label: double (nullable = false)
 |-- ambient_temp: double (nullable = false)
 |-- power: double (nullable = false)



Now we split the DataFrame into a training set and a test set at a percentage of 75 and 25.

We first build and train the model on the training set, then evaluate the model performance on the test set.

In [11]:
training, test = variable_df.randomSplit([0.75, 0.25], 42)

The model is built as a pipeline. There are three stages in the model pipeline: *vector assembly*, *standarization*, and *model definition*. 

In the following cell we execute the three stages.

The training set is first assembled in to a dense vector. Then, the dense vector is standarized to a standard normal distribution. Finally, the linear model is defined with regularization.

In [12]:
vectorAssembler = VectorAssembler(inputCols=variables, outputCol="unscaled_variables")
standardScaler = StandardScaler(inputCol="unscaled_variables", outputCol="features")
linear_model = LinearRegression(maxIter=10, regParam=.01)

stages = [vectorAssembler, standardScaler, linear_model]
pipeline = Pipeline(stages=stages)

The model is then trained on the training set. The trained model is used to make predictions on the test set.

In [13]:
model = pipeline.fit(training)
prediction = model.transform(test)

In the following cell we show the first 10 rows out of the approximately 250 thousand in the prediction:

In [14]:
prediction.show(10)

+------------------+------------------+------------------+--------------------+--------------------+------------------+
|             label|      ambient_temp|             power|  unscaled_variables|            features|        prediction|
+------------------+------------------+------------------+--------------------+--------------------+------------------+
|26.867718317810823|19.000636049568666|  4.69462969258635|[19.0006360495686...|[9.49537135581001...|32.073958640642964|
|26.890434961120306| 18.08741047903665|0.5006280784763337|[18.0874104790366...|[9.03899632177425...|28.799124258335304|
| 27.24124282274609| 15.64429094179144| 8.926407054969541|[15.6442909417914...|[7.81807260047048...| 29.83167720931017|
|28.299657924798513|16.952745077149896| 8.257884674094836|[16.9527450771498...|[8.47195902221244...|31.194962128749516|
|28.386549343567314|18.789370305272453|3.9206903754100915|[18.7893703052724...|[9.38979348506817...| 31.41421580790227|
|28.553495601501716|17.485530835101855| 

### Model Evaluation
The performance of the linear model we just built can be evaluated using multiple error metrics.

We first load and define a regression evaluator using PySpark.

In [15]:
evaluator = RegressionEvaluator(labelCol="label", predictionCol="prediction", metricName="rmse")

We then evaluate the model performance with multiple error metrics.

In [16]:
rmse = evaluator.evaluate(prediction)

mae = evaluator.evaluate(prediction, {evaluator.metricName: "mae"})

r2 = evaluator.evaluate(prediction, {evaluator.metricName: "r2"})

Finally we put the error metrics into a dataframe to help visualization

In [17]:
error_df = {"r2":r2, "mae":mae, "rmse":rmse}
error_df = pd.DataFrame.from_dict(error_df, orient="index")
error_df.columns = ["error metrics"]

In [18]:
# Show error metrics
error_df

Unnamed: 0,error metrics
mae,1.195381
r2,0.800141
rmse,1.497963


**Model Summarization**  
The r2 metrics shows the percentage of the variance in the data that is explained by the model. Our model has a high r2 value that is very close to 1 -- meaning most of the variance in the test data can be explained with our model.

## Model Deployment
Now that the model is trained, you can deploy the model. Once deployed, the model can be used to generate real-time online scoring on the data streamed into IBM Db2 Event Store.

* If you are using the **Enterprise Edition** of Db2 Event Store, you can use the save function in the `dsx_ml` library.
* If you are using the **Developer Edition** of Db2 Event Store, you need to add a Machine Learning service. You can use one with a trial account on IBM Cloud.
  * Sign in and create the service [here](https://console.ng.bluemix.net/catalog/services/machine-learning).
  * Click on `Service credentials` and then `New credential` and `Add`.
  * Use `View credentials` and copy the credentials JSON.
  * Use the JSON to set the `wml_credentials` variable below.
  * After the pip install watson-machine-learning-client, you may need to restart your kernel and run the notebook again from the top.

In [19]:
# This cell will attempt to initialize dsx_ml and set the use_cloud_ml toggle.
# Later cells will use the use_cloud_ml toggle, to choose the necessary API.

import os
import json

use_cloud_ml = False
try:
    from dsx_ml.ml import save
    if os.environ.get('DSX_TOKEN'):
        print('Using dsx_ml to deploy model to IBM Db2 Event Store.')
    else:
        print('DSX_TOKEN not found, try using IBM Cloud Machine Learning.')
        use_cloud_ml = True
except ImportError:
    print('Cannot import dsx_ml. Try using IBM Cloud Machine Learning.')
    use_cloud_ml = True

Cannot import dsx_ml. Try using IBM Cloud Machine Learning.


### With Db2 Event Store Developer Edition plus Machine Learning on IBM Cloud, save the model with metadata.

In [20]:
# If you are using IBM Cloud for your ML deployment...
#
# * The use_cloud_ml toggle should be set to True.
# * You need to set wml_credentials to your service credentials JSON.
# * You most likely will need to restart your kernel after running the pip install (below).
# * After the pip install runs once, you may want to comment out that line.

if use_cloud_ml:
    print('Using IBM Cloud Machine Learning')
    
    !pip install --user watson-machine-learning-client==1.0.364
    from watson_machine_learning_client import WatsonMachineLearningAPIClient
    
    # EDIT HERE TO SET YOUR CREDENTIALS:
    wml_credentials = {}
    
    client = WatsonMachineLearningAPIClient(wml_credentials)
    
    # Store the model
    saved_model = client.repository.store_model(
        model=model,
        pipeline=pipeline,
        training_data=training,
        meta_props={client.repository.ModelMetaNames.NAME: "Linear regression model to predict IOT sensor temperature"})

    published_model_uid = client.repository.get_model_uid(saved_model)
    model_details = client.repository.get_details(published_model_uid)
    print('Model Details:')
    print(json.dumps(model_details, indent=2))
    print('List Models:')
    client.repository.list_models()

    # Create an online deployment
    created_deployment = client.deployments.create(published_model_uid, name="Product line prediction")
    scoring_endpoint = client.deployments.get_scoring_url(created_deployment)
    print('Scoring Endpoint:')
    print(scoring_endpoint)
    print('List Deployments')
    client.deployments.list()
    
else:
    print('Not using remote IBM Cloud Machine Learning.')

Using IBM Cloud Machine Learning
[33mYou are using pip version 9.0.1, however version 19.0.3 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m




Model Details:
{
  "metadata": {
    "created_at": "2019-04-12T00:41:16.125Z",
    "modified_at": "2019-04-12T00:41:16.231Z",
    "guid": "f8a4775b-874d-43a7-94f6-447056c84eda",
    "url": "https://us-south.ml.cloud.ibm.com/v3/wml_instances/fe36dba1-149e-40a5-add0-a16300e295be/published_models/f8a4775b-874d-43a7-94f6-447056c84eda"
  },
  "entity": {
    "feedback_url": "https://us-south.ml.cloud.ibm.com/v3/wml_instances/fe36dba1-149e-40a5-add0-a16300e295be/published_models/f8a4775b-874d-43a7-94f6-447056c84eda/feedback",
    "learning_configuration_url": "https://us-south.ml.cloud.ibm.com/v3/wml_instances/fe36dba1-149e-40a5-add0-a16300e295be/published_models/f8a4775b-874d-43a7-94f6-447056c84eda/learning_configuration",
    "latest_version": {
      "created_at": "2019-04-12T00:41:16.231Z",
      "guid": "b6a1d415-9183-447f-8e9f-bc8b2d26f033",
      "url": "https://us-south.ml.cloud.ibm.com/v3/ml_assets/models/f8a4775b-874d-43a7-94f6-447056c84eda/versions/b6a1d415-9183-447f-8e9f-bc8b2d26

### With Db2 Event Store Enterprise Edition, save the model with metadata.

With the saved model we then define a header that contains authorization, which will be sent to the endpoint, and then retrieve the endpoint to the saved model to allow us to externally access it. Note that the host name `dsxl-api` needs to be replaced with the corresponding external IP address of your IBM Watson Studio cluster.

In [21]:
if not use_cloud_ml:
    model_name = "Event_Store_IOT_Sensor_Temperature_Prediction_Model"
    saved_model = save(name=model_name, 
                       model=model,
                       test_data=test,
                       algorithm_type="Regression",
                       source='Event_Store_Modeling.ipynb',
                       description="Linear regression model to predict IOT sensor temperature"
                      )

    import os
    import requests

    header_online = {'Content-Type': 'application/json', 'Authorization': os.environ['DSX_TOKEN']}
    # Retrieve the endpoint to the saved model
    print(saved_model["scoring_endpoint"])

### Make a Prediction with the Deployed Model
Now the model has been saved and deployed. After deployment, the endpoint of model can be used to make a prediction for new data using the online scoring service.  

The following sample code snippet calls the scoring endpoint to make predictions on the new data. The prediction can be made on single datum, or on batch data.

First create a sample datum to be predicted by the model.

In [22]:
# Create 2 new test data points
new_data = {"deviceID" : 2, "sensorID": 24, "ts": 1541430459386, "ambient_temp": 30, "power": 10}
new_data2 = {"deviceID" : 1, "sensorID": 12, "ts": 1541230400000, "ambient_temp": 16, "power": 50}

# Set fields to use for the IBM Cloud Machine Learning API
fields = tuple(new_data.keys())

- Single datum prediction

In [23]:
if use_cloud_ml:
    predictions = client.deployments.score(
        scoring_endpoint, {"fields": tuple(new_data.keys()),
                           "values": [tuple(new_data.values())]})
    print(json.dumps(predictions, indent=2))
else:
    payload_scoring = [new_data]
    scoring_response = requests.post(saved_model["scoring_endpoint"], json=payload_scoring, headers=header_online, verify=False)
    print(scoring_response.text)

{
  "values": [
    [
      10.0,
      24.0,
      30.0,
      1541430459386,
      2.0,
      [
        30.0,
        10.0
      ],
      [
        14.992189731499383,
        3.336760556487845
      ],
      48.98055760884435
    ]
  ],
  "fields": [
    "power",
    "sensorID",
    "ambient_temp",
    "ts",
    "deviceID",
    "unscaled_variables",
    "features",
    "prediction"
  ]
}


Because this is a regression model, we can retrieve the prediction.

In [24]:
if use_cloud_ml:
    prediction_index = predictions["fields"].index("prediction")
    # print(json.dumps(predictions, indent=2))
    print("predictions: ", [value[prediction_index] for value in predictions["values"]])
else:
    print("predictions: ", scoring_response.json()["object"]["output"]["predictions"])

predictions:  [48.98055760884435]


- Batch prediction

In [25]:
if use_cloud_ml:
    predictions = client.deployments.score(
        scoring_endpoint, {"fields": tuple(new_data.keys()),
                           "values": [tuple(new_data.values()),tuple(new_data2.values())]})
    # print(json.dumps(predictions, indent=2))
    print("predictions: ", [value[prediction_index] for value in predictions["values"]])
else:
    payload_scoring = [new_data, new_data2]
    scoring_response = requests.post(saved_model["scoring_endpoint"], json=payload_scoring, headers=header_online, verify=False)
    print(scoring_response.text)
    print("predictions: ", scoring_response.json()["object"]["output"]["predictions"])
        

predictions:  [48.98055760884435, 50.76838700137278]


## Summary
This notebook introduced you to machine learning and model deployment with IBM Db2 Event Store.

<p><font size=-1 color=gray>
&copy; Copyright 2019 IBM Corp. All Rights Reserved.
<p>
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file
except in compliance with the License. You may obtain a copy of the License at
https://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the
License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either
express or implied. See the License for the specific language governing permissions and
limitations under the License.
</font></p>