# IBM Streams PMML scoring sample application

This sample demonstrates creating a Streams Python application to perform scoring with a PMML model and viewing the results.

In this notebook, you'll see examples of how to :
 1. [Setup your data connections](#setup)
 2. [Create the application](#create)
 3. [Submit the application](#launch)
 4. [Connect to the running application to view data](#view)
 5. [Stop the application](#cancel)

# Overview

**About the sample**

This application simulates patient data which are used to determine the best drug for each patient. Used patient data are: sex, blood-pressure, colesterol, sodium and potassium concentration as well as sex and age. A trained PMML model is used to find the best drug.  
This ML scenario is a sample scenario provided in IBM Cloud for creating a SPSS model inside Watson Studio.


**How it works**

The Python application created in this notebook is submitted to the IBM Streams service for execution. Once the application is running in the service, you can connect to it from the notebook to retrieve the results.


<img src="https://developer.ibm.com/streamsdev/wp-content/uploads/sites/15/2019/04/how-it-works.jpg" alt="How it works">


### Documentation

- [Streams Python development guide](https://ibmstreams.github.io/streamsx.documentation/docs/latest/python/)
- [Streams Python API](https://streamsxtopology.readthedocs.io/)



<a name="setup"></a>
# 1. Setup
### 1.1 Add credentials for the IBM Streams service

With the cell below selected, click the "Connect to instance" button in the toolbar to insert the credentials for the service.

<a target="blank" href="https://developer.ibm.com/streamsdev/wp-content/uploads/sites/15/2019/02/connect_icp4d.gif">See an example</a>.
![ ](attachment:Drug_Sample.jar.jpg)
![ ](attachment:DRUG_SAMPLE.csv.jpg)

## 1.2 Install or upgrade `streamsx.pmml` package

In this sample we need the Python package streamsx.pmml containing the PMML scoring functionality as well as the streamsx.standard package for common streaming functions.   
You need to install streamsx.pmml version 1.0.0 (at least).



In [None]:
import sys
!{sys.executable} -m pip install --user --upgrade streamsx.pmml
!{sys.executable} -m pip install --user --upgrade streamsx.standard

# When you need to install a specific version of the package, run this line instead:
#!{sys.executable} -m pip install --user streamsx.pmml==somever

In [None]:
import streamsx.topology.context
import streamsx.eventstreams as pmml
print("INFO: streamsx package version: " + streamsx.topology.context.__version__)
print("INFO: streamsx.pmml package version: " + pmml.__version__)

## 1.3 Prepare your project with necessary sample data

This sample expects a PMML model in your project as well as a data-set of patient data. To be most comfortable these files are embedded in this notebook (as attachment being not visible).
To setup your project with this data you need to run the next code cell. It will extract the data and copy it in the correct places in your project. They appear in your project as if you would have added the data set via the Web Gui resp. exported the PMML model from the `Flow Modeler`.
This notebook creates a model named `Drug` and a data set named `DRGUG_SAMPLE.csv`. If you have already elements with those names please rename or delete them otherwise the notebook will not properly work.

In [None]:
%%javascript
//get the notebook name from the side
var samplenotebookname = document.getElementsByTagName("body")[0].getAttribute("data-notebook-name"); 
IPython.notebook.kernel.execute('samplenotebookname="' + samplenotebookname + '";');


<div class="output_wrapper output output_area output_subarea output_text output_stream output_stderr">
<b> The next cell has to be executed manually and cannot be run together with the cell above via "Run All". </b>
</div>

In [None]:
import nbformat
import base64

#use the notebook name to open the notebook file and get the attachements
nb = nbformat.read(samplenotebookname, nbformat.current_nbformat)
base64model = base64.b64decode(nb.cells[0].attachments['Drug_Sample.jar.jpg']['image/jpeg'])
base64data = base64.b64decode(nb.cells[0].attachments['DRUG_SAMPLE.csv.jpg']['image/jpeg'])
!rm -r Drug
!rm Drug_Sample.jar
fp = open('Drug_Sample.jar','w+b')
fp.write(base64model)
fp.close()
!jar -xf Drug_Sample.jar 
!mv Drug ../models/
!rm Drug_Sample.jar

!rm DRUG_SAMPLE.csv
fp = open('DRUG_SAMPLE.csv','w+b')
fp.write(base64data)
fp.close()
!mv DRUG_SAMPLE.csv ../datasets/DRUG_SAMPLE.csv


<a id="create"></a>
# 2. Create the application
This application is going to ingest simulated patient data and determines drug based on the actual patient data.  

All Streams applications start with  a `Topology` object, so start by creating one:

In [None]:
from streamsx.topology.topology import Topology

topo = Topology(name="PMMLScoring")

# add files to be contained in the archive which is deployed to the node running the application
# in this sample we need the `dataset` with sample data to be present at the worker node
topo.add_file_dependency("../datasets/DRUG_SAMPLE.csv", 'etc')

## 2.1 Define data schemas

Some functions need explicite definition of the schema to be used. Below the needed schemas are defined.


In [None]:
from streamsx.topology.schema import StreamSchema


# create streams schema for FileSource
rawData = StreamSchema(
    "tuple<int32 age, rstring sex, rstring bloodPressure, rstring cholesterol, float64 bloodSodiumConcentration, float64 bloodPotassionConcentration,rstring referenceDrug>"
)

# extend the schema for the score function
scoredData = rawData.extend(StreamSchema("tuple<float64 NA_to_K, rstring predictedDrug>"))

## 2.2 Define Input 

As data source for the sample a CSV file is used. The file will be read line by line, each line being a record resp. tuple. 
We throttle the processing down in this sample to not drive the ressources to their limits. Removing throttle() and sleep() will give you full performance.

In [None]:
import os
import streamsx.ec
import streamsx.standard.utility as util
import streamsx.spl.op as op
from enum import IntEnum
from time import sleep

# enum for usage with the filesink, filesource
class DataFormats(IntEnum):
    csv = 0
    txt = 1

# define a source class
# it generates filenames which should be processed in the application
# in the sample there is only  one file with sample data 
# so it returns for demo purposes any time the same filename 
class NameGenerator(object):
    def __call__(self):
        while True:
            sleep(3)
            yield os.path.join(streamsx.ec.get_application_directory(), "etc", "DRUG_SAMPLE.csv")

    
# filename stream, each tuple is a filename, the FileSource operator needs StructuredSchema so we generate one
filenames = topo.source(NameGenerator()).as_string().map(lambda i : { "fileName":i }, schema="tuple<rstring fileName>")


# read the csv records from the file for each filename received in filename stream
filerecords = op.Map("spl.adapter::FileSource", filenames, schema=rawData, params={"format":DataFormats.csv, "hasHeaderLine":True }).stream

# throttle the processing rate just for this sample to be able to see view content
# not for real application!!!
records = util.throttle(filerecords, 50.0)

## 2.3 Define Data Analytics to be performed on your data --> ML scoring

Next to reading your streaming data from a source you need to define the "Analytics" you want to perform on your data. In this sample you will use a ML model to predict a drug based on a patients medical data. We use a PMML model. The model would be created in your project either by exporting it from a `Flow Modeler` or by importing it via the Web-Gui.
The PMML model file is loaded from your projects model location. It will be added to the application archive as it is need at the node executing your application.

But before scoring there is the need for preprocessing as the ML model expects not single sodium and potassium concentration but the quotient of both: NA/K.

Steps are outlined in the code below.
See the [PMML streamsx package documentation](https://streamsxpmml.readthedocs.io/en/latest/)  for details.

In [None]:
import streamsx.pmml as pmml

# preprocessing : derive one input predictor field (NA_To_K)
preprocess_op = op.Map("spl.relational::Functor", records, schema=scoredData)
preprocess_op.NA_to_K = preprocess_op.output('bloodSodiumConcentration / bloodPotassionConcentration')
preprocess_op.predictedDrug = preprocess_op.output('""')
preprocess = preprocess_op.stream

# score the records
score = pmml.score(preprocess,
    schema=scoredData,
    model_input_attribute_mapping='Age=age,BP=bloodPressure,Cholesterol=cholesterol,Na_to_K=NA_to_K,Sex=sex',
    model_path='../models/Drug/1/model',
    model_output_attribute_mapping='predictedDrug=Drug.PredictedValue'
)

## 2.4 Create a `View` to preview the tuples on the result `Stream` 


A `View` is a connection to a `Stream` that becomes activated when the application is running. We examine the data from within the notebook in section 4, below.

In [None]:
score_view = score.view(name="ScoredRecords", description="Sample of scored records")

## 2.5 Define Output

The `score` stream is our final result.  We will use `Stream.publish()` to make this stream available to other applications. 

If you want to send the stream to another database or system, you would use a sink function (similar to the source function) and invoke it using `Stream.for_each`.



In [None]:
import json
# publish results as JSON
score.publish(topic="ScoredRecords",
                        schema=json, 
                        name="PublishScores")

# Other options include:
# invoke another sink function:
# rolling_average.for_each(func=send_to_db)

<a name="launch"></a>

# 3. Submit the application
A running Streams application is called a *job*. This next cell submits the application for execution and prints the resulting job id.


In [None]:
from streamsx.topology import context

# Disable SSL certificate verification if necessary
cfg[context.ConfigParams.SSL_VERIFY] = False

# build and submit
submission_result = context.submit('DISTRIBUTED', 
                                   topo, 
                                   cfg)

print(submission_result)
# The submission_result object contains information about the running application, or job
if submission_result.job:
    print("JobId: ", submission_result.job.id , "Name: ", submission_result.job.name)


<a name="view"></a>

# 4. Use the `View` to access data from the job
Now that the job is started, use the `View` object you created in step 2.3 to start retrieving data from a `Stream`.   
   
Compare the attributes 'referenceDrug' and 'predictedDrug' in the output below. The predicted drug is the result of the scoring. Both values should be equal.

In [None]:
# Connect to the view and display the data
queue = score_view.start_data_fetch()
try:
    for val in range(10):
        print(queue.get())    
finally:
    score_view.stop_data_fetch()

## 4.1 Display the results in real time
Calling `View.display()` from the notebook displays the results of the view in a table that is updated in real-time.

In [None]:
# Display the results for 30 seconds
score_view.display(duration=30)


## 4.2 See job status 

You can view job status and logs by going to **My Instances** > **Jobs**. Find your job based on the id printed above.
Retrieve job logs using the "Download jobs" action from the job's context menu.

To view other information about the job such as detailed metrics, access the Streams Console.  Go to **My Instances** > **Provisioned Instances**. Select the Streams instance and open the URL listed under *externalConsoleEndpoint* or *serviceConsoleEndpoint*.

<a name="cancel"></a>

# 5. Cancel the job

This cell generates a widget you can use to cancel the job.

In [None]:
#cancel the job in the IBM Streams service
submission_result.cancel_job_button()

You can also interact with the job through the [Job](https://streamsxtopology.readthedocs.io/en/stable/streamsx.rest_primitives.html#streamsx.rest_primitives.Job) object returned from `submission_result.job`

For example, use `job.cancel()` to cancel the running job directly.

# Summary

We started with a `Stream` called `records`, which contained the data we wanted to analyze. Next, we used functions in the `Stream` object to perform simple preprocessing before we scored the data with a ML model and produced the `score` stream.  This stream is finally written to a file.

After submitting the application to the Streams service, we connected to the `score_view` view to see the results within the notebook.