# IBM Streams sample application

This sample demonstrates creating a Streams Python application to perform some analytics, and viewing the results.

In this notebook, you'll see examples of how to :
 1. [Setup your data connections](#setup)
 2. [Create the application](#create)
 3. [Submit the application](#launch)
 4. [Connect to the running application to view data](#view)
 5. [Stop the application](#cancel)

# Overview

**About the sample**

This application simulates a data hub that receives readings from sensors. It computes the 30 second rolling average of the reported readings using [Pandas](https://pandas.pydata.org/).  

**How it works**
   
The Python application created in this notebook is submitted to the IBM Streams service for execution. Once the application is running in the service, you can connect to it from the notebook to retrieve the results.

<img src="https://developer.ibm.com/streamsdev/wp-content/uploads/sites/15/2019/04/how-it-works.jpg" alt="How it works">


### Documentation

- [Streams Python development guide](https://ibmstreams.github.io/streamsx.documentation/docs/latest/python/)
- [Streams Python API](https://streamsxtopology.readthedocs.io/)



<a name="setup"></a>
# 1.  Build setup

Streams Python applications or topologies can be executed in multiple scenarios, such as from a notebook running within an IBM Cloud Pak for Data project, or a standalone application connecting to a local installation.

In each case, you have to connect to the Streams instance to submit the application for execution.
However, the information required to connect to the instance depends on the scenario. 

Therefore, choose the scenario from the list below that matches your target Streams installation and use the code snippets provided to connect to the instance. 

Each snippet will also set the [`contextType`](https://streamsxtopology.readthedocs.io/en/latest/streamsx.topology.context.html#streamsx.topology.context.ContextTypes) which determines the execution context.

-  Run on a Streams service in IBM Cloud Pak for Data
    - [From a project](#cpd_1)
    - [Without a project](#cpd_2)
- [Run on a standalone Streams installation (v5.2+, installed via Kubernetes)](#edge)
- [Run on the Streaming Analytics service on IBM Cloud](#sas)
- [Run on a local installation of Streams v4.2 or v4.3](#v4)


<a name="cpd_1"></a>

### 1.1.1a Option 1:  Submit to Streams on IBM Cloud Pak for Data  from a project
In this context you need to provide the name of the Streams instance.

1. From the navigation menu, click **My instances**.
2. Click the **Provisioned Instances** tab.
3. Update the value of `streams_instance_name` in the cell below according to your Streams instance name.

In [None]:
from icpd_core import icpd_util
from streamsx.topology import context

streams_instance_name = "sample-streams" ## Change this to Streams instance

cfg=icpd_util.get_service_instance_details(name=streams_instance_name)

# This specifies how the application will be deployed
contextType = context.ContextTypes.DISTRIBUTED


print("Saved credentials, continue to section 1.2")

<a name="cpd_2"></a>
### 1.1.1b Option 1b:    Submit to Streams on Cloud Pak for Data *without* a Cloud Pak for Data project  

Collect the following environment information. Set the values for each variable where indicated in the following cell.

- `CP4D_URL` - Cloud Pak for Data deployment URL, e.g. `https://cp4d_server:31843`. 

- `STREAMS_INSTANCE_ID`:
    1. From the navigation menu, click **My instances**.
    2. Click the **Provisioned Instances** tab.
    3. Select the Streams instance you want to use, and set the value of `STREAMS_INSTANCE_ID` in the cell below to the name of the instance.

- `STREAMS_USERNAME` - (optional) User name to submit the job as, defaulting to the current operating system user name.
- `STREAMS_PASSWORD` - Password for authentication.

Contact your administrator for details.

If you are using a username and password to authenticate, enter them when prompted, otherwise delete those lines before running the cell.

In [73]:
import os
import getpass

from streamsx.topology import context
cfg ={}

CP4D_URL = # Paste URL here
STREAMS_INSTANCE_ID =  # Paste URL here"

os.environ["STREAMS_INSTANCE_ID"]= STREAMS_INSTANCE_ID
os.environ["CP4D_URL"]= CP4D_URL
os.environ["STREAMS_USERNAME"]= getpass.getpass("Streams username")
os.environ["STREAMS_PASSWORD"]= getpass.getpass("Streams password")

# This specifies how the application will be deployed
contextType = context.ContextTypes.DISTRIBUTED

print("Saved credentials, continue to section 1.2")



Streams username········
Streams password········
Saved credentials, continue to section 1.2


<a name="edge"></a>
### 1.1.2 Option 2:    Submit to a  Standalone Streams installation
In order to submit a Streams application you need the following information from the Streams instance

- `STREAMS_BUILD_URL` - Streams build service URL, e.g. when the service is exposed as node port: `https://<NODE-IP>:<NODE-PORT>`
- `STREAMS_REST_URL` - Streams SWS service (REST API) URL, e.g. when the service is exposed as node port: `https://<NODE-IP>:<NODE-PORT>`
- `STREAMS_USERNAME` - (optional) User name to submit the job as, defaulting to the current operating system user name.
- `STREAMS_PASSWORD` - Password for authentication.

The documentation has the steps to retrieve the URLs for the Build and REST service.
Set the values for each variable where indicated in the following cell.

In [11]:
import os
import getpass

from streamsx.topology import context
cfg = {}
STREAMS_BUILD_URL =  # Paste URL here
STREAMS_REST_URL =  #  # Paste URL here
os.environ["STREAMS_REST_URL"]= STREAMS_REST_URL
os.environ["STREAMS_BUILD_URL"]= STREAMS_BUILD_URL
os.environ["STREAMS_USERNAME"]= getpass.getpass("Streams username")
os.environ["STREAMS_PASSWORD"]= getpass.getpass("Streams password")

# This specifies how the application will be deployed
contextType = context.ContextTypes.DISTRIBUTED

print("Saved credentials, continue to section 1.2")

Streams username········
Streams password········
Saved credentials, continue to section 1.2


<a name="sas"></a>
### 1.1.3 Option 3:    Submit to the Streaming Analytics service
To connect to the Streaming Analytics service in IBM cloud you need the **service instance name** and the **service credentials**.
Retrieve your service name and credentials from the Streaming Analytics service dashboard.

- Service instance name: This is the name of the service instance, at the top  of the service dashboard.
- Service credentials: To copy your service credentials, open the Streaming Analytics service dashboard click **Service Credentials**, then **View Credentials**, and copy the contents of the cell. Click **Add new credentials** if there are no credentials listed.

See the image below for an example. Click to enlarge.
<a href="https://developer.ibm.com/streamsdev/wp-content/uploads/sites/15/2019/11/copycredentials.png">
<img width="600" height="500" src="https://developer.ibm.com/streamsdev/wp-content/uploads/sites/15/2019/11/copycredentials.png"></a>

Run the following cells and enter the name and credentials when prompted.


In [48]:
#Do not modify this cell
SA_credentials = None
service_name = ""

In [71]:
from streamsx.topology.context import ConfigParams
from streamsx.topology import context

import getpass
if  SA_credentials is None: 
    SA_credentials=getpass.getpass('Streaming Analytics credentials:')
    service_name =input('Streaming Analytics name:')

vs={'streaming-analytics': [{'name': service_name, 'credentials': json.loads (SA_credentials)}]}
cfg = {}
cfg[ConfigParams.VCAP_SERVICES] = vs
cfg[ConfigParams.SERVICE_NAME] = service_name

# This specifies how the application will be deployed

contextType = context.ContextTypes.STREAMING_ANALYTICS_SERVICE
print("Saved credentials, continue to section 1.2")

Saved credentials, continue to section 1.2


<a name="v4"></a>
### 1.1.4 Option 4:    Submit to  Streams v4.2 or v4.3

If you are using the Streams Quick Start Edition, you do not have to do any further configuration. 

Otherwise, make sure that the `STREAMS_INSTANCE_ID` and `STREAMS_DOMAIN_ID` are set as environment variables.


In [None]:
# Comment out if needed
# STREAMS_INSTANCE_ID = # Set instance ID
# STREAMS_DOMAIN_ID = # Set domain ID
# os.environ["STREAMS_INSTANCE_ID"]= STREAMS_INSTANCE_ID
# os.environ["STREAMS_DOMAIN_ID"]= STREAMS_DOMAIN_ID
# This specifies how the application will be deployed
contextType = context.ContextTypes.DISTRIBUTED

### 1.2 Verify `streamsx` package version

Run the cell below to check which version of the `streamsx` package is installed.  

If you need to upgrade, use

- `import sys`
- `!{sys.executable} -m pip install --user --upgrade streamsx` to upgrade the package.
- Or, use  `!{sys.executable} -m pip install --user streamsx==somever` to install a specific version of the package. 


In [12]:
import streamsx.topology.context
print("INFO: streamsx package version: " + streamsx.topology.context.__version__)
#For more details uncomment line below.
#!pip show streamsx

INFO: streamsx package version: 1.13.14


<a id="create"></a>
# 2. Create the application
This application is going to ingest readings from simulated sensors and compute the 30 second rolling average value for each sensor.  

All Streams applications start with  a `Topology` object, so start by creating one:


In [13]:
from streamsx.topology.topology import Topology

topo = Topology(name="SensorAverages")

## 2.1 Define sources
Your application needs some data to analyze, so the first step is to define a data source that produces the data being processed. 

Next, use the data source to create a `Stream` object. A `Stream` is a potentially infinite sequence of tuples containing the data to be analyzed.

Tuples are Python objects by default. Other supported formats include JSON. [See the doc for all supported formats](http://ibmstreams.github.io/streamsx.topology/doc/pythondoc/streamsx.topology.topology.html#stream).

### 2.1.1 Define a source class

Define a callable class that will produce the data to be analyzed.

This example class produces readings from sensors.

In [14]:
import random 
import time
from datetime import datetime, timedelta

# Define a callable source 
class SensorReadingsSource(object):
    def __call__(self):
        # This is just an example of using generated data, 
        # Here you could connect to db
        # generate data
        # connect to data set
        # open file
        while True:
            time.sleep(0.001)
            sensor_id = random.randint(1,100)
            reading = {}
            reading ["sensor_id"] = "sensor_" + str(sensor_id)
            reading ["value"] =  random.random() * 3000
            reading["ts"] = int((datetime.now().timestamp())) 
            yield reading 

### 2.1.2  Create the `Stream `

Create a `Stream` called  `readings` that will contain the simulated data that `SensorReadingsSource` produces:

In [15]:
#Create a stream from the data using Topology.source
readings = topo.source(SensorReadingsSource(), name="Readings")

# 2.2 Analyze data

Use a variety of methods in the `Stream` class to analyze your in-flight data, including applying machine learning models.

See the [common operations section](https://ibmstreams.github.io/streamsx.documentation/docs/python/1.6/python-appapi-devguide-4/) of the developer guide and the [documentation on the Stream class](https://ibmstreams.github.io/streamsx.topology/doc/pythondoc/streamsx.topology.topology.html#streamsx.topology.topology.Stream) for more details.


### 2.2.1 Filter data from the  `Stream`  

Use `Stream.filter()` to remove data that doesn't match a certain condition.

In [16]:
# Accept only values greater than 100

valid_readings = readings.filter(lambda x : x["value"] > 100,
                                 name="ValidReadings")

# You could create another stream of the invalid data:
# invalid_readings = readings.filter(lambda x : x["value"] <= 100,)

### 2.2.2  Compute averages on the  `Stream`  

Define a function to compute the 30 second rolling average for the readings.

Steps are outlined in the code below.
See the [Window class documentation](http://ibmstreams.github.io/streamsx.topology/doc/pythondoc/streamsx.topology.topology.html#streamsx.topology.topology.Window)  for details.



In [17]:
import pandas as pd

# 1. Define aggregation function
    
def average_reading(items_in_window):
    df = pd.DataFrame(items_in_window)
    readings_by_id = df.groupby("sensor_id")
    
    averages = readings_by_id["value"].mean()
    period_end = df["ts"].max()

    result = []
    for id, avg in averages.iteritems():
        result.append({"average": avg,
                "sensor_id": id,
                "period_end": time.ctime(period_end)})
               
    return result

# 2. Define window: e.g. a 30 second rolling average, updated every second

interval = timedelta(seconds=30)
window = valid_readings.last(size=interval).trigger(when=timedelta(seconds=1))


# 3. Pass aggregation function to Window.aggregate
# average_reading returns a list of the averages for each sensor,
# use flat map to convert it to individual tuples, one per sensor
rolling_average = window.aggregate(average_reading).flat_map()


# 2.3 Create a `View` to preview the tuples on the `Stream` 


A `View` is a connection to a `Stream` that becomes activated when the application is running. We examine the data from within the notebook in section 4, below.


In [18]:
averages_view = rolling_average.view(name="RollingAverage", description="Sample of rolling averages for each sensor")

# 2.4 Define output

The `rolling_average` stream is our final result.  We will use `Stream.publish()` to make this stream available to other applications. 

If you want to send the stream to another database or system, you would use a sink function (similar to the source function) and invoke it using `Stream.for_each`.



In [19]:
import json
# publish results as JSON
rolling_average.publish(topic="AverageReadings",
                        schema=json, 
                        name="PublishAverage")

# Other options include:
# invoke another sink function:
# rolling_average.for_each(func=send_to_db)

<streamsx.topology.topology.Sink at 0x7f045c70c668>

<a name="launch"></a>

# 3. Submit the application
A running Streams application is called a *job*. This next cell submits the application for execution and prints the resulting job id.

In [21]:
from streamsx.topology import context

# Disable SSL certificate verification if necessary
cfg[context.ConfigParams.SSL_VERIFY] = False

print("Submitting to " + contextType  + " context")
# submit the topology 'topo'
submission_result = context.submit (contextType, topo, config = cfg)

# The submission_result object contains information about the running application, or job
if submission_result.job:
    streams_job = submission_result.job
    print ("JobId: ", streams_job.id , "\nJob name: ", streams_job.name)

Submitting to DISTRIBUTED context


Insecure host connections enabled.
Insecure host connections enabled.
Insecure host connections enabled.
Insecure host connections enabled.
Insecure host connections enabled.


JobId:  9 
Job name:  notebook::SensorAverages_9


<a name="view"></a>

# 4. Use a `View` to access data from the job
Now that the job is started, use the `View` object you created in step 2.3 to start retrieving data from a `Stream`.

In [22]:
# Connect to the view and display the data
queue = averages_view.start_data_fetch()
try:
    for val in range(10):
        print(queue.get())    
finally:
    averages_view.stop_data_fetch()

{'average': 1401.0884850389868, 'sensor_id': 'sensor_1', 'period_end': 'Mon Nov 11 22:21:31 2019'}
{'average': 1354.7456766554853, 'sensor_id': 'sensor_10', 'period_end': 'Mon Nov 11 22:21:31 2019'}
{'average': 1823.6135796171936, 'sensor_id': 'sensor_100', 'period_end': 'Mon Nov 11 22:21:31 2019'}
{'average': 1617.1620805690814, 'sensor_id': 'sensor_11', 'period_end': 'Mon Nov 11 22:21:31 2019'}
{'average': 1821.7990644663246, 'sensor_id': 'sensor_12', 'period_end': 'Mon Nov 11 22:21:31 2019'}
{'average': 1538.6331621586512, 'sensor_id': 'sensor_13', 'period_end': 'Mon Nov 11 22:21:31 2019'}
{'average': 1516.803832278493, 'sensor_id': 'sensor_14', 'period_end': 'Mon Nov 11 22:21:31 2019'}
{'average': 1450.0449833862046, 'sensor_id': 'sensor_15', 'period_end': 'Mon Nov 11 22:21:31 2019'}
{'average': 1532.283679450917, 'sensor_id': 'sensor_16', 'period_end': 'Mon Nov 11 22:21:31 2019'}
{'average': 1519.6040973560582, 'sensor_id': 'sensor_17', 'period_end': 'Mon Nov 11 22:21:31 2019'}


## 4.1 Display the results in real time
Calling `View.display()` from the notebook displays the results of the view in a table that is updated in real-time.

In [None]:
# Display the results for 30 seconds
averages_view.display(duration=30)


## 4.2 See job status 

#### View job status in IBM Cloud Pak for Data
You can view job status and logs by going to **My Instances** > **Jobs**. Find your job based on the id printed above.
Retrieve job logs using the "Download logs" action from the job's context menu.

To view other information about the job such as detailed metrics, access the graph. Go to **My Instances** > **Jobs**. Select "View graph" action for the running job.

#### View job status in other Streams installations

- Open the Streams Console. 
    - **IBM Cloud**: Open the service instance page and click **Launch**.
    - **Standalone Streams instance in IBM Cloud Pak for Data**: Get the URL from the documentation.
    - **Local Streams installation**: `streamtool geturl`.

<a name="cancel"></a>

# 5. Cancel the job

This cell generates a widget you can use to cancel the job.

In [None]:
#cancel the job in the IBM Streams service
submission_result.cancel_job_button()

You can also interact with the job through the [Job](https://streamsxtopology.readthedocs.io/en/stable/streamsx.rest_primitives.html#streamsx.rest_primitives.Job) object returned from `submission_result.job`

For example, use `job.cancel()` to cancel the running job directly.

# Summary

We started with a `Stream` called `readings`, which contained the data we wanted to analyze. Next, we used functions in the `Stream` object to perform simple analysis and produced the `rolling_average` stream.  This stream is `published` for other applications running within our Streams instance to access.

After submitting the application to the Streams service, we connected to the `rolling_average` view to see the results within the notebook.