# Create and run a data flow with IBM Watson Data APIs

## Introduction
Use the IBM Watson Data Flows service to create and run data flows in a runtime engine. A flow can read data from a large variety of sources, process that data using pre-defined operations or custom code, and then write it to one or more targets. The runtime engine can handle large amounts of data so it's ideally suited for reading, processing, and writing data at volume.

The sources and targets that are supported include both Cloud and on-premises offerings as well as data assets in projects. Cloud offerings include IBM Cloud Object Storage, Amazon S3, and Azure, among others. On-premises offerings include IBM Db2, Microsoft SQL Server, and Oracle, among others.

This notebook contains the steps and code to get you started with creating, running and debugging data flows using the Watson Data APIs. 

For a list of the supported connectivity and the properties they support see [IBM Watson Data Flows Service - Data Asset and Connection Properties](https://api.dataplatform.ibm.com/v2/data_flows/doc/dataasset_and_connection_properties.html).

This notebook runs on Python and Spark
#### Prerequisites
This tutorial uses the [Customer demographics and sales](https://dataplatform.ibm.com/exchange/public/entry/view/f8ccaf607372882403a37d9019b3abf4) data set. You can add this to your project from the tile in the [Watson Studio](https://dataplatform.ibm.com/community?context=analytics).

## Table of contents

1. [Setup](#setup)<br>
    1.1 [Environments](#setup1)<br>
    1.2 [Project Token](#setup2)<br>
    1.3 [Authorization](#setup3)<br>
2. [Creating a data flow](#create)<br>
    2.1 [Retrieving a data asset](#create1)<br>
    2.2 [Defining a source in a data flow](#create2)<br>
    2.3 [Defining an operation in a data flow](#create3)<br>
    2.4 [Defining a target in a data flow](#create4)<br>
    2.5 [Creating the data flow](#create5)<br>
3. [Working with data flow runs](#run)<br>
    3.1 [What is a data flow run?](#run1)<br>
    3.2 [Run state life cycle](#run2)<br>
    3.3 [Run a data flow](#run3)<br>
    3.4 [Get a data flow run summary](#run4)<br>
    3.5 [Troubleshooting a failed run](#run5)<br>
4. [Resources](#resources)

## 1. Setup

In [None]:
import requests
import json
import uuid

def pretty_print(json_content):
    parsed_json = json.loads(json_content)
    print(json.dumps(parsed_json, indent=4, sort_keys=True))

####  <a id="setup1"></a>1.1 Environments
The data flows service is currently deployed only to the US South region of IBM Cloud. Use this environment URL in place of {service_URL} in the examples below:

    US south https://api.dataplatform.cloud.ibm.com


In [2]:
service_URL = "https://api.dataplatform.cloud.ibm.com"

####  <a id="setup2"></a>1.2 Project Token
Insert a project token from the action bar (more > Insert project token). Project tokens are used to access project resources like data sources and connections.

In [3]:
# @hidden_cell
# The project token is an authorization token that is used to access project resources like data sources, connections, and used by platform APIs.


In [4]:
project_id = pc.projectID

####  <a id="setup3"></a>1.3 Authorization
An IAM Bearer token is required in order to access IBM Watson Data APIs. For information on how to generate an IAM token see <a href="http://ibm.biz/wdp-api#getting" target="_blank" rel="noopener noreferrer">here</a>.

In [5]:
# Replace <IAM Access Token> with your generated IAM Access Token
authorization = "Bearer <IAM Access Token>"

##  <a id="create"></a>2. Creating a data flow ##
The following example shows how to create a data flow that reads data from an existing data asset in an Watson Studio project, filters the data, and writes the data to a data asset in the same project. The data flow created for this example will contain a linear pipeline, although in the general case, the pipeline forms a directed acyclic graph (DAG).

####  <a id="create1"></a>2.1 Retrieving a data asset ####
Begin by retrieving a list of data assets from a Watson Studio project and choose one to use as the source of the data flow. For further information on the data asset service, see <a href="http://ibm.biz/wdp-api#create-a-data-asset" target="_blank" rel="noopener noreferrer">IBM Watson Data API documentation</a>.

In [7]:
query = {
    "query": "asset.asset_type:data_asset"
}
request = requests.post(service_URL + "/v2/asset_types/data_asset/search?project_id=" + project_id, headers={'Authorization': authorization}, json=query)
pretty_print(request.text)

{
    "results": [
        {
            "href": "https://catalogs-yp-prod.mybluemix.net/v2/assets/3621bfda-a92b-4802-9c0d-3eea9c2c0b82?project_id=d12f9685-2693-4c84-af48-3eb6c71e3013",
            "metadata": {
                "asset_attributes": [
                    "data_asset"
                ],
                "asset_id": "3621bfda-a92b-4802-9c0d-3eea9c2c0b82",
                "asset_state": "available",
                "asset_type": "data_asset",
                "catalog_id": "92f22af3-523b-4a35-8e16-e399d1c0e53e",
                "created_at": "2017-12-19T16:34:47Z",
                "name": "customers_orders1_opt.csv",
                "origin_country": "us",
                "owner": "******",
                "rating": 0,
                "rov": {
                    "mode": 0
                },
                "sandbox_id": "d12f9685-2693-4c84-af48-3eb6c71e3013",
                "size": 5648773,
                "usage": {
                    "access_count": 0.0,
                

In the response you can see that the data asset was created with the ID `3621bfda-a92b-4802-9c0d-3eea9c2c0b82`, which you'll need to use later and specify in the data flow you create.

####  <a id="create2"></a>2.2 Defining a source in a data flow ####
A data flow can contain one or more data sources. A data source is defined as a *binding node* in the data flow *pipeline*, which has one output and no inputs. The *binding node* must reference either a connection or a data asset. Depending on the type of connection or data asset, additional *properties* might also need to be specified. Refer to [IBM Watson Data Flows Service - Data Asset and Connection Properties](https://api.dataplatform.ibm.com/v2/data_flows/doc/dataasset_and_connection_properties.html) to determine which properties are applicable for a given connection, and which of those are required. 

For the following example, reference the data asset you retrieved earlier. The *binding node* for the data flow's source is:


In [8]:
source_binding_node = {  
  "id":"source1",
  "type":"binding",
  "output":{  
     "id":"source1Output"
  },
  "data_asset":{  
     "ref":"3621bfda-a92b-4802-9c0d-3eea9c2c0b82"
  }
}

The `output` attribute declares the ID of the *output port* of this source as `source1Output` so that other nodes can read from it. You can see the data asset with ID `3621bfda-a92b-4802-9c0d-3eea9c2c0b82` is being referenced.

####  <a id="create3"></a>2.3 Defining an operation in a data flow ####
A data flow can contain zero or more operations, with a typical operation having one or more inputs and one or more outputs. An operation input is linked to the output of a source or another operation. An operation can also have additional parameters which define how the operation performs its work. An operation is defined as an *execution node* in the data flow *pipeline*.

The following example creates a filter operation so that only rows where field `STATE` is not `""` or empty are retained. The *execution node* for our filter operation is:

In [9]:
filter_operation = {  
  "id":"operation1",
  "type":"execution_node",
  "op":"com.ibm.wdp.transformer.FreeformCode",
  "parameters":{  
     "FREEFORM_CODE":"filter(STATE!=\"\")"
  },
  "inputs":[  
     {  
        "id":"inputPort1",
        "links":[  
           {  
              "node_id_ref":"source1",
              "port_id_ref":"source1Output"
           }
        ]
     }
  ],
  "outputs":[  
     {  
        "id":"outputPort1"
     }
  ]
}

The `inputs` attribute declares an *input port* with ID `inputPort1` which references the *output port* of the source node (node ID `source1` and port ID `source1Output`). The `outputs` attribute declares the ID of the *output port* of this operation as `outputPort1` so that other nodes can read from it. For this example, the operation is defined as a freeform operation, denoted by the `op` attribute value of `com.ibm.wdp.transformer.FreeformCode`. A freeform operation has only a single parameter named `FREEFORM_CODE` whose value is a snippet of Sparklyr code. In this snippet of code, a filter function is called with the arguments to retain only those rows with a non empty value in the `STATE` field.

The `outputs` attribute declares the ID of the output of this operation as `outputPort1` so that other nodes can read from it.

####  <a id="create4"></a>2.4 Defining a target in a data flow ####
A data flow can contain zero or more targets. A target is defined as a *binding node* in the data flow *pipeline* which has one input and no outputs. As with the source, the *binding node* must reference either a connection or a data asset. When using a data asset as a target, specify either the ID or name of an existing data asset.

In the following example, a data asset is referenced by its name. The *binding node* for the data flow's target is:

In [10]:
target_binding_node = {  
  "id":"target1",
  "type":"binding",
  "input":{  
     "id":"target1Input",
     "link":{  
        "node_id_ref":"operation1",
        "port_id_ref":"outputPort1"
     }
  },
  "data_asset":{  
     "properties":{  
        "name":"my_shapedFile.csv"
     }
  }
}

The `input` attribute declares an *input port* with ID `target1Input` which references the *output port* of the operation node (node ID `operation1` and port ID `outputPort1`). The name of the data asset to create or update is specified as `my_shapedFile.csv`. Unless otherwise specified, this data asset is assumed to be in the same catalog or project as that which contains the data flow.

####  <a id="create5"></a>2.5 Creating the data flow ####
Putting it all together, you can now call the API to create the data flow with the following POST method:

```POST https://{service_URL}/v2/data_flows```

The new data flow can be stored in a catalog or project. Use either the `catalog_id` **or** `project_id` query parameter, depending on where you want to store the data flow. An example request to create a data flow is shown below.

In [11]:
dataflow = {  
   "name":"my_dataflow_" + str(uuid.uuid4()),
   "pipeline":{  
      "doc_type":"pipeline",
      "version":"1.0",
      "primary_pipeline":"pipeline1",
      "pipelines":[  
         {  
            "id":"pipeline1",
            "runtime":"Spark",
            "nodes":[  
            ]
         }
      ]
   }
}

dataflow["pipeline"]["pipelines"][0]["nodes"].append(source_binding_node)
dataflow["pipeline"]["pipelines"][0]["nodes"].append(filter_operation)
dataflow["pipeline"]["pipelines"][0]["nodes"].append(target_binding_node)

dataflow_response = requests.post(service_URL + "/v2/data_flows?project_id=" + project_id, headers={'Authorization': authorization}, json=dataflow)
data_flow_id = json.loads(dataflow_response.text)["metadata"]["asset_id"]
pretty_print(dataflow_response.text)

{
    "entity": {
        "name": "my_dataflow_0a67cdf4-0047-4ab0-8352-30ea2009af1a",
        "pipeline": {
            "doc_type": "pipeline",
            "id": "53be959f-8cfd-4186-ba4a-d2bf869277a7",
            "pipelines": [
                {
                    "id": "pipeline1",
                    "nodes": [
                        {
                            "data_asset": {
                                "ref": "3621bfda-a92b-4802-9c0d-3eea9c2c0b82"
                            },
                            "id": "source1",
                            "output": {
                                "id": "source1Output"
                            },
                            "type": "binding"
                        },
                        {
                            "id": "operation1",
                            "inputs": [
                                {
                                    "id": "inputPort1",
                                    "links": [
          

The response shows that the data flow was created with an ID of `4dd043be-2bdb-4312-9d75-ddd2b045db55`, which you will need later to run the data flow you created.

##  <a id="run"></a>3. Working with data flow runs ##

#### <a id="run1"></a>3.1 What is a data flow run? ####
Each time a data flow is run, a new data flow run asset is created and stored in the project or catalog to record this event. This asset stores detailed metrics such as how many rows were read and written, a copy of the data flow that was run, and any logs from the engine. During a run, the information in this asset is updated to reflect the current state of the run. When the run completes (successfully or not), the information in the asset is updated one final time. If and when the data flow is deleted, any run assets of that data flow are also deleted.

There are four components of a data flow run, which are accessible using different APIs.

- Summary (`GET /v2/data_flows/{data_flow_id}/runs/{data_flow_run_id}`). A quick, at-a-glance view of a run with a summary of how many rows in total were read and written.
- Detailed metrics (`GET /v2/data_flows/{data_flow_id}/runs/{data_flow_run_id}/metrics`). Detailed metrics for each binding node in the data flow (link sources and targets).
- Data flow (`GET /v2/data_flows/{data_flow_id}/runs/{data_flow_run_id}/origin`). A copy of the data flow that was run at that point in time. (Remember that data flows can be modified between runs.)
- Logs (`GET /v2/data_flows/{data_flow_id}/runs/{data_flow_run_id}/logs`). The logs from the engine, which are useful for diagnosing run failures.

#### <a id="run2"></a>3.2 Run state life cycle ####
A data flow run has a defined life cycle, which is shown by its `state` attribute. The `state` attribute can have one of the following values:

- `starting` The run was created but was not yet submitted to the engine.
- `queued` The run was submitted to the engine and it is pending.
- `running` The run is currently in progress.
- `finished` The run finished and was successful.
- `error` The run did not complete. An error occurred either before the run was sent to the engine or while the run was in progress.
- `stopping` The run was canceled but it is still running.
- `stopped` The run is no longer in progress.

The run states that define phases of progress are: `starting`, `queued`, `running`, `stopping`. The run states that define states of completion are:  `finished`, `error`, `stopped`.

The following are typical state transitions you would expect to see:

1. The run completed successfully: `starting` -> `queued` -> `running` -> `finished`.
2. The run failed (for example, connection credentials were incorrect): `starting` -> `queued` -> `running` -> `error`.
3. The run could not be sent to the engine (for example, the connection referenced does not exist): `starting` -> `error`.
4. The run was stopped (for example, at users request): `starting` -> `queued` -> `running` -> `stopping` -> `stopped`.

#### <a id="run3"></a>3.3 Run a data flow ####
To run a data flow, call the following POST API:

```
POST https://{service_URL}/v2/data_flows/{data_flow_id}/runs?project_id={project_id}
```

The value of `data_flow_id` is the `metadata.asset_id` from your data flow. An example response from this API call could be:

In [12]:
dataflow_run_response = requests.post(service_URL + "/v2/data_flows/" + data_flow_id + "/runs?project_id=" + project_id, headers={'Authorization': authorization}, json={})
data_flow_run_id = json.loads(dataflow_run_response.text)["metadata"]["asset_id"]
pretty_print(dataflow_run_response.text)

{
    "entity": {
        "configuration": {},
        "data_flow_ref": "4dd043be-2bdb-4312-9d75-ddd2b045db55",
        "name": "my_dataflow_0a67cdf4-0047-4ab0-8352-30ea2009af1a",
        "rov": {
            "members": [],
            "mode": 0
        },
        "state": "starting",
        "tags": []
    },
    "metadata": {
        "asset_id": "9f1fb8ca-5bd3-4bc3-8f23-00d53356883e",
        "asset_type": "data_flow_run",
        "create_time": "2018-01-31T12:10:00.000Z",
        "creator": "******",
        "href": "https://api.dataplatform.ibm.com/v2/data_flows/4dd043be-2bdb-4312-9d75-ddd2b045db55/runs/9f1fb8ca-5bd3-4bc3-8f23-00d53356883e?project_id=d12f9685-2693-4c84-af48-3eb6c71e3013",
        "project_id": "d12f9685-2693-4c84-af48-3eb6c71e3013",
        "usage": {
            "access_count": 0,
            "last_access_time": "2018-01-31T12:10:00.597Z",
            "last_accessor": "******",
            "last_modification_time": "2018-01-31T12:10:00.597Z",
            "last_mod

#### <a id="run4"></a>3.4 Get a data flow run summary ####
To retrieve the latest summary of a data flow run, call the following GET method:
```
GET https://{service_URL}/v2/data_flows/{data_flow_id}/runs/{data_flow_run_id}?project_id={project_id}
```

The value of `data_flow_id` is the `metadata.asset_id` from your data flow. The value of `data_flow_run_id` is the `metadata.asset_id` from your data flow run. An example response from this API call could be:

In [15]:
dataflow_run_summary = requests.get(service_URL + "/v2/data_flows/" + data_flow_id + "/runs/" + data_flow_run_id + "?project_id=" + project_id, headers={'Authorization': authorization})
pretty_print(dataflow_run_summary.text)

{
    "entity": {
        "configuration": {},
        "data_flow_ref": "4dd043be-2bdb-4312-9d75-ddd2b045db55",
        "engine_state": {
            "engine_run_id": "f9790715-a987-40d2-a009-286990d928bd",
            "session_cookie": "route=Spark; HttpOnly; Secure"
        },
        "name": "my_dataflow_0a67cdf4-0047-4ab0-8352-30ea2009af1a",
        "rov": {
            "members": [],
            "mode": 0
        },
        "state": "finished",
        "summary": {
            "completed_date": "2018-01-31T12:10:40.259Z",
            "engine_completed_date": "2018-01-31T12:10:39.621Z",
            "engine_elapsed_secs": 23,
            "engine_started_date": "2018-01-31T12:10:16.562Z",
            "engine_status_date": "2018-01-31T12:10:39.622Z",
            "engine_submitted_date": "2018-01-31T12:10:08.489Z",
            "total_bytes_read": 4351424,
            "total_bytes_written": 3850572,
            "total_rows_read": 13733,
            "total_rows_written": 12186
        },

#### <a id="run5"></a>3.5 Troubleshooting a failed run ####
If a data flow run fails, the `state` attribute is set to the value `error`. In addition to this, the run asset itself has an attribute called `error` which is set to a concise description of the error (where available from the engine). If this information is not available from the engine, a more general message is set in the `error` attribute. This means that the `error` attribute is never left unset if a run fails. The following example shows the `error` payload produced if a schema specified in a source connection's properties doesn't exist:

```json
{
    "error": {
        "trace": "1c09deb8-c3f9-4dc1-ad5a-0fc4e7c97071",
        "errors": [
            {
                "code": "runtime_failed",
                "message": "While the process was running a fatal error occurred in the engine (see logs for more details): SCAPI: CDICO2005E: Table could not be found: \"BADSCHEMAGOSALESHR.EMPLOYEE\" is an undefined name.. SQLCODE=-204, SQLSTATE=42704, DRIVER=4.20.4\ncom.ibm.connect.api.SCAPIException: CDICO2005E: Table could not be found: \"BADSCHEMAGOSALESHR.EMPLOYEE\" is an undefined name.. SQLCODE=-204, SQLSTATE=42704, DRIVER=4.20.4\n\tat com.ibm.connect.jdbc.JdbcInputInteraction.init(JdbcInputInteraction.java:158)\n\t...",
                "extra": {
                    "account": "2d0d29d5b8d2701036042ca4cab8b613",
                    "diagnostics": "[PROJECT_ID-ff1ab70b-0553-409a-93f9-ccc31471c218] [DATA_FLOW_ID-cfdacdb4-3180-466f-8d4c-be7badea5d64] [DATA_FLOW_NAME-my_dataflow] [DATA_FLOW_RUN_ID-ed09488c-6d51-48c4-b190-7096f25645d5]",
                    "environment_name": "ypprod",
                    "http_status": 400,
                    "id": "CDIWA0129E",
                    "source_cluster": "NULL",
                    "service_version": "1.0.471",
                    "source_component": "WDP-DataFlows",
                    "timestamp": "2017-12-19T19:52:09.438Z",
                    "transaction_id": "71c7d19b-a91b-40b1-9a14-4535d76e9e16",
                    "user": "******"
                }
            }
        ]
    }
}
```

To get the logs produced by the engine, use the following API:

```
GET https://{service_URL}/v2/data_flows/{data_flow_id}/runs/{data_flow_run_id}/logs?project_id={project_id}
```

An example response from this API call could be:

In [17]:
dataflow_run_logs = requests.get(service_URL + "/v2/data_flows/" + data_flow_id + "/runs/" + data_flow_run_id + "/logs?project_id=" + project_id, headers={'Authorization': authorization})
pretty_print(dataflow_run_logs.text)

{
    "first": {
        "href": "https://api.dataplatform.ibm.com/v2/data_flows/4dd043be-2bdb-4312-9d75-ddd2b045db55/runs/9f1fb8ca-5bd3-4bc3-8f23-00d53356883e/logs?project_id=d12f9685-2693-4c84-af48-3eb6c71e3013&offset=0&limit=100&raw_logs=false"
    },
    "last": {
        "href": "https://api.dataplatform.ibm.com/v2/data_flows/4dd043be-2bdb-4312-9d75-ddd2b045db55/runs/9f1fb8ca-5bd3-4bc3-8f23-00d53356883e/logs?project_id=d12f9685-2693-4c84-af48-3eb6c71e3013&offset=0&limit=100&raw_logs=false"
    },
    "limit": 100,
    "logs": [
        {
            "date": "2018-01-31T12:10:15.000Z",
            "event_id": "0",
            "message_text": "Job requested for activity '4dd043be-2bdb-4312-9d75-ddd2b045db55' with run id 'f9790715-a987-40d2-a009-286990d928bd' by user '*****'",
            "type": "info"
        },
        {
            "date": "2018-01-31T12:10:15.000Z",
            "event_id": "1",
            "message_text": "Job submitted to cluster 'spark'",
            "type": "

## <a id="resources"></a>4. Resources ##
For further information, see <a href="http://ibm.biz/wdp-api" target="_blank" rel="noopener noreferrer">IBM Watson Core Services</a>

## Author
**Damian Cummins** is a Cloud Application Developer with the Data Refinery and IBM Watson teams at IBM. 

Copyright © IBM Corp. 2018, 2019. This notebook and its source code are released under the terms of the MIT License.

<div style="background:#F5F7FA; height:110px; padding: 2em; font-size:14px;">
<span style="font-size:18px;color:#152935;">Love this notebook? </span>
<span style="font-size:15px;color:#152935;float:right;margin-right:40px;">Don't have an account yet?</span><br>
<span style="color:#5A6872;">Share it with your colleagues and help them discover the power of Watson Studio!</span>
<span style="border: 1px solid #3d70b2;padding:8px;float:right;margin-right:40px; color:#3d70b2;"><a href="https://ibm.co/wsnotebooks" target="_blank" style="color: #3d70b2;text-decoration: none;">Sign Up</a></span><br>
</div>