<table style="border: none" align="left">
    <tr style="border: none">
       <th style="border: none"><img src="https://raw.githubusercontent.com/pmservice/cars-4-you/master/static/images/logo.png" width="200" alt="Icon"></th>
       <th style="border: none"><font face="verdana" size="5" color="black"><b>Business Area Prediction</b></th>
   </tr>
</table>

<img align=left src="https://github.com/pmservice/cars-4-you/raw/master/static/images/business_area.png" width="560" alt="Icon">


Contents
- [0. Setup](#setup)
- [1. Introduction](#introduction)
- [2. Load and explore data](#load)
- [3. Create an Apache Spark machine learning model](#model)
- [4. Store the model in the Watson Machine Learning repository](#persistence)
- [5. Deploy the model in the IBM Cloud](#deployment)
- [6. Payload logging for Spark model](#payload_logging)

**Note:** This notebook works correctly with kernel `Python 3.5 with Spark 2.1`, please **do not change kernel**.

<a id="setup"></a>
## 0. Setup

In this section please use below cell to upgrade the `watson-machine-learning-client`.

In [54]:
!rm -rf $PIP_BUILD/watson-machine-learning-client
!pip install --upgrade watson-machine-learning-client

Requirement already up-to-date: watson-machine-learning-client in /gpfs/global_fs01/sym_shared/YPProdSpark/user/s081-fcdcc2c8c4a157-70f20d2e11bc/.local/lib/python3.5/site-packages
Requirement already up-to-date: tqdm in /gpfs/global_fs01/sym_shared/YPProdSpark/user/s081-fcdcc2c8c4a157-70f20d2e11bc/.local/lib/python3.5/site-packages (from watson-machine-learning-client)
Requirement already up-to-date: tabulate in /usr/local/src/conda3_runtime.v37/home/envs/DSX-Python35-Spark/lib/python3.5/site-packages (from watson-machine-learning-client)
Requirement already up-to-date: urllib3 in /gpfs/global_fs01/sym_shared/YPProdSpark/user/s081-fcdcc2c8c4a157-70f20d2e11bc/.local/lib/python3.5/site-packages (from watson-machine-learning-client)
Requirement already up-to-date: certifi in /gpfs/global_fs01/sym_shared/YPProdSpark/user/s081-fcdcc2c8c4a157-70f20d2e11bc/.local/lib/python3.5/site-packages (from watson-machine-learning-client)
Requirement already up-to-date: pandas in /gpfs/global_fs01/sym_s

**Note**: Please restart the kernel (Kernel -> Restart)

<a id="introduction"></a>
## 1. Introduction

This notebook creates a spark mllib model to predict Business Area based on client feedback. The notebook shows how to train, store and deploy a model  for scoring.

<a id="load"></a>
## 2. Load and explore data

In this section you will load the data as an Apache Spark DataFrame and perform a basic exploration.

Read data into Spark DataFrame from DB2 database and show sample record.

In [93]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

# @hidden_cell
# The following code is used to access your data and contains your credentials.
# You might want to remove those credentials before you share your notebook.

properties_db2 = {
    'driver': 'com.ibm.db2.jcc.DB2Driver',
    'jdbcurl': 'jdbc:db2://dashdb-entry-yp-dal10-01.services.dal.bluemix.net:50000/BLUDB',
    'user': 'dash5120',
    'password': 'G5_CehiL4_Ux'
}

table_name = 'CAR_RENTAL_TRAINING'
df_data = spark.read.jdbc(properties_db2['jdbcurl'], table='.'.join([properties_db2['user'], table_name]), properties=properties_db2)
df_data.head()

Row(ID=2805, Gender='Male', Status='S', Children=1, Age=Decimal('43.91'), Customer_Status='Active', Car_Owner='Yes', Customer_Service='The rental clerk was nice, but I swear she was deaf.  We had to repeat everything twice and although we had a reservation she treated us like a walk in, so that all the information that we gave to reserve the car we had to repeat all over again.', Satisfaction=0, Business_Area='Service: Knowledge', Action='On-demand pickup location')

**Tip:** Code above can be inserted using Data menu.  You have to select `Insert SparkSession DataFrame` option.

**Note:** Inserted code is modified to work with code in cells below.

As you can see, the data contains eleven fields. `Business_Area` field is the one you would like to predict using feedback data in `Customer_Service` field.

In [94]:
print("Number of records: " + str(df_data.count()))

Number of records: 486


Let's see distribution of target field.

In [95]:
df_data.select('Business_Area').groupBy('Business_Area').count().show(truncate=False)

+----------------------------------+-----+
|Business_Area                     |count|
+----------------------------------+-----+
|Service: Accessibility            |26   |
|Product: Functioning              |150  |
|Service: Attitude                 |24   |
|Service: Orders/Contracts         |32   |
|Product: Availability/Variety/Size|42   |
|Product: Pricing and Billing      |24   |
|Product: Information              |8    |
|Service: Knowledge                |180  |
+----------------------------------+-----+



<a id="model"></a>
## 3. Create an Apache Spark machine learning model

In this section you will learn how to:

- [3.1 Prepare data for model training and evaluation](#prep)
- [3.2 Create an Apache Spark machine learning pipeline](#pipe)
- [3.3 Train a model](#train)

<a id="prep"></a>
### 3.1 Prepare data for model training and evaluation

In this subsection you will split your data into: train and test data set.

In [None]:
train_data, test_data = df_data.select("ID", "Customer_Service", "Business_Area").randomSplit([0.8, 0.2], 24)

print("Number of training records: " + str(train_data.count()))
print("Number of testing records : " + str(test_data.count()))

### 3.2 Create the pipeline<a id="pipe"></a>

In this section you will create an Apache Spark machine learning pipeline and then train the model.

In [96]:
from pyspark.ml.feature import StringIndexer, IndexToString, HashingTF, IDF, Tokenizer
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml import Pipeline, Model
from pyspark.sql.types import *

In the first data preprocessing step, create features from `Customer_Service` field.

In [97]:
tokenizer = Tokenizer(inputCol="Customer_Service", outputCol="words")
hashing_tf = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol='hash')
idf = IDF(inputCol=hashing_tf.getOutputCol(), outputCol="features", minDocFreq=5)

In the following step, use the StringIndexer transformer to convert `Business_Area` to numeric.

In [98]:
string_indexer_label = StringIndexer(inputCol="Business_Area", outputCol="label").fit(train_data)

Add decision tree model to predict `Business_Area`.

In [99]:
dt_area = DecisionTreeClassifier(labelCol="label", featuresCol=idf.getOutputCol())

Finally, setup transformer to convert the indexed labels back to original labels.

In [100]:
label_converter = IndexToString(inputCol="prediction", outputCol="predictedLabel", labels=string_indexer_label.labels)

In [101]:
pipeline = Pipeline(stages=[tokenizer, hashing_tf, idf, string_indexer_label, dt_area, label_converter])

### 3.3 Train the model<a id="train"></a>

In this subsection you will train model and evaluate its accuracy.

In [102]:
model = pipeline.fit(train_data)

In [103]:
predictions = model.transform(test_data)
predictions.select('Customer_Service','Business_Area','predictedLabel').show(3)

+--------------------+--------------------+------------------+
|    Customer_Service|       Business_Area|    predictedLabel|
+--------------------+--------------------+------------------+
|Agents always wan...|   Service: Attitude|Service: Knowledge|
|Did not have some...|Service: Accessib...|Service: Knowledge|
|I was penalty cha...|Product: Pricing ...|Service: Knowledge|
+--------------------+--------------------+------------------+
only showing top 3 rows



In [104]:
predictions.printSchema()

root
 |-- ID: integer (nullable = true)
 |-- Customer_Service: string (nullable = true)
 |-- Business_Area: string (nullable = true)
 |-- words: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- hash: vector (nullable = true)
 |-- features: vector (nullable = true)
 |-- label: double (nullable = true)
 |-- rawPrediction: vector (nullable = true)
 |-- probability: vector (nullable = true)
 |-- prediction: double (nullable = true)
 |-- predictedLabel: string (nullable = true)



In [105]:
evaluator = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction", metricName="accuracy")
accuracy = evaluator.evaluate(predictions)

print("Accuracy = %3.2f" % accuracy)

Accuracy = 0.49


**Note:** Accuracy of the model is low, however based on customer comment more than one Business Area could be selected. In such cases top k (for example k=3) would be more suited for model evaluation.

<a id="persistence"></a>
## 4. Store the model in the repository

In this section you will store trained model to Watson Machine Learning repository. When model is stored some metada is optional, however we provide it to be able to configure Continuous Learning System.

In [106]:
from watson_machine_learning_client import WatsonMachineLearningAPIClient

We need Watson Machine Learning credentials to be able to store model in repository.

In [107]:
# @hidden_cell
# How to get associated service credentials

wml_credentials = {
  "instance_id": "000263d8-04e0-4060-ad69-fcfe40069018",
  "password": "7419325b-3de4-476c-94cb-4b158fa335b0",
  "url": "https://us-south.ml.cloud.ibm.com",
  "username": "cdc4b5da-8380-42f1-bd82-da044b283959"
}

In [108]:
client = WatsonMachineLearningAPIClient(wml_credentials)

In [109]:
client.version

'1.0.260'

Use code in cell below to store model in Watson Machine Learning repository.

In [110]:
published_model_details = client.repository.store_model(model=model, meta_props={'name':'CARS4U - Business Area Prediction Model'}, training_data=train_data, pipeline=pipeline)

In [111]:
model_uid = client.repository.get_model_uid(published_model_details)
print(model_uid)

ef914435-09d3-4fc5-a637-8958f1ade572


<a id="deploy"></a>
## 5. Deploy model in the IBM Cloud

In this section you will learn how to create model deployment in the IBM Cloud and retreive information about scoring endpoint.

In [None]:
deployment_details = client.deployments.create(asset_uid=model_uid, name='CARS4U - Business Area Prediction Model Deployment')

You can use deployed model to score new data using scoring endpoint. You can use following command to get scoring endpoint.

In [112]:
scoring_url = client.deployments.get_scoring_url(deployment_details)
print(scoring_url)

https://us-south.ml.cloud.ibm.com/v3/wml_instances/000263d8-04e0-4060-ad69-fcfe40069018/deployments/86837723-331c-4fb7-a60e-c858c734d74a/online


<a id="payload_logging"></a>
## 6. Payload logging

In this section we configure payload logging for online scoring.

n this section you will learn how to:

- [6.1 Payload_logging setup](#payload_logging_setup)
- [6.2 Payload logging with transaction ID](#payload_logging_scoring)
- [6.3 Scoring](#scoring)

<a id="payload_logging_setup"></a>
### 6.1. Setup

We have to get `deployment_uid` for model deployed in the IBM Cloud.

In [113]:
deployment_uid = client.deployments.get_uid(deployment_details)

We need to provide configuration for database to which scoring payload will be logged.

In [114]:
# @hidden_cell
postgres_connection = {
  'database':'compose',
  'password':"""WHDHTGJYSXKJTMET""",
  'port':'47860',
  'host':'sl-us-south-1-portal.28.dblayer.com',
  'username':'admin'
}

**Tip:** You can use Data panel to insert postgress connection credentials.

In [115]:
payload_data_reference = {
    "type": "postgresql",
    "location": {
        "tablename": "public.cars4u_business_area_prediction_payload"
    },
    "connection": {
            "uri": "postgres://{username}:{password}@{host}:{port}/{database}".format(**postgres_connection)
        }
}
print(payload_data_reference)

{'connection': {'uri': 'postgres://admin:WHDHTGJYSXKJTMET@sl-us-south-1-portal.28.dblayer.com:47860/compose'}, 'location': {'tablename': 'public.cars4u_business_area_prediction_payload'}, 'type': 'postgresql'}


In [116]:
payload_metadata = {client.deployments.PayloadLoggingMetaNames.PAYLOAD_DATA_REFERENCE: payload_data_reference}

Now we are ready to setup payload logging for deployed model.

In [117]:
config_details = client.deployments.setup_payload_logging(deployment_uid, meta_props=payload_metadata)
print(config_details)

{'dynamic_schema_update': False, 'payload_store': {'connection': {'host': 'sl-us-south-1-portal.28.dblayer.com:47860', 'db': 'compose', 'uri': 'postgres://admin:WHDHTGJYSXKJTMET@sl-us-south-1-portal.28.dblayer.com:47860/compose'}, 'location': {'tablename': 'public.cars4u_business_area_prediction_payload'}, 'type': 'postgresql'}, 'output_data_schema': {'fields': [{'metadata': {'name': 'ID', 'scale': 0}, 'type': 'integer', 'name': 'ID', 'nullable': True}, {'metadata': {'name': 'Customer_Service', 'scale': 0}, 'type': 'string', 'name': 'Customer_Service', 'nullable': True}, {'metadata': {'modeling_role': 'prediction'}, 'type': 'double', 'name': 'prediction', 'nullable': True}, {'metadata': {'modeling_role': 'prediction-probability'}, 'type': 'double', 'name': 'prediction_probability', 'nullable': True}, {'metadata': {'modeling_role': 'probability'}, 'type': {'containsNull': True, 'elementType': 'double', 'type': 'array'}, 'name': 'probability', 'nullable': True}], 'type': 'struct'}}


<a id="payload_logging_scoring"></a>
### 6.2. Payload logging with transaction ID

Here we present how to add transaction ID to payload logging.

In [118]:
import uuid

scoring_request_payload = {
    "fields": ["ID","Customer_Service"],
    "values": [[1,'Although there were no available car in category stuff assisted in finding an available vehicle.'],
               [2,'Car rental cost was higher because I decided to pay cash'],
               [3,'Do not try sell what I do not need.']]
}

transaction_id = 'transaction-id-' + uuid.uuid4().hex
print("Transaction ID:", transaction_id)

Transaction ID: transaction-id-44ad66ad71fe453eb995cd1e63eed22f


Now we are ready to send scoring request with transaction ID.

<a id="scoring"></a>
### 6.3. Scoring

Scoring with transaction ID.

In [126]:
scoring_response = client.deployments.score(scoring_url, scoring_request_payload, transaction_id=transaction_id)

Scoring without transaction ID.

In [125]:
import json

scoring_response = client.deployments.score(scoring_url, scoring_request_payload)

print(json.dumps(scoring_response, indent=3))

{
   "fields": [
      "ID",
      "Customer_Service",
      "Business_Area",
      "words",
      "hash",
      "features",
      "label",
      "rawPrediction",
      "probability",
      "prediction",
      "predictedLabel"
   ],
   "values": [
      [
         1,
         "Although there were no available car in category stuff assisted in finding an available vehicle.",
         "Product: Functioning",
         [
            "although",
            "there",
            "were",
            "no",
            "available",
            "car",
            "in",
            "category",
            "stuff",
            "assisted",
            "in",
            "finding",
            "an",
            "available",
            "vehicle."
         ],
         [
            262144,
            [
               33209,
               56750,
               58227,
               68435,
               92612,
               156250,
               180535,
               194536,
               222453,

---