<table style="border: none" align="left">
    <tr style="border: none">
       <th style="border: none"><img src="https://raw.githubusercontent.com/pmservice/cars-4-you/master/static/images/logo.png" width="200" alt="Icon"></th>
       <th style="border: none"><font face="verdana" size="5" color="black"><b>Business Area Prediction</b></th>
   </tr>
</table>

<img align=left src="https://github.com/pmservice/cars-4-you/raw/master/static/images/business_area.png" width="560" alt="Icon">


Contents
- [0. Setup](#setup)
- [1. Introduction](#introduction)
- [2. Load and explore data](#load)
- [3. Create an Apache Spark machine learning model](#model)
- [4. Store the model in the Watson Machine Learning repository](#persistence)
- [5. Deploy the model in the IBM Cloud](#deployment)

**Note:** This notebook works correctly with kernel `Python 3.5 with Spark 2.1`, please **do not change kernel**.

<a id="setup"></a>
## 0. Setup

In this section please use below cell to upgrade the `watson-machine-learning-client`.

In [1]:
!rm -rf $PIP_BUILD/watson-machine-learning-client
!pip install --upgrade watson-machine-learning-client==1.0.260

Requirement already up-to-date: watson-machine-learning-client==1.0.260 in /gpfs/global_fs01/sym_shared/YPProdSpark/user/s081-fcdcc2c8c4a157-70f20d2e11bc/.local/lib/python3.5/site-packages
Requirement already up-to-date: tqdm in /gpfs/global_fs01/sym_shared/YPProdSpark/user/s081-fcdcc2c8c4a157-70f20d2e11bc/.local/lib/python3.5/site-packages (from watson-machine-learning-client==1.0.260)
Requirement already up-to-date: tabulate in /usr/local/src/conda3_runtime.v38/home/envs/DSX-Python35-Spark/lib/python3.5/site-packages (from watson-machine-learning-client==1.0.260)
Requirement already up-to-date: urllib3 in /gpfs/global_fs01/sym_shared/YPProdSpark/user/s081-fcdcc2c8c4a157-70f20d2e11bc/.local/lib/python3.5/site-packages (from watson-machine-learning-client==1.0.260)
Requirement already up-to-date: certifi in /gpfs/global_fs01/sym_shared/YPProdSpark/user/s081-fcdcc2c8c4a157-70f20d2e11bc/.local/lib/python3.5/site-packages (from watson-machine-learning-client==1.0.260)
Requirement already 

**Note**: Please restart the kernel (Kernel -> Restart)

<a id="introduction"></a>
## 1. Introduction

This notebook creates a spark mllib model to predict Business Area based on client feedback. The notebook shows how to train, store and deploy a model  for scoring.

<a id="load"></a>
## 2. Load and explore data

In this section you will load the data as an Apache Spark DataFrame and perform a basic exploration.

Read data into Spark DataFrame from DB2 database and show sample record.

**TIP:** If needed put your service credentials here.

In [2]:
# The code was removed by Watson Studio for sharing.

Row(ID=74, Gender='Male', Status='M', Children=1, Age=Decimal('26.26'), Customer_Status='Active', Car_Owner='No', Customer_Service='no wait for pick up and drop off was great, help with luggage, face to face directions to hotel, recommended entertainment for area.', Satisfaction=1, Business_Area='Product: Information', Action='NA')

**Tip:** Code above can be inserted using Data menu.  You have to select `Insert SparkSession DataFrame` option.

**Note:** Inserted code is modified to work with code in cells below.

As you can see, the data contains eleven fields. `Business_Area` field is the one you would like to predict using feedback data in `Customer_Service` field.

In [3]:
print("Number of records: " + str(df_data.count()))

Number of records: 482


Let's see distribution of target field.

In [4]:
df_data.select('Business_Area').groupBy('Business_Area').count().show(truncate=False)

+----------------------------------+-----+
|Business_Area                     |count|
+----------------------------------+-----+
|Service: Accessibility            |26   |
|Product: Functioning              |150  |
|Service: Attitude                 |24   |
|Service: Orders/Contracts         |32   |
|Product: Availability/Variety/Size|38   |
|Product: Pricing and Billing      |24   |
|Product: Information              |8    |
|Service: Knowledge                |180  |
+----------------------------------+-----+



<a id="model"></a>
## 3. Create an Apache Spark machine learning model

In this section you will learn how to:

- [3.1 Prepare data for model training and evaluation](#prep)
- [3.2 Create an Apache Spark machine learning pipeline](#pipe)
- [3.3 Train a model](#train)

<a id="prep"></a>
### 3.1 Prepare data for model training and evaluation

In this subsection you will split your data into: train and test data set.

In [5]:
train_data, test_data = df_data.select("ID", "Customer_Service", "Business_Area").randomSplit([0.8, 0.2], 24)

print("Number of training records: " + str(train_data.count()))
print("Number of testing records : " + str(test_data.count()))

Number of training records: 387
Number of testing records : 95


### 3.2 Create the pipeline<a id="pipe"></a>

In this section you will create an Apache Spark machine learning pipeline and then train the model.

In [6]:
from pyspark.ml.feature import StringIndexer, IndexToString, HashingTF, IDF, Tokenizer
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml import Pipeline, Model
from pyspark.sql.types import *

  return f(*args, **kwds)


In the first data preprocessing step, create features from `Customer_Service` field.

In [7]:
tokenizer = Tokenizer(inputCol="Customer_Service", outputCol="words")
hashing_tf = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol='hash')
idf = IDF(inputCol=hashing_tf.getOutputCol(), outputCol="features", minDocFreq=5)

In the following step, use the StringIndexer transformer to convert `Business_Area` to numeric.

In [8]:
string_indexer_label = StringIndexer(inputCol="Business_Area", outputCol="label").fit(train_data)

Add decision tree model to predict `Business_Area`.

In [9]:
dt_area = DecisionTreeClassifier(labelCol="label", featuresCol=idf.getOutputCol())

Finally, setup transformer to convert the indexed labels back to original labels.

In [10]:
label_converter = IndexToString(inputCol="prediction", outputCol="predictedLabel", labels=string_indexer_label.labels)

In [11]:
pipeline = Pipeline(stages=[tokenizer, hashing_tf, idf, string_indexer_label, dt_area, label_converter])

### 3.3 Train the model<a id="train"></a>

In this subsection you will train model and evaluate its accuracy.

In [12]:
model = pipeline.fit(train_data)

In [13]:
predictions = model.transform(test_data)
predictions.select('Customer_Service','Business_Area','predictedLabel').show(3)

+--------------------+--------------------+--------------------+
|    Customer_Service|       Business_Area|      predictedLabel|
+--------------------+--------------------+--------------------+
|Initially the rep...|Product: Availabi...|  Service: Knowledge|
|I have had a few ...|Product: Availabi...|  Service: Knowledge|
|They did not have...|Product: Availabi...|Product: Availabi...|
+--------------------+--------------------+--------------------+
only showing top 3 rows



In [14]:
predictions.printSchema()

root
 |-- ID: integer (nullable = true)
 |-- Customer_Service: string (nullable = true)
 |-- Business_Area: string (nullable = true)
 |-- words: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- hash: vector (nullable = true)
 |-- features: vector (nullable = true)
 |-- label: double (nullable = true)
 |-- rawPrediction: vector (nullable = true)
 |-- probability: vector (nullable = true)
 |-- prediction: double (nullable = true)
 |-- predictedLabel: string (nullable = true)



In [15]:
evaluator = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction", metricName="accuracy")
accuracy = evaluator.evaluate(predictions)

print("Accuracy = %3.2f" % accuracy)

Accuracy = 0.55


**Note:** Accuracy of the model is low, however based on customer comment more than one Business Area could be selected. In such cases top k (for example k=3) would be more suited for model evaluation.

<a id="persistence"></a>
## 4. Store the model in the repository

In this section you will store trained model to Watson Machine Learning repository. When model is stored some metada is optional, however we provide it to be able to configure Continuous Learning System.

In [16]:
from watson_machine_learning_client import WatsonMachineLearningAPIClient

  return f(*args, **kwds)
  return f(*args, **kwds)
  return f(*args, **kwds)
  return f(*args, **kwds)


We need Watson Machine Learning credentials to be able to store model in repository.

**TIP:** If needed put your service credentials here.

In [17]:
# The code was removed by Watson Studio for sharing.

In [18]:
client = WatsonMachineLearningAPIClient(wml_credentials)

In [19]:
client.version

'1.0.260'

Use code in cell below to store model in Watson Machine Learning repository.

In [20]:
published_model_details = client.repository.store_model(model=model, meta_props={'name':'CARS4U - Business Area Prediction Model'}, training_data=train_data, pipeline=pipeline)

In [21]:
model_uid = client.repository.get_model_uid(published_model_details)
print(model_uid)

d928c3e2-8eb6-4d64-892a-e3607297c89d


<a id="deploy"></a>
## 5. Deploy model in the IBM Cloud

In this section you will learn how to create model deployment in the IBM Cloud and retreive information about scoring endpoint.

In [22]:
deployment_details = client.deployments.create(asset_uid=model_uid, name='CARS4U - Business Area Prediction Model Deployment')



#######################################################################################

Synchronous deployment creation for uid: 'd928c3e2-8eb6-4d64-892a-e3607297c89d' started

#######################################################################################


INITIALIZING
DEPLOY_SUCCESS


------------------------------------------------------------------------------------------------
Successfully finished deployment creation, deployment_uid='4bb27d68-f2ba-4825-b451-c745450d83b4'
------------------------------------------------------------------------------------------------




You can use deployed model to score new data using scoring endpoint. You can use following command to get scoring endpoint.

In [23]:
scoring_url = client.deployments.get_scoring_url(deployment_details)
print(scoring_url)

https://us-south.ml.cloud.ibm.com/v3/wml_instances/aaed6937-c0e7-4307-8a17-361aca257c7e/deployments/4bb27d68-f2ba-4825-b451-c745450d83b4/online


<a id="scoring"></a>
## 6. Scoring

Scoring with transaction ID.

In [30]:
scoring_response = client.deployments.score(scoring_url, scoring_request_payload, transaction_id=transaction_id)

Scoring without transaction ID.

In [31]:
import json

scoring_response = client.deployments.score(scoring_url, scoring_request_payload)

print(json.dumps(scoring_response, indent=3))

{
   "fields": [
      "ID",
      "Customer_Service",
      "Business_Area",
      "words",
      "hash",
      "features",
      "label",
      "rawPrediction",
      "probability",
      "prediction",
      "predictedLabel"
   ],
   "values": [
      [
         1,
         "Although there were no available car in category stuff assisted in finding an available vehicle.",
         "Service: Knowledge",
         [
            "although",
            "there",
            "were",
            "no",
            "available",
            "car",
            "in",
            "category",
            "stuff",
            "assisted",
            "in",
            "finding",
            "an",
            "available",
            "vehicle."
         ],
         [
            262144,
            [
               33209,
               56750,
               58227,
               68435,
               92612,
               156250,
               180535,
               194536,
               222453,
 

---