<table style="border: none" align="left">
   <tr style="border: none">
      <th style="border: none"><font face="verdana" size="5" color="black"><b>Use Spark and Python to Predict Equipment Purchase</b></th>
      <th style="border: none"><img src="https://github.com/pmservice/customer-satisfaction-prediction/blob/master/app/static/images/ml_icon_gray.png?raw=true" alt="Watson Machine Learning icon" height="40" width="40"></th>
   </tr>
   <tr style="border: none">
       <th style="border: none"><img src="https://github.com/pmservice/wml-sample-models/blob/master/spark/product-line-prediction/images/products_graphics.png?raw=true" alt="Icon"> </th>
   </tr>
</table>

This notebook demonstrates how to perform data analysis on classification problem using <a href="http://spark.apache.org/docs/2.3.0/api/python/pyspark.ml.html" target="_blank" rel="noopener no referrer">PySpark ML package</a>.

Some familiarity with Python is helpful. This notebook is compatible with Python 3.6 and Spark 2.x.

You will use a publicly available data set, **GoSales Transactions for Naive Bayes Model**, which details anonymous outdoor equipment purchases. This data set will be used to predict clients' interests in terms of product line, such as golf accessories, camping equipment, and so forth.

**Note**: In this notebook, we use the GoSales data available to the <a  href="https://dataplatform.cloud.ibm.com/exchange/public/entry/view/8044492073eb964f46597b4be06ff5ea" target="_blank" rel="noopener no referrer">Watson Studio Community</a>.

## Learning goals

You will learn how to:

-  Load a CSV file into a Spark DataFrame.
-  Explore data.
-  Prepare data for training and evaluation.
-  Create a Spark machine learning pipeline.
-  Train and evaluate a model.
-  Store a pipeline and model in the Watson Machine Learning (WML) repository.
-  Deploy a model for online scoring via the Watson Machine Learning (WML) API.
-  Score the model using sample data via the Watson Machine Learning (WML) API.
-  Explore and visualize the prediction results using the plotly package.


## Contents

This notebook contains the following parts:

1.	[Set up the environment](#setup)
2.	[Load and explore the data](#load)
3.	[Build a Spark machine learning model](#model)
4.	[Store the model in the WML repository](#persistence)
5.	[Predict locally and visualize](#visualization)
6.	[Deploy and score in a Cloud](#scoring)
7.	[Summary and next steps](#summary)

<a id="setup"></a>
## 1. Set up the environment

Before you use the sample code in this notebook, you must perform the following setup tasks:

-  Create a <a href="https://cloud.ibm.com/catalog/services/machine-learning" target="_blank" rel="noopener no referrer">Watson Machine Learning (WML) Service</a> instance (a lite plan is offered and information about how to create the instance can be found <a href="https://dataplatform.ibm.com/docs/content/analyze-data/wml-setup.html" target="_blank" rel="noopener no referrer">here</a>).
-  Make sure that you are using Spark 2.x kernel.
-  Download **GoSales Transactions** from the Watson Studio Community (code provided below).

<a id="load"></a>
## 2. Load and explore the data

In this section, you will load the data as a Spark DataFrame and explore the data.

Use `wget` to upload the data to the IBM General Parallel File System (GPFS), load the data to the Spark DataFrame, and use Spark `read` method to read the data. 

In [None]:
# Install wget if you don't already have it installed.
!pip install --upgrade wget

Import the data link. To get the data link:
1. Select the **GoSales Transactions for Naive Bayes Model** from the Watson Studio community.
2. Click the **Data Access Link**, then copy the link information.
3. Paste the link information in `link_to_data` in the cell below.


In [2]:
import wget

link_to_data = 'Enter data link here'
filename = wget.download(link_to_data)

print(filename)

GoSales_Tx_NaiveBayes (1).csv


In [3]:
spark = SparkSession.builder.getOrCreate()

df = spark.read\
  .format('org.apache.spark.sql.execution.datasources.csv.CSVFileFormat')\
  .option('header', 'true')\
  .option('inferSchema', 'true')\
  .load(filename)

The csv file, **GoSales_Tx_NaiveBayes.csv**, is availble in the IBM General Parallel File System (GPFS) - your local file system - now. Load the file to Spark DataFrame and display it using `pixiedust`.  
`pixiedust` is an open-source Python helper library that works as an add-on to Jupyter notebooks to improve the user experience of working with data.  
`pixiedust` documentation/code can be found <a href="https://github.com/pixiedust/pixiedust" target="_blank" rel="noopener no referrer">here</a>. 

In [None]:
!pip install --upgrade pixiedust

In [5]:
import pixiedust

Pixiedust database opened successfully
Table VERSION_TRACKER created successfully
Table METRICS_TRACKER created successfully

Share anonymous install statistics? (opt-out instructions)

PixieDust will record metadata on its environment the next time the package is installed or updated. The data is anonymized and aggregated to help plan for future releases, and records only the following values:

{
   "data_sent": currentDate,
   "runtime": "python",
   "application_version": currentPixiedustVersion,
   "space_id": nonIdentifyingUniqueId,
   "config": {
       "repository_id": "https://github.com/ibm-watson-data-lab/pixiedust",
       "target_runtimes": ["Data Science Experience"],
       "event_id": "web",
       "event_organizer": "dev-journeys"
   }
}
You can opt out by calling pixiedust.optOut() in a new cell.


[31mPixiedust runtime updated. Please restart kernel[0m
Table SPARK_PACKAGES created successfully
Table USER_PREFERENCES created successfully
Table service_connections created successfully


In [6]:
pixiedust.optOut()

Pixiedust will not collect anonymous install statistics.


In [None]:
display(df)

PRODUCT_LINE,GENDER,AGE,MARITAL_STATUS,PROFESSION
Outdoor Protection,F,49,Married,Other
Camping Equipment,M,33,Single,Other
Camping Equipment,M,37,Married,Trades
Mountaineering Equipment,M,35,Married,Executive
Camping Equipment,M,35,Married,Other
Personal Accessories,F,28,Single,Professional
Camping Equipment,M,24,Single,Other
Mountaineering Equipment,F,27,Single,Student
Camping Equipment,F,25,Single,Other
Camping Equipment,F,25,Single,Other


As you can see, the data contains five features (predictors). PRODUCT_LINE is the one you would like to predict (label).
You can check the Schema of the DataFrame by clicking on the `Schema` panel.


`brunel` defines a highly succinct and novel language that defines interactive data visualizations based on tabular data.  
`brunel` documentation/code can be found <a href="https://github.com/Brunel-Visualization/Brunel" target="_blank" rel="noopener no referrer">here</a>. 

In [None]:
!pip install --upgrade brunel

You have to convert the PySpark DataFrame into a Pandas DataFrame first in order to pass it to `brunel`.

In [9]:
df_pd = df.toPandas()

4 histograms plotted by `brunel`. As you can see, zoom in and zoom out are supported. 

In [10]:
%brunel data('df_pd') bar x(GENDER) y(#count)

<IPython.core.display.Javascript object>

In [11]:
%brunel data('df_pd') bar x(MARITAL_STATUS) y(#count)

<IPython.core.display.Javascript object>

In [12]:
%brunel data('df_pd') bar x(PROFESSION) y(#count)

<IPython.core.display.Javascript object>

In [13]:
%brunel data('df_pd') bar x(PRODUCT_LINE) y(#count)

<IPython.core.display.Javascript object>

Since 4 predictors are categorical, you can perform chi-squared tests on them. Chi-squared test can be performed when both the predictor and the target (label) are categorical. The goal of the chi-squared test is to assess the relationship between two categorical variables.

You will use `scipy.stats` module for the chi-squared test.

In [14]:
from scipy import stats
import pandas as pd

The `chisquare` method returns chi-squared test statistics and the p-value.

In [15]:
stats.chisquare(df_pd['GENDER'].value_counts())

Power_divergenceResult(statistic=99.78596561110005, pvalue=1.6978915791618042e-23)

In [16]:
stats.chisquare(df_pd['MARITAL_STATUS'].value_counts())

Power_divergenceResult(statistic=18131.09191396136, pvalue=0.0)

In [17]:
stats.chisquare(df_pd['PROFESSION'].value_counts())

Power_divergenceResult(statistic=59934.1604262099, pvalue=0.0)

In [18]:
stats.chisquare(df_pd['PRODUCT_LINE'].value_counts())

Power_divergenceResult(statistic=24592.717685719977, pvalue=0.0)

Let's create cross-tabulation matrix for each predictor and get the chi-squared test results.

In [19]:
target_classes = ['Camping Equipment', 'Gold Equipment', 'Mountaineering Equipment', 'Outdoor Protection', 'Personal Accessories']

Cross-tabulation matrix for predictor `GENDER` and target `PRODUCT_LINE`.

In [20]:
cont_gender = pd.crosstab(df_pd['PRODUCT_LINE'], df_pd['GENDER'])

In [21]:
cont_gender_df = cont_gender
cont_gender_df.index = target_classes
cont_gender_df.index.name = 'PRODUCT_LINE'

In [22]:
cont_gender_df

GENDER,F,M
PRODUCT_LINE,Unnamed: 1_level_1,Unnamed: 2_level_1
Camping Equipment,9398,14658
Gold Equipment,2247,4214
Mountaineering Equipment,3379,6635
Outdoor Protection,1917,621
Personal Accessories,11959,5224


The first value of the output of the ` chi2_contingency` method is the chi-squared test statistics, the second values is the p-value, the third value it the degree of freedom, and the last value is the contingency table with expected values.

In [23]:
stats.chi2_contingency(cont_gender)

(6019.443820817951, 0.0, 4, array([[11538.51158468, 12517.48841532],
        [ 3099.03239726,  3361.96760274],
        [ 4803.23640709,  5210.76359291],
        [ 1217.35710018,  1320.64289982],
        [ 8241.86251079,  8941.13748921]]))

Using `stats.chi2_contingency`, you can check if two features (predictors) are independent or not.

$H_{0}$ (null hypothesis): Predictor $A$ and predictor $B$ are independent.  
$H_{1}$ (alternative hypothesis): Predictor $A$ and predictor $B$ are dependent.

If $p$ < $0.05$, then $A$ and $B$ are dependent, else $A$ and $B$ are independent.

Since the $p$-value is $0.0$, $H_{0}$ (null hypothesis) is rejected - `GENDER` and `PRODUCT_LINE` are dependent.

Cross-tabulation matrix for predictor `MARITAL_STATUS` and target `PRODUCT_LINE`.

In [24]:
cont_marital = pd.crosstab(df_pd['PRODUCT_LINE'], df_pd['MARITAL_STATUS'])
cont_marital_df = cont_marital
cont_marital_df.index = target_classes
cont_marital_df.index.name = 'PRODUCT_LINE'

In [25]:
cont_marital_df

MARITAL_STATUS,Married,Single,Unspecified
PRODUCT_LINE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Camping Equipment,14293,8243,1520
Gold Equipment,4833,425,1203
Mountaineering Equipment,2757,6593,664
Outdoor Protection,1676,577,285
Personal Accessories,7220,8711,1252


The first value of the output of the ` chi2_contingency` method is the chi-squared test statistics, the second values is the p-value, the third value it the degree of freedom, and the last value is the contingency table with expected values.

In [26]:
stats.chi2_contingency(cont_marital)

(7818.266841594747,
 0.0,
 8,
 array([[12288.71446591,  9801.34674368,  1965.93879041],
        [ 3300.52311956,  2632.4618104 ,   528.01507004],
        [ 5115.52987453,  4080.0917148 ,   818.37841068],
        [ 1296.50637323,  1034.07956582,   207.41406094],
        [ 8777.72616677,  7001.02016531,  1404.25366793]]))

Using `stats.chi2_contingency`, you can check if two features (predictors) are independent or not.

$H_{0}$ (null hypothesis): Predictor $A$ and predictor $B$ are independent.  
$H_{1}$ (alternative hypothesis): Predictor $A$ and predictor $B$ are dependent.

If $p$ < $0.05$, then $A$ and $B$ are dependent, else $A$ and $B$ are independent.

Since the $p$-value is $0.0$, $H_{0}$ (null hypothesis) is rejected - `MARITAL_STATUS` and `PRODUCT_LINE` are dependent.

Cross-tabulation matrix for predictor `PROFESSION` and target `PRODUCT_LINE`.

In [27]:
cont_profession = pd.crosstab(df_pd['PRODUCT_LINE'], df_pd['PROFESSION'])
cont_profession_df = cont_profession
cont_profession_df.index = target_classes
cont_profession_df.index.name = 'PRODUCT_LINE'

In [28]:
cont_profession_df

PROFESSION,Executive,Hospitality,Other,Professional,Retail,Retired,Sales,Student,Trades
PRODUCT_LINE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Camping Equipment,3764,1967,9650,1861,619,30,3440,503,2222
Gold Equipment,175,458,3462,1022,21,442,356,54,471
Mountaineering Equipment,496,83,4029,2133,448,14,1264,812,735
Outdoor Protection,292,301,1136,135,179,123,79,220,73
Personal Accessories,1144,502,6226,3787,1518,574,1569,1356,507


The first value of the output of the ` chi2_contingency` method is the chi-squared test statistics, the second values is the p-value, the third value it the degree of freedom, and the last value is the contingency table with expected values.

In [29]:
stats.chi2_contingency(cont_profession)

(10261.273753141253,
 0.0,
 32,
 array([[2344.03465445, 1321.93812654, 9782.98094669, 3568.55420567,
         1111.92923056,  472.3203877 , 2678.21230831, 1175.81026356,
         1600.21987652],
        [ 629.56467835,  355.04831375, 2627.52909447,  958.4481511 ,
          298.64377946,  126.85658567,  719.31866162,  315.80105225,
          429.78968333],
        [ 975.77165903,  550.29466242, 4072.44642501, 1485.51304521,
          462.87243577,  196.61690898, 1114.88269269,  489.46474806,
          666.13742282],
        [ 247.30462059,  139.46952798, 1032.141904  ,  376.49611631,
          117.3127863 ,   49.83160725,  282.56164111,  124.05247959,
          168.82931687],
        [1674.32438757,  944.24936932, 6987.90162982, 2548.98848171,
          794.24176791,  337.37451039, 1913.02469628,  839.87145655,
         1143.02370046]]))

Using `stats.chi2_contingency`, you can check if two features (predictors) are independent or not.

$H_{0}$ (null hypothesis): Predictor $A$ and predictor $B$ are independent.  
$H_{1}$ (alternative hypothesis): Predictor $A$ and predictor $B$ are dependent.

If $p$ < $0.05$, then $A$ and $B$ are dependent, else $A$ and $B$ are independent.

Since the $p$-value is $0.0$, $H_{0}$ (null hypothesis) is rejected - `PROFESSIONS` and `PRODUCT_LINE` are dependent.

<a id="model"></a>
## 3. Build a Spark machine learning model

In this section, you will learn how to:

- [3.1 Split data](#prep)
- [3.2 Build a Spark machine learning pipeline](#pipe)
- [3.3 Train a model](#train)

### 3.1 Split data<a id="prep"></a>

In this subsection, you will split your data into: 
- Train data set
- Test data set
- Prediction data set

In [30]:
split_data = df.randomSplit([0.8, 0.18, 0.02], 24)
train_data = split_data[0]
test_data = split_data[1]
predict_data = split_data[2]

print('Number of training records: ' + str(train_data.count()))
print('Number of testing records : ' + str(test_data.count()))
print('Number of prediction records : ' + str(predict_data.count()))

Number of training records: 48176
Number of testing records : 10860
Number of prediction records : 1216


As you can see, your data has been successfully split into three data sets: 

-  The train data set which is the largest group is used for training.
-  The test data set will be used for model evaluation and is used to test the assumptions of the model.
-  The prediction data set will be used for prediction.

### 3.2 Create the pipeline<a id="pipe"></a>

In this subsection, you will create a Spark machine learning pipeline and train the model.

In the first step, you need to import the Spark machine learning modules that will be needed in the subsequent steps.

In [31]:
from pyspark.ml.feature import OneHotEncoder, StringIndexer, IndexToString, VectorAssembler
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml import Pipeline, Model

In the following step, use the `StringIndexer` transformer to convert all string fields into numerical type.

In [32]:
stringIndexer_label = StringIndexer(inputCol='PRODUCT_LINE', outputCol='label').fit(df)
stringIndexer_prof = StringIndexer(inputCol='PROFESSION', outputCol='PROFESSION_IX')
stringIndexer_gend = StringIndexer(inputCol='GENDER', outputCol='GENDER_IX')
stringIndexer_mar = StringIndexer(inputCol='MARITAL_STATUS', outputCol='MARITAL_STATUS_IX')

In the following step, create a feature vector to combine all features (predictors) together.

In [33]:
vectorAssembler_features = VectorAssembler(inputCols=['GENDER_IX', 'AGE', 'MARITAL_STATUS_IX', 'PROFESSION_IX'], outputCol='features')

Next, select the estimator you want to use for classification. `Random Forest` is used in this example.

In [34]:
rf = RandomForestClassifier(labelCol='label', featuresCol='features')

Finally, convert the indexed labels back to original labels.

In [35]:
labelConverter = IndexToString(inputCol='prediction', outputCol='predictedLabel', labels=stringIndexer_label.labels)

Now build the pipeline. A pipeline consists of transformers and an estimator.

In [36]:
pipeline_rf = Pipeline(stages=[stringIndexer_label, stringIndexer_prof, stringIndexer_gend, stringIndexer_mar, vectorAssembler_features, rf, labelConverter])

### 3.3 Train a model<a id="train"></a>

Now, you can train your Random Forest model by using the previously defined **pipeline** and **train data**.

In [None]:
display(train_data)

PRODUCT_LINE,GENDER,AGE,MARITAL_STATUS,PROFESSION
Camping Equipment,F,19,Single,Hospitality
Camping Equipment,F,19,Single,Other
Camping Equipment,F,20,Married,Retail
Camping Equipment,F,20,Single,Other
Camping Equipment,F,21,Single,Other
Camping Equipment,F,22,Single,Other
Camping Equipment,F,22,Single,Other
Camping Equipment,F,22,Single,Other
Camping Equipment,F,22,Single,Retail
Camping Equipment,F,22,Single,Student


In order to train the `Random Forest` model, run the following cell.

In [38]:
model_rf = pipeline_rf.fit(train_data)

You can check your **model accuracy** now. Use **test data** to evaluate the model.

In [39]:
predictions = model_rf.transform(test_data)
evaluatorRF = MulticlassClassificationEvaluator(labelCol='label', predictionCol='prediction', metricName='accuracy')
accuracy = evaluatorRF.evaluate(predictions)

print('Accuracy = {:.2f}%'.format(accuracy*100))
print('Test Error = {:.2f}%'.format((1.0 - accuracy)*100))

Accuracy = 58.58%
Test Error = 41.42%


You can tune your model to achieve better accuracy. For simplicity, the tuning step is omitted in this example.

<a id="persistence"></a>
## 4. Store the model in the WML repository

In this section, you will learn how to use `watson-machine-learning-client` package to store your pipeline and model in the WML repository.

- [4.1 Install required package](#lib)
- [4.2 Save pipeline and model](#save)
- [4.3 Load the model](#load)

### 4.1 Install required package<a id="lib"></a>

**Note**: Python 3.6 and Spark version >= 2.3 are required.

In [40]:
from watson_machine_learning_client import WatsonMachineLearningAPIClient

Authenticate the Watson Machine Learning service on the IBM Cloud.

**Tip**: Authentication information (your credentials) can be found in the <a href="https://dataplatform.cloud.ibm.com/docs/content/wsj/analyze-data/ml-get-wml-credentials.html" target="_blank" rel="noopener no referrer">Service credentials</a> tab of the service instance that you created on the IBM Cloud. 

If you cannot find the **instance_id** field in **Service Credentials**, click **New credential (+)** to generate new authentication information. 

**Action**: Enter your Watson Machine Learning service instance credentials here.

In [41]:
wml_credentials = {
    "apikey": "***",
    "instance_id": "***",
    "password": "***",
    "url": "https://ibm-watson-ml.mybluemix.net",
    "username": "***"
}

In [43]:
client = WatsonMachineLearningAPIClient(wml_credentials)

### 4.2 Save the pipeline and model<a id="save"></a>

In this subsection, you will learn how to save pipeline and model artifacts to your Watson Machine Learning instance.

In [None]:
published_model_details = client.repository.store_model(model=model_rf, meta_props={'name':'Product line model'}, training_data=train_data, pipeline=pipeline_rf)

In [45]:
model_uid = client.repository.get_model_uid(published_model_details)
print(model_uid)

c2674950-55eb-4aaa-92b6-829293c40533


Get saved model metadata from Watson Machine Learning.

**Tip**: Use `client.repository.ModelMetaNames.show()` to get the list of available props.

In [46]:
client.repository.ModelMetaNames.show()

-----------------------  ----  --------
META_PROP NAME           TYPE  REQUIRED
NAME                     str   Y
DESCRIPTION              str   N
AUTHOR_NAME              str   N
FRAMEWORK_NAME           str   N
FRAMEWORK_VERSION        str   N
FRAMEWORK_LIBRARIES      list  N
RUNTIME_NAME             str   N
RUNTIME_VERSION          str   N
TRAINING_DATA_SCHEMA     dict  N
INPUT_DATA_SCHEMA        dict  N
TRAINING_DATA_REFERENCE  dict  N
EVALUATION_METHOD        str   N
EVALUATION_METRICS       list  N
OUTPUT_DATA_SCHEMA       dict  N
LABEL_FIELD              str   N
TRANSFORMED_LABEL_FIELD  str   N
RUNTIME_UID              str   N
TRAINING_DEFINITION_URL  str   N
-----------------------  ----  --------


### 4.3 Load the model<a id="load"></a>

In this subsection, you will learn how to load a saved model from the specified Watson Machine Learning instance.

In [47]:
loaded_model = client.repository.load(model_uid)

You can print the model name to make sure that model has been loaded correctly.

In [48]:
print(type(loaded_model))

<class 'pyspark.ml.pipeline.PipelineModel'>


As you can see, the name is correct. 

<a id="visualization"></a>
## 5. Predict locally and visualize prediction results

In this section, you will learn how to score the loaded model using test data and visualize the prediction results with the Plotly package.

- [5.1 Make a local prediction using previously loaded model and test data](#local)
- [5.2 Use Plotly to visualize data](#plotly)

### 5.1 Make a local prediction using previously loaded model and test data<a id="local"></a>

In this subsection, you will score the model with the *predict_data* data set.

In [49]:
predictions = loaded_model.transform(predict_data)

Preview the predictions DataFrame via `pixiedust`.

In [None]:
display(predictions)

PRODUCT_LINE,GENDER,AGE,MARITAL_STATUS,PROFESSION,label,PROFESSION_IX,GENDER_IX,MARITAL_STATUS_IX,features,rawPrediction,probability,prediction,predictedLabel
Camping Equipment,F,20,Single,Other,0.0,0.0,1.0,1.0,"[1.0,20.0,1.0,0.0]","[5.509971379543874,9.932445048374051,3.882448513724272,0.24210362834569382,0.43303143001210975]","[0.2754985689771937,0.49662225241870256,0.1941224256862136,0.01210518141728469,0.02165157150060549]",1.0,Personal Accessories
Camping Equipment,F,22,Single,Hospitality,0.0,5.0,1.0,1.0,"[1.0,22.0,1.0,5.0]","[12.877234637010023,4.845533467611932,1.5167842751902962,0.3953789653889419,0.36506865479880696]","[0.643861731850501,0.24227667338059655,0.0758392137595148,0.01976894826944709,0.018253432739940345]",0.0,Camping Equipment
Camping Equipment,F,24,Single,Retail,0.0,7.0,1.0,1.0,"[1.0,24.0,1.0,7.0]","[2.8665285541871945,14.22005860476708,2.252511197964372,0.2121665384340576,0.44873510464729693]","[0.1433264277093597,0.7110029302383539,0.11262555989821858,0.010608326921702878,0.022436755232364842]",1.0,Personal Accessories
Camping Equipment,F,25,Single,Other,0.0,0.0,1.0,1.0,"[1.0,25.0,1.0,0.0]","[5.485362805161418,9.898660887852433,3.9439604608798824,0.24589679734979636,0.426119048756471]","[0.27426814025807084,0.49493304439262154,0.1971980230439941,0.012294839867489816,0.021305952437823548]",1.0,Personal Accessories
Camping Equipment,F,26,Married,Other,0.0,0.0,1.0,0.0,"[1.0,26.0,0.0,0.0]","[6.766698963425116,5.84414318442204,4.24008335358142,1.234732484621389,1.9143420139500358]","[0.3383349481712557,0.29220715922110196,0.212004167679071,0.06173662423106944,0.09571710069750178]",0.0,Camping Equipment
Camping Equipment,F,26,Single,Professional,0.0,1.0,1.0,1.0,"[1.0,26.0,1.0,1.0]","[2.8995171595105704,12.722261667400003,3.7588870395863205,0.23213410268653276,0.3872000308165724]","[0.14497585797552853,0.6361130833700002,0.18794435197931603,0.011606705134326639,0.01936000154082862]",1.0,Personal Accessories
Camping Equipment,F,33,Married,Other,0.0,0.0,1.0,0.0,"[1.0,33.0,0.0,0.0]","[9.266307608035678,4.263733295159572,2.1631987892878124,2.5269023375174995,1.7798579699994401]","[0.4633153804017839,0.2131866647579786,0.10815993946439062,0.12634511687587496,0.08899289849997201]",0.0,Camping Equipment
Camping Equipment,F,35,Single,Retail,0.0,7.0,1.0,1.0,"[1.0,35.0,1.0,7.0]","[3.4458574468443306,13.924699345461564,1.094231184207484,0.35830447289014233,1.17690755059648]","[0.17229287234221652,0.6962349672730782,0.054711559210374204,0.017915223644507115,0.058845377529823995]",1.0,Personal Accessories
Camping Equipment,F,37,Married,Other,0.0,0.0,1.0,0.0,"[1.0,37.0,0.0,0.0]","[9.266307608035678,4.263733295159572,2.1631987892878124,2.5269023375174995,1.7798579699994401]","[0.4633153804017839,0.2131866647579786,0.10815993946439062,0.12634511687587496,0.08899289849997201]",0.0,Camping Equipment
Camping Equipment,F,37,Single,Retail,0.0,7.0,1.0,1.0,"[1.0,37.0,1.0,7.0]","[3.304443305430189,14.045189965952185,1.094231184207484,0.3922150068006762,1.163920537609467]","[0.16522216527150946,0.7022594982976093,0.054711559210374204,0.01961075034003381,0.05819602688047335]",1.0,Personal Accessories


By tabulating a count, you can see which product line is the most popular.

In [None]:
display(predictions.select('predictedLabel').groupBy('predictedLabel').count())

predictedLabel,count
Camping Equipment,782
Golf Equipment,70
Mountaineering Equipment,48
Personal Accessories,316


### 5.2 Use Plotly to visualize data <a id="plotly"></a>

In this subsection, you will use the Plotly package to explore the prediction results. Plotly is an online analytics and data visualization tool.

First, you need to install the required packages. You can do it by running the following code. Run it once only.

In [None]:
!pip install --upgrade plotly
# !pip install cufflinks==0.8.2

Import Plotly and the other required packages.

In [53]:
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
# import cufflinks as cf
# import plotly.graph_objs as go
from plotly.graph_objs import Layout, Figure, Pie, Bar
# import plotly.plotly as py
import sys

init_notebook_mode(connected=True)
sys.path.append(''.join([os.environ['HOME']])) 

Convert the Spark DataFrame to a Pandas DataFrame.

In [54]:
predictions_pdf = predictions.select('prediction', 'predictedLabel', 'GENDER', 'AGE', 'PROFESSION', 'MARITAL_STATUS').toPandas()

Plot a pie chart that shows the predicted product-line interest.

In [55]:
cumulative_stats = predictions_pdf.groupby(['predictedLabel']).count()
product_data = [Pie(labels=cumulative_stats.index, values=cumulative_stats['GENDER'])]
product_layout = Layout(title='Predicted product line client interest distribution')

fig = Figure(data=product_data, layout=product_layout)
iplot(fig)

With this data set, perform some analysis of the mean AGE per product line by using a bar chart.

In [56]:
age_data = [Bar(y=predictions_pdf.groupby(['predictedLabel']).mean()['AGE'], x=cumulative_stats.index)]

age_layout = Layout(
    title='Mean AGE per predicted product line',
    xaxis=dict(title = 'Product Line', showline=False),
    yaxis=dict(title = 'Mean AGE'))

fig = Figure(data=age_data, layout=age_layout)
iplot(fig)

Based on the bar plot you created, the following conclusion can be reached: The mean age of clients that are interested in golf equipment is predicted to be over 50 years old.

<a id="scoring"></a>
## 6. Deploy and score in the WML repository

In this section, you will learn how to create an online deployment, create an online scoring endpoint, and score a new data record using the `watson-machine-learning-client` package.

**Note:** You can also use the REST API to deploy and score.
For more information about REST APIs, see the <a href="http://watson-ml-api.mybluemix.net/" target="_blank" rel="noopener noreferrer">Swagger Documentation</a>.

#### Create an online deployment for the published model.

In [57]:
deployment_details = client.deployments.create(model_uid, name='Product line model deployment')



#######################################################################################

Synchronous deployment creation for uid: 'c2674950-55eb-4aaa-92b6-829293c40533' started

#######################################################################################


INITIALIZING
DEPLOY_SUCCESS


------------------------------------------------------------------------------------------------
Successfully finished deployment creation, deployment_uid='1a679262-48aa-4213-8184-349f7aa60b20'
------------------------------------------------------------------------------------------------




#### Create an online scoring endpoint. 

In [58]:
scoring_url = client.deployments.get_scoring_url(deployment_details)

Now, you can send new scoring records (new data) for which you would like to get predictions. To do that, run the following sample code: 

In [59]:
payload_scoring = {'fields': ['GENDER','AGE','MARITAL_STATUS','PROFESSION'],'values': [['M',23,'Single','Student'],['M',55,'Single','Executive']]}

client.deployments.score(scoring_url, payload_scoring)

{'fields': ['GENDER',
  'AGE',
  'MARITAL_STATUS',
  'PROFESSION',
  'PRODUCT_LINE',
  'label',
  'PROFESSION_IX',
  'GENDER_IX',
  'MARITAL_STATUS_IX',
  'features',
  'rawPrediction',
  'probability',
  'prediction',
  'predictedLabel'],
 'values': [['M',
   23,
   'Single',
   'Student',
   'Camping Equipment',
   0.0,
   6.0,
   0.0,
   1.0,
   [0.0, 23.0, 1.0, 6.0],
   [5.570605067417983,
    6.7285830309330175,
    5.782009212142643,
    0.1766529669798611,
    1.742149722526497],
   [0.2785302533708991,
    0.3364291515466508,
    0.2891004606071321,
    0.008832648348993053,
    0.08710748612632484],
   1.0,
   'Personal Accessories'],
  ['M',
   55,
   'Single',
   'Executive',
   'Camping Equipment',
   0.0,
   3.0,
   0.0,
   1.0,
   [0.0, 55.0, 1.0, 3.0],
   [2.632879457632312,
    4.479278937861745,
    2.7938862335667167,
    10.010685179962001,
    0.08327019097722486],
   [0.1316439728816156,
    0.22396394689308724,
    0.13969431167833585,
    0.5005342589981001,
    

As you can see, a 23 year old male student is predicted to be interested in personal accessories (predictedLabel: Personal Accessories, prediction: 1.0). You can also see that a single 55 year old man is predicted to be interested in golf equipment.

<a id="summary"></a>
## 7. Summary and next steps     

You successfully completed this notebook! 
 
You learned how to use Spark Machine Learning as well as Watson Machine Learning (WML) API client for model creation and deployment. 
 
Check out our <a href="https://dataplatform.ibm.com/docs/content/analyze-data/wml-setup.html" target="_blank" rel="noopener noreferrer">Online Documentation</a> for more samples, tutorials, documentation, how-tos, and blog posts. 

### Authors

**Lukasz Cmielowski**, Ph.D., is an Automation Architect and Data Scientist at IBM with a track record of developing enterprise-level applications that substantially increases clients' ability to turn data into actionable knowledge.  
**Jihyoung Kim**, Ph.D., is a Data Scientist at IBM who strives to make data science easy for everyone through Watson Studio.

Copyright Â© 2017-2019 IBM. This notebook and its source code are released under the terms of the MIT License.