<table style="border: none" align="left">
   <tr style="border: none">
      <th style="border: none"><font face="verdana" size="5" color="black"><b>Use Spark and Python to predict customer churn</b></th>
      <th style="border: none"><img src="https://github.com/pmservice/customer-satisfaction-prediction/blob/master/app/static/images/ml_icon_gray.png?raw=true" alt="Watson Machine Learning icon" height="40" width="40"></th>
  <tr style="border: none">
       <th style="border: none"><img src="https://github.com/pmservice/wml-sample-models/blob/master/spark/customer-satisfaction-prediction/images/users_banner_2-03.png?raw=true" width="600" alt="Icon"> </th>
   </tr>
</table>

This notebook demonstrates how to build a predictive model and score the model with new data. 

Some familiarity with Python is helpful. This notebook is compatible with Python and Spark.

You will use a data set, **Telco Customer Churn**, which contains a fictional telecommunications company's anonymous customer data. Use the details of this data set to predict customer churn which is critical to business as it is easier to retain existing customers rather than acquire new ones.

## Learning goals

In this notebook, you will learn how to:

-  Load a CSV file into a Spark DataFrame.
-  Explore data.
-  Prepare data for training and evaluation.
-  Create a Spark machine learning pipeline.
-  Train and evaluate a model.
-  Store a pipeline and model in the Watson Machine Learning (WML) repository.
-  Explore and visualize the prediction result using the plotly package.
-  Deploy the trained model for scoring using the Watson Machine Learning (WML) API.


## Contents

This notebook contains the following parts:

1.	[Set up the environment](#setup)
2.	[Load and explore data](#load)
3.	[Create a Spark machine learning model](#model)
4.	[Store the model in the WML repository](#persistence)
5.	[Predict locally and visualize prediction results](#visualization)
6.	[Deploy and score on the IBM Cloud](#scoring)
7.	[Summary and next steps](#summary)

<a id="setup"></a>
## 1. Set up the environment

Before you use the sample code in this notebook, you must perform the following tasks:

-  Create a <a href="https://cloud.ibm.com/catalog/services/machine-learning" target="_blank" rel="noopener no referrer">Watson Machine Learning (WML) Service</a> instance (a free plan is offered and information about how to create the instance can be found <a href="https://dataplatform.ibm.com/docs/content/analyze-data/wml-setup.html" target="_blank" rel="noopener no referrer">here</a>).
-  Make sure that you are using a Spark kernel in the notebook.

### Create the Telco Customer Churn Data Asset  

The fictional Telco Customer Churn data is available in a <a href="https://github.com/pmservice/wml-sample-models" target="_blank" rel="noopener no referrer">Github repository</a> where this notebook is included.

Run the following cells to download the data set from the above Github repository <a href="https://github.com/pmservice/wml-sample-models/tree/master/spark/customer-satisfaction-prediction/data" target="_blank" rel="noopener no referrer">data folder</a> to your local file system.


In [1]:
from contextlib import suppress
import os

_filename = 'Telco_customer_churn.csv'

with suppress(OSError):
    os.remove(_filename)

del _filename

In [1]:
# Install wget package if necessary by running following command:
!pip install --upgrade wget

Waiting for a Spark session to start...
Spark Initialization Done! ApplicationId = app-20200124165729-0000
KERNEL_ID = 4e69ed8d-d470-4ea1-8705-1ec5f651b5c1
Collecting wget
  Downloading https://files.pythonhosted.org/packages/47/6a/62e288da7bcda82b935ff0c6cfe542970f04e29c756b0e147251b2fb251f/wget-3.2.zip
Building wheels for collected packages: wget
  Building wheel for wget (setup.py) ... [?25ldone
[?25h  Stored in directory: /home/spark/shared/.cache/pip/wheels/40/15/30/7d8f7cea2902b4db79e3fea550d7d7b85ecb27ef992b618f3f
Successfully built wget
Installing collected packages: wget
Successfully installed wget-3.2


In [2]:
import wget

link_to_data = 'https://raw.githubusercontent.com/pmservice/wml-sample-models/master/spark/customer-satisfaction-prediction/data/Telco_customer_churn.csv'
filename = wget.download(link_to_data)

print(filename)

Telco_customer_churn.csv


Now that you have downloaded the data set to your local directory, you can read it into a Spark DataFrame.

<a id="load"></a>
## 2. Load and explore data

In this section, you will load the downloaded dataset as a Spark DataFrame and perform basic data exploration. 

Instead of using the `wget` package, you can upload the `Telco Customer Churn` data set locally or by using a data connection via the `Find and add data` option on the `top right` corner - the icon of the `Find and add data` option looks like binary digits (01 00).

If you uploaded the data set from your computer, you can find it in the  **Files** section - you might need to reload the notebook for the file to appear. Under the name of the file, you will find a dropdown menu - **Insert to code**. You can your data into a Spark DataFrame by selecting **Insert SparkSession DataFrame**.

In this notebook, we assume that you obtained the data via `wget` instead of uploading the data from your computer.

In this section, you will learn how to:

- [2.1 Load data into Spark DataFrame](#loaddataframe)
- [2.2 Explore and visualize data](#explore)
- [2.3 Perform statistical tests](#stat)

### 2.1 Load data into Spark DataFrame <a id="loaddataframe"></a>

In this subsection, you will learn how to load the downloaded data into a Spark DataFrame.

In [3]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()
df_data = spark.read\
    .format('org.apache.spark.sql.execution.datasources.csv.CSVFileFormat')\
    .option('header', 'true')\
    .option('inferSchema', 'true')\
    .option('nanValue', ' ')\
    .option('nullValue', ' ')\
    .load(filename)

### 2.2 Explore and visualize data <a id="explore"></a>

In this subsection, you explore the data and create visualizations.

`pixiedust` is an open-source Python helper library that works as an add-on to Jupyter notebooks to improve the user experience of working with data.  
`pixiedust` documentation/code can be found <a href="https://github.com/pixiedust/pixiedust" target="_blank" rel="noopener no referrer">here</a>.
`pixiedust` will be installed to visualize the Spark DataFrame better than the default Spark DataFrame `show()` method.

In [None]:
!pip install --upgrade pixiedust

Import `pixiedust`.

In [5]:
import pixiedust

Pixiedust database opened successfully
Table VERSION_TRACKER created successfully
Table METRICS_TRACKER created successfully

Share anonymous install statistics? (opt-out instructions)

PixieDust will record metadata on its environment the next time the package is installed or updated. The data is anonymized and aggregated to help plan for future releases, and records only the following values:

{
   "data_sent": currentDate,
   "runtime": "python",
   "application_version": currentPixiedustVersion,
   "space_id": nonIdentifyingUniqueId,
   "config": {
       "repository_id": "https://github.com/ibm-watson-data-lab/pixiedust",
       "target_runtimes": ["Data Science Experience"],
       "event_id": "web",
       "event_organizer": "dev-journeys"
   }
}
You can opt out by calling pixiedust.optOut() in a new cell.


[31mPixiedust runtime updated. Please restart kernel[0m
Table SPARK_PACKAGES created successfully
Table USER_PREFERENCES created successfully
Table service_connections created successfully


You can run the following method if you don't want ``pixiedust`` collecting user statistics.

In [6]:
pixiedust.optOut()

Pixiedust will not collect anonymous install statistics.


In this notebook, ``pixiedust`` will only be used as a DataFrame viewer. However, ``pixiedust`` can also be used as a data visualization tool. You can find the details of the visualization functionality of ``pixiedust`` <a href="https://pixiedust.github.io/pixiedust/displayapi.html" target="_blank" rel="noopener no referrer">here</a>.

`display` is the name of the function that enables `pixiedust` to create a DataFrame viewer.

In [None]:
display(df_data)

Gender,Senior_Citizen,Partner,Dependents,Tenure_Months,Phone_Service,Multiple_Lines,Internet_Service,Online_Security,Online_Backup,Device_Protection,Tech_Support,Streaming_TV,Streaming_Movies,Contract,Paperless_Billing,Payment_Method,Monthly_Charges,Total_Charges,Churn_Label
Male,No,No,No,59,Yes,No,No,No internet service,No internet service,No internet service,No internet service,No internet service,No internet service,Two year,No,Mailed check,19.35,1099.6,Yes
Female,No,No,No,1,Yes,No,Fiber optic,Yes,No,No,No,No,No,Month-to-month,Yes,Mailed check,73.6,73.6,Yes
Male,No,Yes,No,3,Yes,No,DSL,No,No,Yes,No,No,No,Month-to-month,Yes,Electronic check,50.15,168.15,Yes
Male,No,No,No,62,Yes,Yes,Fiber optic,No,No,Yes,No,Yes,Yes,One year,Yes,Electronic check,96.75,6125.4,Yes
Male,No,Yes,No,4,Yes,No,Fiber optic,No,No,Yes,Yes,No,No,Month-to-month,Yes,Mailed check,80.6,319.15,Yes
Male,No,Yes,No,32,Yes,Yes,Fiber optic,Yes,No,No,No,Yes,Yes,Month-to-month,Yes,Credit card (automatic),99.55,3204.65,Yes
Female,No,No,No,5,Yes,No,Fiber optic,No,No,No,No,No,Yes,Month-to-month,Yes,Electronic check,80.0,412.5,Yes
Male,Yes,Yes,No,15,Yes,Yes,Fiber optic,No,No,No,No,No,Yes,Month-to-month,Yes,Bank transfer (automatic),85.6,1345.55,Yes
Male,No,No,No,1,Yes,No,No,No internet service,No internet service,No internet service,No internet service,No internet service,No internet service,Month-to-month,Yes,Mailed check,19.45,19.45,Yes
Male,No,No,No,66,Yes,No,Fiber optic,No,Yes,Yes,No,Yes,Yes,Month-to-month,No,Electronic check,99.5,6710.5,Yes


You can check the schema of the DataFrame by clicking on the `Schema` panel in the DataFrame viewer created by `pixiedust`.

Also, `pixiedust` tells you how many rows (in this case, 7043 rows) exist in the DataFrame. `Churn_Label` field is the one you would like to predict (label).

Now, you will check if all records have complete data.

In [8]:
df_complete = df_data.dropna()

print('Number of records with complete data: %3g' % df_complete.count())

Number of records with complete data: 7032


You can see that there are some missing values. You can investigate that all missing values are present in the `Total_Charges` feature. For training and evaluation, you will use the data set with the missing values removed.

Inspect the class distribution in the label column.

In [None]:
display(df_complete.groupBy('Churn_Label').count())

Churn_Label,count
No,5163
Yes,1869


You will visualize the data via `brunel`.  
`brunel` defines a highly succinct and novel language that defines interactive data visualizations based on tabular data.  
`brunel` documentation/code can be found <a href="https://github.com/Brunel-Visualization/Brunel" target="_blank" rel="noopener no referrer">here</a>. 

In [None]:
!pip install --upgrade brunel

You have to convert the PySpark DataFrame into a Pandas DataFrame first in order to pass it to `brunel`.

In [11]:
df_pd = df_complete.toPandas()

There are 19 features (predictors) and 1 label (target). Also, there are 16 features (predictors) that are categorical out of 19. 1 continuous feature (predictor) and 3 categorical features (predictors) will be visualized.

In [12]:
df_pd.columns

Index(['Gender', 'Senior_Citizen', 'Partner', 'Dependents', 'Tenure_Months',
       'Phone_Service', 'Multiple_Lines', 'Internet_Service',
       'Online_Security', 'Online_Backup', 'Device_Protection', 'Tech_Support',
       'Streaming_TV', 'Streaming_Movies', 'Contract', 'Paperless_Billing',
       'Payment_Method', 'Monthly_Charges', 'Total_Charges', 'Churn_Label'],
      dtype='object')

First, let's plot the distribution of a continuous feature (predictor) - `Total_Charges`.

In [13]:
df_tc = df_pd['Total_Charges']
df_tc = df_tc.sort_values(ascending=True).to_frame()
df_tc = df_tc.reset_index(drop=True).reset_index()
df_tc = df_tc.rename(index=str, columns={'index': 'Record'})

In [14]:
%brunel data('df_tc') x(Record) y(Total_Charges)

<IPython.core.display.Javascript object>

Let's plot the distribution of a categorical feature (predictor) - `Gender`.

In [15]:
%brunel data('df_pd') bar x(gender) y(#count)

<IPython.core.display.Javascript object>

Let's plot the distribution of a categorical feature (predictor) - `Internet_Service`.

In [16]:
%brunel data('df_pd') bar x(Internet_Service) y(#count)

<IPython.core.display.Javascript object>

Let's plot the distribution of a categorical feature (predictor) - `Streaming_Movies`.

In [17]:
%brunel data('df_pd') bar x(Streaming_Movies) y(#count)

<IPython.core.display.Javascript object>

### 2.3 Perform statistical tests <a id="stat"></a>

In this subsection, you will perform statistical tests especially chi-squared tests on categorical features (predictors). Chi-squared test can be performed when both the feature (predictor) and the target (label) are categorical. The goal of the chi-squared test is to assess the relationship between two categorical features (predictors).

For demo purpose, only 3 categorical features (predictors) visualized in section [2.2 Explore and visualize data](#explore) will be selected for the chi-squared test.

You will use `scipy.stats` module for the chi-squared test.

In [18]:
from scipy import stats
import pandas as pd

In [19]:
stats.chisquare(df_pd['Churn_Label'].value_counts())

Power_divergenceResult(statistic=1543.0085324232082, pvalue=0.0)

In [20]:
stats.chisquare(df_pd['Gender'].value_counts())

Power_divergenceResult(statistic=0.6194539249146758, pvalue=0.43125028325456827)

In [21]:
stats.chisquare(df_pd['Internet_Service'].value_counts())

Power_divergenceResult(statistic=533.1331058020478, pvalue=1.7045785344130917e-116)

In [22]:
stats.chisquare(df_pd['Streaming_Movies'].value_counts())

Power_divergenceResult(statistic=435.0315699658703, pvalue=3.4205414303055575e-95)

Let's create a cross-tabulation matrix for each predictor and get the chi-squared test results.

In [23]:
target_classes = ['Yes', 'No']

Cross-tabulation matrix for predictor `Gender` and target `Churn_Label`.

In [24]:
cont_gender = pd.crosstab(df_pd['Churn_Label'], df_pd['Gender'])

In [25]:
cont_gender_df = cont_gender
cont_gender_df.index = target_classes
cont_gender_df.index.name = 'Churn_Label'

In [26]:
cont_gender_df

Gender,Female,Male
Churn_Label,Unnamed: 1_level_1,Unnamed: 2_level_1
Yes,2544,2619
No,939,930


The first value of the output of the `chi2_contingency` method is the chi-squared test statistics, the second values is the $p$-value, the third value it the degree of freedom, and the last value is the contingency table with expected values.

In [27]:
stats.chi2_contingency(cont_gender)

(0.47545453727386294,
 0.4904884707065509,
 1,
 array([[2557.27090444, 2605.72909556],
        [ 925.72909556,  943.27090444]]))

Using `stats.chi2_contingency`, you can check if two features (predictors) are independent or not.

$H_{0}$ (null hypothesis): Predictor $A$ and predictor $B$ are independent.  
$H_{1}$ (alternative hypothesis): Predictor $A$ and predictor $B$ are dependent.

If $p$ < $0.05$, then $A$ and $B$ are dependent, else $A$ and $B$ are independent.

Since the $p$-value is $0.49$, $H_{0}$ (null hypothesis) is accepted - `Gender` and `Churn_Label` are independent.

Cross-tabulation matrix for predictor `Internet_Service` and target `Churn_Label`.

In [28]:
cont_int = pd.crosstab(df_pd['Churn_Label'], df_pd['Internet_Service'])

In [29]:
cont_int_df = cont_int
cont_int_df.index = target_classes
cont_int_df.index.name = 'Churn_Label'

In [30]:
cont_int_df

Internet_Service,DSL,Fiber optic,No
Churn_Label,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Yes,1957,1799,1407
No,459,1297,113


The first value of the output of the `chi2_contingency` method is the chi-squared test statistics, the second values is the $p$-value, the third value it the degree of freedom, and the last value is the contingency table with expected values.

In [31]:
stats.chi2_contingency(cont_int)

(728.6956143058695,
 5.831198962236941e-159,
 2,
 array([[1773.86348123, 2273.12969283, 1116.00682594],
        [ 642.13651877,  822.87030717,  403.99317406]]))

Using `stats.chi2_contingency`, you can check if two features (predictors) are independent or not.

$H_{0}$ (null hypothesis): Predictor $A$ and predictor $B$ are independent.  
$H_{1}$ (alternative hypothesis): Predictor $A$ and predictor $B$ are dependent.

If $p$ < $0.05$, then $A$ and $B$ are dependent, else $A$ and $B$ are independent.

Since the $p$-value is $5.83\times10^{-159}$, $H_{0}$ (null hypothesis) is accepted - `Internet_Service` and `Churn_Label` are dependent.

Cross-tabulation matrix for predictor `Streaming_Movies` and target `Churn_Label`.

In [32]:
cont_sm = pd.crosstab(df_pd['Churn_Label'], df_pd['Streaming_Movies'])

In [33]:
cont_sm_df = cont_sm
cont_sm_df.index = target_classes
cont_sm_df.index.name = 'Churn_Label'

In [34]:
cont_sm_df

Streaming_Movies,No,No internet service,Yes
Churn_Label,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Yes,1843,1407,1913
No,938,113,818


The first value of the output of the `chi2_contingency` method is the chi-squared test statistics, the second values is the $p$-value, the third value it the degree of freedom, and the last value is the contingency table with expected values.

In [35]:
stats.chi2_contingency(cont_sm)

(374.26843157324595,
 5.353560421401323e-82,
 2,
 array([[2041.85196246, 1116.00682594, 2005.1412116 ],
        [ 739.14803754,  403.99317406,  725.8587884 ]]))

Using `stats.chi2_contingency`, you can check if two features (predictors) are independent or not.

$H_{0}$ (null hypothesis): Predictor $A$ and predictor $B$ are independent.  
$H_{1}$ (alternative hypothesis): Predictor $A$ and predictor $B$ are dependent.

If $p$ < $0.05$, then $A$ and $B$ are dependent, else $A$ and $B$ are independent.

Since the $p$-value is $5.35\times10^{-82}$, $H_{0}$ (null hypothesis) is accepted - `Internet_Service` and `Churn_Label` are dependent.

<a id="model"></a>
## 3. Create a Spark machine learning model

In this section, you will learn how to:

- [3.1 Split data](#prep)
- [3.2 Create a Spark machine learning pipeline](#pipe)
- [3.3 Train a model](#train)

### 3.1 Split data<a id="prep"></a>

In this subsection, you will split your data into: 
- train data set
- test data set
- predict data set

In [36]:
(train_data, test_data, predict_data) = df_complete.randomSplit([0.8, 0.18, 0.02], 24)

print('Number of records for training: {}'.format(train_data.count()))
print('Number of records for evaluation: {}'.format(test_data.count()))
print('Number of records for prediction: {}'.format(predict_data.count()))

Number of records for training: 5621
Number of records for evaluation: 1263
Number of records for prediction: 148


As you can see your, data has been successfully split into three data sets: 

-  The train data set which is the largest group is used for training.
-  The test data set will be used for model evaluation and to test the assumptions of the model.
-  The predict data set will be used for prediction.

### 3.2 Create a Spark machine learning pipeline<a id="pipe"></a>

In this section, you will create a Spark machine learning pipeline and then train the model.

In the first step, you need to import the Spark machine learning packages that will be needed in the subsequent steps.

In [37]:
from pyspark.ml.feature import StringIndexer, IndexToString, RFormula
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml import Pipeline, Model

In the following step, convert all the predictors to features vectors and convert the label feature to numeric.

In [38]:
df_data.columns

['Gender',
 'Senior_Citizen',
 'Partner',
 'Dependents',
 'Tenure_Months',
 'Phone_Service',
 'Multiple_Lines',
 'Internet_Service',
 'Online_Security',
 'Online_Backup',
 'Device_Protection',
 'Tech_Support',
 'Streaming_TV',
 'Streaming_Movies',
 'Contract',
 'Paperless_Billing',
 'Payment_Method',
 'Monthly_Charges',
 'Total_Charges',
 'Churn_Label']

In [39]:
lab = StringIndexer(inputCol = 'Churn_Label', outputCol = 'label')
features = RFormula(formula = '~ Gender + Senior_Citizen +  Partner + Dependents + Tenure_Months + Phone_Service \
+ Multiple_Lines + Internet_Service + Online_Security + Online_Backup + Device_Protection + Tech_Support \
+ Streaming_TV + Streaming_Movies + Contract + Paperless_Billing + Payment_Method + Monthly_Charges + Total_Charges - 1')

Next, define estimators you want to use for classification. Logistic Regression is used in the following example.

In [40]:
lr = LogisticRegression(maxIter = 10)

Now build the pipeline. A pipeline consists of transformers and an estimator.

In [41]:
pipeline_lr = Pipeline(stages = [features, lab, lr])

### 3.3 Train the model<a id="train"></a>

Now, you can train your Logistic Regression model by using the previously defined **pipeline** and **train data**.

In [42]:
model_lr = pipeline_lr.fit(train_data)

You can check your **Area Under the Curve (AUC)** now. AUC is the default metric of `BinaryClassificationEvaluator`. Use **test data** to evaluate the model.

In [43]:
predictions = model_lr.transform(test_data)
evaluator = BinaryClassificationEvaluator(labelCol='label', rawPredictionCol='rawPrediction')
auc = evaluator.evaluate(predictions)

print('Test dataset:')
print('Area Under the Curve = {:.2f}%'.format((auc*100)))

Test dataset:
Area Under the Curve = 85.97%


You can tune your model now to achieve better accuracy. For simplicity, the tuning example is omitted in this example.

<a id="persistence"></a>
## 4. Store the model in the WML repository

In this section, you will learn how to use Python client libraries to store your pipeline and model in the WML repository and make predictions.

- [4.1 Install required packages](#lib)
- [4.2 Save the pipeline and model](#save)
- [4.3 Load the model](#loadmodel)

### 4.1 Install required packages<a id="lib"></a>

First, import required WML client libraries.

**Note**: Spark 2.3+ is required.

In [44]:
from watson_machine_learning_client import WatsonMachineLearningAPIClient

Authenticate to the Watson Machine Learning service on IBM Cloud.

**Tip**: Authentication information (your credentials) can be found in the <a href="https://dataplatform.cloud.ibm.com/docs/content/wsj/analyze-data/ml-get-wml-credentials.html" target="_blank" rel="noopener no referrer">Service Credentials</a> tab of the service instance that you created on IBM Cloud. <BR>If you cannot see the **instance_id** field in **Service Credentials**, click **New credential (+)** to generate new authentication information. 

**Action**: Enter your Watson Machine Learning service instance credentials here.

In [45]:
wml_credentials = {
    'apikey': '***',
    'url': '***',
    'instance_id': '***'
}

In [46]:
# @hidden_cell

wml_credentials = {
    'apikey': '***',
    'instance_id': '***',
    'url': '***'
}

In [47]:
client = WatsonMachineLearningAPIClient(wml_credentials)

### 4.2 Save the pipeline and model<a id="save"></a>

In this subsection you will learn how to save the pipeline and model artifacts to your WML instance.

In [48]:
saved_model = client.repository.store_model(model=model_lr, 
                                            meta_props={'name': 'Customer churn Spark model'}, 
                                            training_data=train_data, 
                                            pipeline=pipeline_lr)

Get the saved model metadata from the WML repository.

In [49]:
published_model_ID = client.repository.get_model_uid(saved_model)

print('Model ID: {}'.format(published_model_ID))

Model ID: 0de1cde8-024f-4ce8-9a10-71421bcf2a95


The Model ID can be used to retrieve the latest model version from the WML repository.

### 4.3 Load the model<a id="loadmodel"></a>

In this subsection, you will learn how to load a saved model from a specified WML repository.

In [50]:
loaded_model = client.repository.load(published_model_ID)

In [51]:
print(type(loaded_model))

<class 'pyspark.ml.pipeline.PipelineModel'>


As you can see the name is correct. You have now learned how save and load the model from the WML repository.

<a id="visualization"></a>
## 5. Predict locally and visualize prediction results

In this section, you will learn how to score test data using the loaded model and visualize the prediction results with the plotly package.

- [5.1 Make a local prediction using previously loaded model and test data](#local)
- [5.2 Use Plotly to visualize prediction results](#plotly)

### 5.1 Make a local prediction using previously loaded model and test data<a id="local"></a>

In this subsection, you will score the `predict_data` data set.

In [52]:
predictions = loaded_model.transform(predict_data)

Check the results by viewing the predictions DataFrame via `pixiedust`.

In [None]:
display((predictions.toPandas()).head())

Gender,Senior_Citizen,Partner,Dependents,Tenure_Months,Phone_Service,Multiple_Lines,Internet_Service,Online_Security,Online_Backup,Device_Protection,Tech_Support,Streaming_TV,Streaming_Movies,Contract,Paperless_Billing,Payment_Method,Monthly_Charges,Total_Charges,Churn_Label,features,label,rawPrediction,probability,prediction
Female,No,No,No,2,Yes,Yes,Fiber optic,No,No,No,No,No,No,Month-to-month,Yes,Electronic check,75.9,143.35,Yes,"(31,[1,2,3,4,5,6,8,9,11,13,15,17,19,21,23,25,26,29,30],[1.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,75.9,143.35])",1.0,"[-1.142952966329993,1.142952966329993]","[0.2417786054497176,0.7582213945502824]",1.0
Female,No,No,No,2,Yes,Yes,Fiber optic,No,No,No,No,No,Yes,Month-to-month,Yes,Electronic check,85.7,169.8,Yes,"(31,[1,2,3,4,5,6,8,9,11,13,15,17,19,22,23,25,26,29,30],[1.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,85.7,169.8])",1.0,"[-1.3785260731845268,1.3785260731845268]","[0.20124582317854675,0.7987541768214533]",1.0
Female,No,No,No,3,Yes,No,No,No internet service,No internet service,No internet service,No internet service,No internet service,No internet service,Month-to-month,No,Electronic check,19.75,58.85,Yes,"(31,[1,2,3,4,5,6,7,23,26,29,30],[1.0,1.0,1.0,1.0,3.0,1.0,1.0,1.0,1.0,19.75,58.85])",1.0,"[0.9695232366209245,-0.9695232366209245]","[0.7250244585367621,0.2749755414632378]",0.0
Female,No,No,No,3,Yes,Yes,Fiber optic,No,No,No,No,No,No,Month-to-month,No,Electronic check,75.25,242.0,Yes,"(31,[1,2,3,4,5,6,8,9,11,13,15,17,19,21,23,26,29,30],[1.0,1.0,1.0,1.0,3.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,75.25,242.0])",1.0,"[-0.8045267033823671,0.8045267033823671]","[0.3090580467631725,0.6909419532368275]",1.0
Female,No,No,No,4,Yes,Yes,Fiber optic,No,Yes,No,Yes,No,Yes,Month-to-month,Yes,Electronic check,93.5,362.2,Yes,"(31,[1,2,3,4,5,6,8,9,11,14,15,18,19,22,23,25,26,29,30],[1.0,1.0,1.0,1.0,4.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,93.5,362.2])",1.0,"[-0.9613189967611451,0.9613189967611451]","[0.27661418748428257,0.7233858125157174]",1.0


You can check the count of each predicted label.

In [None]:
display(predictions.select('prediction').groupBy('prediction').count())

prediction,count
0.0,101.0
1.0,47.0


### 5.2 Use Plotly to visualize prediction results <a id="plotly"></a>

In this subsection, you will use `plotly`, an online analytics and data visualization tool, to explore the prediction results. 

**Example**: First, you need to install required packages. You can do it by running the following code. Run it one time only.

In [55]:
!pip install --upgrade plotly -q

[31mtensorflow 1.13.1 requires tensorboard<1.14.0,>=1.13.0, which is not installed.[0m
[31mibm-cos-sdk-core 2.4.3 has requirement urllib3<1.25,>=1.20, but you'll have urllib3 1.25.8 which is incompatible.[0m
[31mbotocore 1.12.82 has requirement urllib3<1.25,>=1.20, but you'll have urllib3 1.25.8 which is incompatible.[0m


Import `plotly` and the other required modules.

In [56]:
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
from plotly.graph_objs import Layout, Figure, Pie, Bar

import sys

init_notebook_mode(connected=True)
sys.path.append(''.join([os.environ['HOME']])) 

Plot a pie chart that shows the predicted churn split. They are plotted by the most important attributes: contract and `Monthly_Charges`, predicted vs observed churns.

In [57]:
cumulative_stats = predictions.groupby(['label']).count()
labels_data_plot = ['No', 'Yes']
values_data_plot = [cumulative_stats.select('count').collect()[x][0] for x in range(2)]

In [58]:
product_data = [Pie(
    labels=labels_data_plot,
    values=values_data_plot,
)]

product_layout = Layout(
    title='Churn',
)

fig = Figure(data=product_data, layout=product_layout)
iplot(fig)

Let's do some analysis of Mean Monthly Charges per churn class.

In [59]:
y_data_plot = [predictions.groupby(['label']).mean().collect()[x][3] for x in range(2)]

In [60]:
age_data = [Bar(
    y = y_data_plot,
    x = labels_data_plot
)]

age_layout = Layout(
    title = 'Mean Monthly Charges per churn class',
    xaxis = dict(
        title = 'Churn',
        showline = False
    ),
    yaxis=dict(
        title = 'Mean Monthly Charges',
    )
)

fig = Figure(data=age_data, layout=age_layout)
iplot(fig)

Based on the bar plot you created, it is likely to reach the following conclusion: The mean monthly charges for churn customers is higher than non-churn customers as you may have expected.

<a id="scoring"></a>
## 6. Deploy and score the model in the WML repository

In this section, you will learn how to create a deployment and score a new data (test) record by using the Watson Machine Learning REST API. For more information about REST APIs, see the <a href="http://watson-ml-api.mybluemix.net/" target="_blank" rel="noopener no referrer">Swagger Documentation</a>.

- [6.1 Create a model deployment](#deploy)
- [6.2 Score the deployed model](#score)

### 6.1 Create a model deployment  <a id="deploy"></a>

Now, you can create a scoring endpoint. Run the following code in this subsection that uses the `published_Model_Id` value to create the scoring endpoint in the WML repository.

#### Create the access token for the WML service.

To work with the WML REST API you must generate an IAM token. To do this, use the following sample code:

In [61]:
# First, you must import the standard python libraries.
import urllib3
import requests
import json

# Create the token.
url     = "https://iam.bluemix.net/oidc/token"
headers = {"Content-Type" : "application/x-www-form-urlencoded"}
data    = "apikey=" + wml_credentials['apikey'] + "&grant_type=urn:ibm:params:oauth:grant-type:apikey"
IBM_cloud_IAM_uid = "bx"
IBM_cloud_IAM_pwd = "bx"
response  = requests.post(url, headers=headers, data=data, auth=(IBM_cloud_IAM_uid, IBM_cloud_IAM_pwd))
iam_token = response.json()["access_token"]

#### Get the published models url from instance details.

In [None]:
endpoint_instance = wml_credentials['url'] + '/v3/wml_instances/' + wml_credentials['instance_id']
header = {
    'Content-Type': 'application/json', 
    'Authorization': 'Bearer ' + iam_token
}

response_get_instance = requests.get(endpoint_instance, headers=header)

print(response_get_instance)
json.loads(response_get_instance.text) # Use this line to load the instance details

In [63]:
endpoint_published_models = json.loads(response_get_instance.text).get('entity').get('published_models').get('url')

# print(endpoint_published_models)

#### Get a list of the published models.

Run the following code that uses the published models endpoint to get deployments url.

In [64]:
header = {
    'Content-Type': 'application/json',
    'Authorization': 'Bearer ' + iam_token
}
response_get = requests.get(endpoint_published_models, headers=header)
# print(response_get.text)

#### Get the published model deployment url.

In [65]:
[endpoint_deployments] = [x.get('entity').get('deployments').get('url') \
                        for x in json.loads(response_get.text).get('resources') \
                        if x.get('metadata').get('guid') == saved_model['metadata']['guid']]

print(endpoint_deployments)

https://us-south.ml.cloud.ibm.com/v3/wml_instances/b4b6c696-172c-4164-8049-c0b621dbf3c9/published_models/0de1cde8-024f-4ce8-9a10-71421bcf2a95/deployments


#### Create a deployment for the published model.

Run the following code to deploy the model.

In [66]:
deploy_header = {
    'Content-Type': 'application/json', 
    'Authorization': 'Bearer ' + iam_token 
}

deploy_payload = {
    'type':'online', 
    'name': 'Customer Churn Prediction', 
    'description': 'Online Deployment'
}

deploy_response = requests.post(endpoint_deployments, json=deploy_payload, headers=deploy_header)

In [None]:
json.loads(deploy_response.text)

Obtain the deployment UID.

In [None]:
deployment_uid = json.loads(deploy_response.text)['metadata']['guid']
deployment_uid

### 6.2 Score the deployed model <a id="score"></a>

To score the deployed model, you'll need the scoring URL that can be retrieved using the following code.

In [None]:
scoring_endpoint = json.loads(deploy_response.text)['entity']['scoring_url']
scoring_endpoint

Create a scoring payload to test your deployed model.

In [70]:
payload_scoring = {
    "fields": ["Gender", "Senior_Citizen", "Partner", "Dependents", "Tenure_Months", "Phone_Service", "Multiple_Lines",
               "Internet_Service", "Online_Security", "Online_Backup", "Device_Protection", "Tech_Support", "Streaming_TV",
               "Streaming_Movies", "Contract", "Paperless_Billing", "Payment_Method", "Monthly_Charges", "Total_Charges"],
    "values": [['Female', 'No', 'No', 'No', 2, 'Yes', 'Yes', 'Fiber optic', 'No','No',
                'No', 'No', 'No', 'No', 'Month-to-month', 'Yes', 'Electronic check', 75.9, 143.35],
               ['Male', 'Yes', 'Yes', 'Yes', 64, 'Yes', 'Yes', 'No', 'No internet service', 'No internet service',
                'No internet service', 'No internet service', 'No internet service', 'No internet service', 'Two year', 'Yes', 'Bank transfer (automatic)', 24.4, 1548.65]]
}

Perform the predictions using the scoring payload.

In [71]:
response_scoring = requests.post(scoring_endpoint, json=payload_scoring, headers=header)
print("Scoring response")
response_json = json.loads(response_scoring.text)
print(response_json)

Scoring response
{'fields': ['Gender', 'Senior_Citizen', 'Partner', 'Dependents', 'Tenure_Months', 'Phone_Service', 'Multiple_Lines', 'Internet_Service', 'Online_Security', 'Online_Backup', 'Device_Protection', 'Tech_Support', 'Streaming_TV', 'Streaming_Movies', 'Contract', 'Paperless_Billing', 'Payment_Method', 'Monthly_Charges', 'Total_Charges', 'features', 'rawPrediction', 'probability', 'prediction'], 'values': [['Female', 'No', 'No', 'No', 2, 'Yes', 'Yes', 'Fiber optic', 'No', 'No', 'No', 'No', 'No', 'No', 'Month-to-month', 'Yes', 'Electronic check', 75.9, 143.35, [31, [1, 2, 3, 4, 5, 6, 8, 9, 11, 13, 15, 17, 19, 21, 23, 25, 26, 29, 30], [1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 75.9, 143.35]], [-1.142952966329993, 1.142952966329993], [0.2417786054497176, 0.7582213945502824], 1.0], ['Male', 'Yes', 'Yes', 'Yes', 64, 'Yes', 'Yes', 'No', 'No internet service', 'No internet service', 'No internet service', 'No internet service', 'No internet

In [72]:
print("First customer prediction: {} (Customer churned/left), Probability: {}".format(response_json['values'][0][-1], response_json['values'][0][-2][1]))
print("Second customer prediction: {} (Customer did not churn), Probability: {}".format(response_json['values'][1][-1], response_json['values'][1][-2][0]))

First customer prediction: 1.0 (Customer churned/left), Probability: 0.7582213945502824
Second customer prediction: 0.0 (Customer did not churn), Probability: 0.9935872315322964


The first customer is predicted to have churned, while the second customer is not predicted to churn.

<a id="summary"></a>
## 7. Summary and next steps     

You successfully completed this notebook! 
 
You learned how to use Spark machine learning as well as Watson Machine Learning for model creation and deployment. 
 
Check out our <a href="https://dataplatform.ibm.com/docs/content/analyze-data/wml-setup.html" target="_blank" rel="noopener no referrer">Online Documentation</a>
 for more samples, tutorials, documentation, how-tos, and blog posts. 

### Authors

**Umit Cakmak**, is a Data Scientist at IBM with a track record of developing enterprise-level applications that substantially improves the  clients' ability to turn data into actionable knowledge.<br><br>
**Jihyoung Kim**, Ph.D., is a Data Scientist at IBM who strives to make data science easy for everyone through Watson Studio.<br><br>
**Ananya Kaushik** is a Data Scientist at IBM.

Copyright © 2017-2019 IBM. This notebook and its source code are released under the terms of the MIT License.

<div style="background:#F5F7FA; height:110px; padding: 2em; font-size:14px;">
<span style="font-size:18px;color:#152935;">Love this notebook? </span>
<span style="font-size:15px;color:#152935;float:right;margin-right:40px;">Don't have an account yet?</span><br>
<span style="color:#5A6872;">Share it with your colleagues and help them discover the power of Watson Studio!</span>
<span style="border: 1px solid #3d70b2;padding:8px;float:right;margin-right:40px; color:#3d70b2;"><a href="https://ibm.co/wsnotebooks" target="_blank" style="color: #3d70b2;text-decoration: none;">Sign Up</a></span><br>
</div>