## Predict Customer Churn Use Case Implementation
The objective is to follow the CRISP-DM methodology to build a model to predict customer churn
![CRISP-DM](https://raw.githubusercontent.com/yfphoon/dsx_demo/master/crisp_dm.png)


## Table of contents

1. [Step 1: Download the customer churn data](#download)<br/>
2. [Step 2: Read data into Spark DataFrames](#getdata)<br/>
3. [Step 3: Merge Files](#merge)<br/>
4. [Step 4: Rename some columns](#rename)<br/>
5. [Step 5: Data understanding](#dataunderstanding)<br/>
    5.1 [Dataset overview](#overview)<br/>
    5.2 [Exploratory data analysis](#eda)<br/>
    5.3 [Interactive query with SparkSQL](#sparksql)<br/>
6. [Step 6: Introduction to Spark pipelines](#intropipeline)<br/>
    6.1 [StringIndexer](#stringindexer)<br/>
    6.2 [IndexToString](#indextostring)<br/>
    6.3 [OneHotEncoder](#onehotencoder)<br/>
    6.4 [Bucketizer](#bucketizer)<br/>
    6.4 [VectorAssembler](#vectorassembler)<br/>
    6.4 [Normalizer](#normalizer)<br/>
7. [Step 7: Applying Spark pipeline concepts to customer churn data](#applypipelineconcepts)<br/>
8. [Step 8: Creating a Spark ML pipeline](#createpipeline)<br/>
9. [Step 9: Score the test dataset](#scoretestdata)<br/>
10. [Step 10: Model evaluation](#evaluate)<br/>
11. [Step 11: Tune the hyperparameters to find the best model](#tune)<br/>
12. [Step 12: Execute inline invocation of best model](#execute)<br/>
13. [Step 13: Save model](#save)<br/>

<a id="download"></a>
# <span style="color:#fa04d9"> Step 1: Download the customer churn data</span>

In [None]:
#Run once to install the wget package
!pip install wget

In [None]:
# download data from GitHub repository
import wget
url_churn='https://raw.githubusercontent.com/yfphoon/dsx_demo/master/data/customer_churn/churn.csv'
url_customer='https://raw.githubusercontent.com/yfphoon/dsx_demo/master/data/customer_churn/customer.csv'

#remove existing files before downloading
!rm -f churn.csv
!rm -f customer.csv

churnFilename=wget.download(url_churn)
customerFilename=wget.download(url_customer)

#list existing files
!ls -l churn.csv
!ls -l customer.csv

<a id="getdata"></a>
# <span style="color:#fa04d9">Step 2: Read data into Spark DataFrames</span>

Note: You want to reference the Spark DataFrame API to learn more about the supported operations, https://spark.apache.org/docs/2.0.0-preview/api/python/pyspark.sql.html#pyspark.sql.DataFrame

In [None]:
churn_df = sqlContext.read.option("header", "true").option("inferSchema", "true").csv(churnFilename)
customer_df = sqlContext.read.option("header", "true").option("inferSchema", "true").csv(customerFilename)

#Note that the pre-spark 2.0 method for reading the csv files would have been
#churn_df = sqlContext.read.format("csv").option("header", "true").option("inferSchema", "true").load(churnFilename)
#customer_df = sqlContext.read.format("csv").option("header", "true").option("inferSchema", "true").load(customerFilename)

#### <span style="color:blue">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Take a look at the 5 first datapoints from the newly loaded Spark dataframes.</span>

In [None]:
customer_df.show(5)

In [None]:
churn_df.show(5)

<a id="merge"></a>
# <span style="color:#fa04d9">Step 3: Merge Files </span>


In [None]:
data_df = customer_df.join(churn_df,customer_df['ID'] == churn_df['ID']).select(customer_df['*'], churn_df['CHURN'])

In [None]:
data_df.show(5)

<a id="rename"></a>
# <span style="color:#fa04d9">Step 4: Rename some columns </span>
This step is not a requirement, it just makes some columns names simpler to type with no spaces

In [None]:
# withColumnRenamed renames an existing column in a SparkDataFrame and returns a new SparkDataFrame
data_df = data_df.withColumnRenamed("Est Income", "EstIncome").withColumnRenamed("Car Owner","CarOwner")
data_df.toPandas().head()

<a id="dataunderstanding"></a>
# <span style="color:#fa04d9">Step 5: Data understanding </span>

<a id="overview"></a>
### &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Dataset Overview
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;The Pandas library has a powerful set commands to analyze data. As an example, check the use of "describe" below.

In [None]:
df_pandas = data_df.toPandas()
print "There are " + str(len(df_pandas)) + " observations in the customer history dataset."
print "There are " + str(len(df_pandas.columns)) + " variables in the dataset."

print "\n******************Descriptive statistics*****************************\n"
print df_pandas.drop(['ID'], axis = 1).describe()


<a id="eda"></a>
### &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Exploratory Data Analysis

The **Brunel** Visualization library provides a highly succinct and novel language that defines interactive data visualizations based on tabular data. The language is well suited for both data scientists and more aggressive business users. The system interprets the language and produces visualizations using the user's choice of existing lower-level visualization technologies typically used by application engineers such as RAVE or D3. 

More information about Brunel Visualization: https://github.com/Brunel-Visualization/Brunel/wiki

Try Brunel visualization here:  http://brunel.mybluemix.net/gallery_app/renderer

In [None]:
import brunel
df_pandas = data_df.toPandas()
%brunel data('df_pandas') stack bar x(Paymethod) y(#count) color(CHURN) bin(Paymethod) percent(#count) label(#count) tooltip(#all) | x(LongDistance) y(Usage) point color(Paymethod) tooltip(LongDistance, Usage) :: width=1100, height=400 

In [None]:
# Heat map
%brunel data('df_pandas') x(LocalBilltype) y(Dropped) color(#count:red) style('symbol:rect; size:100%; stroke:none') tooltip(Dropped,#count)

**PixieDust** is a Python Helper library for Spark IPython Notebooks. One of it's main features are visualizations. You'll notice that unlike other APIs which produce just output, PixieDust creates an interactive UI in which you can explore data.<br/>
More information about PixieDust: https://github.com/ibm-cds-labs/pixiedust?cm_mc_uid=78151411419314871783930&cm_mc_sid_50200000=1487962969

**If you haven't already installed it, uncomment and run the following cell to install the pixiedust Python library in your notebook environment. You only need to run it once**


In [None]:
#!pip install --user --upgrade pixiedust

In [None]:
from pixiedust.display import *
display(data_df)

<a id="sparksql"></a>
### &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Interactive query with Spark SQL

In [None]:
# Spark SQL also allow you to use standard SQL
data_df.createOrReplaceTempView("data_df")
sql = """
SELECT c.*
FROM data_df c
WHERE c.EstIncome>90000

"""
spark.sql(sql).toPandas().head()

<a id="intropipeline"></a>
# <span style="color:#fa04d9">Step 6: Introduction to Spark Pipelines (Optional. if you are already familiar with these concepts, please skip to Step 7).</span>

In [None]:
from pyspark.ml.feature import OneHotEncoder, StringIndexer, VectorIndexer, IndexToString
from pyspark.ml import Pipeline
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.classification import RandomForestClassifier

### In the section following this one, you will be building a SparkML Pipeline which consists of Transformers and Estimators. As a preamble to that section, users who are not familiar with the concepts and terminology of "Transformers", "Estimators" and "Pipeline" are invited to take advantage of this section to get familiarity with those concepts. Users who are already familiar with these concepts can skip directly to the next section of this notebook: Step 7


## <span style="color:green">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;In this section, you will get familiar with a few important Spark ML concepts:
### <span style="color:green">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;* Discovering some Estimators, Transformers and what they do.
### <span style="color:green">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;* Introduction to the notion of a Spark Machine Learning Pipeline.</span>

<a id="stringindexer"></a>
## <span style="color:green">Getting familiar with the SparkML Estimator: <a href="https://spark.apache.org/docs/latest/ml-features.html#stringindexer">StringIndexer</a> </span>

### StringIndexer encodes a string column of labels to a column of label indices. The indices are in [0, numLabels), ordered by label frequencies, so the most frequent label gets index 0.<br><br> Note that StringIndexer is an estimator, not a transformer. StringIndexer needs to scan the data it is given as input, to find the most frequent string and assign to it label 0, and then label 1 to the next most frequent string and so on. It will then produce a StringIndexerModel which is a transformer which can be applied to the input data using the "transform" method.

<div class="panel-group" id="accordion-10">
  <div class="panel panel-default">
    <div class="panel-heading">
      <h4 class="panel-title">
        <a data-toggle="collapse" data-parent="#accordion-10" href="#collapse1-10">
        Click on this link to expand this cell, then copy and paste the code which will appear in a new cell just below, and execute that new cell to see how StringIndexer works. (You may subsequently delete that new cell and proceed with this notebook).</a>
      </h4>
    </div>
    <div id="collapse1-10" class="panel-collapse collapse">
      <div class="panel-body">
from pyspark.ml.feature import StringIndexer<br>
<br>
df = spark.createDataFrame( <br>
    [(0, "a"), (1, "b"), (2, "c"), (3, "a"), (4, "a"), (5, "c")], <br>
    ["id", "category"]) <br>
<br>
indexer = StringIndexer(inputCol="category", outputCol="categoryIndex") <br>
indexed = indexer.fit(df).transform(df) <br>
indexed.show()
      </div>
    </div>
  </div>

<a id="indextostring"></a>
## <span style="color:green">Getting familiar with the SparkML Transformer: <a href="https://spark.apache.org/docs/latest/ml-features.html#indextostring">IndexToString</a> </span>

### Symmetrically to StringIndexer, IndexToString maps a column of label indices back to a column containing the original labels as strings. A common use case is to produce indices from labels with StringIndexer, train a model with those indices and retrieve the original labels from the column of predicted indices with IndexToString. However, you are free to supply your own labels.

<div class="panel-group" id="accordion-11">
  <div class="panel panel-default">
    <div class="panel-heading">
      <h4 class="panel-title">
        <a data-toggle="collapse" data-parent="#accordion-11" href="#collapse1-11">
        Click on this link to expand this cell, then copy and paste the code which will appear in a new cell just below, and execute that new cell to see how IndexToString works. (You may subsequently delete that new cell and proceed with this notebook).</a>
      </h4>
    </div>
    <div id="collapse1-11" class="panel-collapse collapse">
      <div class="panel-body">
from pyspark.ml.feature import IndexToString, StringIndexer <br>
<br>
df = spark.createDataFrame(<br>
    [(0, "a"), (1, "b"), (2, "c"), (3, "a"), (4, "a"), (5, "c")],<br>
    ["id", "category"])<br>
<br>
indexer = StringIndexer(inputCol="category", outputCol="categoryIndex")<br>
model = indexer.fit(df)<br>
indexed = model.transform(df)<br>
<br>
print("Transformed string column '%s' to indexed column '%s'"<br>
      % (indexer.getInputCol(), indexer.getOutputCol()))<br>
indexed.show()<br>
<br>
print("StringIndexer will store labels in output column metadata\n")<br>
<br>
converter = IndexToString(inputCol="categoryIndex", outputCol="originalCategory")<br>
converted = converter.transform(indexed)<br>
<br>
print("Transformed indexed column '%s' back to original string column '%s' using "<br>
      "labels in metadata" % (converter.getInputCol(), converter.getOutputCol()))<br>
converted.select("id", "categoryIndex", "originalCategory").show()
      </div>
    </div>
  </div>

<a id="onehotencoder"></a>
## <span style="color:green">Getting familiar with the SparkML Transformer: <a href="https://spark.apache.org/docs/latest/ml-features.html#onehotencoder">OneHotEncoder</a> </span>

### One-hot encoding maps a column of label indices to a column of binary vectors, with at most a single one-value. This encoding allows algorithms which expect continuous (quantitative to be precise as the output is discrete) features, such as Logistic Regression, to use categorical features. OneHotEncoder is a transformer.

<div class="panel-group" id="accordion-12">
  <div class="panel panel-default">
    <div class="panel-heading">
      <h4 class="panel-title">
        <a data-toggle="collapse" data-parent="#accordion-12" href="#collapse1-12">
        Click on this link to expand this cell, then copy and paste the code which will appear in a new cell just below, and execute that cell to see how OneHotEncoder works. (You may subsequently delete that new cell and proceed with this notebook).</a>
      </h4>
    </div>
    <div id="collapse1-12" class="panel-collapse collapse">
      <div class="panel-body">
          df = spark.createDataFrame([ <br>
    (0, "a"), <br>
    (1, "b"), <br>
    (2, "c"), <br>
    (3, "a"), <br>
    (4, "a"), <br>
    (5, "c")  <br>
    ], ["id", "category"]) <br>
<br>
stringIndexer = StringIndexer(inputCol="category", outputCol="categoryIndex") <br>
model = stringIndexer.fit(df) <br>
indexed = model.transform(df) <br>
<br>
encoder = OneHotEncoder(inputCol="categoryIndex", outputCol="categoryVec") <br>
encoded = encoder.transform(indexed) <br>
encoded.show()
      </div>
    </div>
  </div>

<a id="bucketizer"></a>
## <span style="color:green">Getting familiar with the SparkML Transformer: <a href="https://spark.apache.org/docs/latest/ml-features.html#bucketizer">Bucketizer</a> </span>

### Bucketizer transforms a column of continuous features to a column of feature buckets, where the buckets are specified by users. It takes a parameter defining the number of buckets. Bucketizing data is also referred to as "binning".

<div class="panel-group" id="accordion-13">
  <div class="panel panel-default">
    <div class="panel-heading">
      <h4 class="panel-title">
        <a data-toggle="collapse" data-parent="#accordion-13" href="#collapse1-13">
        Click on this link to expand this cell, then copy and paste the code which will appear in a new cell just below, and execute that cell to see how Bucketizer works. (You may subsequently delete that new cell and proceed with this notebook).</a>
      </h4>
    </div>
    <div id="collapse1-13" class="panel-collapse collapse">
      <div class="panel-body">
from pyspark.ml.feature import Bucketizer<br>
<br>
splits = [-float("inf"), -0.5, 0.0, 0.5, float("inf")] <br>
<br>
data = [(-999.9,), (-0.5,), (-0.3,), (0.0,), (0.2,), (999.9,)] <br>
dataFrame = spark.createDataFrame(data, ["features"]) <br>
<br>
bucketizer = Bucketizer(splits=splits, inputCol="features", outputCol="bucketedFeatures") <br>
<br>
# Transform original data into its bucket index. <br>
bucketedData = bucketizer.transform(dataFrame) <br>
<br>
print("Bucketizer output with %d buckets" % (len(bucketizer.getSplits())-1)) <br>
bucketedData.show()
      </div>
    </div>
  </div>

<a id="vectorassembler"></a>
## <span style="color:green">Getting familiar with the SparkML Transformer: <a href="https://spark.apache.org/docs/latest/ml-features.html#vectorassembler">VectorAssembler</a> </span>

### VectorAssembler is a transformer that combines a given list of columns into a single vector column. It is useful for combining raw features and features generated by different feature transformers into a single feature vector, in order to train ML models like logistic regression and decision trees.

<div class="panel-group" id="accordion-14">
  <div class="panel panel-default">
    <div class="panel-heading">
      <h4 class="panel-title">
        <a data-toggle="collapse" data-parent="#accordion-14" href="#collapse1-14">
        Click on this link to expand this cell, then copy and paste the code which will appear in a new cell just below, and execute that cell to see how VectorAssembler works. (You may subsequently delete that new cell and proceed with this notebook).</a>
      </h4>
    </div>
    <div id="collapse1-14" class="panel-collapse collapse">
      <div class="panel-body">
from pyspark.ml.linalg import Vectors <br>
from pyspark.ml.feature import VectorAssembler <br>
<br>
dataset = spark.createDataFrame( <br>
    [(0, 18, 1.0, Vectors.dense([0.0, 10.0, 0.5]), 1.0)], <br>
    ["id", "hour", "mobile", "userFeatures", "clicked"]) <br>
<br>
assembler = VectorAssembler( <br>
    inputCols=["hour", "mobile", "userFeatures"], <br>
    outputCol="features") <br>
<br>
output = assembler.transform(dataset) <br>
print("Assembled columns 'hour', 'mobile', 'userFeatures' to vector column 'features'") <br>
output.select("features", "clicked").show(truncate=False) <br>
      </div>
    </div>
  </div>

<a id="normalizer"></a>
## <span style="color:green">Getting familiar with the SparkML Transformer: <a href="https://spark.apache.org/docs/latest/ml-features.html#normalizer">Normalizer</a> </span>

### Normalizer is a Transformer which transforms a dataset of Vector rows, normalizing each Vector to have unit norm. It takes parameter p, which specifies the p-norm used for normalization. (p=2 by default.) This normalization can help standardize your input data and improve the behavior of learning algorithms.

<div class="panel-group" id="accordion-15">
  <div class="panel panel-default">
    <div class="panel-heading">
      <h4 class="panel-title">
        <a data-toggle="collapse" data-parent="#accordion-15" href="#collapse1-15">
        Click on this link to expand this cell, then copy and paste the code which will appear in a new cell just below, and execute that cell to see how Normalizer works. (You may subsequently delete that new cell and proceed with this notebook).</a>
      </h4>
    </div>
    <div id="collapse1-15" class="panel-collapse collapse">
      <div class="panel-body">
from pyspark.ml.feature import Normalizer<br>
from pyspark.ml.linalg import Vectors<br>
<br>
dataFrame = spark.createDataFrame([ <br>
    (0, Vectors.dense([1.0, 0.5, -1.0]),), <br>
    (1, Vectors.dense([2.0, 1.0, 1.0]),), <br>
    (2, Vectors.dense([4.0, 10.0, 2.0]),) <br>
], ["id", "features"]) <br>
<br>
# Normalize each Vector using $L^1$ norm.<br>
normalizer = Normalizer(inputCol="features", outputCol="normFeatures", p=1.0)<br>
l1NormData = normalizer.transform(dataFrame) <br>
print("Normalized using L^1 norm") <br>
l1NormData.show() <br>
<br>
# Normalize each Vector using $L^\infty$ norm. <br>
lInfNormData = normalizer.transform(dataFrame, {normalizer.p: float("inf")}) <br>
print("Normalized using L^inf norm") <br>
lInfNormData.show()
      </div>
    </div>
  </div>

## <span style="color:green">There are several other Estimators and Transformers which are documented in the Apache documentation online right <a href="https://spark.apache.org/docs/latest/ml-features.html">here</a>

<a id="applypipelineconcepts"></a>
# <span style="color:#fa04d9">Step 7: Applying the concepts described above to our customer churn dataset: 
** * a) Defining and applying the StringIndexer Estimator to input columns Gender, Status, CarOwner, Paymethod, LocalBilltype, LongDistanceBilltype. **<br>
** * b) Defining and applying VectorAssembler to the columns above to group them as one input vector to the model. **<br>
** * c) Defining and applying a StringIndexer Estimator to the target label column "CHURN", to encode the T/F values into 0/1. ** <br>
** * d) Defining and applying an IndexToString Transformer to reverse the output of our model from 0/1 predictions back to T/F values . ** <br>
** * e) Defining the Random Forest estimator itself, which will be trained on the input training data to produce the actual model which will perform the predictions. **

### &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;a) Defining a StringIndexer for the String columns in our dataset

In [None]:
### In this dataset, we will encode columns Gender, Status, CarOwner, Paymethod, LocalBilltype and LongDistanceBilltype
# StringIndexer encodes a string column of labels to a column of label indices. 
SI1 = StringIndexer(inputCol='Gender', outputCol='GenderEncoded')
SI2 = StringIndexer(inputCol='Status',outputCol='StatusEncoded')
SI3 = StringIndexer(inputCol='CarOwner',outputCol='CarOwnerEncoded')
SI4 = StringIndexer(inputCol='Paymethod',outputCol='PaymethodEncoded')
SI5 = StringIndexer(inputCol='LocalBilltype',outputCol='LocalBilltypeEncoded')
SI6 = StringIndexer(inputCol='LongDistanceBilltype',outputCol='LongDistanceBilltypeEncoded')

### &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;b) Define a Vector Assembler for all the columns of interest to be passed into the chosen machine learning model (columns which are encoded as well as those kept as is)

In [None]:
# Pipelines API requires that input variables are passed in  a vector
assembler = VectorAssembler(inputCols=["GenderEncoded", "StatusEncoded", "CarOwnerEncoded", "PaymethodEncoded", \
                                       "LocalBilltypeEncoded", "LongDistanceBilltypeEncoded", "Children", "EstIncome", "Age", \
                                       "LongDistance", "International", "Local", "Dropped","Usage","RatePlan"], outputCol="features")

### &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;c) Defining a StringIndexer for the label column of our model (CHURN column. The values True and False will be converted to 0 and 1)

In [None]:
# encode the label column
labelIndexer = StringIndexer(inputCol='CHURN', outputCol='label').fit(data_df)

### &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;d) Defining an IndexToString transformer to bring the labels back to True and False once the predictions are done. The model will produce a column named "prediction" which will contain 0 or 1. We will convert it back to True and False in a column named "predictedLabel"

In [None]:
# Convert indexed labels back to original labels.
labelConverter = IndexToString(inputCol="prediction", outputCol="predictedLabel", labels=labelIndexer.labels)

### &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;e) Defining a Random Forest estimator. This is a popular tree based classifier method

In [None]:
# instantiate the algorithm, take the default settings
rf=RandomForestClassifier(labelCol="label", featuresCol="features")

<a id="createpipeline"></a>
# <span style="color:#fa04d9">Step 8: Creating a Spark ML pipeline:
** * All the individual components of the pipeline have been defined in the section above. Notice how we will now "group" them into a pipeline object</span> **

### In machine learning, it is common to run a sequence of algorithms to process and learn from data. E.g., a simple text document processing workflow might include several stages:
* Split each document’s text into words. 
* Convert each document’s words into a numerical feature vector.  
* Learn a prediction model using the feature vectors and labels.<br>

### MLlib represents such a workflow as a Pipeline, which consists of a sequence of PipelineStages (Transformers and Estimators) to be run in a specific order. 

We will now build the Spark pipeline including the operations defined in Step 7 above.
"Pipeline" is an API in SparkML. A pipeline defines a sequence of transformers and estimators to perform the analysis in stages.
Additional information on SparkML is available online, including at this link: https://spark.apache.org/docs/2.0.2/ml-guide.html

In [None]:
# build the pipeline
pipeline = Pipeline(stages=[SI1,SI2,SI3,SI4,SI5,SI6, labelIndexer, assembler, rf, labelConverter])

### Split the data into Training and Testing sets (this is a standard best practice in data science)

In [None]:
# Split data into train and test datasets
(trainingData, testingData) = data_df.randomSplit([0.7, 0.3],seed=9)
trainingData.cache()
testingData.cache()

### Build the model from fitting the whole pipeline using the training data set. <br><br>Note that the pipeline interface will correctly call fit+transform or just transform alone for each stage of the pipeline, depending on whether the current stage is an estimator (such as StringIndex) or a Transformer

In [None]:
# Build model. The fitted model from a Pipeline is a PipelineModel, which consists of fitted models and transformers, corresponding to the pipeline stages.
model = pipeline.fit(trainingData)

<a id="scoretestdata"></a>
# <span style="color:#fa04d9">Step 9: Score the test data set </span>

In [None]:
result=model.transform(testingData)
result_display=result.select(result["ID"],result["CHURN"],result["Label"],result["predictedLabel"],result["prediction"],result["probability"])
result_display.toPandas().head(6)

<a id="evaluate"></a>
# <span style="color:#fa04d9">Step 10: Model Evaluation </span>
** Find accuracy of the model and the Area Under the ROC Curve **

In [None]:
print 'Model Accuracy = {:.2f}.'.format(result.filter(result.label == result.prediction).count() / float(result.count()))

### Create an evaluator for the binary classification using area under the ROC Curve as the evaluation metric
Receiver operating characteristic (ROC) is a graphical plot that illustrates the performance of a binary classifier system as its discrimination threshold is varied.

Additional reading on this metric can be found very easily online, such as at this wikipedia link: https://en.wikipedia.org/wiki/Receiver_operating_characteristic


In [None]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator

# Evaluate model
evaluator = BinaryClassificationEvaluator(rawPredictionCol="prediction", labelCol="label", metricName="areaUnderROC")
print 'Area under ROC curve = {:.2f}.'.format(evaluator.evaluate(result))

<a id="tune"></a>
# <span style="color:#fa04d9">Step 11:  Tune the hyperparameters to find the best model </span>

### Build a Parameter Grid specifying the parameters to be evaluated to determine the best combination

In [None]:
# set different levels for the maxDepth
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
paramGrid = (ParamGridBuilder().addGrid(rf.maxDepth,[4,6,8]).build())

### Create a cross validator to tune the pipeline with the generated parameter grid
Cross-validation attempts to fit the underlying estimator with user-specified combinations of parameters, cross-evaluate the fitted models, and output the best one.

In [None]:
# perform 3 fold cross validation
cv = CrossValidator().setEstimator(pipeline).setEvaluator(evaluator).setEstimatorParamMaps(paramGrid).setNumFolds(3)

In [None]:
# train the model
cvModel = cv.fit(trainingData)

# pick the best model
best_rfModel = cvModel.bestModel

In [None]:
# score the test data set with the best model
cvresult=best_rfModel.transform(testingData)
cvresults_show=cvresult.select(cvresult["ID"],cvresult["CHURN"],cvresult["Label"],cvresult["predictedLabel"],cvresult["prediction"],cvresult["probability"])
cvresults_show.toPandas().head()

In [None]:
print 'Model Accuracy of the best fitted model = {:.2f}.'.format(cvresult.filter(cvresult.label == cvresult.prediction).count()/ float(cvresult.count()))
print 'Model Accuracy of the default model = {:.2f}.'.format(result.filter(result.label == result.prediction).count() / float(result.count()))
print '   '
print('Area under the ROC curve of best fitted model = {:.2f}.'.format(evaluator.evaluate(cvresult)))
print 'Area under the ROC curve of the default model = {:.2f}.'.format(evaluator.evaluate(result))

<a id="execute"></a>
# <span style="color:#fa04d9">Step 12: Execute an inline invocation of the best model which was just identified </span>

### Let us now make a prediction on some customer for which we will provide our own made up attributes

In [None]:
Gender = 'F'
Status = 'M'
CarOwner = 'N'
Paymethod = 'CC'
LocalBilltype = 'Budget'
LongDistanceBilltype = 'Standard'
Children = 1
EstIncome = 45000
Age = 30
LongDistance = 30
International = 0
Local = 100
Dropped = 0
Usage = 150
RatePlan = 2

Features = (spark.createDataFrame([(Gender, Status, CarOwner, Paymethod, LocalBilltype, LongDistanceBilltype, Children, EstIncome, Age, LongDistance, \
                                              International, Local, Dropped, Usage, RatePlan)],
    ['Gender', 'Status', 'CarOwner', 'Paymethod', 'LocalBilltype', 'LongDistanceBilltype', 'Children', 'EstIncome', 'Age', 'LongDistance', \
     'International', 'Local', 'Dropped', 'Usage', 'RatePlan']))
Features.show()

In [None]:
ChurnPrediction = best_rfModel.transform(Features)
ChurnPrediction.select('rawPrediction', 'probability', 'prediction').show(1, False)

### Mini Exercise: Change the number of children and/or the EstIncome in the cell prior to the one above, and observe the impact on the prediction:
* It seems that a number of children lower than 3 will result in churn, but a customer with 3 children or more will not churn.
* The rule above is true for lower incomes. With higher incomes, churn is less likely (if we change the income to 145,000 the model does not seem to predict churn anymore, regardless of the number of children)

<a id="save"></a>
# <span style="color:#fa04d9"> Step 13: Save Model </span>
** Save the best model in Object Storage. **

** A separate notebook has been created for "batch scoring deployment". This deployment notebook retrieves the model from object storage and applies it to a new dataset. The notebook can be scheduled to run via the Notebook scheduler or through the deployment interface in IBM WML. (In order to schedule through WML, the model needs to first be saved in the WML repository, which can be done using the appropriate API calls) **

In [None]:
# Overwrite any existing saved model in the specified path
best_rfModel.write().overwrite().save("PredictChurn.churnModel")

You have come to the end of this notebook

**Sidney Phoon**<br/>
**Elena Lowery**<br/>
**Rich Tarro**<br/>
**Mokhtar Kandil**<br/>
July, 2017