# Predicting churn with the SPSS random tree algorithm

This Scala 2.10 notebook shows you how to create a predictive model of churn rate by using IBM SPSS Algorithm on Apache Spark version 1.6. You'll learn how to create an SPSS random tree model by using the IBM SPSS Machine Learning API, and how to view the model with IBM SPSS Model Viewer.

Because it consists of multiple classification and regression trees (CART), you can use random tree algorithms to generate accurate predictive models and solve complex classification and regression problems. Each tree develops from a bootstrap sample that is produced by resampling the original data points with replacement data. During the resampling phase, the best split variable is selected for each node from a specified smaller number of variables that are drawn randomly from the full set of variables. Each tree grows without pruning and then, during the scoring phase, the random tree algorithm aggregates tree scores by majority voting (for classification) or average (for regression).

In this notebook, you'll create a model with telecommunications data to predict when its customers will leave for a competitor, so that you can take some action to retain the customer.
    
To get the most out of this notebook, you should have some familiarity with the Scala programming language.

## Contents 
This notebook contains the following main sections:

1. [Load the Telco Churn data to the cloud data repository.](#overview)
1. [Prepare the data.](#prepare)
1. [Configure the RandomTrees model.](#configure) 
1. [View the model.](#view)
1. [Summary and next steps.](#next)    

<a id="overview"></a>
## 1. Load the Telco Churn data to the cloud data repository.
Telco Churn is a hypothetical data file that concerns a telecommunications company's efforts to reduce turnover in its customer base. Each case corresponds to a separate customer and it records various demographic and service usage information. Before you can work with the data, you must use the URL to get the telco.csv and telco_Feb.csv files from the GitHub repository.


In [1]:
val link_telco = "https://raw.githubusercontent.com/AlgorithmDemo/SampleData/master/telco.csv"

import sys.process._
import java.net.URL
import java.io.File
new URL(link_telco) #> new File("telco.csv") !!

val link_telco_Feb = "https://raw.githubusercontent.com/AlgorithmDemo/SampleData/master/telco_Feb.csv"

import sys.process._
import java.net.URL
import java.io.File
new URL(link_telco_Feb) #> new File("telco_Feb.csv") !!

<a id="prepare"></a>
## 2. Prepare the data.

After uploading the CSV files that contain the data, you must create a SQLContext, put the data from the telco.scv file into a Spark DataFrame, and show the first row in the DataFrame.

In [2]:
val sqlContext = new org.apache.spark.sql.SQLContext(sc)

val dfTelco = sqlContext.
    read.
    format("com.databricks.spark.csv").
    option("header", "true").
    option("inferschema", "true").
    load("telco.csv")

dfTelco.show(1)

+------+------+---+-------+-------+------+---+------+------+------+------+--------+-----+--------+--------+-------+-------+--------+-------+-------+-------+-------+--------+-------+-------+--------+-----+-----+--------+------+--------+-------+------+-----+----------------+-------+-------+----------------+-------+----------------+-------+-----+
|region|tenure|age|marital|address|income| ed|employ|retire|gender|reside|tollfree|equip|callcard|wireless|longmon|tollmon|equipmon|cardmon|wiremon|longten|tollten|equipten|cardten|wireten|multline|voice|pager|internet|callid|callwait|forward|confer|ebill|         loglong|logtoll|logequi|         logcard|logwire|           lninc|custcat|churn|
+------+------+---+-------+-------+------+---+------+------+------+------+--------+-----+--------+--------+-------+-------+--------+-------+-------+-------+-------+--------+-------+-------+--------+-----+-----+--------+------+--------+-------+------+-----+----------------+-------+-------+----------------+--

Review the data. Print the schema of the DataFrame to look at what kind of data you have.

In [3]:
dfTelco.printSchema

root
 |-- region: integer (nullable = true)
 |-- tenure: integer (nullable = true)
 |-- age: integer (nullable = true)
 |-- marital: integer (nullable = true)
 |-- address: integer (nullable = true)
 |-- income: integer (nullable = true)
 |-- ed: integer (nullable = true)
 |-- employ: integer (nullable = true)
 |-- retire: integer (nullable = true)
 |-- gender: integer (nullable = true)
 |-- reside: integer (nullable = true)
 |-- tollfree: integer (nullable = true)
 |-- equip: integer (nullable = true)
 |-- callcard: integer (nullable = true)
 |-- wireless: integer (nullable = true)
 |-- longmon: double (nullable = true)
 |-- tollmon: double (nullable = true)
 |-- equipmon: double (nullable = true)
 |-- cardmon: double (nullable = true)
 |-- wiremon: double (nullable = true)
 |-- longten: double (nullable = true)
 |-- tollten: double (nullable = true)
 |-- equipten: double (nullable = true)
 |-- cardten: double (nullable = true)
 |-- wireten: double (nullable = true)
 |-- multline: int

Create a DataFrame for the telco_Feb.csv data. You'll use this year's data to build the model, and use the February data for accuracy value.

In [4]:
val dfTelcoFeb = sqlContext.
    read.
    format("com.databricks.spark.csv").
    option("header", "true").
    option("inferschema", "true").
    load("telco_Feb.csv")

<a id="configure"></a>
## 3. Configure the RandomTrees model.

By running this portion of the code, you create the random trees estimator, import the libraries, and set the ordinal and nominal variables. Because no inputFieldList value is set, all fields except the target, frequency, and analysis weight fields are treated as input fields. To make the random tree model build faster, set the number of trees to 10 instead of the default value, which is 100. Finally, you must specify the churn target field. 

In [5]:
import com.ibm.spss.ml.classificationandregression.ensemble.RandomTrees
import com.ibm.spss.ml.utils.DataFrameImplicits.DataFrameEnrichImplicitsClass

val ordinal = Array("ed")
val nominal = Array("region",
     "marital",
     "retire",
     "gender",
     "tollfree",
     "equip",
     "callcard",
     "wireless","multline",
     "voice","pager","internet","callid","callwait","forward","confer",
     "ebill",
     "custcat",
     "churn"
   )
val srf = RandomTrees().setTargetField("churn").setNumTrees(10)
val srfModel = srf.fit(dfTelco.setNominalMeasure(nominal,true).setOrdinalMeasure(ordinal,true))

Do the prediction and get your results.

In [6]:
val predResult = srfModel.transform(dfTelcoFeb)
val predResultNew = predResult.withColumn("prediction", predResult("prediction").cast("double")).
    withColumn("churn", predResult("churn").cast("double"))

To get the accuracy result, use the Apache Spark **MulticlassClassificationEvaluator** function. Notice that the accuracy is above 90%.

In [7]:
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
val evaluator = new MulticlassClassificationEvaluator().setLabelCol("churn").setMetricName("precision")
val acc_result = evaluator.evaluate(predResultNew)
println(s"acc_result:$acc_result")

acc_result:0.948


<a id="view"></a>
## 4. View the model.

### Show the random trees model result.
To see the result, import the IBM SPSS Model Viewer, which you can use to explore different views of the model.

## 4.1 Generate a project token

Before you can run the model viewer, you need to generate a project token

1. In the **My Projects** banner, click the **More** icon and then click **Insert project token**. The project token is inserted into the first cell of the notebook, before the title.
2. Copy the text, which appears at the beginning of the notebook, into the following cell and run it.

## 4.2 Start the model viewer

Run the code in the following cell to start SPSS Model Viewer, where you can see a visualization and see model statistics and other characteristics.

In [9]:
import com.ibm.spss.scala.ModelViewer
kernel.magics.html(ModelViewer.toHTML(pc, srfModel))

0,1
Target Field,churn
Model Building Method,Random Trees Classification
Number of Predictors Input,36
Model Accuracy,0.692
Misclassification Rate,0.308

Records,Number,Percent
Included,1000,100.0
Excluded,0,0.0
Total,1000,100.0

0,1
Mean,35.526
Standard Deviation,21.349
Minimum,1.0
Maximum,72.0
N,1000.0

0,1
Mean,41.684
Standard Deviation,12.553
Minimum,18.0
Maximum,77.0
N,1000.0

0,1
Mean,11.551
Standard Deviation,10.082
Minimum,0.0
Maximum,55.0
N,1000.0

Value,Count,Percent
2,334,33.4
3,344,34.4
1,322,32.2

0,1
Mean,11.723
Standard Deviation,10.358
Minimum,0.9
Maximum,99.95
N,1000.0

0,1
Mean,10.987
Standard Deviation,10.077
Minimum,0.0
Maximum,47.0
N,1000.0

0,1
Mean,77.535
Standard Deviation,106.991
Minimum,9.0
Maximum,1668.0
N,1000.0

0,1
Mean,13.781
Standard Deviation,14.077
Minimum,0.0
Maximum,109.25
N,1000.0

0,1
Mean,574.05
Standard Deviation,789.579
Minimum,0.9
Maximum,7257.6
N,1000.0

0,1
Mean,13.274
Standard Deviation,16.894
Minimum,0.0
Maximum,173.0
N,1000.0

Value,Count,Percent
1,204,20.4
2,287,28.7
3,209,20.9
4,234,23.4
5,66,6.6

0,1
Mean,14.22
Standard Deviation,19.059
Minimum,0.0
Maximum,77.7
N,1000.0

Value,Count,Percent
1,495,49.5
0,505,50.5

0,1
Mean,2.182
Standard Deviation,0.734
Minimum,-0.105
Maximum,4.605
N,1000.0

0,1
Mean,2.331
Standard Deviation,1.435
Minimum,1.0
Maximum,8.0
N,1000.0

0,1
Mean,605.774
Standard Deviation,829.711
Minimum,0.0
Maximum,7515.0
N,1000.0

0,1
Mean,3.957
Standard Deviation,0.803
Minimum,2.197
Maximum,7.419
N,1000.0

0,1
Mean,11.584
Standard Deviation,19.71
Minimum,0.0
Maximum,111.95
N,1000.0

0,1
Mean,551.259
Standard Deviation,915.289
Minimum,0.0
Maximum,5916.0
N,1000.0

Value,Count,Percent
0,483,48.3
1,517,51.7

0,1
Mean,465.633
Standard Deviation,856.844
Minimum,0.0
Maximum,5028.65
N,1000.0

Value,Count,Percent
1,266,26.6
4,236,23.6
3,281,28.1
2,217,21.7

0,1
Mean,442.737
Standard Deviation,970.985
Minimum,0.0
Maximum,7856.85
N,1000.0

Value,Count,Percent
1,678,67.8
0,322,32.2

Value,Count,Percent
0,526,52.6
1,474,47.4

Value,Count,Percent
0,696,69.6
1,304,30.4

Value,Count,Percent
0,614,61.4
1,386,38.6

Value,Count,Percent
0,498,49.8
1,502,50.2

Value,Count,Percent
0,629,62.9
1,371,37.1

Value,Count,Percent
0,525,52.5
1,475,47.5

Value,Count,Percent
0,739,73.9
1,261,26.1

Value,Count,Percent
0,519,51.9
1,481,48.1

Value,Count,Percent
0,632,63.2
1,368,36.8

Value,Count,Percent
1,493,49.3
0,507,50.7

Value,Count,Percent
0,704,70.4
1,296,29.6

Value,Count,Percent
0,515,51.5
1,485,48.5

Decision Rule,Most Frequent Category,Rule Accuracy,Ensemble Accuracy,Interestingness Index
internet = 0gender = 1loglong > 2.56109578814555equipmon <= 0.0,0,0.976,1.0,0.953
tollmon <= 21.25internet = 0tenure > 48.0tenure > 13.0equip = 0,0,0.976,0.976,0.952
multline = 1longmon > 16.15age > 36.0equipmon <= 0.0,0,0.971,0.971,0.944
longten > 85.05employ > 8.0tenure > 11.0income <= 47.0equipmon <= 0.0,0,0.967,1.0,0.934
multline = 1lninc > 3.55534806148941address > 8.0ebill = 0tenure > 30.0,0,0.927,1.0,0.859

Observed,Predicted,Predicted,Predicted
Observed,1,0,Percent Correct
1,120,154,43.8%
0,152,569,78.9%
Percent Correct,44.1%,78.7%,69.2%


### Export the XML files (PMML, StatXML) for other detail statistics.
By exporting your results to different formats, such as Predictive Model Markup Language (PMML) or statXML format you can share your statistical analyses outside of IBM Data Science Experience.

In [10]:
import java.io.{File, PrintWriter}

srfModel.toPMML("randomTrees_pmml.xml")
val statXML = srfModel.statXML()
new PrintWriter("StatXML.xml") {
      write(statXML)
      close
}

$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anon$1@12079a4c

<a id="next"></a>
# Summary and next steps
You have created a predictive model of churn rate by using IBM SPSS Algorithm on Apache Spark. Now you can create a different model to compare model evaluations, such as the test of model effects, residuals, and so on. See [SPSS documentation](https://apsportal.ibm.com/docs/content/kc_gen/integrations-gen2.html).

## Authors

Wang Zhiyuan and Yu Wenpei are SPSS Algorithm Engineers at IBM.

Copyright © 2017 IBM. This notebook and its source code are released under the terms of the MIT License.