## SPSS Decision Tree with Visualization

First we need to import the necessary libraries for the decision tree, model viewer, and Spark SQL

In [13]:
import com.ibm.spss.ml.classificationandregression.tree.CHAID
import com.ibm.spss.scala.ModelViewer
import org.apache.spark.sql.SQLContext

Here we define a sqlContext that will allow us to talk to Spark

In [14]:
val sqlContext = new SQLContext(sc)

This function will be used to read our CSV file.

In [15]:
def setHadoopConfig(credentials: collection.mutable.Map[String, String]) = {
    val prefix = "fs.swift.service." + credentials("name") 
    val hconf = sc.hadoopConfiguration
    hconf.set(prefix + ".auth.url", credentials("auth_url") + "/v3/auth/tokens")
    hconf.set(prefix + ".auth.endpoint.prefix", "endpoints")
    hconf.set(prefix + ".tenant", credentials("project_id"))
    hconf.set(prefix + ".username", credentials("user_id"))
    hconf.set(prefix + ".password", credentials("password"))
    hconf.setInt(prefix + ".http.port", 8080)
    hconf.set(prefix + ".region", credentials("region"))
    hconf.setBoolean(prefix + ".public", true)
}

### To import data 
- Click on the cell below this.
- Click the 1010 icon on the top right
- Click on "Insert to Code" under transactions.csv
- rename "var credentials_..." in the first line to be "var credentials"

In [17]:
credentials("name") = "chaid"
setHadoopConfig(credentials)

In [18]:
val filePath = "swift://" + credentials("container") + "." + credentials("name") + "/"
val fileName = credentials("filename")

This creates a Spark SQL dataframe from the transactions.csv data set

In [19]:
val df = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").option("inferschema", "true").load(filePath + fileName)

How many records in data set?

In [8]:
df.count

60252

Any nulls? No

In [9]:
df.na.drop.count

60252

In [10]:
df.show(5)

+--------------------+------------+-----------------+---------------+-----+--------------+------+---+--------------+------------+
|        PRODUCT_LINE|PRODUCT_TYPE|CUST_ORDER_NUMBER|           CITY|STATE|       COUNTRY|GENDER|AGE|MARITAL_STATUS|  PROFESSION|
+--------------------+------------+-----------------+---------------+-----+--------------+------+---+--------------+------------+
|Personal Accessories|  Navigation|           174344|       Plymouth|   NA|United Kingdom|     M| 27|        Single|Professional|
|Personal Accessories|     Eyewear|           170637|        Leipzig|   NA|       Germany|     F| 39|       Married|       Other|
|Mountaineering Eq...|        Rope|           170637|        Leipzig|   NA|       Germany|     F| 39|       Married|       Other|
|Personal Accessories|  Binoculars|           170641|         Manaus|BR-AM|        Brazil|     F| 56|   Unspecified| Hospitality|
|      Golf Equipment|       Woods|           170643|College Station|   TX| United States|

In [11]:
df.printSchema

root
 |-- PRODUCT_LINE: string (nullable = true)
 |-- PRODUCT_TYPE: string (nullable = true)
 |-- CUST_ORDER_NUMBER: integer (nullable = true)
 |-- CITY: string (nullable = true)
 |-- STATE: string (nullable = true)
 |-- COUNTRY: string (nullable = true)
 |-- GENDER: string (nullable = true)
 |-- AGE: integer (nullable = true)
 |-- MARITAL_STATUS: string (nullable = true)
 |-- PROFESSION: string (nullable = true)



Need to bin Age in order to use in the Chaid decision tree.  We will leave this out for now.  Try coding this up as a challenge!

In [12]:
df.describe("AGE").show

+-------+------------------+
|summary|               AGE|
+-------+------------------+
|  count|             60252|
|   mean|  34.1874792538007|
| stddev|10.105477019283859|
|    min|                17|
|    max|                69|
+-------+------------------+



Now we need to split our data into training and testing

In [13]:
val Array(training, test) = df.randomSplit(Array(0.6, 0.4), seed=12345)

Here we choose our model and set the target and predictor variables

In [14]:
val chaid = CHAID().setTargetField("PRODUCT_LINE").setInputFieldList(Array("GENDER", "PROFESSION", "MARITAL_STATUS"))

In [15]:
val chaidModel = chaid.fit(training)

### Model Visualization

This output shows a single output that contains everything you need to evaluate your model.  This output contains 3 tables and 2 interactive charts.  Although for this model the Predictor Importance is constant so that chart was not drawn.  

In [None]:
kernel.magics.html(ModelViewer.toHTML(chaidModel))