# Part 9 Machine Learning With Spark ML
As the last step, you are given a dataset called `data/ccdefault.csv`. The dataset represents default of credit card clients. It has 30,000 cases and 24 different attributes. More details about the dataset is available at `data/ccdefault.txt`. In this task you should make three models, compare their results and conclude the ideal solution. Here are the suggested steps:
1. Load the data.
2. Carry out some exploratory analyses (e.g., how various features and the target variable are distributed).
3. Train a model to predict the target variable (risk of `default`).
  - Employ three different models (logistic regression, decision tree, and random forest).
  - Compare the models' performances (e.g., AUC).
  - Defend your choice of best model (e.g., what are the strength and weaknesses of each of these models?).
4. What more would you do with this data? Anything to help you devise a better solution?

---
# 1. Get the data
We start by loading the dataset. we infer column types automatically by reading the filewith `inferSchema` to true. The `header` option will read the columns' name from the file.

In [43]:
val default = spark.read.format("csv").option("inferSchema", "true").option("header", "true").load("data/ccdefault.csv")

default: org.apache.spark.sql.DataFrame = [ID: int, LIMIT_BAL: int ... 23 more fields]


---
# 2. Discover the data to gain insights
here we analyse the data by obtaining the statistical summary of the attributes. We attempt to find the correlation between the different features. In the end, we combine 18 out of the 24 features in 3 groups of 6 to generate 3 new attribues.

## 2.1. Schema and dimension
Print the schema of the dataset

In [44]:
default.printSchema()

root
 |-- ID: integer (nullable = true)
 |-- LIMIT_BAL: integer (nullable = true)
 |-- SEX: integer (nullable = true)
 |-- EDUCATION: integer (nullable = true)
 |-- MARRIAGE: integer (nullable = true)
 |-- AGE: integer (nullable = true)
 |-- PAY_0: integer (nullable = true)
 |-- PAY_2: integer (nullable = true)
 |-- PAY_3: integer (nullable = true)
 |-- PAY_4: integer (nullable = true)
 |-- PAY_5: integer (nullable = true)
 |-- PAY_6: integer (nullable = true)
 |-- BILL_AMT1: integer (nullable = true)
 |-- BILL_AMT2: integer (nullable = true)
 |-- BILL_AMT3: integer (nullable = true)
 |-- BILL_AMT4: integer (nullable = true)
 |-- BILL_AMT5: integer (nullable = true)
 |-- BILL_AMT6: integer (nullable = true)
 |-- PAY_AMT1: integer (nullable = true)
 |-- PAY_AMT2: integer (nullable = true)
 |-- PAY_AMT3: integer (nullable = true)
 |-- PAY_AMT4: integer (nullable = true)
 |-- PAY_AMT5: integer (nullable = true)
 |-- PAY_AMT6: integer (nullable = true)
 |-- DEFAULT: integer (nullable = tru

Print the number of records in the dataset.

In [45]:
default.count()

res35: Long = 30000


## 2.2. Look at the data
Print the first five records of the dataset.

In [46]:
default.show(5)

+---+---------+---+---------+--------+---+-----+-----+-----+-----+-----+-----+---------+---------+---------+---------+---------+---------+--------+--------+--------+--------+--------+--------+-------+
| ID|LIMIT_BAL|SEX|EDUCATION|MARRIAGE|AGE|PAY_0|PAY_2|PAY_3|PAY_4|PAY_5|PAY_6|BILL_AMT1|BILL_AMT2|BILL_AMT3|BILL_AMT4|BILL_AMT5|BILL_AMT6|PAY_AMT1|PAY_AMT2|PAY_AMT3|PAY_AMT4|PAY_AMT5|PAY_AMT6|DEFAULT|
+---+---------+---+---------+--------+---+-----+-----+-----+-----+-----+-----+---------+---------+---------+---------+---------+---------+--------+--------+--------+--------+--------+--------+-------+
|  1|    20000|  2|        2|       1| 24|    2|    2|   -1|   -1|   -2|   -2|     3913|     3102|      689|        0|        0|        0|       0|     689|       0|       0|       0|       0|      1|
|  2|   120000|  2|        2|       2| 26|   -1|    2|    0|    0|    0|    2|     2682|     1725|     2682|     3272|     3455|     3261|       0|    1000|    1000|    1000|       0|    2000|    

Print the number of records with age more than 30.

In [47]:
default.filter("age > 30").count()

res37: Long = 18987


## 2.3. Statistical summary
A summary of the table statistics for the attributes `age`, `sex`, `education`, and `limit_bal`.

In [48]:
// TODO: Replace <FILL IN> with appropriate code

default.describe("age", "sex", "education", "limit_bal").show()

+-------+-----------------+------------------+------------------+------------------+
|summary|              age|               sex|         education|         limit_bal|
+-------+-----------------+------------------+------------------+------------------+
|  count|            30000|             30000|             30000|             30000|
|   mean|          35.4855|1.6037333333333332|1.8531333333333333|167484.32266666667|
| stddev|9.217904068090155|0.4891291960902602|0.7903486597207269|129747.66156720246|
|    min|               21|                 1|                 0|             10000|
|    max|               79|                 2|                 6|           1000000|
+-------+-----------------+------------------+------------------+------------------+



## 2.4. Correlation among attributes
Correlation between the attributes `age`, `sex`, `education`, and `limit_bal`,by computing the standard correlation coefficient (Pearson) between every pair.

In [49]:
import org.apache.spark.ml.feature.VectorAssembler

val va = new VectorAssembler().setInputCols(Array("LIMIT_BAL", "SEX", "EDUCATION", "MARRIAGE", "AGE")).setOutputCol("features")

val defaultAttrs = va.transform(default)

defaultAttrs.show(5)

+---+---------+---+---------+--------+---+-----+-----+-----+-----+-----+-----+---------+---------+---------+---------+---------+---------+--------+--------+--------+--------+--------+--------+-------+--------------------+
| ID|LIMIT_BAL|SEX|EDUCATION|MARRIAGE|AGE|PAY_0|PAY_2|PAY_3|PAY_4|PAY_5|PAY_6|BILL_AMT1|BILL_AMT2|BILL_AMT3|BILL_AMT4|BILL_AMT5|BILL_AMT6|PAY_AMT1|PAY_AMT2|PAY_AMT3|PAY_AMT4|PAY_AMT5|PAY_AMT6|DEFAULT|            features|
+---+---------+---+---------+--------+---+-----+-----+-----+-----+-----+-----+---------+---------+---------+---------+---------+---------+--------+--------+--------+--------+--------+--------+-------+--------------------+
|  1|    20000|  2|        2|       1| 24|    2|    2|   -1|   -1|   -2|   -2|     3913|     3102|      689|        0|        0|        0|       0|     689|       0|       0|       0|       0|      1|[20000.0,2.0,2.0,...|
|  2|   120000|  2|        2|       2| 26|   -1|    2|    0|    0|    0|    2|     2682|     1725|     2682|    

import org.apache.spark.ml.feature.VectorAssembler
va: org.apache.spark.ml.feature.VectorAssembler = vecAssembler_6e03b13ab469
defaultAttrs: org.apache.spark.sql.DataFrame = [ID: int, LIMIT_BAL: int ... 24 more fields]


In [50]:
import org.apache.spark.ml.linalg.Matrix
import org.apache.spark.ml.stat.Correlation
import org.apache.spark.sql.Row

val Row(coeff: Matrix) = Correlation.corr(defaultAttrs, "features").head

println(s"The standard correlation coefficient:\n ${coeff}")

The standard correlation coefficient:
 1.0                   0.024755235111645853  -0.2191606982292233   ... (5 total)
0.024755235111645853  1.0                   0.01423193616219367   ...
-0.2191606982292233   0.01423193616219367   1.0                   ...
-0.10813941027800818  -0.03138884007083411  -0.14346434041145634  ...
0.14471279755736938   -0.09087364652720994  0.17506066148814436   ...


import org.apache.spark.ml.linalg.Matrix
import org.apache.spark.ml.stat.Correlation
import org.apache.spark.sql.Row
coeff: org.apache.spark.ml.linalg.Matrix =
1.0                   0.024755235111645853  -0.2191606982292233   ... (5 total)
0.024755235111645853  1.0                   0.01423193616219367   ...
-0.2191606982292233   0.01423193616219367   1.0                   ...
-0.10813941027800818  -0.03138884007083411  -0.14346434041145634  ...
0.14471279755736938   -0.09087364652720994  0.17506066148814436   ...


## 2.5. Combine and make new attributes
We chose to combine the mopnthly payment parameters and to average them. We generate 3 new features, `repayment`, `billStatement`, and `payment`

In [51]:
import org.apache.spark.sql.functions._

val repayCol = Array(col("PAY_0"), col("PAY_2"), col("PAY_3"), col("PAY_4"), col("PAY_5"), col("PAY_6"))
val averageRepay = repayCol.foldLeft(lit(0)){(x, y) => x+y}/repayCol.length
val default1 = default.withColumn("repayment", averageRepay)


val billCol = Array(col("BILL_AMT1"), col("BILL_AMT2"), col("BILL_AMT3"), col("BILL_AMT4"), col("BILL_AMT5"), col("BILL_AMT6"))
val averageBill = billCol.foldLeft(lit(0)){(x, y) => x+y}/billCol.length
val default2 = default1.withColumn("billStatement", averageBill)

val payCol = Array(col("PAY_AMT1"), col("PAY_AMT2"), col("PAY_AMT3"), col("PAY_AMT4"), col("PAY_AMT5"), col("PAY_AMT6"))
val averagePay = payCol.foldLeft(lit(0)){(x, y) => x+y}/payCol.length
val defaultExtra = default2.withColumn("payment", averagePay)

defaultExtra.select("repayment", "billStatement", "payment").show(5)


+-------------------+------------------+------------------+
|          repayment|     billStatement|           payment|
+-------------------+------------------+------------------+
|-0.3333333333333333|            1284.0|114.83333333333333|
|                0.5|2846.1666666666665| 833.3333333333334|
|                0.0|16942.166666666668|1836.3333333333333|
|                0.0|38555.666666666664|            1398.0|
|-0.3333333333333333|18223.166666666668|            9841.5|
+-------------------+------------------+------------------+
only showing top 5 rows



import org.apache.spark.sql.functions._
repayCol: Array[org.apache.spark.sql.Column] = Array(PAY_0, PAY_2, PAY_3, PAY_4, PAY_5, PAY_6)
averageRepay: org.apache.spark.sql.Column = (((((((0 + PAY_0) + PAY_2) + PAY_3) + PAY_4) + PAY_5) + PAY_6) / 6)
default1: org.apache.spark.sql.DataFrame = [ID: int, LIMIT_BAL: int ... 24 more fields]
billCol: Array[org.apache.spark.sql.Column] = Array(BILL_AMT1, BILL_AMT2, BILL_AMT3, BILL_AMT4, BILL_AMT5, BILL_AMT6)
averageBill: org.apache.spark.sql.Column = (((((((0 + BILL_AMT1) + BILL_AMT2) + BILL_AMT3) + BILL_AMT4) + BILL_AMT5) + BILL_AMT6) / 6)
default2: org.apache.spark.sql.DataFrame = [ID: int, LIMIT_BAL: int ... 25 more fields]
payCol: Array[org.apache.spark.sql.Column] = Array(PAY_AMT1, PAY_AMT2, PAY_AMT3, PAY_AMT4, PAY_AMT5, PAY_AMT6)
averagePay...

---
## 3. Preparing the data for Machine Learning algorithms


In [52]:
val renamedDefault = default.withColumnRenamed("DEFAULT", "label")

val filteredDefault =renamedDefault.drop("ID")

renamedDefault: org.apache.spark.sql.DataFrame = [ID: int, LIMIT_BAL: int ... 23 more fields]
filteredDefault: org.apache.spark.sql.DataFrame = [LIMIT_BAL: int, SEX: int ... 22 more fields]


In [53]:
// label columns
val colLabel = "label"

// numerical columns
val colNum = filteredDefault.columns.filter(_ != colLabel)

colLabel: String = label
colNum: Array[String] = Array(LIMIT_BAL, SEX, EDUCATION, MARRIAGE, AGE, PAY_0, PAY_2, PAY_3, PAY_4, PAY_5, PAY_6, BILL_AMT1, BILL_AMT2, BILL_AMT3, BILL_AMT4, BILL_AMT5, BILL_AMT6, PAY_AMT1, PAY_AMT2, PAY_AMT3, PAY_AMT4, PAY_AMT5, PAY_AMT6)


## 3.1. Prepare continuse attributes
### Data cleaning
The dataset does not have missing values.

In [54]:
// TODO: Replace <FILL IN> with appropriate code

for (c <- colNum) {
    println(c, filteredDefault.filter(filteredDefault(c).isNull || filteredDefault(c) === "" || filteredDefault(c).isNaN).count())
}

(LIMIT_BAL,0)
(SEX,0)
(EDUCATION,0)
(MARRIAGE,0)
(AGE,0)
(PAY_0,0)
(PAY_2,0)
(PAY_3,0)
(PAY_4,0)
(PAY_5,0)
(PAY_6,0)
(BILL_AMT1,0)
(BILL_AMT2,0)
(BILL_AMT3,0)
(BILL_AMT4,0)
(BILL_AMT5,0)
(BILL_AMT6,0)
(PAY_AMT1,0)
(PAY_AMT2,0)
(PAY_AMT3,0)
(PAY_AMT4,0)
(PAY_AMT5,0)
(PAY_AMT6,0)


### Scaling
Here we standardize the values of the attributes so that the resulting distribution has unit variance.

In [55]:
import org.apache.spark.ml.feature.{VectorAssembler, StandardScaler}

val va = new VectorAssembler().setInputCols(colNum).setOutputCol("featuresscale")
val featuredDefault = va.transform(filteredDefault) 
val scaler = new StandardScaler().setInputCol("featuresscale").setOutputCol("scaled")
val scaledDefault = scaler.fit(featuredDefault).transform(featuredDefault)

scaledDefault.show(5)
scaledDefault.columns

+---------+---+---------+--------+---+-----+-----+-----+-----+-----+-----+---------+---------+---------+---------+---------+---------+--------+--------+--------+--------+--------+--------+-----+--------------------+--------------------+
|LIMIT_BAL|SEX|EDUCATION|MARRIAGE|AGE|PAY_0|PAY_2|PAY_3|PAY_4|PAY_5|PAY_6|BILL_AMT1|BILL_AMT2|BILL_AMT3|BILL_AMT4|BILL_AMT5|BILL_AMT6|PAY_AMT1|PAY_AMT2|PAY_AMT3|PAY_AMT4|PAY_AMT5|PAY_AMT6|label|       featuresscale|              scaled|
+---------+---+---------+--------+---+-----+-----+-----+-----+-----+-----+---------+---------+---------+---------+---------+---------+--------+--------+--------+--------+--------+--------+-----+--------------------+--------------------+
|    20000|  2|        2|       1| 24|    2|    2|   -1|   -1|   -2|   -2|     3913|     3102|      689|        0|        0|        0|       0|     689|       0|       0|       0|       0|    1|[20000.0,2.0,2.0,...|[0.15414535998894...|
|   120000|  2|        2|       2| 26|   -1|    2|  

import org.apache.spark.ml.feature.{VectorAssembler, StandardScaler}
va: org.apache.spark.ml.feature.VectorAssembler = vecAssembler_e6a758dc86ee
featuredDefault: org.apache.spark.sql.DataFrame = [LIMIT_BAL: int, SEX: int ... 23 more fields]
scaler: org.apache.spark.ml.feature.StandardScaler = stdScal_28e53d6ab527
scaledDefault: org.apache.spark.sql.DataFrame = [LIMIT_BAL: int, SEX: int ... 24 more fields]
res43: Array[String] = Array(LIMIT_BAL, SEX, EDUCATION, MARRIAGE, AGE, PAY_0, PAY_2, PAY_3, PAY_4, PAY_5, PAY_6, BILL_AMT1, BILL_AMT2, BILL_AMT3, BILL_AMT4, BILL_AMT5, BILL_AMT6, PAY_AMT1, PAY_AMT2, PAY_AMT3, PAY_AMT4, PAY_AMT5, PAY_AMT6, label, featuresscale, scaled)


In [56]:
// TODO: Replace <FILL IN> with appropriate code

import org.apache.spark.ml.{Pipeline, PipelineModel, PipelineStage}

val numPipeline = new Pipeline().setStages(Array(va, scaler))
val pipeline = new Pipeline().setStages(Array(numPipeline))
val newDefault = pipeline.fit(filteredDefault).transform(filteredDefault)
newDefault.show(5)

+---------+---+---------+--------+---+-----+-----+-----+-----+-----+-----+---------+---------+---------+---------+---------+---------+--------+--------+--------+--------+--------+--------+-----+--------------------+--------------------+
|LIMIT_BAL|SEX|EDUCATION|MARRIAGE|AGE|PAY_0|PAY_2|PAY_3|PAY_4|PAY_5|PAY_6|BILL_AMT1|BILL_AMT2|BILL_AMT3|BILL_AMT4|BILL_AMT5|BILL_AMT6|PAY_AMT1|PAY_AMT2|PAY_AMT3|PAY_AMT4|PAY_AMT5|PAY_AMT6|label|       featuresscale|              scaled|
+---------+---+---------+--------+---+-----+-----+-----+-----+-----+-----+---------+---------+---------+---------+---------+---------+--------+--------+--------+--------+--------+--------+-----+--------------------+--------------------+
|    20000|  2|        2|       1| 24|    2|    2|   -1|   -1|   -2|   -2|     3913|     3102|      689|        0|        0|        0|       0|     689|       0|       0|       0|       0|    1|[20000.0,2.0,2.0,...|[0.15414535998894...|
|   120000|  2|        2|       2| 26|   -1|    2|  

import org.apache.spark.ml.{Pipeline, PipelineModel, PipelineStage}
numPipeline: org.apache.spark.ml.Pipeline = pipeline_d197a4463891
pipeline: org.apache.spark.ml.Pipeline = pipeline_a584609d8436
newDefault: org.apache.spark.sql.DataFrame = [LIMIT_BAL: int, SEX: int ... 24 more fields]


Creating the `dataset` to be used. 

In [57]:
val va2 = new VectorAssembler().setInputCols(Array("scaled")).setOutputCol("features")
val dataset = va2.transform(newDefault).select("features", "label")

dataset.show(5)

+--------------------+-----+
|            features|label|
+--------------------+-----+
|[0.15414535998894...|    1|
|[0.92487215993365...|    1|
|[0.69365411995024...|    0|
|[0.38536339997235...|    0|
|[0.38536339997235...|    0|
+--------------------+-----+
only showing top 5 rows



va2: org.apache.spark.ml.feature.VectorAssembler = vecAssembler_7ae4f5490324
dataset: org.apache.spark.sql.DataFrame = [features: vector, label: int]


---
# 4. Make a model
Here we going to implement four different regression models:
* Linear regression model
* Decission tree regression
* Random forest regression


In [58]:
val Array(trainSet, testSet) = dataset.randomSplit(Array(0.8, 0.2))

trainSet: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [features: vector, label: int]
testSet: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [features: vector, label: int]


## 4.1. Logistic regression model
Now, train a Linear Regression model using the `LogisticRegression` class. Then, print the coefficients and intercept of the model, as well as the summary of the model over the training set by calling the `binarySummary` method.

In [59]:
import org.apache.spark.ml.classification.LogisticRegression

// train the model
val lr = new LogisticRegression().setMaxIter(10)
val lrModel = lr.fit(trainSet)
val trainingSummary = lrModel.binarySummary

val roc = trainingSummary.roc
roc.show()

println(s"Coefficients: ${lrModel.coefficients} Intercept: ${lrModel.intercept}")
println(s"AreaUnderROC: ${trainingSummary.areaUnderROC}")

+--------------------+--------------------+
|                 FPR|                 TPR|
+--------------------+--------------------+
|                 0.0|                 0.0|
|0.003705692803437...|0.032003012048192774|
|0.006874328678839957| 0.06588855421686747|
|0.009828141783029001| 0.10052710843373494|
|0.013480128893662728| 0.13271837349397592|
|0.016970998925886143| 0.16547439759036145|
|0.020461868958109558| 0.19823042168674698|
|0.024543501611170783|  0.2289156626506024|
| 0.02916219119226638|  0.2577183734939759|
| 0.03431793770139635| 0.28463855421686746|
|   0.040171858216971| 0.30911144578313254|
| 0.04505907626208378|  0.3369728915662651|
| 0.05080558539205156|  0.3618222891566265|
|0.057518796992481205|  0.3832831325301205|
| 0.06439312567132116|  0.4041792168674699|
| 0.07234156820622986| 0.42131024096385544|
| 0.07986036519871106|  0.4399472891566265|
| 0.08909774436090226| 0.45256024096385544|
|  0.0985499462943072| 0.46442018072289154|
| 0.10735767991407089|  0.478727

import org.apache.spark.ml.classification.LogisticRegression
lr: org.apache.spark.ml.classification.LogisticRegression = logreg_cf7792d896b2
lrModel: org.apache.spark.ml.classification.LogisticRegressionModel = LogisticRegressionModel: uid = logreg_cf7792d896b2, numClasses = 2, numFeatures = 23
trainingSummary: org.apache.spark.ml.classification.BinaryLogisticRegressionTrainingSummary = org.apache.spark.ml.classification.BinaryLogisticRegressionTrainingSummaryImpl@3d213494
roc: org.apache.spark.sql.DataFrame = [FPR: double, TPR: double]


Now, use `RegressionEvaluator` to measure the root-mean-square-erroe (RMSE) of the model on the test dataset.

In [60]:
import org.apache.spark.ml.evaluation.RegressionEvaluator
import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator


// make predictions on the test data
val predictions = lrModel.transform(testSet)
predictions.select("prediction", "label", "features").show(5)

val evaluator = new BinaryClassificationEvaluator().setMetricName("areaUnderROC").setRawPredictionCol("prediction").setLabelCol("label")
val roc = evaluator.evaluate(predictions)
println(s"AreaUnderROC on test data = $roc")

+----------+-----+--------------------+
|prediction|label|            features|
+----------+-----+--------------------+
|       0.0|    0|(23,[0,1,2,3,4,5,...|
|       0.0|    0|(23,[0,1,2,3,4,5,...|
|       0.0|    1|(23,[0,1,2,3,4,5,...|
|       0.0|    0|(23,[0,1,2,3,4,5,...|
|       0.0|    1|(23,[0,1,2,3,4,5,...|
+----------+-----+--------------------+
only showing top 5 rows

AreaUnderROC on test data = 0.5982478105592436


import org.apache.spark.ml.evaluation.RegressionEvaluator
import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator
predictions: org.apache.spark.sql.DataFrame = [features: vector, label: int ... 3 more fields]
evaluator: org.apache.spark.ml.evaluation.BinaryClassificationEvaluator = binEval_4f6a0b944620
roc: Double = 0.5982478105592436


## 4.2. Decision tree regression
Repeat what you have done on Regression Model to build a Decision Tree model. Use the `DecisionTreeRegressor` to make a model and then measure its RMSE on the test dataset.

In [61]:
import org.apache.spark.ml.regression.DecisionTreeRegressor
import org.apache.spark.ml.evaluation.RegressionEvaluator

val dt = new DecisionTreeRegressor().setLabelCol("label").setFeaturesCol("features")

// train the model
val dtModel = dt.fit(trainSet)

// make predictions on the test data
val predictions = dtModel.transform(testSet)
predictions.select("prediction", "label", "features").show(10)

// select (prediction, true label) and compute test error
val evaluator = new BinaryClassificationEvaluator().setMetricName("areaUnderROC").setRawPredictionCol("prediction").setLabelCol("label")
val roc = evaluator.evaluate(predictions)
println(s"AreaUnderROC on test data = $roc")

+------------------+-----+--------------------+
|        prediction|label|            features|
+------------------+-----+--------------------+
|0.2987012987012987|    0|(23,[0,1,2,3,4,5,...|
|0.2987012987012987|    0|(23,[0,1,2,3,4,5,...|
|0.2987012987012987|    1|(23,[0,1,2,3,4,5,...|
|0.2987012987012987|    0|(23,[0,1,2,3,4,5,...|
|0.2987012987012987|    1|(23,[0,1,2,3,4,5,...|
|0.2987012987012987|    0|(23,[0,1,2,3,4,5,...|
|0.2987012987012987|    1|(23,[0,1,2,3,4,5,...|
|0.2987012987012987|    1|(23,[0,1,2,3,4,5,...|
|0.2987012987012987|    0|(23,[0,1,2,3,4,5,...|
|0.2987012987012987|    0|(23,[0,1,2,3,4,5,...|
+------------------+-----+--------------------+
only showing top 10 rows

AreaUnderROC on test data = 0.7479485933575502


import org.apache.spark.ml.regression.DecisionTreeRegressor
import org.apache.spark.ml.evaluation.RegressionEvaluator
dt: org.apache.spark.ml.regression.DecisionTreeRegressor = dtr_d219f4d6059b
dtModel: org.apache.spark.ml.regression.DecisionTreeRegressionModel = DecisionTreeRegressionModel (uid=dtr_d219f4d6059b) of depth 5 with 63 nodes
predictions: org.apache.spark.sql.DataFrame = [features: vector, label: int ... 1 more field]
evaluator: org.apache.spark.ml.evaluation.BinaryClassificationEvaluator = binEval_68400c51e586
roc: Double = 0.7479485933575502


## 4.3. Random forest regression
Let's try the test error on a Random Forest Model. Youcan use the `RandomForestRegressor` to make a Random Forest model.

In [62]:
// TODO: Replace <FILL IN> with appropriate code

import org.apache.spark.ml.regression.RandomForestRegressor
import org.apache.spark.ml.evaluation.RegressionEvaluator

val rf = new RandomForestRegressor().setLabelCol("label").setFeaturesCol("features")

// train the model
val rfModel = rf.fit(trainSet)

// make predictions on the test data
val predictions = rfModel.transform(testSet)
predictions.select("prediction", "label", "features").show(5)

// select (prediction, true label) and compute test error
val evaluator = new BinaryClassificationEvaluator().setMetricName("areaUnderROC").setRawPredictionCol("prediction").setLabelCol("label")
val roc = evaluator.evaluate(predictions)
println(s"AreaUnderROC on test data = $roc")

+-------------------+-----+--------------------+
|         prediction|label|            features|
+-------------------+-----+--------------------+
|0.28602066035377427|    0|(23,[0,1,2,3,4,5,...|
| 0.2922952491960832|    0|(23,[0,1,2,3,4,5,...|
| 0.2922952491960832|    1|(23,[0,1,2,3,4,5,...|
| 0.2922952491960832|    0|(23,[0,1,2,3,4,5,...|
|0.25418748520266976|    1|(23,[0,1,2,3,4,5,...|
+-------------------+-----+--------------------+
only showing top 5 rows

AreaUnderROC on test data = 0.7668859344670713


import org.apache.spark.ml.regression.RandomForestRegressor
import org.apache.spark.ml.evaluation.RegressionEvaluator
rf: org.apache.spark.ml.regression.RandomForestRegressor = rfr_2fded48f4edf
rfModel: org.apache.spark.ml.regression.RandomForestRegressionModel = RandomForestRegressionModel (uid=rfr_2fded48f4edf) with 20 trees
predictions: org.apache.spark.sql.DataFrame = [features: vector, label: int ... 1 more field]
evaluator: org.apache.spark.ml.evaluation.BinaryClassificationEvaluator = binEval_e1de9b771033
roc: Double = 0.7668859344670713


## The models we used

We tested three different models to predict default: Logistic Regression, Desicion Tree and Random Forest. The results were as expected, they were improving in this respective order. We used AUC to compare the different approaches. The distribution of the dataset was not equal so the result depends largely on the test set that is chosen. Logistic Regression's linear boundaries generalizes better and have less chance of overfitting compared to Decision Trees, but also is less accurate. Decision Trees are simple and explainable models and already perform quite well. But Random Forests will always yield better accuracy and is more robust as it's constructed of multiple decision trees. 

AreaUnderROC for test sets:

Logistic Regression: 0.5982478105592436

Decision Tree: 0.7479485933575502

Random Forest: 0.7668859344670713

The given attributes are relevant with the problem of financial default, and all the models achieved an accuracy of over 60%.
The models could become more accurate if we obtained more historical data and would train on more aggregate features. More granular data for parameters like education would also help to improve the predictions.