# Freddie Mac Mortgage Data

### Pipeline - Random Forest

Thomas Butler (vra2cf), Andrej Erkelens (wsw3fa), Matt Suozzi (mds5dd)

labelCol: Delinquency Status

- Downsample and Split train and test data
- StringIndexer and OneHotEncoder for categorical features
- VectorAssembler of numerical features
- StandardScalar to scale features (if necessary)
- Cross-validation to train Random Forest
- Tune Hyperparameters on train data
- Evaluate on test: Accuracy, Confusion Matrix, ROC/AUROC


In [1]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

from pyspark.sql.functions import *
from pyspark.sql import functions as F
import pandas as pd
import numpy as np
import time

In [2]:
# Load Data
df = spark.read.parquet("/project/ds5559/Group_6_Housing/DS-5110-Final-Project/df_model_1.parquet")

In [3]:
#create binary variable for response
df = df.withColumn("label", when(df.Current_Loan_Delinquency_Status_Cat == "Past_Due", 1.0).\
                             when(df.Current_Loan_Delinquency_Status_Cat == "Deliquent",1.0).\
                            when(df.Current_Loan_Delinquency_Status_Cat == "0",0.0))
df.columns

['Current_Loan_Delinquency_Status',
 'Current_Loan_Delinquency_Status_Cat',
 'Credit_Score',
 'First_Time_Homebuyer_Flag',
 'Occupancy_Status',
 'Original_Combined_Loan-to-Value_CLTV',
 'Original_Debt-to-Income_DTI_Ratio',
 'Original_Interest_Rate',
 'Loan_Purpose',
 'Original_Loan_Term',
 'Number_of_Borrowers',
 'Loan_Age',
 'Actual_Loss_Calculation',
 'Delinquent_Accrued_Interest',
 'Remaining_Months_to_Legal_Maturity',
 'Modification_Flag',
 'Change_of_Interest_Rate',
 'label']

In [5]:
df.select('Credit_Score','Original_Combined_Loan-to-Value_CLTV','Original_Debt-to-Income_DTI_Ratio').summary().show()

+-------+-----------------+------------------------------------+---------------------------------+
|summary|     Credit_Score|Original_Combined_Loan-to-Value_CLTV|Original_Debt-to-Income_DTI_Ratio|
+-------+-----------------+------------------------------------+---------------------------------+
|  count|          4411156|                             4411156|                          4411156|
|   mean|755.3636792713746|                   73.03009279200282|                33.99194451522458|
| stddev|42.65619126289339|                   16.93961230990166|                9.666894418350566|
|    min|              309|                                   6|                                1|
|    25%|              727|                                  63|                               27|
|    50%|              764|                                  75|                               35|
|    75%|              790|                                  85|                               42|
|    max| 

In [4]:
total = df.count()

response_cat = df.select("label").\
                groupBy('label').\
                agg(F.count('label').alias("label_count")).\
                withColumn("Percent", col('label_count')/total*100)

response_cat.show()

+-----+-----------+-----------------+
|label|label_count|          Percent|
+-----+-----------+-----------------+
|  0.0|    4152122|94.12775245309847|
|  1.0|     259034|5.872247546901538|
+-----+-----------+-----------------+



Only 5.9% of loans are delinquent, so we will downsample the data.

In [4]:
# Load libraries for Pipeline
from pyspark.ml.feature import Bucketizer
from pyspark.ml.feature import OneHotEncoder, StringIndexer
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.feature import MaxAbsScaler
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
from pyspark.ml.feature import StandardScaler
from pyspark.ml import Pipeline

# Load libraries for Models
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.classification import GBTClassifier
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.mllib.evaluation import BinaryClassificationMetrics

### Create Pipeline with Transformers


In [5]:
#Split data - must do stratified sampling due to rarity of Delinquency
seed = 314
train_test = [0.5,0.5]
train, test = df.randomSplit(train_test, seed)

frac=train.filter(train.label==1).count()/train.filter(train.label==0).count()

train = train.sampleBy("label", fractions={0:frac, 1:1}, seed=seed)

# Use smaller subset to test code
sample = df.sampleBy("label", fractions={0: 0.05, 1: 0.05}, seed=seed)
holdout = df.subtract(sample)

In [6]:
# Transform features

# Bucketize FICO scores into categories
splits = [300, 580, 670, 740, 800, 850]
bucketizer = Bucketizer(splits=splits, inputCol="Credit_Score", outputCol="Credit_Score_Groups")

# Transform Categorical variables into dummy variables using StringIndexer and OneHotEncoder
stringIndexer_HF = StringIndexer(inputCol="First_Time_Homebuyer_Flag", outputCol="First_Time_Homebuyer_Flag_Index")
encoder_HF = OneHotEncoder(inputCol="First_Time_Homebuyer_Flag_Index", outputCol="First_Time_Homebuyer_Flag_Vec")

stringIndexer_LP = StringIndexer(inputCol="Loan_Purpose", outputCol="Loan_Purpose_Index")
encoder_LP = OneHotEncoder(inputCol="Loan_Purpose_Index", outputCol="Loan_Purpose_Vec")

stringIndexer_OS = StringIndexer(inputCol="Occupancy_Status", outputCol="Occupancy_Status_Index")
encoder_OS = OneHotEncoder(inputCol="Occupancy_Status_Index", outputCol="Occupancy_Status_Vec")

In [7]:
# Assemble features
numericCols = ["Original_Combined_Loan-to-Value_CLTV", "Original_Debt-to-Income_DTI_Ratio", "Original_Interest_Rate"]
indexCategoricalCols = ["Credit_Score_Groups","First_Time_Homebuyer_Flag_Index", "Loan_Purpose_Index", "Occupancy_Status_Index"]
categoricalCols = ["Credit_Score_Groups", "First_Time_Homebuyer_Flag_Vec", "Loan_Purpose_Vec","Occupancy_Status_Vec"]

# no scaling, no OHE
assemblerInputs_simple = numericCols + indexCategoricalCols
assembler_simple = VectorAssembler(inputCols=assemblerInputs_simple, outputCol="features")

# no scaling, with OHE
assemblerInputs_mix = numericCols + categoricalCols
assembler_mix = VectorAssembler(inputCols=assemblerInputs_mix, outputCol="features_OHE")

In [8]:
# With scaled the assembled numerical features and OHE
assembler_num = VectorAssembler(inputCols=numericCols, outputCol="features")
std = StandardScaler(inputCol='features',outputCol='scaled_features')
assemblerInputs_full = ['scaled_features'] + categoricalCols
assembler_full = VectorAssembler(inputCols=assemblerInputs_full, outputCol='features_full')

In [9]:
# Set up pipelines -  do not use
#pipelineRF = Pipeline(stages=[bucketizer, stringIndexer_HF, encoder_HF, stringIndexer_LP,encoder_LP, stringIndexer_OS,encoder_OS, assembler,std,assembler2])

In [10]:
#transform data - do not use
#model1 = pipelineRF.fit(df)
#df = model1.transform(df)

In [11]:
# Define Random Forest Model
rf_simple = RandomForestClassifier(labelCol='label', featuresCol="features", maxBins=10, seed=seed)

rf_mix = RandomForestClassifier(labelCol='label', featuresCol="features_OHE", maxBins=10, seed=seed)

rf_full = RandomForestClassifier(labelCol='label', featuresCol="features_full", maxBins=10, seed=seed)

In [12]:
# Set up pipelines
# no scaling, no OHE
pipelineRF_simple = Pipeline(stages=[bucketizer, stringIndexer_HF, stringIndexer_LP, stringIndexer_OS,assembler_simple, rf_simple])

# no scaling with OHE
pipelineRF_mix = Pipeline(stages=[bucketizer, stringIndexer_HF, encoder_HF, stringIndexer_LP,encoder_LP, stringIndexer_OS,encoder_OS, assembler_mix, rf_mix])

# scaling and OHE
pipelineRF_full = Pipeline(stages=[bucketizer, stringIndexer_HF, encoder_HF, stringIndexer_LP,encoder_LP, stringIndexer_OS,encoder_OS, assembler_num,std,assembler_full, rf_full])

In [13]:
# Evaluator
evalRF = BinaryClassificationEvaluator()

### Train and Evaluate Model

Train and test using each pipeline to see if using one hot encoding and scaled numerical variables improves the model.

In [104]:
# no scaling, no OHE
m1_start = time.time()
modelRF_simple = pipelineRF_simple.fit(train)
m1_end = time.time()
m1_time = m1_end-m1_start
print('Total time for training was ' + str(m1_time))

Total time for training was 128.6145749092102


In [105]:
predRF_simple = modelRF_simple.transform(test)

In [106]:
predRF_simple.select('label','features','prediction', 'probability').show(5, truncate=False)

+-----+---------------------------------+----------+----------------------------------------+
|label|features                         |prediction|probability                             |
+-----+---------------------------------+----------+----------------------------------------+
|0.0  |[60.0,44.0,3.5,1.0,0.0,1.0,0.0]  |1.0       |[0.48313075183666554,0.5168692481633345]|
|0.0  |[70.0,41.0,4.875,1.0,0.0,1.0,0.0]|1.0       |[0.3343314247674514,0.6656685752325486] |
|0.0  |[80.0,44.0,3.5,1.0,0.0,2.0,0.0]  |1.0       |[0.4808054817882799,0.5191945182117201] |
|0.0  |[50.0,36.0,4.875,1.0,2.0,0.0,0.0]|1.0       |[0.33552186164907244,0.6644781383509276]|
|0.0  |[80.0,45.0,5.5,1.0,2.0,0.0,0.0]  |1.0       |[0.33168855006413356,0.6683114499358666]|
+-----+---------------------------------+----------+----------------------------------------+
only showing top 5 rows



In [107]:
#Best AROC
print('Best Test AUROC: ', evalRF.evaluate(predRF_simple))

Best Test AUROC:  0.7055191515036103


In [146]:
print(modelRF_simple.stages)
modelRF_simple.stages[-1].featureImportances

[Bucketizer_638a059ebd32, StringIndexerModel: uid=StringIndexer_072da252b4f2, handleInvalid=error, StringIndexerModel: uid=StringIndexer_abefe9a09ee6, handleInvalid=error, StringIndexerModel: uid=StringIndexer_2b845242b6ed, handleInvalid=error, VectorAssembler_1e488bd4ac1d, RandomForestClassificationModel: uid=RandomForestClassifier_fec2bd05c95f, numTrees=20, numClasses=2, numFeatures=7]


SparseVector(7, {0: 0.0172, 1: 0.1392, 2: 0.7129, 3: 0.1221, 4: 0.0003, 5: 0.0063, 6: 0.002})

Features: LTV, DTI, Interest Rate, Credit Score Bucket, First Time Home Buyer, Loan Purpose, Occupancy Status

 - Interest Rate is the most important feature.

 - Loan Purpose is the least important feature.

In [12]:
# no scaling, with OHE
m1_start = time.time()
modelRF_mix = pipelineRF_mix.fit(train)
m1_end = time.time()
m1_time = m1_end-m1_start
print('Total time for training was ' + str(m1_time))

Total time for training was 303.9675626754761


In [13]:
predRF_mix = modelRF_mix.transform(test)

In [14]:
#Best AROC
print('Best Test AUROC: ', evalRF.evaluate(predRF_mix))

Best Test AUROC:  0.7070914736438928


In [15]:
predRF_mix.select('label','features_OHE','prediction', 'probability').show(5, truncate=False)

+-----+---------------------------------------------+----------+---------------------------------------+
|label|features_OHE                                 |prediction|probability                            |
+-----+---------------------------------------------+----------+---------------------------------------+
|0.0  |[60.0,44.0,3.5,1.0,1.0,0.0,0.0,1.0,1.0,0.0]  |1.0       |[0.4891504322809889,0.510849567719011] |
|0.0  |[70.0,41.0,4.875,1.0,1.0,0.0,0.0,1.0,1.0,0.0]|1.0       |[0.36110233222043703,0.638897667779563]|
|0.0  |[80.0,44.0,3.5,1.0,1.0,0.0,0.0,0.0,1.0,0.0]  |1.0       |[0.48038545519456904,0.519614544805431]|
|0.0  |[50.0,36.0,4.875,1.0,0.0,0.0,1.0,0.0,1.0,0.0]|1.0       |[0.3362504244334831,0.663749575566517] |
|0.0  |[80.0,45.0,5.5,1.0,0.0,0.0,1.0,0.0,1.0,0.0]  |1.0       |[0.333709031200867,0.6662909687991331] |
+-----+---------------------------------------------+----------+---------------------------------------+
only showing top 5 rows



In [23]:
print(modelRF_mix.stages)

[Bucketizer_f02e81c254ea, StringIndexerModel: uid=StringIndexer_a7255a325d85, handleInvalid=error, OneHotEncoderModel: uid=OneHotEncoder_18197dfddee4, dropLast=true, handleInvalid=error, StringIndexerModel: uid=StringIndexer_375bb71f9de8, handleInvalid=error, OneHotEncoderModel: uid=OneHotEncoder_baa3bd06864b, dropLast=true, handleInvalid=error, StringIndexerModel: uid=StringIndexer_e9a4962030cc, handleInvalid=error, OneHotEncoderModel: uid=OneHotEncoder_122296066df2, dropLast=true, handleInvalid=error, VectorAssembler_6ea02726aebb, RandomForestClassificationModel: uid=RandomForestClassifier_be83f853d175, numTrees=20, numClasses=2, numFeatures=10]


In [24]:
modelRF_mix.stages[-1].featureImportances

SparseVector(10, {0: 0.0185, 1: 0.1358, 2: 0.6172, 3: 0.1666, 4: 0.003, 5: 0.0, 6: 0.0074, 7: 0.0468, 8: 0.0005, 9: 0.0042})

In [32]:
modelRF_mix.stages[-1].explainParams()

"bootstrap: Whether bootstrap samples are used when building trees. (default: True)\ncacheNodeIds: If false, the algorithm will pass trees to executors to match instances with nodes. If true, the algorithm will cache node IDs for each instance. Caching can speed up training of deeper trees. Users can set how often should the cache be checkpointed or disable it by setting checkpointInterval. (default: False)\ncheckpointInterval: set checkpoint interval (>= 1) or disable checkpoint (-1). E.g. 10 means that the cache will get checkpointed every 10 iterations. Note: this setting will be ignored if the checkpoint directory is not set in the SparkContext. (default: 10)\nfeatureSubsetStrategy: The number of features to consider for splits at each tree node. Supported options: 'auto' (choose automatically for task: If numTrees == 1, set to 'all'. If numTrees > 1 (forest), set to 'sqrt' for classification and to 'onethird' for regression), 'all' (use all features), 'onethird' (use 1/3 of the fe

In [44]:
modelRF_mix.stages[-1].featureSubsetStrategy

Param(parent='RandomForestClassifier_be83f853d175', name='featureSubsetStrategy', doc="The number of features to consider for splits at each tree node. Supported options: 'auto' (choose automatically for task: If numTrees == 1, set to 'all'. If numTrees > 1 (forest), set to 'sqrt' for classification and to 'onethird' for regression), 'all' (use all features), 'onethird' (use 1/3 of the features), 'sqrt' (use sqrt(number of features)), 'log2' (use log2(number of features)), 'n' (when n is in the range (0, 1.0], use n * number of features. When n is in the range (1, number of features), use n features). default = 'auto'")

In [43]:
modelRF_mix.stages[-1].toDebugString

'RandomForestClassificationModel: uid=RandomForestClassifier_be83f853d175, numTrees=20, numClasses=2, numFeatures=10\n  Tree 0 (weight 1.0):\n    If (feature 1 <= 34.5)\n     If (feature 7 in {1.0})\n      If (feature 2 <= 3.609999895095825)\n       Predict: 0.0\n      Else (feature 2 > 3.609999895095825)\n       If (feature 2 <= 3.9830000400543213)\n        Predict: 0.0\n       Else (feature 2 > 3.9830000400543213)\n        If (feature 3 in {4.0})\n         Predict: 0.0\n        Else (feature 3 not in {4.0})\n         Predict: 1.0\n     Else (feature 7 not in {1.0})\n      If (feature 9 in {1.0})\n       Predict: 1.0\n      Else (feature 9 not in {1.0})\n       If (feature 2 <= 3.609999895095825)\n        Predict: 0.0\n       Else (feature 2 > 3.609999895095825)\n        If (feature 3 in {1.0,3.0,4.0})\n         Predict: 0.0\n        Else (feature 3 not in {1.0,3.0,4.0})\n         Predict: 1.0\n    Else (feature 1 > 34.5)\n     If (feature 6 in {0.0})\n      If (feature 0 <= 69.5)\n  

In [117]:
# scaling & OHE
m1_start = time.time()
modelRF_full = pipelineRF_full.fit(train)
m1_end = time.time()
m1_time = m1_end-m1_start
print('Total time for training was ' + str(m1_time))

Total time for training was 125.89264965057373


In [118]:
predRF_full = modelRF_full.transform(test)

In [119]:
#Best AROC
print('Best Test AUROC: ', evalRF.evaluate(predRF_full))

Best Test AUROC:  0.70709201727127


Scaling the numerical features did not meaningfully improve the model accuracy, while taking longer.

### Adjust Features

Interest Rate was the most important feature, which is determined by the underwriter, based on the person's creditworthiness.

In [147]:
numericCols_noInt = ["Original_Combined_Loan-to-Value_CLTV", "Original_Debt-to-Income_DTI_Ratio"]

# no scaling, with OHE, no Interest
assemblerInputs_noInt = numericCols_noInt + categoricalCols
assembler_noInt = VectorAssembler(inputCols=assemblerInputs_noInt, outputCol="features_noInt")

rf_noInt = RandomForestClassifier(labelCol='label', featuresCol="features_noInt", maxBins=10, seed=seed)

# no scaling, with OHE, no Int
pipelineRF_noInt = Pipeline(stages=[bucketizer, stringIndexer_HF, encoder_HF, stringIndexer_LP,encoder_LP, stringIndexer_OS,encoder_OS, assembler_noInt, rf_noInt])


In [148]:
# no scaling, with OHE, no Int
m1_start = time.time()
modelRF_noInt = pipelineRF_noInt.fit(train)
m1_end = time.time()
m1_time = m1_end-m1_start
print('Total time for training was ' + str(m1_time))

Total time for training was 106.7457287311554


In [149]:
predRF_noInt = modelRF_noInt.transform(test)

In [208]:
predRF_noInt.summary

<bound method DataFrame.summary of DataFrame[Current_Loan_Delinquency_Status: string, Current_Loan_Delinquency_Status_Cat: string, Credit_Score: int, First_Time_Homebuyer_Flag: string, Occupancy_Status: string, Original_Combined_Loan-to-Value_CLTV: int, Original_Debt-to-Income_DTI_Ratio: int, Original_Interest_Rate: float, Loan_Purpose: string, Original_Loan_Term: int, Number_of_Borrowers: string, Loan_Age: int, Remaining_Months_to_Legal_Maturity: int, Modification_Flag: string, Change_of_Interest_Rate: float, label: double, Credit_Score_Groups: double, First_Time_Homebuyer_Flag_Index: double, First_Time_Homebuyer_Flag_Vec: vector, Loan_Purpose_Index: double, Loan_Purpose_Vec: vector, Occupancy_Status_Index: double, Occupancy_Status_Vec: vector, features_noInt: vector, rawPrediction: vector, probability: vector, prediction: double]>

In [216]:
predRF_noInt.select('Original_Combined_Loan-to-Value_CLTV', 'Original_Debt-to-Income_DTI_Ratio', 'Credit_Score_Groups','features_noInt').show(5, truncate=False)

+------------------------------------+---------------------------------+-------------------+---------------------------------------+
|Original_Combined_Loan-to-Value_CLTV|Original_Debt-to-Income_DTI_Ratio|Credit_Score_Groups|features_noInt                         |
+------------------------------------+---------------------------------+-------------------+---------------------------------------+
|60                                  |44                               |1.0                |[60.0,44.0,1.0,1.0,0.0,0.0,1.0,1.0,0.0]|
|70                                  |41                               |1.0                |[70.0,41.0,1.0,1.0,0.0,0.0,1.0,1.0,0.0]|
|80                                  |44                               |1.0                |[80.0,44.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0]|
|50                                  |36                               |1.0                |[50.0,36.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0]|
|80                                  |45                             

In [215]:
predRF_noInt.select('First_Time_Homebuyer_Flag_Vec','Loan_Purpose_Vec','Occupancy_Status_Vec','features_noInt').show(5, truncate=False)

+-----------------------------+----------------+--------------------+---------------------------------------+
|First_Time_Homebuyer_Flag_Vec|Loan_Purpose_Vec|Occupancy_Status_Vec|features_noInt                         |
+-----------------------------+----------------+--------------------+---------------------------------------+
|(2,[0],[1.0])                |(2,[1],[1.0])   |(2,[0],[1.0])       |[60.0,44.0,1.0,1.0,0.0,0.0,1.0,1.0,0.0]|
|(2,[0],[1.0])                |(2,[1],[1.0])   |(2,[0],[1.0])       |[70.0,41.0,1.0,1.0,0.0,0.0,1.0,1.0,0.0]|
|(2,[0],[1.0])                |(2,[],[])       |(2,[0],[1.0])       |[80.0,44.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0]|
|(2,[],[])                    |(2,[0],[1.0])   |(2,[0],[1.0])       |[50.0,36.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0]|
|(2,[],[])                    |(2,[0],[1.0])   |(2,[0],[1.0])       |[80.0,45.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0]|
+-----------------------------+----------------+--------------------+---------------------------------------+
only showi

In [150]:
#Best AROC
print('Best Test AUROC: ', evalRF.evaluate(predRF_noInt))

Best Test AUROC:  0.6598292668181213


In [152]:
modelRF_noInt.stages[-1].featureImportances

SparseVector(9, {0: 0.0677, 1: 0.3388, 2: 0.4142, 3: 0.0024, 4: 0.0002, 5: 0.0312, 6: 0.1191, 7: 0.0056, 8: 0.0209})

Model performance declines when removing Interest Rate as a feature.

### Train using Cross-Validation

Train and test using the mix pipeline since that performed well in the fastest time.

In [14]:
# Cross-validation - RF
gridRF_simple = (ParamGridBuilder().addGrid(rf_simple.maxDepth, [2, 4, 6]). addGrid(rf_simple.numTrees, [10, 25, 100]).build())

gridRF_mix = (ParamGridBuilder().addGrid(rf_mix.maxDepth, [2, 4, 6]). addGrid(rf_mix.numTrees, [10, 25, 100]).build())

gridRF_full = (ParamGridBuilder().addGrid(rf_full.maxDepth, [2, 4, 6]). addGrid(rf_full.numTrees, [10, 25, 100]).build())

In [15]:
#cvRF = CrossValidator(estimator=rf, evaluator=evalRF, estimatorParamMaps=gridRF, numFolds=10, seed=seed).setParallelism(4)

cvRF_simple = CrossValidator(estimator=pipelineRF_simple, evaluator=evalRF, estimatorParamMaps=gridRF_simple, numFolds=10, seed=seed).setParallelism(4)

cvRF_mix = CrossValidator(estimator=pipelineRF_mix, evaluator=evalRF, estimatorParamMaps=gridRF_mix, numFolds=10, seed=seed).setParallelism(4)

cvRF_full = CrossValidator(estimator=pipelineRF_full, evaluator=evalRF, estimatorParamMaps=gridRF_full, numFolds=10, seed=seed).setParallelism(4)

In [16]:
m1_start = time.time()
cv_modelRF = cvRF_mix.fit(train)
m1_end = time.time()
m1_time = m1_end-m1_start
print('Total time for training with 4 parallel processors was ' + str(m1_time))

Total time for training with 4 parallel processors was 2590.406844139099


In [17]:
bestModel = cv_modelRF.bestModel

In [18]:
bestModel.stages

[Bucketizer_fe47c4c2eb51,
 StringIndexerModel: uid=StringIndexer_02925e40e79c, handleInvalid=error,
 OneHotEncoderModel: uid=OneHotEncoder_7ec0e1f78364, dropLast=true, handleInvalid=error,
 StringIndexerModel: uid=StringIndexer_31f24c248372, handleInvalid=error,
 OneHotEncoderModel: uid=OneHotEncoder_33f778af7eb1, dropLast=true, handleInvalid=error,
 StringIndexerModel: uid=StringIndexer_68a09efd0820, handleInvalid=error,
 OneHotEncoderModel: uid=OneHotEncoder_69cd0133dce7, dropLast=true, handleInvalid=error,
 VectorAssembler_cfee5aa9d1fd,
 RandomForestClassificationModel: uid=RandomForestClassifier_a8f61f3490db, numTrees=100, numClasses=2, numFeatures=10]

In [21]:
dir(bestModel.stages[1])

['__abstractmethods__',
 '__class__',
 '__del__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__slotnames__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_abc_impl',
 '_call_java',
 '_copyValues',
 '_copy_params',
 '_create_from_java_class',
 '_create_params_from_java',
 '_defaultParamMap',
 '_dummy',
 '_empty_java_param_map',
 '_from_java',
 '_java_obj',
 '_make_java_param_pair',
 '_new_java_array',
 '_new_java_obj',
 '_paramMap',
 '_params',
 '_randomUID',
 '_resetUid',
 '_resolveParam',
 '_set',
 '_setDefault',
 '_shouldOwn',
 '_testOwnParam',
 '_to_java',
 '_transfer_param_map_from_java',
 '_transfer_param_map_to_java',
 '_transfer_params_from_java',
 '_transfer_params_to_java',
 '_transform',
 'clear',
 'copy',
 '

In [19]:
predRF_best = bestModel.transform(test)

In [20]:
predRF_best.summary

<bound method DataFrame.summary of DataFrame[Current_Loan_Delinquency_Status: string, Current_Loan_Delinquency_Status_Cat: string, Credit_Score: int, First_Time_Homebuyer_Flag: string, Occupancy_Status: string, Original_Combined_Loan-to-Value_CLTV: int, Original_Debt-to-Income_DTI_Ratio: int, Original_Interest_Rate: float, Loan_Purpose: string, Original_Loan_Term: int, Number_of_Borrowers: string, Loan_Age: int, Actual_Loss_Calculation: int, Delinquent_Accrued_Interest: int, Remaining_Months_to_Legal_Maturity: int, Modification_Flag: string, Change_of_Interest_Rate: float, label: double, Credit_Score_Groups: double, First_Time_Homebuyer_Flag_Index: double, First_Time_Homebuyer_Flag_Vec: vector, Loan_Purpose_Index: double, Loan_Purpose_Vec: vector, Occupancy_Status_Index: double, Occupancy_Status_Vec: vector, features_OHE: vector, rawPrediction: vector, probability: vector, prediction: double]>

In [40]:
predRF_best.select('label','features_OHE','prediction', 'probability').show(5, truncate=False)

+-----+---------------------------------------------+----------+----------------------------------------+
|label|features_OHE                                 |prediction|probability                             |
+-----+---------------------------------------------+----------+----------------------------------------+
|0.0  |[60.0,44.0,3.5,1.0,1.0,0.0,0.0,1.0,1.0,0.0]  |1.0       |[0.4232928083845522,0.5767071916154478] |
|0.0  |[70.0,41.0,4.875,1.0,1.0,0.0,0.0,1.0,1.0,0.0]|1.0       |[0.32542674356044254,0.6745732564395575]|
|0.0  |[80.0,44.0,3.5,1.0,1.0,0.0,0.0,0.0,1.0,0.0]  |1.0       |[0.43647588922294056,0.5635241107770596]|
|0.0  |[50.0,36.0,4.875,1.0,0.0,0.0,1.0,0.0,1.0,0.0]|1.0       |[0.3276871473526159,0.672312852647384]  |
|0.0  |[80.0,45.0,5.5,1.0,0.0,0.0,1.0,0.0,1.0,0.0]  |1.0       |[0.31759154669927375,0.6824084533007263]|
+-----+---------------------------------------------+----------+----------------------------------------+
only showing top 5 rows



Many of the predicted probabilities are in the 40-60% range, which does provide convincing certainty.

In [52]:
#Best AUROC
print('Best Test AUROC: ', evalRF.evaluate(predRF_best))

Best Test AUROC:  0.7101361382126586


In [53]:
print("Accuracy: " + str(bestModel.stages[-1].summary.accuracy))
print("AUROC: " + str(bestModel.stages[-1].summary.areaUnderROC))

Accuracy: 0.6581011631052205
AUROC: 0.7095038867943781


In [54]:
predRF_best.select('First_Time_Homebuyer_Flag_Vec','Loan_Purpose_Vec','Occupancy_Status_Vec','features_OHE').show(5, truncate=False)

+-----------------------------+----------------+--------------------+---------------------------------------------+
|First_Time_Homebuyer_Flag_Vec|Loan_Purpose_Vec|Occupancy_Status_Vec|features_OHE                                 |
+-----------------------------+----------------+--------------------+---------------------------------------------+
|(2,[0],[1.0])                |(2,[1],[1.0])   |(2,[0],[1.0])       |[60.0,44.0,3.5,1.0,1.0,0.0,0.0,1.0,1.0,0.0]  |
|(2,[0],[1.0])                |(2,[1],[1.0])   |(2,[0],[1.0])       |[70.0,41.0,4.875,1.0,1.0,0.0,0.0,1.0,1.0,0.0]|
|(2,[0],[1.0])                |(2,[],[])       |(2,[0],[1.0])       |[80.0,44.0,3.5,1.0,1.0,0.0,0.0,0.0,1.0,0.0]  |
|(2,[],[])                    |(2,[0],[1.0])   |(2,[0],[1.0])       |[50.0,36.0,4.875,1.0,0.0,0.0,1.0,0.0,1.0,0.0]|
|(2,[],[])                    |(2,[0],[1.0])   |(2,[0],[1.0])       |[80.0,45.0,5.5,1.0,0.0,0.0,1.0,0.0,1.0,0.0]  |
+-----------------------------+----------------+--------------------+---

In [55]:
print(bestModel.stages[-1].trees)
print(bestModel.stages[-1].totalNumNodes)

[DecisionTreeClassificationModel: uid=dtc_b5eb55148744, depth=6, numNodes=51, numClasses=2, numFeatures=10, DecisionTreeClassificationModel: uid=dtc_19b6718abf7b, depth=6, numNodes=65, numClasses=2, numFeatures=10, DecisionTreeClassificationModel: uid=dtc_a0232e642e95, depth=6, numNodes=59, numClasses=2, numFeatures=10, DecisionTreeClassificationModel: uid=dtc_9305d35995c8, depth=6, numNodes=53, numClasses=2, numFeatures=10, DecisionTreeClassificationModel: uid=dtc_7dcbc6491764, depth=6, numNodes=47, numClasses=2, numFeatures=10, DecisionTreeClassificationModel: uid=dtc_f2339e2b4121, depth=6, numNodes=59, numClasses=2, numFeatures=10, DecisionTreeClassificationModel: uid=dtc_664f18d9fea0, depth=6, numNodes=57, numClasses=2, numFeatures=10, DecisionTreeClassificationModel: uid=dtc_abef1ed3c4d3, depth=6, numNodes=61, numClasses=2, numFeatures=10, DecisionTreeClassificationModel: uid=dtc_5cf9b19c232c, depth=6, numNodes=67, numClasses=2, numFeatures=10, DecisionTreeClassificationModel: uid

In [56]:
bestModel.stages[-1].featureSubsetStrategy

Param(parent='RandomForestClassifier_f7ce0bdb6967', name='featureSubsetStrategy', doc="The number of features to consider for splits at each tree node. Supported options: 'auto' (choose automatically for task: If numTrees == 1, set to 'all'. If numTrees > 1 (forest), set to 'sqrt' for classification and to 'onethird' for regression), 'all' (use all features), 'onethird' (use 1/3 of the features), 'sqrt' (use sqrt(number of features)), 'log2' (use log2(number of features)), 'n' (when n is in the range (0, 1.0], use n * number of features. When n is in the range (1, number of features), use n features). default = 'auto'")

In [57]:
bestModel.stages[-1].toDebugString

'RandomForestClassificationModel: uid=RandomForestClassifier_f7ce0bdb6967, numTrees=100, numClasses=2, numFeatures=10\n  Tree 0 (weight 1.0):\n    If (feature 1 <= 34.5)\n     If (feature 2 <= 3.5475000143051147)\n      If (feature 2 <= 3.127500057220459)\n       Predict: 0.0\n      Else (feature 2 > 3.127500057220459)\n       If (feature 6 in {1.0})\n        If (feature 0 <= 94.5)\n         Predict: 0.0\n        Else (feature 0 > 94.5)\n         If (feature 0 <= 99.5)\n          Predict: 0.0\n         Else (feature 0 > 99.5)\n          Predict: 1.0\n       Else (feature 6 not in {1.0})\n        If (feature 2 <= 3.374500036239624)\n         If (feature 3 in {1.0})\n          Predict: 1.0\n         Else (feature 3 not in {1.0})\n          Predict: 0.0\n        Else (feature 2 > 3.374500036239624)\n         Predict: 0.0\n     Else (feature 2 > 3.5475000143051147)\n      If (feature 2 <= 4.387500047683716)\n       If (feature 3 in {4.0})\n        Predict: 0.0\n       Else (feature 3 not i

In [58]:
bestModel.stages[-1].featureImportances

SparseVector(10, {0: 0.0126, 1: 0.1406, 2: 0.651, 3: 0.1574, 4: 0.0009, 5: 0.0005, 6: 0.0049, 7: 0.0287, 8: 0.0006, 9: 0.0028})

#### Top 5 Feature Importance (cumulative):
 1. Interest Rate: 0.656
 2. Credit Score Group: 0.1536 (0.8096)
 3. Debt-to-Income: 0.1394 (0.949)
 4. Loan Purpose\[1]: 0.0297 (0.9787)
 5. Loan-to-Value: 0.0121 (0.9908)

In [167]:
#save or load F_measure
path_df = '/project/ds5559/Group_6_Housing/DS-5110-Final-Project/F_measure.parquet'

In [168]:
F_measure = bestModel.stages[-1].summary.fMeasureByThreshold

In [170]:
maxF_measure = F_measure.groupBy().max('F-Measure').select('max(F-Measure)').head()
best_thres = F_measure.where(F_measure["F-Measure"] == maxF_measure['max(F-Measure)']).select('threshold').head()["threshold"]
print("Best F-Measure: " + str(maxF_measure[0]))
print("Best Threshold: " + str(best_thres))

Best F-Measure: 0.6975419146151758
Best Threshold: 0.3354955139729926


In [172]:
#Evaluate Accuracy and Confusion Matrix on test matrix

from pyspark.mllib.evaluation import MulticlassMetrics

temp = predRF_best.select("probability", "label")

def round_thres(num):
    if num >= best_thres:
        return 1.0
    else:
        return 0.0
                       
predsAndLabels = temp.rdd.map(lambda p: (round_thres(float(p.probability[1])),p.label))

metrics = MulticlassMetrics(predsAndLabels)

print('model accuracy: ' + str(metrics.accuracy))

print('model precision: ' + str(metrics.precision(1.0)))

print('model F-Score: ' + str(metrics.fMeasure(0.0, 1.0)))

con_matrix = metrics.confusionMatrix().toArray()
print(con_matrix)

#Maximize Negative Predictive value NPV
NPV = con_matrix[0][0]/(con_matrix[0][0]+con_matrix[1][0])
print("Negative Predictive Value: " + str(NPV))

model accuracy: 0.40136770669363164
model precision: 0.07970169811081115
model F-Score: 0.5391504478259792
[[ 772591. 1304267.]
 [  16508.  112955.]]
Negative Predictive Value: 0.979079938005244


In [184]:
print("Precision by Label: " + str(bestModel.stages[-1].summary.precisionByLabel))
print("Weighted Precision: " + str(bestModel.stages[-1].summary.weightedPrecision))

Precision by Label: [0.66014629671734, 0.6569922050609955]
Weighted Precision: 0.6585670936244447


In [185]:
print("Recall by Label: " + str(bestModel.stages[-1].summary.recallByLabel))
print("Weighted Recall: " +  str(bestModel.stages[-1].summary.weightedRecall))

Recall by Label: [0.6516247862123404, 0.665449830594809]
Weighted Recall: 0.6585467641467146


In [186]:
print("True Positive Rate by Label: " + str(bestModel.stages[-1].summary.truePositiveRateByLabel))
print("Weighted True Positive Rate: " + str(bestModel.stages[-1].summary.weightedTruePositiveRate))

True Positive Rate by Label: [0.6516247862123404, 0.665449830594809]
Weighted True Positive Rate: 0.6585467641467146


In [187]:
print("False Positive Rate by Label: " + str(bestModel.stages[-1].summary.falsePositiveRateByLabel))
print("Weighted False Positive Rate: " +  str(bestModel.stages[-1].summary.weightedFalsePositiveRate))

False Positive Rate by Label: [0.334550169405191, 0.3483752137876595]
Weighted False Positive Rate: 0.34147214733956516
