# Mortgage CrossValidation with GPU accelerating on XGBoost

In this notebook, we will show you how to levarage GPU to accelerate mortgage CrossValidation of XGBoost to find out the best model given a group of parameters.

Note: CrossValidation can't be ran with the latest cudf v21.06.1 because of some API changes. We'll plan to release a new XGBoost jar with the fixing soon. We keep this notebook using cudf v0.19.2 & rapids-4-spark v0.5.0.

## Import classes
First we need load some common classes that both GPU version and CPU version will use:

In [1]:
import ml.dmlc.xgboost4j.scala.spark.{XGBoostClassificationModel, XGBoostClassifier}

import org.apache.spark.sql.SparkSession
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
import org.apache.spark.ml.tuning.ParamGridBuilder
import org.apache.spark.sql.types.{FloatType, IntegerType, StructField, StructType}

what is new to xgboost-spark users is **rapids.CrossValidator**

In [2]:
import ml.dmlc.xgboost4j.scala.spark.rapids.CrossValidator

## Set dataset path

In [3]:
val trainParquetPath="/data/mortgage/parquet/train"
val evalParquetPath="/data/mortgage/parquet/eval"

trainParquetPath = /data/mortgage/parquet/train
evalParquetPath = /data/mortgage/parquet/eval


/data/mortgage/parquet/eval

# Set the schema of the dataset

In [4]:
val labelColName = "delinquency_12"
val schema = StructType(List(
    StructField("orig_channel", FloatType),
    StructField("first_home_buyer", FloatType),
    StructField("loan_purpose", FloatType),
    StructField("property_type", FloatType),
    StructField("occupancy_status", FloatType),
    StructField("property_state", FloatType),
    StructField("product_type", FloatType),
    StructField("relocation_mortgage_indicator", FloatType),
    StructField("seller_name", FloatType),
    StructField("mod_flag", FloatType),
    StructField("orig_interest_rate", FloatType),
    StructField("orig_upb", IntegerType),
    StructField("orig_loan_term", IntegerType),
    StructField("orig_ltv", FloatType),
    StructField("orig_cltv", FloatType),
    StructField("num_borrowers", FloatType),
    StructField("dti", FloatType),
    StructField("borrower_credit_score", FloatType),
    StructField("num_units", IntegerType),
    StructField("zip", IntegerType),
    StructField("mortgage_insurance_percent", FloatType),
    StructField("current_loan_delinquency_status", IntegerType),
    StructField("current_actual_upb", FloatType),
    StructField("interest_rate", FloatType),
    StructField("loan_age", FloatType),
    StructField("msa", FloatType),
    StructField("non_interest_bearing_upb", FloatType),
    StructField(labelColName, IntegerType)))

labelColName = delinquency_12
schema = StructType(StructField(orig_channel,FloatType,true), StructField(first_home_buyer,FloatType,true), StructField(loan_purpose,FloatType,true), StructField(property_type,FloatType,true), StructField(occupancy_status,FloatType,true), StructField(property_state,FloatType,true), StructField(product_type,FloatType,true), StructField(relocation_mortgage_indicator,FloatType,true), StructField(seller_name,FloatType,true), StructField(mod_flag,FloatType,true), StructField(orig_interest_rate,FloatType,true), StructField(orig_upb,IntegerType,true), StructField(orig_loan_term,IntegerType,true), StructField(orig_ltv,FloatType,true), StructField(orig_cltv,FloatType,true), StructField(num_borrowers,FloatType,true), Str...


StructType(StructField(orig_channel,FloatType,true), StructField(first_home_buyer,FloatType,true), StructField(loan_purpose,FloatType,true), StructField(property_type,FloatType,true), StructField(occupancy_status,FloatType,true), StructField(property_state,FloatType,true), StructField(product_type,FloatType,true), StructField(relocation_mortgage_indicator,FloatType,true), StructField(seller_name,FloatType,true), StructField(mod_flag,FloatType,true), StructField(orig_interest_rate,FloatType,true), StructField(orig_upb,IntegerType,true), StructField(orig_loan_term,IntegerType,true), StructField(orig_ltv,FloatType,true), StructField(orig_cltv,FloatType,true), StructField(num_borrowers,FloatType,true), Str...

## Create a new spark session and load data
we must create a new spark session to continue all spark operations. It will also be used to initilize the `GpuDataReader` which is a data reader powered by GPU.

NOTE: in this notebook, we have uploaded dependency jars when installing toree kernel. If we don't upload them at installation time, we can also upload in notebook by [%AddJar magic](https://toree.incubator.apache.org/docs/current/user/faq/). However, there's one restriction for `%AddJar`: the jar uploaded can only be available when `AddJar` is called after a new spark session is created. We must use it as below:

```scala
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder().appName("Taxi-GPU").getOrCreate
%AddJar file:/data/libs/cudf-0.19.2-cuda10-1.jar
%AddJar file:/data/libs/xgboost4j_3.0-1.3.0-0.1.0.jar
%AddJar file:/data/libs/xgboost4j-spark_3.0-1.3.0-0.1.0.jar
%AddJar file:/data/libs/rapids-4-spark_2.12-0.5.0.jar
// ...
```

In [5]:
val spark = SparkSession.builder().appName("mortgage-gpu-cv").getOrCreate()
val trainDs = spark.read.parquet(trainParquetPath)

spark = org.apache.spark.sql.SparkSession@4ef2d54c
trainDs = [orig_channel: double, first_home_buyer: double ... 26 more fields]


[orig_channel: double, first_home_buyer: double ... 26 more fields]

## Find out features to train

In [6]:
val featureNames = schema.filter(_.name != labelColName).map(_.name)

featureNames = List(orig_channel, first_home_buyer, loan_purpose, property_type, occupancy_status, property_state, product_type, relocation_mortgage_indicator, seller_name, mod_flag, orig_interest_rate, orig_upb, orig_loan_term, orig_ltv, orig_cltv, num_borrowers, dti, borrower_credit_score, num_units, zip, mortgage_insurance_percent, current_loan_delinquency_status, current_actual_upb, interest_rate, loan_age, msa, non_interest_bearing_upb)


List(orig_channel, first_home_buyer, loan_purpose, property_type, occupancy_status, property_state, product_type, relocation_mortgage_indicator, seller_name, mod_flag, orig_interest_rate, orig_upb, orig_loan_term, orig_ltv, orig_cltv, num_borrowers, dti, borrower_credit_score, num_units, zip, mortgage_insurance_percent, current_loan_delinquency_status, current_actual_upb, interest_rate, loan_age, msa, non_interest_bearing_upb)

In [7]:
val classifierParam = Map(
    "eta" -> 0.1,
    "gamma" -> 0.1,
    "missing" -> 0.0,
    "max_depth" -> 10,
    "max_leaves" -> 256,
    "grow_policy" -> "depthwise",
    "objective" -> "binary:logistic",
    "min_child_weight" -> 30,
    "lambda" -> 1,
    "scale_pos_weight" -> 2,
    "subsample" -> 1,
    "nthread" -> 1,
    "num_round" -> 100,
    "tree_method" -> "gpu_hist")

classifierParam = Map(min_child_weight -> 30, grow_policy -> depthwise, scale_pos_weight -> 2, subsample -> 1, lambda -> 1, max_depth -> 10, objective -> binary:logistic, num_round -> 100, missing -> 0.0, tree_method -> gpu_hist, eta -> 0.1, max_leaves -> 256, gamma -> 0.1, nthread -> 1)


Map(min_child_weight -> 30, grow_policy -> depthwise, scale_pos_weight -> 2, subsample -> 1, lambda -> 1, max_depth -> 10, objective -> binary:logistic, num_round -> 100, missing -> 0.0, tree_method -> gpu_hist, eta -> 0.1, max_leaves -> 256, gamma -> 0.1, nthread -> 1)

## Construct CrossValidator

In [8]:
val classifier = new XGBoostClassifier(classifierParam)
    .setLabelCol(labelColName)
    .setFeaturesCols(featureNames)
val paramGrid = new ParamGridBuilder()
    .addGrid(classifier.maxDepth, Array(3, 10))
    .addGrid(classifier.eta, Array(0.2, 0.6))
    .build()
val evaluator = new MulticlassClassificationEvaluator().setLabelCol(labelColName)
val cv = new CrossValidator()
    .setEstimator(classifier)
    .setEvaluator(evaluator)
    .setEstimatorParamMaps(paramGrid)
    .setNumFolds(3)

classifier = xgbc_9b881658ae0b
paramGrid = 
evaluator = MulticlassClassificationEvaluator: uid=mcEval_5057139582af, metricName=f1, metricLabel=0.0, beta=1.0, eps=1.0E-15
cv = cv_510a3b5b16bc


Array({
	xgbc_9b881658ae0b-eta: 0.2,
	xgbc_9b881658ae0b-maxDepth: 3
}, {
	xgbc_9b881658ae0b-eta: 0.2,
	xgbc_9b881658ae0b-maxDepth: 10
}, {
	xgbc_9b881658ae0b-eta: 0.6,
	xgbc_9b881658ae0b-maxDepth: 3
}, {
	xgbc_9b881658ae0b-eta: 0.6,
	xgbc_9b881658ae0b-maxDepth: 10
})


cv_510a3b5b16bc

## train with CrossValidator

In [9]:
val model = cv.fit(trainDs).bestModel.asInstanceOf[XGBoostClassificationModel]

Tracker started, with env={DMLC_NUM_SERVER=0, DMLC_TRACKER_URI=10.19.183.93, DMLC_TRACKER_PORT=9091, DMLC_NUM_WORKER=1}
Tracker started, with env={DMLC_NUM_SERVER=0, DMLC_TRACKER_URI=10.19.183.93, DMLC_TRACKER_PORT=9091, DMLC_NUM_WORKER=1}
Tracker started, with env={DMLC_NUM_SERVER=0, DMLC_TRACKER_URI=10.19.183.93, DMLC_TRACKER_PORT=9091, DMLC_NUM_WORKER=1}
Tracker started, with env={DMLC_NUM_SERVER=0, DMLC_TRACKER_URI=10.19.183.93, DMLC_TRACKER_PORT=9091, DMLC_NUM_WORKER=1}
Tracker started, with env={DMLC_NUM_SERVER=0, DMLC_TRACKER_URI=10.19.183.93, DMLC_TRACKER_PORT=9091, DMLC_NUM_WORKER=1}
Tracker started, with env={DMLC_NUM_SERVER=0, DMLC_TRACKER_URI=10.19.183.93, DMLC_TRACKER_PORT=9091, DMLC_NUM_WORKER=1}
Tracker started, with env={DMLC_NUM_SERVER=0, DMLC_TRACKER_URI=10.19.183.93, DMLC_TRACKER_PORT=9091, DMLC_NUM_WORKER=1}
Tracker started, with env={DMLC_NUM_SERVER=0, DMLC_TRACKER_URI=10.19.183.93, DMLC_TRACKER_PORT=9091, DMLC_NUM_WORKER=1}
Tracker started, with env={DMLC_NUM_SERV

model = xgbc_9b881658ae0b


xgbc_9b881658ae0b

## tranform with best model trained by CrossValidator

In [10]:
val transformDs = spark.read.parquet(evalParquetPath)
val df = model.transform(transformDs).cache()
df.drop(featureNames: _*).show(5)

transformDs = [orig_channel: double, first_home_buyer: double ... 26 more fields]
df = [orig_channel: double, first_home_buyer: double ... 29 more fields]


+--------------+--------------------+--------------------+----------+
|delinquency_12|       rawPrediction|         probability|prediction|
+--------------+--------------------+--------------------+----------+
|             0|[6.45624113082885...|[0.99843177443835...|       0.0|
|             0|[8.55493927001953...|[0.99980744560889...|       0.0|
|             0|[5.98186397552490...|[0.99748223810456...|       0.0|
|             0|[9.41403198242187...|[0.99991843526368...|       0.0|
|             0|[3.82117009162902...|[0.97856726497411...|       0.0|
+--------------+--------------------+--------------------+----------+
only showing top 5 rows



[orig_channel: double, first_home_buyer: double ... 29 more fields]

In [11]:
val evaluator = new MulticlassClassificationEvaluator().setLabelCol(labelColName)
val accuracy = evaluator.evaluate(df)

evaluator = MulticlassClassificationEvaluator: uid=mcEval_f16f1f7867dc, metricName=f1, metricLabel=0.0, beta=1.0, eps=1.0E-15
accuracy = 0.9899912947173429


0.9899912947173429

In [12]:
spark.close()