# Introduction to XGBoost Spark with GPU

Mortgage is an example of xgboost classifier to do binary classification. This notebook will show you how to load data, train the xgboost model and use this model to predict if a mushroom is "poisonous". Camparing to original XGBoost Spark code, there're only one API difference.

## Load libraries
First load some common libraries will be used by both GPU version and CPU version xgboost.

In [1]:
import ml.dmlc.xgboost4j.scala.spark.{XGBoostClassifier, XGBoostClassificationModel}
import org.apache.spark.sql.SparkSession
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
import org.apache.spark.sql.types.{DoubleType, IntegerType, StructField, StructType}

Besides CPU version requires some extra libraries, such as:

```scala
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.FloatType
```

## Set the dataset path

In [2]:
// You need to update them to your real paths!
val dataRoot = sys.env.getOrElse("DATA_ROOT", "/data")
val trainPath = dataRoot + "/mortgage/csv/train/"
val evalPath  = dataRoot + "/mortgage/csv/test/"
val transPath = dataRoot + "/mortgage/csv/test/"

trainPath = /data/mortgage/csv/train/
evalPath = /data/mortgage/csv/test/
transPath = /data/mortgage/csv/test/


/data/mortgage/csv/test/

## Build the schema and parameters
The mortgage data has 27 columns: 26 features and 1 label. "deinquency_12" is the label column. The schema will be used to load data in the future.

The next block also defines some key parameters used in xgboost training process.

In [3]:
val labelColName = "delinquency_12"
val schema = StructType(List(
  StructField("orig_channel", DoubleType),
  StructField("first_home_buyer", DoubleType),
  StructField("loan_purpose", DoubleType),
  StructField("property_type", DoubleType),
  StructField("occupancy_status", DoubleType),
  StructField("property_state", DoubleType),
  StructField("product_type", DoubleType),
  StructField("relocation_mortgage_indicator", DoubleType),
  StructField("seller_name", DoubleType),
  StructField("mod_flag", DoubleType),
  StructField("orig_interest_rate", DoubleType),
  StructField("orig_upb", IntegerType),
  StructField("orig_loan_term", IntegerType),
  StructField("orig_ltv", DoubleType),
  StructField("orig_cltv", DoubleType),
  StructField("num_borrowers", DoubleType),
  StructField("dti", DoubleType),
  StructField("borrower_credit_score", DoubleType),
  StructField("num_units", IntegerType),
  StructField("zip", IntegerType),
  StructField("mortgage_insurance_percent", DoubleType),
  StructField("current_loan_delinquency_status", IntegerType),
  StructField("current_actual_upb", DoubleType),
  StructField("interest_rate", DoubleType),
  StructField("loan_age", DoubleType),
  StructField("msa", DoubleType),
  StructField("non_interest_bearing_upb", DoubleType),
  StructField(labelColName, IntegerType)))

val featureNames = schema.filter(_.name != labelColName).map(_.name)

val commParamMap = Map(
  "eta" -> 0.1,
  "gamma" -> 0.1,
  "missing" -> 0.0,
  "max_depth" -> 10,
  "max_leaves" -> 256,
  "objective" -> "binary:logistic",
  "grow_policy" -> "depthwise",
  "min_child_weight" -> 30,
  "lambda" -> 1,
  "scale_pos_weight" -> 2,
  "subsample" -> 1,
  "nthread" -> 1,
  "num_round" -> 100)

labelColName = delinquency_12
schema = StructType(StructField(orig_channel,DoubleType,true), StructField(first_home_buyer,DoubleType,true), StructField(loan_purpose,DoubleType,true), StructField(property_type,DoubleType,true), StructField(occupancy_status,DoubleType,true), StructField(property_state,DoubleType,true), StructField(product_type,DoubleType,true), StructField(relocation_mortgage_indicator,DoubleType,true), StructField(seller_name,DoubleType,true), StructField(mod_flag,DoubleType,true), StructField(orig_interest_rate,DoubleType,true), StructField(orig_upb,IntegerType,true), StructField(orig_loan_term,IntegerType,true), StructField(orig_ltv,DoubleType,true), StructField(orig_cltv,DoubleType,true), StructField(num_borrowers,DoubleT...


StructType(StructField(orig_channel,DoubleType,true), StructField(first_home_buyer,DoubleType,true), StructField(loan_purpose,DoubleType,true), StructField(property_type,DoubleType,true), StructField(occupancy_status,DoubleType,true), StructField(property_state,DoubleType,true), StructField(product_type,DoubleType,true), StructField(relocation_mortgage_indicator,DoubleType,true), StructField(seller_name,DoubleType,true), StructField(mod_flag,DoubleType,true), StructField(orig_interest_rate,DoubleType,true), StructField(orig_upb,IntegerType,true), StructField(orig_loan_term,IntegerType,true), StructField(orig_ltv,DoubleType,true), StructField(orig_cltv,DoubleType,true), StructField(num_borrowers,DoubleT...

## Create a new spark session and load data

A new spark session should be created to continue all the following spark operations.

NOTE: in this notebook, the dependency jars have been loaded when installing toree kernel. Alternatively the jars can be loaded into notebook by [%AddJar magic](https://toree.incubator.apache.org/docs/current/user/faq/). However, there's one restriction for `%AddJar`: the jar uploaded can only be available when `AddJar` is called just after a new spark session is created. Do it as below:

```scala
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder().appName("mortgage-GPU").getOrCreate
%AddJar file:/data/libs/cudf-XXX-cuda10.jar
%AddJar file:/data/libs/rapids-4-spark-XXX.jar
%AddJar file:/data/libs/xgboost4j_3.0-XXX.jar
%AddJar file:/data/libs/xgboost4j-spark_3.0-XXX.jar
// ...
```

##### Please note the new jar "rapids-4-spark-XXX.jar" is only needed for GPU version, you can not add it to dependence list for CPU version.

In [4]:
// Build the spark session and data reader as usual
val sparkSession = SparkSession.builder.appName("mortgage-gpu").getOrCreate
val reader = sparkSession.read.option("header", true).schema(schema)

sparkSession = org.apache.spark.sql.SparkSession@56233666
reader = org.apache.spark.sql.DataFrameReader@62964667


org.apache.spark.sql.DataFrameReader@62964667

In [5]:
val trainSet = reader.csv(trainPath)
val evalSet  = reader.csv(evalPath)
val transSet = reader.csv(transPath)

trainSet = [orig_channel: double, first_home_buyer: double ... 26 more fields]
evalSet = [orig_channel: double, first_home_buyer: double ... 26 more fields]
transSet = [orig_channel: double, first_home_buyer: double ... 26 more fields]


[orig_channel: double, first_home_buyer: double ... 26 more fields]

## Set xgboost parameters and build a XGBoostClassifier

For CPU version, `num_workers` is recommended being equal to the number of CPU cores, while for GPU version, it should be set to the number of GPUs in Spark cluster.

Besides the `tree_method` for CPU version is also different from that for GPU version. Now only "gpu_hist" is supported for training on GPU.

```scala
// difference in parameters
  "num_workers" -> 12,
  "tree_method" -> "hist",
```

In [6]:
val xgbParamFinal = commParamMap ++ Map("tree_method" -> "gpu_hist", "num_workers" -> 1)

xgbParamFinal = Map(min_child_weight -> 30, grow_policy -> depthwise, scale_pos_weight -> 2, num_workers -> 1, subsample -> 1, lambda -> 1, max_depth -> 10, objective -> binary:logistic, num_round -> 100, missing -> 0.0, tree_method -> gpu_hist, eta -> 0.1, max_leaves -> 256, gamma -> 0.1, nthread -> 1)


Map(min_child_weight -> 30, grow_policy -> depthwise, scale_pos_weight -> 2, num_workers -> 1, subsample -> 1, lambda -> 1, max_depth -> 10, objective -> binary:logistic, num_round -> 100, missing -> 0.0, tree_method -> gpu_hist, eta -> 0.1, max_leaves -> 256, gamma -> 0.1, nthread -> 1)

Here comes the only API difference,`setFeaturesCol` in CPU version vs `setFeaturesCols` in GPU version.

In previous block, it said that CPU version needs `VectorAssembler` to assemble multiple feature columns into one column, because `setFeaturesCol` only accepts one feature column with the type of `vector`.

But `setFeaturesCols` supports multiple columns directly, so set the feautres column names directly to `XGBoostClassifier`. 

CPU version:

```scala
val xgbClassifier  = new XGBoostClassifier(paramMap)
  .setLabelCol(labelName)
  .setFeaturesCol("features")
```

In [7]:
val xgbClassifier = new XGBoostClassifier(xgbParamFinal)
      .setLabelCol(labelColName)
      // === diff ===
      .setFeaturesCols(featureNames)

xgbClassifier = xgbc_51efa8a205b1


xgbc_51efa8a205b1

## Benchmark and train
The object `benchmark` is used to compute the elapsed time of some operations.

Training with evaluation sets is also supported in 2 ways, the same as CPU version's behavior:

* Call API `setEvalSets` after initializing an XGBoostClassifier

```scala
xgbClassifier.setEvalSets(Map("eval" -> evalSet))

```

* Use parameter `eval_sets` when initializing an XGBoostClassifier

```scala
val paramMapWithEval = paramMap + ("eval_sets" -> Map("eval" -> evalSet))
val xgbClassifierWithEval = new XGBoostClassifier(paramMapWithEval)
```

Here chooses the API way to set evaluation sets.

In [8]:
xgbClassifier.setEvalSets(Map("eval" -> evalSet))

xgbc_51efa8a205b1

In [9]:
object Benchmark {
  def time[R](phase: String)(block: => R): (R, Float) = {
    val t0 = System.currentTimeMillis
    val result = block // call-by-name
    val t1 = System.currentTimeMillis
    println("Elapsed time [" + phase + "]: " + ((t1 - t0).toFloat / 1000) + "s")
    (result, (t1 - t0).toFloat / 1000)
  }
}

defined object Benchmark


CPU version reqires an extra step before fitting data to classifier, using `VectorAssembler` to assemble all feature columns into one column. The following code snip shows how to do the vectorizing.

```scala
object Vectorize {
  def apply(df: DataFrame, featureNames: Seq[String], labelName: String): DataFrame = {
    val toFloat = df.schema.map(f => col(f.name).cast(FloatType))
    new VectorAssembler()
      .setInputCols(featureNames.toArray)
      .setOutputCol("features")
      .transform(df.select(toFloat:_*))
      .select(col("features"), col(labelName))
  }
}

trainSet = Vectorize(trainSet, featureCols, labelName)
evalSet = Vectorize(evalSet, featureCols, labelName)
transSet = Vectorize(transSet, featureCols, labelName)

```

Fortunately `VectorAssembler` is not needed for GPU version. Just fit the loaded data directly to XGBoostClassifier.

In [10]:
// Start training
println("\n------ Training ------")
val (xgbClassificationModel, _) = Benchmark.time("train") {
  xgbClassifier.fit(trainSet)
}


------ Training ------
Tracker started, with env={DMLC_NUM_SERVER=0, DMLC_TRACKER_URI=10.19.183.78, DMLC_TRACKER_PORT=9091, DMLC_NUM_WORKER=1}


xgbClassificationModel = xgbc_51efa8a205b1


Elapsed time [train]: 35.409s


xgbc_51efa8a205b1

## Transformation and evaluation
Here uses `transSet` to evaluate our model and prints some useful columns to show our prediction result. After that `MulticlassClassificationEvaluator` is used to calculate an overall accuracy of our predictions.

In [11]:
println("\n------ Transforming ------")
val (results, _) = Benchmark.time("transform") {
  val ret = xgbClassificationModel.transform(transSet).cache()
  ret.foreachPartition((_: Iterator[_]) => ())
  ret
}
results.select("orig_channel", labelColName,"rawPrediction","probability","prediction").show(10)

println("\n------Accuracy of Evaluation------")
val evaluator = new MulticlassClassificationEvaluator().setLabelCol(labelColName)
val accuracy = evaluator.evaluate(results)
println(accuracy)


------ Transforming ------
Elapsed time [transform]: 10.922s
+------------+--------------+--------------------+--------------------+----------+
|orig_channel|delinquency_12|       rawPrediction|         probability|prediction|
+------------+--------------+--------------------+--------------------+----------+
|         0.0|             0|[7.94447755813598...|[0.99964551060111...|       0.0|
|         0.0|             0|[4.54992532730102...|[0.98954252060502...|       0.0|
|         0.0|             0|[4.54992532730102...|[0.98954252060502...|       0.0|
|         0.0|             0|[4.54992532730102...|[0.98954252060502...|       0.0|
|         0.0|             0|[4.54992532730102...|[0.98954252060502...|       0.0|
|         0.0|             0|[4.54992532730102...|[0.98954252060502...|       0.0|
|         0.0|             0|[4.54992532730102...|[0.98954252060502...|       0.0|
|         0.0|             0|[4.27546072006225...|[0.98628507554531...|       0.0|
|         0.0|           

results = [orig_channel: double, first_home_buyer: double ... 29 more fields]
evaluator = MulticlassClassificationEvaluator: uid=mcEval_cfa6376fa392, metricName=f1, metricLabel=0.0, beta=1.0, eps=1.0E-15
accuracy = 0.9876288410129955


0.9876288410129955

## Save the model to disk and load model
Save the model to disk and then load it to memory. After that use the loaded model to do a new prediction.

In [12]:
xgbClassificationModel.write.overwrite.save(dataRoot + "/model/mortgage")

val modelFromDisk = XGBoostClassificationModel.load(dataRoot + "/model/mortgage")

val (results2, _) = Benchmark.time("transform2") {
  modelFromDisk.transform(transSet)
}
results2.show(10)

Elapsed time [transform2]: 0.072s


modelFromDisk = xgbc_51efa8a205b1
results2 = [orig_channel: double, first_home_buyer: double ... 29 more fields]


+------------+----------------+------------+-------------+----------------+--------------+------------+-----------------------------+-----------+--------+------------------+--------+--------------+--------+---------+-------------+----+---------------------+---------+---+--------------------------+-------------------------------+------------------+-------------+--------+-------+------------------------+--------------+--------------------+--------------------+----------+
|orig_channel|first_home_buyer|loan_purpose|property_type|occupancy_status|property_state|product_type|relocation_mortgage_indicator|seller_name|mod_flag|orig_interest_rate|orig_upb|orig_loan_term|orig_ltv|orig_cltv|num_borrowers| dti|borrower_credit_score|num_units|zip|mortgage_insurance_percent|current_loan_delinquency_status|current_actual_upb|interest_rate|loan_age|    msa|non_interest_bearing_upb|delinquency_12|       rawPrediction|         probability|prediction|
+------------+----------------+------------+--------

[orig_channel: double, first_home_buyer: double ... 29 more fields]

In [13]:
sparkSession.close()