# Introduction to XGBoost-Spark Cross Validation with GPU

The goal of this notebook is to show you how to levarage GPU to accelerate XGBoost spark cross validatoin for hyperparameter tuning. The best model for the given hyperparameters will be returned.

Note: CrossValidation can't be ran with the latest cudf v21.06.1 because of some API changes. We'll plan to release a new XGBoost jar with the fixing soon. We keep this notebook using cudf v0.19.2 & rapids-4-spark v0.5.0.

Here takes the application 'Taxi' as an example.

A few libraries are required for this notebook:
  1. NumPy
  2. cudf jar
  2. xgboost4j jar
  3. xgboost4j-spark jar

#### Import the Required Libraries

In [1]:
from ml.dmlc.xgboost4j.scala.spark import XGBoostRegressionModel, XGBoostRegressor
from ml.dmlc.xgboost4j.scala.spark.rapids import CrossValidator
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.tuning import ParamGridBuilder
from pyspark.sql import SparkSession
from pyspark.sql.types import FloatType, IntegerType, StructField, StructType
from time import time

As shown above, here `CrossValidator` is imported from package `ml.dmlc.xgboost4j.scala.spark.rapids`, not the spark's `tuning.CrossValidator`.

#### Create a Spark Session

In [2]:
spark = SparkSession.builder.appName("taxi-cv-gpu-python").getOrCreate()

#### Specify the Data Schema and Load the Data

In [3]:
label = 'fare_amount'
schema = StructType([
    StructField('vendor_id', FloatType()),
    StructField('passenger_count', FloatType()),
    StructField('trip_distance', FloatType()),
    StructField('pickup_longitude', FloatType()),
    StructField('pickup_latitude', FloatType()),
    StructField('rate_code', FloatType()),
    StructField('store_and_fwd', FloatType()),
    StructField('dropoff_longitude', FloatType()),
    StructField('dropoff_latitude', FloatType()),
    StructField(label, FloatType()),
    StructField('hour', FloatType()),
    StructField('year', IntegerType()),
    StructField('month', IntegerType()),
    StructField('day', FloatType()),
    StructField('day_of_week', FloatType()),
    StructField('is_weekend', FloatType()),
])

features = [ x.name for x in schema if x.name != label ]

train_data = spark.read.parquet('/data/taxi/parquet/train')
trans_data = spark.read.parquet('/data/taxi/parquet/eval')

#### Build a XGBoost-Spark CrossValidator

In [4]:
# First build a regressor of GPU version using *setFeaturesCols* to set feature columns
params = {
    'eta': 0.05,
    'maxDepth': 8,
    'subsample': 0.8,
    'gamma': 1.0,
    'numRound': 100,
    'numWorkers': 1,
    'treeMethod': 'gpu_hist',
}
regressor = XGBoostRegressor(**params).setLabelCol(label).setFeaturesCols(features)
# Then build the evaluator and the hyperparameters
evaluator = (RegressionEvaluator()
    .setLabelCol(label))
param_grid = (ParamGridBuilder()
    .addGrid(regressor.maxDepth, [3, 6])
    .addGrid(regressor.numRound, [100, 200])
    .build())
# Finally the corss validator
cross_validator = (CrossValidator()
    .setEstimator(regressor)
    .setEvaluator(evaluator)
    .setEstimatorParamMaps(param_grid)
    .setNumFolds(3))

#### Start Cross Validation by Fitting Data to CrossValidator

In [5]:
def with_benchmark(phrase, action):
    start = time()
    result = action()
    end = time()
    print('{} takes {} seconds'.format(phrase, round(end - start, 2)))
    return result
model = with_benchmark('Cross-Validation', lambda: cross_validator.fit(train_data)).bestModel

Cross-Validation takes 73.77 seconds


#### Transform On the Best Model

In [6]:
def transform():
    result = model.transform(trans_data).cache()
    result.foreachPartition(lambda _: None)
    return result
result = with_benchmark('Transforming', transform)
result.select(label, 'prediction').show(5)

Transforming takes 1.33 seconds
+-----------+-----------------+
|fare_amount|       prediction|
+-----------+-----------------+
|        2.5|34.38509750366211|
|       45.0|37.97528839111328|
|        2.5|28.55727195739746|
|       45.0|40.39316177368164|
|       45.0|36.12188720703125|
+-----------+-----------------+
only showing top 5 rows



#### Evaluation

In [7]:
accuracy = with_benchmark(
    'Evaluation',
    lambda: RegressionEvaluator().setLabelCol(label).evaluate(result))
print('RMSE is ' + str(accuracy))

Evaluation takes 0.26 seconds
RMSE is 3.5167114187894883


In [8]:
spark.stop()