# Introduction to XGBoost-Spark Cross Validation with GPU

The goal of this notebook is to show you how to levarage GPU to accelerate XGBoost spark cross validatoin for hyperparameter tuning. The best model for the given hyperparameters will be returned.

Here takes the application 'Taxi' as an example.

A few libraries are required for this notebook:
  1. cudf-cu11
  2. xgboost
  3. scikit-learn
  4. numpy

#### Import the Required Libraries

In [1]:
from xgboost.spark import SparkXGBRegressor, SparkXGBRegressorModel
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.sql import SparkSession
from pyspark.sql.types import FloatType, IntegerType, StructField, StructType
from time import time
from pyspark.conf import SparkConf
import os
# os.environ['PYSPARK_PYTHON'] = "./environment/bin/python"
# os.environ['PYSPARK_DRIVER_PYTHON'] = "./environment/bin/python"

#### Create a Spark Session

In [2]:
SPARK_MASTER_URL = os.getenv("SPARK_MASTER_URL", "/your-url")

RAPIDS_JAR = os.getenv("RAPIDS_JAR", "/your-jar-path")

# You need to update with your real hardware resource 
driverMem = os.getenv("DRIVER_MEM", "2g")
executorMem = os.getenv("EXECUTOR_MEM", "2g")
pinnedPoolSize = os.getenv("PINNED_POOL_SIZE", "2g")
concurrentGpuTasks = os.getenv("CONCURRENT_GPU_TASKS", "2")
executorCores = int(os.getenv("EXECUTOR_CORES", "2"))
# Common spark settings
conf = SparkConf()
conf.setMaster(SPARK_MASTER_URL)
conf.setAppName("Microbenchmark on GPU")
conf.set("spark.executor.instances","1")
conf.set("spark.driver.memory", driverMem)
## The tasks will run on GPU memory, so there is no need to set a high host memory
conf.set("spark.executor.memory", executorMem)
## The tasks will run on GPU cores, so there is no need to use many cpu cores
conf.set("spark.executor.cores", executorCores)

# Plugin settings
conf.set("spark.executor.resource.gpu.amount", "1")
conf.set("spark.rapids.sql.concurrentGpuTasks", concurrentGpuTasks)
conf.set("spark.rapids.memory.pinnedPool.size", pinnedPoolSize)
conf.set("spark.rapids.memory.gpu.allocFraction","0.7")
conf.set("spark.locality.wait","0")
##############note: only support value=1 https://github.com/dmlc/xgboost/blame/master/python-package/xgboost/spark/core.py#L370-L374
conf.set("spark.task.resource.gpu.amount", 1) 
conf.set("spark.rapids.sql.enabled", "true") 
conf.set("spark.plugins", "com.nvidia.spark.SQLPlugin")
conf.set("spark.sql.cache.serializer","com.nvidia.spark.ParquetCachedBatchSerializer")
conf.set("spark.sql.execution.arrow.maxRecordsPerBatch", 200000) 
conf.set("spark.driver.extraClassPath", RAPIDS_JAR)
conf.set("spark.executor.extraClassPath", RAPIDS_JAR)
# if you pass/unpack the archive file and enable the environment
# conf.set("spark.yarn.dist.archives", "your_pyspark_venv.tar.gz#environment")
# Create spark session
spark = SparkSession.builder.config(conf=conf).getOrCreate()

reader = spark.read

2022-11-30 08:02:09,748 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
2022-11-30 08:02:10,103 WARN resource.ResourceUtils: The configuration of cores (exec = 2 task = 1, runnable tasks = 2) will result in wasted resources due to resource gpu limiting the number of runnable tasks per executor to: 1. Please adjust your configuration.
2022-11-30 08:02:23,737 WARN rapids.RapidsPluginUtils: RAPIDS Accelerator 24.02.0 using cudf 24.02.0.
2022-11-30 08:02:23,752 WARN rapids.RapidsPluginUtils: spark.rapids.sql.multiThreadedRead.numThreads is set to 20.
2022-11-30 08:02:23,756 WARN rapids.RapidsPluginUtils: RAPIDS Accelerator is enabled, to disable GPU support set `spark.rapids.sql.enabled` to false.
2022-11-30 08:02:23,757 WARN rapids.RapidsPluginUtils: spark.rapids.sql.explain is se

#### Specify the Data Schema and Load the Data

In [3]:
label = 'fare_amount'
schema = StructType([
    StructField('vendor_id', FloatType()),
    StructField('passenger_count', FloatType()),
    StructField('trip_distance', FloatType()),
    StructField('pickup_longitude', FloatType()),
    StructField('pickup_latitude', FloatType()),
    StructField('rate_code', FloatType()),
    StructField('store_and_fwd', FloatType()),
    StructField('dropoff_longitude', FloatType()),
    StructField('dropoff_latitude', FloatType()),
    StructField(label, FloatType()),
    StructField('hour', FloatType()),
    StructField('year', IntegerType()),
    StructField('month', IntegerType()),
    StructField('day', FloatType()),
    StructField('day_of_week', FloatType()),
    StructField('is_weekend', FloatType()),
])

features = [ x.name for x in schema if x.name != label ]

# You need to update them to your real paths!
dataRoot = os.getenv("DATA_ROOT", "/data")
train_path = dataRoot + "/taxi/csv/train"
eval_path = dataRoot + "/taxi/csv/test"

data_format = 'csv'
has_header = 'true'
if data_format == 'csv':
    train_data = reader.schema(schema).option('header',has_header).csv(train_path)
    trans_data = reader.schema(schema).option('header',has_header).csv(eval_path)
else :
    train_data = reader.load(train_path)
    trans_data = reader.load(eval_path)

#### Build a XGBoost-Spark CrossValidator

In [4]:
# First build a regressor of GPU version using *setFeaturesCols* to set feature columns
params = { 
    "tree_method": "gpu_hist",
    "grow_policy": "depthwise",
    "num_workers": 1,
    "use_gpu": "true",
}
params['features_col'] = features
params['label_col'] = label

regressor = SparkXGBRegressor(**params)
# Then build the evaluator and the hyperparameters
evaluator = (RegressionEvaluator()
    .setLabelCol(label))
param_grid = (ParamGridBuilder()
    .addGrid(regressor.max_depth, [3, 6])
    .addGrid(regressor.n_estimators, [100, 200])
    .build())
# Finally the corss validator
cross_validator = (CrossValidator()
    .setEstimator(regressor)
    .setEvaluator(evaluator)
    .setEstimatorParamMaps(param_grid)
    .setNumFolds(2))

#### Start Cross Validation by Fitting Data to CrossValidator

In [5]:
def with_benchmark(phrase, action):
    start = time()
    result = action()
    end = time()
    print('{} takes {} seconds'.format(phrase, round(end - start, 2)))
    return result
model = with_benchmark('Cross-Validation', lambda: cross_validator.fit(train_data)).bestModel

If features_cols param set, then features_col param is ignored.
2022-11-30 08:03:14,308 WARN rapids.GpuOverrides: 
! <DeserializeToObjectExec> cannot run on GPU because not all expressions can be replaced; GPU does not currently support the operator class org.apache.spark.sql.execution.DeserializeToObjectExec
  ! <CreateExternalRow> createexternalrow(prediction#889, fare_amount#890, 1.0#891, StructField(prediction,DoubleType,true), StructField(fare_amount,DoubleType,true), StructField(1.0,DoubleType,false)) cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.objects.CreateExternalRow
    @Expression <AttributeReference> prediction#889 could run on GPU
    @Expression <AttributeReference> fare_amount#890 could run on GPU
    @Expression <AttributeReference> 1.0#891 could run on GPU
  !Expression <AttributeReference> obj#895 cannot run on GPU because expression AttributeReference obj#895 produces an unsupported type Object

If features_cols param set, then features_col param is ignored.
[Stage 34:>                                                         (0 + 1) / 1]

Cross-Validation takes 55.19 seconds


                                                                                

#### Transform On the Best Model

In [6]:
def transform():
    result = model.transform(trans_data).cache()
    result.foreachPartition(lambda _: None)
    return result
result = with_benchmark('Transforming', transform)
result.select(label, 'prediction').show(5)

Transforming takes 0.23 seconds


2022-11-30 08:03:45,503 WARN rapids.GpuOverrides: 
!Exec <CollectLimitExec> cannot run on GPU because the Exec CollectLimitExec has been disabled, and is disabled by default because Collect Limit replacement can be slower on the GPU, if huge number of rows in a batch it could help by limiting the number of rows transferred from GPU to CPU. Set spark.rapids.sql.exec.CollectLimitExec to true if you wish to enable it
  @Partitioning <SinglePartition$> could run on GPU



+-----------+-----------+
|fare_amount| prediction|
+-----------+-----------+
|        5.0| 5.01032114|
|       34.0|  31.134758|
|       10.0|9.288980484|
|       16.5|15.33446312|
|        7.0|8.197098732|
+-----------+-----------+
only showing top 5 rows



#### Evaluation

In [7]:
accuracy = with_benchmark(
    'Evaluation',
    lambda: RegressionEvaluator().setLabelCol(label).evaluate(result))
print('RMSE is ' + str(accuracy))

Evaluation takes 0.05 seconds
RMSE is 2.055690464034438


2022-11-30 08:03:45,728 WARN rapids.GpuOverrides: 
! <DeserializeToObjectExec> cannot run on GPU because not all expressions can be replaced; GPU does not currently support the operator class org.apache.spark.sql.execution.DeserializeToObjectExec
  ! <CreateExternalRow> createexternalrow(prediction#7645, fare_amount#8271, 1.0#8272, StructField(prediction,DoubleType,true), StructField(fare_amount,DoubleType,true), StructField(1.0,DoubleType,false)) cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.objects.CreateExternalRow
    @Expression <AttributeReference> prediction#7645 could run on GPU
    @Expression <AttributeReference> fare_amount#8271 could run on GPU
    @Expression <AttributeReference> 1.0#8272 could run on GPU
  !Expression <AttributeReference> obj#8276 cannot run on GPU because expression AttributeReference obj#8276 produces an unsupported type ObjectType(interface org.apache.spark.sql.Row)



In [8]:
spark.stop()