# Introduction to XGBoost Spark3.1 with GPU

Taxi is an example of xgboost regressor. This notebook will show you how to load data, train the xgboost model and use this model to predict "fare_amount" of your taxi trip.

A few libraries required for this notebook:
  1. cudf-cu11
  2. xgboost
  3. scikit-learn
  4. numpy

This notebook also illustrates the ease of porting a sample CPU based Spark xgboost4j code into GPU. There is no change required for running Spark XGBoost on GPU because both CPU and GPU call the same API. For CPU run, we need to vectorize the trained dataset before fitting data to regressor.

#### Import Required Libraries

In [1]:
from xgboost.spark import SparkXGBRegressor, SparkXGBRegressorModel
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.sql import SparkSession
from pyspark.sql.types import FloatType, IntegerType, StructField, StructType
from time import time
from pyspark.conf import SparkConf
import os
# if you pass/unpack the archive file and enable the environment
# os.environ['PYSPARK_PYTHON'] = "./environment/bin/python"
# os.environ['PYSPARK_DRIVER_PYTHON'] = "./environment/bin/python"

Besides CPU version requires two extra libraries.
```Python
from pyspark.ml.feature import VectorAssembler
from pyspark.sql.functions import col
```

#### Create Spark Session and Data Reader

In [2]:
SPARK_MASTER_URL = os.getenv("SPARK_MASTER_URL", "/your-url")

RAPIDS_JAR = os.getenv("RAPIDS_JAR", "/your-jar-path")

# You need to update with your real hardware resource 
driverMem = os.getenv("DRIVER_MEM", "2g")
executorMem = os.getenv("EXECUTOR_MEM", "2g")
pinnedPoolSize = os.getenv("PINNED_POOL_SIZE", "2g")
concurrentGpuTasks = os.getenv("CONCURRENT_GPU_TASKS", "2")
executorCores = int(os.getenv("EXECUTOR_CORES", "2"))
# Common spark settings
conf = SparkConf()
conf.setMaster(SPARK_MASTER_URL)
conf.setAppName("Microbenchmark on GPU")
conf.set("spark.executor.instances","1")
conf.set("spark.driver.memory", driverMem)
## The tasks will run on GPU memory, so there is no need to set a high host memory
conf.set("spark.executor.memory", executorMem)
## The tasks will run on GPU cores, so there is no need to use many cpu cores
conf.set("spark.executor.cores", executorCores)

# Plugin settings
conf.set("spark.executor.resource.gpu.amount", "1")
conf.set("spark.rapids.sql.concurrentGpuTasks", concurrentGpuTasks)
conf.set("spark.rapids.memory.pinnedPool.size", pinnedPoolSize)
# since pyspark and xgboost share the same GPU, we need to allocate some memory to xgboost to avoid GPU OOM while training 
conf.set("spark.rapids.memory.gpu.allocFraction","0.7")
conf.set("spark.locality.wait","0")
##############note: only support value=1 https://github.com/dmlc/xgboost/blame/master/python-package/xgboost/spark/core.py#L370-L374
conf.set("spark.task.resource.gpu.amount", 1) 
conf.set("spark.rapids.sql.enabled", "true") 
conf.set("spark.plugins", "com.nvidia.spark.SQLPlugin")
conf.set("spark.sql.cache.serializer","com.nvidia.spark.ParquetCachedBatchSerializer")
conf.set("spark.sql.execution.arrow.maxRecordsPerBatch", 200000) 
conf.set("spark.driver.extraClassPath", RAPIDS_JAR)
conf.set("spark.executor.extraClassPath", RAPIDS_JAR)

# if you pass/unpack the archive file and enable the environment
# conf.set("spark.yarn.dist.archives", "your_pyspark_venv.tar.gz#environment")
# Create spark session
spark = SparkSession.builder.config(conf=conf).getOrCreate()

reader = spark.read

2022-11-30 07:51:19,104 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
2022-11-30 07:51:19,480 WARN resource.ResourceUtils: The configuration of cores (exec = 2 task = 1, runnable tasks = 2) will result in wasted resources due to resource gpu limiting the number of runnable tasks per executor to: 1. Please adjust your configuration.
2022-11-30 07:51:33,277 WARN rapids.RapidsPluginUtils: RAPIDS Accelerator 23.06.0 using cudf 23.06.0.
2022-11-30 07:51:33,292 WARN rapids.RapidsPluginUtils: spark.rapids.sql.multiThreadedRead.numThreads is set to 20.
2022-11-30 07:51:33,295 WARN rapids.RapidsPluginUtils: RAPIDS Accelerator is enabled, to disable GPU support set `spark.rapids.sql.enabled` to false.
2022-11-30 07:51:33,295 WARN rapids.RapidsPluginUtils: spark.rapids.sql.explain is se

#### Specify the Data Schema and Load the Data

In [3]:
label = 'fare_amount'
schema = StructType([
    StructField('vendor_id', FloatType()),
    StructField('passenger_count', FloatType()),
    StructField('trip_distance', FloatType()),
    StructField('pickup_longitude', FloatType()),
    StructField('pickup_latitude', FloatType()),
    StructField('rate_code', FloatType()),
    StructField('store_and_fwd', FloatType()),
    StructField('dropoff_longitude', FloatType()),
    StructField('dropoff_latitude', FloatType()),
    StructField(label, FloatType()),
    StructField('hour', FloatType()),
    StructField('year', IntegerType()),
    StructField('month', IntegerType()),
    StructField('day', FloatType()),
    StructField('day_of_week', FloatType()),
    StructField('is_weekend', FloatType()),
])
features = [ x.name for x in schema if x.name != label ]

# You need to update them to your real paths!
dataRoot = os.getenv("DATA_ROOT", "/data")
train_path = dataRoot + "/taxi/csv/train"
eval_path = dataRoot + "/taxi/csv/test"

data_format = 'csv'
has_header = 'true'
if data_format == 'csv':
    train_data = reader.schema(schema).option('header',has_header).csv(train_path)
    trans_data = reader.schema(schema).option('header',has_header).csv(eval_path)
else :
    train_data = reader.load(train_path)
    trans_data = reader.load(eval_path)

Note on CPU version, vectorization is required before fitting data to regressor, which means you need to assemble all feature columns into one column.

```Python
def vectorize(data_frame):
    to_floats = [ col(x.name).cast(FloatType()) for x in data_frame.schema ]
    return (VectorAssembler()
        .setInputCols(features)
        .setOutputCol('features')
        .transform(data_frame.select(to_floats))
        .select(col('features'), col(label)))

train_data = vectorize(train_data)
trans_data = vectorize(trans_data)
```

#### Create a XGBoostRegressor

In [4]:
params = { 
    "tree_method": "gpu_hist",
    "grow_policy": "depthwise",
    "num_workers": 1,
    "use_gpu": "true",
}
params['features_col'] = features
params['label_col'] = label
    
regressor = SparkXGBRegressor(**params)

The parameter `num_workers` should be set to the number of GPUs in Spark cluster for GPU version, while for CPU version it is usually equal to the number of the CPU cores.

Concerning the tree method, GPU version only supports `gpu_hist` currently, while `hist` is designed and used here for CPU training.

An example of CPU classifier:
```
classifier = SparkXGBClassifier(
  feature_col=features,
  label_col=label,  
  num_workers=1024,
  use_gpu=False,
)
```

#### Train the Data with Benchmark

In [5]:
def with_benchmark(phrase, action):
    start = time()
    result = action()
    end = time()
    print('{} takes {} seconds'.format(phrase, round(end - start, 2)))
    return result
model = with_benchmark('Training', lambda: regressor.fit(train_data))

If features_cols param set, then features_col param is ignored.
[Stage 2:>                                                          (0 + 1) / 1]

Training takes 24.08 seconds




#### Save and Reload the Model

In [6]:
model.write().overwrite().save(dataRoot + '/model/taxi')

If features_cols param set, then features_col param is ignored.


In [7]:
loaded_model = SparkXGBRegressorModel().load(dataRoot + '/model/taxi')

#### Transformation and Show Result Sample

In [8]:
def transform():
    result = loaded_model.transform(trans_data).cache()
    result.foreachPartition(lambda _: None)
    return result
result = with_benchmark('Transformation', transform)
result.select('vendor_id', 'passenger_count', 'trip_distance', label, 'prediction').show(5)

2022-11-30 07:52:27,357 WARN util.package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.


Transformation takes 0.93 seconds


2022-11-30 07:52:28,189 WARN rapids.GpuOverrides: 
!Exec <CollectLimitExec> cannot run on GPU because the Exec CollectLimitExec has been disabled, and is disabled by default because Collect Limit replacement can be slower on the GPU, if huge number of rows in a batch it could help by limiting the number of rows transferred from GPU to CPU. Set spark.rapids.sql.exec.CollectLimitExec to true if you wish to enable it
  @Partitioning <SinglePartition$> could run on GPU



+--------------+---------------+-------------+-----------+-----------+
|     vendor_id|passenger_count|trip_distance|fare_amount| prediction|
+--------------+---------------+-------------+-----------+-----------+
|1.559730432E09|            2.0|  0.699999988|        5.0|5.046935558|
|1.559730432E09|            3.0|  10.69999981|       34.0|31.72706413|
|1.559730432E09|            1.0|  2.299999952|       10.0|9.294451714|
|1.559730432E09|            1.0|  4.400000095|       16.5|15.05233097|
|1.559730432E09|            1.0|          1.5|        7.0|8.995832443|
+--------------+---------------+-------------+-----------+-----------+
only showing top 5 rows



Note on CPU version: You cannot `select` the feature columns after vectorization. So please use `result.show(5)` instead.

#### Evaluation

In [9]:
accuracy = with_benchmark(
    'Evaluation',
    lambda: RegressionEvaluator().setLabelCol(label).evaluate(result))
print('RMSE is ' + str(accuracy))

Evaluation takes 0.22 seconds
RMSE is 1.9141528471228921


2022-11-30 07:52:28,580 WARN rapids.GpuOverrides: 
! <DeserializeToObjectExec> cannot run on GPU because not all expressions can be replaced; GPU does not currently support the operator class org.apache.spark.sql.execution.DeserializeToObjectExec
  ! <CreateExternalRow> createexternalrow(prediction#87, fare_amount#728, 1.0#729, StructField(prediction,DoubleType,true), StructField(fare_amount,DoubleType,true), StructField(1.0,DoubleType,false)) cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.objects.CreateExternalRow
    @Expression <AttributeReference> prediction#87 could run on GPU
    @Expression <AttributeReference> fare_amount#728 could run on GPU
    @Expression <AttributeReference> 1.0#729 could run on GPU
  !Expression <AttributeReference> obj#733 cannot run on GPU because expression AttributeReference obj#733 produces an unsupported type ObjectType(interface org.apache.spark.sql.Row)



#### Stop

In [10]:
spark.stop()