# Introduction to XGBoost Spark with GPU

The goal of this notebook is to show how to train a XGBoost Model with Spark RAPIDS XGBoost library on GPUs. The dataset used with this notebook is derived from Fannie Mae’s Single-Family Loan Performance Data with all rights reserved by Fannie Mae. This processed dataset is redistributed with permission and consent from Fannie Mae. This notebook uses XGBoost to train 12-month mortgage loan delinquency prediction model .

A few libraries required for this notebook:
  1. cudf-cu11
  2. xgboost
  3. scikit-learn
  4. numpy

This notebook also illustrates the ease of porting a sample CPU based Spark xgboost4j code into GPU. There is no change required for running Spark XGBoost on GPU because both CPU and GPU call the same API. For CPU run, we need to vectorize the trained dataset before fitting data to classifier.

#### Import All Libraries

In [1]:
import os

# if you pass/unpack the archive file and enable the environment
# os.environ['PYSPARK_PYTHON'] = "./environment/bin/python"

In [2]:
from xgboost.spark import SparkXGBClassifier, SparkXGBClassifierModel
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.sql import SparkSession
from pyspark.sql.types import FloatType, IntegerType, StructField, StructType, DoubleType
from pyspark.conf import SparkConf
from time import time

Besides CPU version requires two extra libraries.
```Python
from pyspark.ml.feature import VectorAssembler
from pyspark.sql.functions import col
```

#### Create Spark Session and Data Reader

In [3]:
SPARK_MASTER_URL = os.getenv("SPARK_MASTER_URL", "/your-url")
RAPIDS_JAR = os.getenv("RAPIDS_JAR", "/your-jar-path")

# You need to update with your real hardware resource 
driverMem = os.getenv("DRIVER_MEM", "10g")
executorMem = os.getenv("EXECUTOR_MEM", "10g")
pinnedPoolSize = os.getenv("PINNED_POOL_SIZE", "2g")
concurrentGpuTasks = os.getenv("CONCURRENT_GPU_TASKS", "2")
executorCores = int(os.getenv("EXECUTOR_CORES", "4"))

# Common spark settings
conf = SparkConf()
conf.setMaster(SPARK_MASTER_URL)
conf.setAppName("Microbenchmark on GPU")
conf.set("spark.driver.memory", driverMem)
## The tasks will run on GPU memory, so there is no need to set a high host memory
conf.set("spark.executor.memory", executorMem)
## The tasks will run on GPU cores, so there is no need to use many cpu cores
conf.set("spark.executor.cores", executorCores)

# Plugin settings
conf.set("spark.executor.resource.gpu.amount", "1")
conf.set("spark.rapids.sql.concurrentGpuTasks", concurrentGpuTasks)
conf.set("spark.rapids.memory.pinnedPool.size", pinnedPoolSize)
##############note: only support value=1 see https://github.com/dmlc/xgboost/blame/master/python-package/xgboost/spark/core.py#L370-L374
conf.set("spark.task.resource.gpu.amount", 1) 
# since pyspark and xgboost share the same GPU, we need to allocate some memory to xgboost to avoid GPU OOM while training 
conf.set("spark.rapids.memory.gpu.allocFraction","0.6")
conf.set("spark.rapids.sql.enabled", "true") 
conf.set("spark.plugins", "com.nvidia.spark.SQLPlugin")
conf.set("spark.sql.cache.serializer","com.nvidia.spark.ParquetCachedBatchSerializer")
conf.set("spark.driver.extraClassPath", RAPIDS_JAR)
conf.set("spark.sql.execution.arrow.maxRecordsPerBatch", 200000) 
conf.set("spark.executor.extraClassPath", RAPIDS_JAR)
conf.set("spark.jars", RAPIDS_JAR)

# if you pass/unpack the archive file and enable the environment
# conf.set("spark.yarn.dist.archives", "your_pyspark_venv.tar.gz#environment")

# Create spark session
spark = SparkSession.builder.config(conf=conf).getOrCreate()
reader = spark.read

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/11/24 06:14:05 WARN org.apache.spark.resource.ResourceUtils: The configuration of cores (exec = 4 task = 1, runnable tasks = 4) will result in wasted resources due to resource gpu limiting the number of runnable tasks per executor to: 1. Please adjust your configuration.
22/11/24 06:14:06 INFO org.apache.spark.SparkEnv: Registering MapOutputTracker
22/11/24 06:14:06 INFO org.apache.spark.SparkEnv: Registering BlockManagerMaster
22/11/24 06:14:06 INFO org.apache.spark.SparkEnv: Registering BlockManagerMasterHeartbeat
22/11/24 06:14:06 INFO org.apache.spark.SparkEnv: Registering OutputCommitCoordinator
22/11/24 06:14:07 WARN com.nvidia.spark.rapids.RapidsPluginUtils: RAPIDS Accelerator 23.04.0 using cudf 23.04.0.
22/11/24 06:14:07 WARN com.nvidia.spark.rapids.RapidsPluginUtils: spark.rapids.sql.multiThreadedRead.numThreads is set to 20.
22/11/24 06:14:07 WA

#### Specify the Data Schema and Load the Data

In [4]:
label = 'delinquency_12'
schema = StructType([
    StructField('orig_channel', FloatType()),
    StructField('first_home_buyer', FloatType()),
    StructField('loan_purpose', FloatType()),
    StructField('property_type', FloatType()),
    StructField('occupancy_status', FloatType()),
    StructField('property_state', FloatType()),
    StructField('product_type', FloatType()),
    StructField('relocation_mortgage_indicator', FloatType()),
    StructField('seller_name', FloatType()),
    StructField('mod_flag', FloatType()),
    StructField('orig_interest_rate', FloatType()),
    StructField('orig_upb', DoubleType()),
    StructField('orig_loan_term', IntegerType()),
    StructField('orig_ltv', FloatType()),
    StructField('orig_cltv', FloatType()),
    StructField('num_borrowers', FloatType()),
    StructField('dti', FloatType()),
    StructField('borrower_credit_score', FloatType()),
    StructField('num_units', IntegerType()),
    StructField('zip', IntegerType()),
    StructField('mortgage_insurance_percent', FloatType()),
    StructField('current_loan_delinquency_status', IntegerType()),
    StructField('current_actual_upb', FloatType()),
    StructField('interest_rate', FloatType()),
    StructField('loan_age', FloatType()),
    StructField('msa', FloatType()),
    StructField('non_interest_bearing_upb', FloatType()),
    StructField(label, IntegerType()),
])
features = [ x.name for x in schema if x.name != label ]

# You need to update them to your real paths!
dataRoot = os.getenv("DATA_ROOT", "/data")
train_path = dataRoot + "/mortgage/output/train"
eval_path = dataRoot + "/mortgage/output/eval"

data_format = 'parquet'
has_header = 'true'
if data_format == 'csv':
    train_data = reader.schema(schema).option('header',has_header).csv(train_path)
    trans_data = reader.schema(schema).option('header',has_header).csv(eval_path)
else :
    train_data = reader.load(train_path)
    trans_data = reader.load(eval_path)
  

Note on CPU version, vectorization is required before fitting data to classifier, which means you need to assemble all feature columns into one column.

```Python
def vectorize(data_frame):
    to_floats = [ col(x.name).cast(FloatType()) for x in data_frame.schema ]
    return (VectorAssembler()
        .setInputCols(features)
        .setOutputCol('features')
        .transform(data_frame.select(to_floats))
        .select(col('features'), col(label)))

train_data = vectorize(train_data)
trans_data = vectorize(trans_data)
```

#### Create a XGBoostClassifier

In [5]:
params = { 
    "tree_method": "gpu_hist",
    "grow_policy": "depthwise",
    "num_workers": 1,
    "use_gpu": "true",
}
params['features_col'] = features
params['label_col'] = label
    
classifier = SparkXGBClassifier(**params)

The parameter `num_workers` should be set to the number of GPUs in Spark cluster for GPU version, while for CPU version it is usually equal to the number of the CPU cores.

Concerning the tree method, GPU version only supports `gpu_hist` currently, while `hist` is designed and used here for CPU training.

An example of CPU classifier:
```
classifier = SparkXGBClassifier(
  feature_col=features,
  label_col=label,  
  num_workers=1024,
  use_gpu=False,
)
```

#### Train the Data with Benchmark

In [6]:
def with_benchmark(phrase, action):
    start = time()
    result = action()
    end = time()
    print('{} takes {} seconds'.format(phrase, round(end - start, 2)))
    return result
model = with_benchmark('Training', lambda: classifier.fit(train_data))

If features_cols param set, then features_col param is ignored.
22/11/24 06:14:44 WARN org.apache.spark.sql.catalyst.util.package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
[Stage 12:>                                                         (0 + 1) / 1]

  If you are loading a serialized model (like pickle in Python, RDS in R) generated by
  older XGBoost, please export the model by calling `Booster.save_model` from that version
  first, then load it back in current version. See:

    https://xgboost.readthedocs.io/en/latest/tutorials/saving_model.html

  for more details about differences between saving model and serializing.

Training takes 28.6 seconds




#### Save and Reload the Model

In [7]:
model.write().overwrite().save(dataRoot + '/model/mortgage')

If features_cols param set, then features_col param is ignored.
                                                                                

In [8]:
loaded_model = SparkXGBClassifierModel().load(dataRoot + '/model/mortgage')

#### Transformation and Show Result Sample

In [9]:
def transform():
    result = loaded_model.transform(trans_data).cache()
    result.foreachPartition(lambda _: None)
    return result
result = with_benchmark('Transformation', transform)
result.select(label, 'rawPrediction', 'probability', 'prediction').show(5)

22/11/24 06:15:13 WARN com.nvidia.spark.rapids.GpuOverrides: 
!Exec <ProjectExec> cannot run on GPU because not all expressions can be replaced; unsupported data types in output: org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7 [rawPrediction#209, probability#275]
  @Expression <AttributeReference> orig_channel#56 could run on GPU
  @Expression <AttributeReference> first_home_buyer#57 could run on GPU
  @Expression <AttributeReference> loan_purpose#58 could run on GPU
  @Expression <AttributeReference> property_type#59 could run on GPU
  @Expression <AttributeReference> occupancy_status#60 could run on GPU
  @Expression <AttributeReference> property_state#61 could run on GPU
  @Expression <AttributeReference> product_type#62 could run on GPU
  @Expression <AttributeReference> relocation_mortgage_indicator#63 could run on GPU
  @Expression <AttributeReference> seller_name#64 could run on GPU
  @Expression <AttributeReference> mod_flag#65 could run on GPU
  @Expression <AttributeReference> 

Transformation takes 15.62 seconds
+--------------+--------------------+--------------------+----------+
|delinquency_12|       rawPrediction|         probability|prediction|
+--------------+--------------------+--------------------+----------+
|             0|[8.84631538391113...|[0.99985611438751...|       0.0|
|             0|[9.41864871978759...|[0.99991881847381...|       0.0|
|             0|[9.41864871978759...|[0.99991881847381...|       0.0|
|             0|[9.41864871978759...|[0.99991881847381...|       0.0|
|             0|[8.84631538391113...|[0.99985611438751...|       0.0|
+--------------+--------------------+--------------------+----------+
only showing top 5 rows



#### Evaluation

In [10]:
def check_classification_accuracy(data_frame, label):
    accuracy = (MulticlassClassificationEvaluator()
                .setLabelCol(label)
                .evaluate(data_frame))
    print('-' * 100)
    print('Accuracy is ' + str(accuracy))

In [11]:
with_benchmark('Evaluation', lambda: check_classification_accuracy(result, label))

22/11/24 06:15:28 WARN com.nvidia.spark.rapids.GpuOverrides: 
! <DeserializeToObjectExec> cannot run on GPU because not all expressions can be replaced; GPU does not currently support the operator class org.apache.spark.sql.execution.DeserializeToObjectExec
  ! <CreateExternalRow> createexternalrow(prediction#243, delinquency_12#1450, 1.0#1449, newInstance(class org.apache.spark.ml.linalg.VectorUDT).deserialize, StructField(prediction,DoubleType,true), StructField(delinquency_12,DoubleType,true), StructField(1.0,DoubleType,false), StructField(probability,org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7,true)) cannot run on GPU because GPU does not currently support the operator class org.apache.spark.sql.catalyst.expressions.objects.CreateExternalRow
    @Expression <AttributeReference> prediction#243 could run on GPU
    @Expression <AttributeReference> delinquency_12#1450 could run on GPU
    @Expression <AttributeReference> 1.0#1449 could run on GPU
    ! <Invoke> newInstance(class org

----------------------------------------------------------------------------------------------------
Accuracy is 1.0
Evaluation takes 2.29 seconds


                                                                                

In [12]:
spark.stop()