# Introduction to XGBoost Spark with GPU

Agaricus is an example of xgboost classifier for multiple classification. This notebook will show you how to load data, train the xgboost model.

A few libraries required for this notebook:
  1. NumPy
  2. cudf jar
  3. xgboost4j jar
  4. xgboost4j-spark jar
  5. rapids-4-spark.jar
  
This notebook also illustrates the ease of porting a sample CPU based Spark xgboost4j code into GPU. There is only one change required for running Spark XGBoost on GPU. That is replacing the API `setFeaturesCol(feature)` on CPU with the new API `setFeaturesCols(features)`. This also eliminates the need for vectorization (assembling multiple feature columns in to one column) since we can read multiple columns.

Note: For PySpark based XGBoost, please refer to the [Spark-RAPIDS-examples 22.04 branch](https://github.com/NVIDIA/spark-rapids-examples/tree/branch-22.04) that
uses [NVIDIA’s Spark XGBoost version](https://repo1.maven.org/maven2/com/nvidia/xgboost4j-spark_3.0/1.4.2-0.3.0/).

#### Import All Libraries

In [1]:
from ml.dmlc.xgboost4j.scala.spark import XGBoostClassificationModel, XGBoostClassifier
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.sql import SparkSession
from pyspark.sql.types import FloatType, StructField, StructType
from time import time
import os

Besides CPU version requires two extra libraries.
```Python
from pyspark.ml.feature import VectorAssembler
from pyspark.sql.functions import col
```

#### Create Spark Session and Data Reader

In [2]:
spark = SparkSession.builder.getOrCreate()
reader = spark.read

#### Specify the Data Schema and Load the Data

In [3]:
label = 'label'
features = [ 'feature_' + str(i) for i in range(0, 126) ]
schema = StructType([ StructField(x, FloatType()) for x in [label] + features ])

# You need to update them to your real paths!
dataRoot = os.getenv("DATA_ROOT", "/data")
train_data = reader.schema(schema).option('header', True).csv(dataRoot + '/agaricus/csv/train')
trans_data = reader.schema(schema).option('header', True).csv(dataRoot + '/agaricus/csv/test')

Note on CPU version, vectorization is required before fitting data to classifier, which means you need to assemble all feature columns into one column.

```Python
def vectorize(data_frame):
    to_floats = [ col(x.name).cast(FloatType()) for x in data_frame.schema ]
    return (VectorAssembler()
        .setInputCols(features)
        .setOutputCol('features')
        .transform(data_frame.select(to_floats))
        .select(col('features'), col(label)))

train_data = vectorize(train_data)
trans_data = vectorize(trans_data)
```

#### Create a XGBoostClassifier

In [4]:
params = { 
    'eta': 0.1,
    'missing': 0.0,
    'treeMethod': 'gpu_hist',
    'maxDepth': 2,
    'numWorkers': 1,
    'numRound' : 100,
}
classifier = XGBoostClassifier(**params).setLabelCol(label).setFeaturesCols(features)

The CPU version classifier provides the API `setFeaturesCol` which only accepts a single column name, so vectorization for multiple feature columns is required.
```Python
classifier = XGBoostClassifier(**params).setLabelCol(label).setFeaturesCol('features')
```

The parameter `num_workers` should be set to the number of GPUs in Spark cluster for GPU version, while for CPU version it is usually equal to the number of the CPU cores.

Concerning the tree method, GPU version only supports `gpu_hist` currently, while `hist` is designed and used here for CPU training.

#### Train the Data with Benchmark

In [5]:
def with_benchmark(phrase, action):
    start = time()
    result = action()
    end = time()
    print('{} takes {} seconds'.format(phrase, round(end - start, 2)))
    return result
model = with_benchmark('Training', lambda: classifier.fit(train_data))

Training takes 27.95 seconds


#### Save and Reload the Model

In [6]:
model.write().overwrite().save(dataRoot + '/new-model-path')
loaded_model = XGBoostClassificationModel().load(dataRoot + '/new-model-path')

#### Transformation and Show Result Sample

In [7]:
def transform():
    result = loaded_model.transform(trans_data).cache()
    result.foreachPartition(lambda _: None)
    return result
result = with_benchmark('Transformation', transform)
result.select(label, 'rawPrediction', 'probability', 'prediction').show(5)

Transformation takes 2.63 seconds
+-----+--------------------+--------------------+----------+
|label|       rawPrediction|         probability|prediction|
+-----+--------------------+--------------------+----------+
|  1.0|[-0.9667757749557...|[0.03322422504425...|       1.0|
|  0.0|[-0.0080436170101...|[0.99195638298988...|       0.0|
|  0.0|[-0.0080436170101...|[0.99195638298988...|       0.0|
|  0.0|[-0.1416745483875...|[0.85832545161247...|       0.0|
|  0.0|[-0.0747678577899...|[0.92523214221000...|       0.0|
+-----+--------------------+--------------------+----------+
only showing top 5 rows



#### Evaluation

In [8]:
accuracy = with_benchmark(
    'Evaluation',
    lambda: MulticlassClassificationEvaluator().setLabelCol(label).evaluate(result))
print('Accuracy is ' + str(accuracy))

Evaluation takes 0.29 seconds
Accuracy is 0.9987577063864658


#### Stop

In [9]:
spark.stop()