You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: doc/jvm/xgboost4j_spark_gpu_tutorial.rst
+34-34Lines changed: 34 additions & 34 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -2,8 +2,8 @@
2
2
XGBoost4J-Spark-GPU Tutorial (version 1.6.0+)
3
3
#############################################
4
4
5
-
**XGBoost4J-Spark-GPU** is a project aiming to accelerate XGBoost distributed training on Spark from
6
-
end to end with GPUs by leveraging the `Spark-Rapids<https://nvidia.github.io/spark-rapids/>`_ project.
5
+
**XGBoost4J-Spark-GPU** is an open source library aiming to accelerate distributed XGBoost training on Apache Spark cluster from
6
+
end to end with GPUs by leveraging the `RAPIDS Accelerator for Apache Spark <https://nvidia.github.io/spark-rapids/>`_ product.
7
7
8
8
This tutorial will show you how to use **XGBoost4J-Spark-GPU**.
9
9
@@ -15,8 +15,8 @@ This tutorial will show you how to use **XGBoost4J-Spark-GPU**.
15
15
Build an ML Application with XGBoost4J-Spark-GPU
16
16
************************************************
17
17
18
-
Adding XGBoost to Your Project
19
-
==============================
18
+
Add XGBoost to Your Project
19
+
===========================
20
20
21
21
Before we go into the tour of how to use XGBoost4J-Spark-GPU, you should first consult
22
22
:ref:`Installation from Maven repository <install_jvm_packages>` in order to add XGBoost4J-Spark-GPU as
@@ -25,10 +25,10 @@ a dependency for your project. We provide both stable releases and snapshots.
25
25
Data Preparation
26
26
================
27
27
28
-
In this section, we use `Iris <https://archive.ics.uci.edu/ml/datasets/iris>`_ dataset as an example to
29
-
showcase how we use Spark to transform raw dataset and make it fit to the data interface of XGBoost.
28
+
In this section, we use the `Iris <https://archive.ics.uci.edu/ml/datasets/iris>`_ dataset as an example to
29
+
showcase how we use Apache Spark to transform a raw dataset and make it fit the data interface of XGBoost.
30
30
31
-
Iris dataset is shipped in CSV format. Each instance contains 4 features, "sepal length", "sepal width",
31
+
The Iris dataset is shipped in CSV format. Each instance contains 4 features, "sepal length", "sepal width",
32
32
"petal length" and "petal width". In addition, it contains the "class" column, which is essentially the
33
33
label with three possible values: "Iris Setosa", "Iris Versicolour" and "Iris Virginica".
34
34
@@ -54,26 +54,26 @@ Read Dataset with Spark's Built-In Reader
54
54
.schema(schema)
55
55
.csv(dataPath)
56
56
57
-
At the first line, we create an instance of `SparkSession <https://spark.apache.org/docs/latest/sql-getting-started.html#starting-point-sparksession>`_
58
-
which is the entry of any Spark program working with DataFrame. The ``schema`` variable
59
-
defines the schema of DataFrame wrapping Iris data. With this explicitly set schema, we
60
-
can define the columns' name as well as their types; otherwise the column name would be
57
+
In the first line, we create an instance of a `SparkSession <https://spark.apache.org/docs/latest/sql-getting-started.html#starting-point-sparksession>`_
58
+
which is the entry point of any Spark application working with DataFrames. The ``schema`` variable
59
+
defines the schema of the DataFrame wrapping Iris data. With this explicitly set schema, we
60
+
can define the column names as well as their types; otherwise the column names would be
61
61
the default ones derived by Spark, such as ``_col0``, etc. Finally, we can use Spark's
62
-
built-in csv reader to load Iris csv file as a DataFrame named ``xgbInput``.
62
+
built-in CSV reader to load the Iris CSV file as a DataFrame named ``xgbInput``.
63
+
64
+
Apache Spark also contains many built-in readers for other formats such as ORC, Parquet, Avro, JSON.
63
65
64
-
Spark also contains many built-in readers for other format. eg ORC, Parquet, Avro, Json.
65
66
66
67
Transform Raw Iris Dataset
67
68
--------------------------
68
69
69
-
To make Iris dataset be recognizable to XGBoost, we need to encode String-typed
70
-
label, i.e. "class", to Double-typed label.
70
+
To make the Iris dataset recognizable to XGBoost, we need to encode the String-typed
71
+
label, i.e. "class", to the Double-typed label.
71
72
72
73
One way to convert the String-typed label to Double is to use Spark's built-in feature transformer
With window operations, we have mapped string column of labels to label indices.
105
+
With window operations, we have mapped the string column of labels to label indices.
106
106
107
107
Training
108
108
========
@@ -133,7 +133,7 @@ To train a XGBoost model for classification, we need to claim a XGBoostClassifie
133
133
The available parameters for training a XGBoost model can be found in :doc:`here </parameter>`.
134
134
Similar to the XGBoost4J-Spark package, in addition to the default set of parameters,
135
135
XGBoost4J-Spark-GPU also supports the camel-case variant of these parameters to be
136
-
consistent with Spark's MLLIB naming convention.
136
+
consistent with Spark's MLlib naming convention.
137
137
138
138
Specifically, each parameter in :doc:`this page </parameter>` has its equivalent form in
139
139
XGBoost4J-Spark-GPU with camel case. For example, to set ``max_depth`` for each tree, you can pass
@@ -149,12 +149,11 @@ you can do it through setters in XGBoostClassifer:
149
149
150
150
.. note::
151
151
152
-
In contrast to the XGBoost4J-Spark package, which needs to first assemble the numeric
153
-
feature columns into one column with VectorUDF type by VectorAssembler, the
154
-
XGBoost4J-Spark-GPU does not require such transformation, it accepts an array of feature
152
+
In contrast with XGBoost4j-Spark which accepts both a feature column with VectorUDT type and
153
+
an array of feature column names, XGBoost4j-Spark-GPU only accepts an array of feature
155
154
column names by ``setFeaturesCol(value: Array[String])``.
156
155
157
-
After we set XGBoostClassifier parameters and feature/label columns, we can build a
156
+
After setting XGBoostClassifier parameters and feature/label columns, we can build a
158
157
transformer, XGBoostClassificationModel by fitting XGBoostClassifier with the input
159
158
DataFrame. This ``fit`` operation is essentially the training process and the generated
160
159
model can then be used in other tasks like prediction.
@@ -166,12 +165,12 @@ model can then be used in other tasks like prediction.
166
165
Prediction
167
166
==========
168
167
169
-
When we get a model, either XGBoostClassificationModel or XGBoostRegressionModel, it takes a DataFrame,
170
-
read the column containing feature vectors, predict for each feature vector, and output a new DataFrame
168
+
When we get a model, either a XGBoostClassificationModel or a XGBoostRegressionModel, it takes a DataFrame as an input,
169
+
reads the column containing feature vectors, predicts for each feature vector, and outputs a new DataFrame
171
170
with the following columns by default:
172
171
173
172
* XGBoostClassificationModel will output margins (``rawPredictionCol``), probabilities(``probabilityCol``) and the eventual prediction labels (``predictionCol``) for each possible label.
174
-
* XGBoostRegressionModel will output prediction label(``predictionCol``).
173
+
* XGBoostRegressionModel will output prediction a label(``predictionCol``).
175
174
176
175
.. code-block:: scala
177
176
@@ -180,7 +179,7 @@ with the following columns by default:
180
179
results.show()
181
180
182
181
With the above code snippet, we get a DataFrame as result, which contains the margin, probability for each class,
183
-
and the prediction for each instance
182
+
and the prediction for each instance.
184
183
185
184
.. code-block:: none
186
185
@@ -213,8 +212,9 @@ and the prediction for each instance
213
212
Submit the application
214
213
**********************
215
214
216
-
Take submitting the spark job to Spark Standalone cluster as an example, and assuming your application main class
217
-
is ``Iris`` and the application jar is ``iris-1.0.0.jar``
215
+
Here’s an example to submit an end-to-end XGBoost-4j-Spark-GPU Spark application to an
216
+
Apache Spark Standalone cluster, assuming the application main class is Iris and the
217
+
application jar is iris-1.0.0.jar
218
218
219
219
.. code-block:: bash
220
220
@@ -237,10 +237,10 @@ is ``Iris`` and the application jar is ``iris-1.0.0.jar``
237
237
--class ${main_class} \
238
238
${app_jar}
239
239
240
-
* First, we need to specify the ``spark-rapids, cudf, xgboost4j-gpu, xgboost4j-spark-gpu`` packages by ``--packages``
241
-
* Second, ``spark-rapids`` is a Spark plugin, so we need to configure it by specifying ``spark.plugins=com.nvidia.spark.SQLPlugin``
240
+
* First, we need to specify the ``RAPIDS Accelerator, cudf, xgboost4j-gpu, xgboost4j-spark-gpu`` packages by ``--packages``
241
+
* Second, ``RAPIDS Accelerator`` is a Spark plugin, so we need to configure it by specifying ``spark.plugins=com.nvidia.spark.SQLPlugin``
242
242
243
-
For details about ``spark-rapids`` other configurations, please refer to `configuration <https://nvidia.github.io/spark-rapids/docs/configs.html>`_.
243
+
For details about other ``RAPIDS Accelerator`` other configurations, please refer to the `configuration <https://nvidia.github.io/spark-rapids/docs/configs.html>`_.
244
244
245
-
For ``spark-rapids Frequently Asked Questions``, please refer to
245
+
For ``RAPIDS Accelerator Frequently Asked Questions``, please refer to the
0 commit comments