Skip to content

Commit 4b00c64

Browse files
wbo4958sameerz
andauthored
[doc] improve xgboost4j-spark-gpu doc [skip ci] (dmlc#7793)
Co-authored-by: Sameer Raheja <sameerz@users.noreply.github.com>
1 parent 118192f commit 4b00c64

File tree

2 files changed

+39
-34
lines changed

2 files changed

+39
-34
lines changed

doc/jvm/xgboost4j_spark_gpu_tutorial.rst

Lines changed: 34 additions & 34 deletions
Original file line numberDiff line numberDiff line change
@@ -2,8 +2,8 @@
22
XGBoost4J-Spark-GPU Tutorial (version 1.6.0+)
33
#############################################
44

5-
**XGBoost4J-Spark-GPU** is a project aiming to accelerate XGBoost distributed training on Spark from
6-
end to end with GPUs by leveraging the `Spark-Rapids <https://nvidia.github.io/spark-rapids/>`_ project.
5+
**XGBoost4J-Spark-GPU** is an open source library aiming to accelerate distributed XGBoost training on Apache Spark cluster from
6+
end to end with GPUs by leveraging the `RAPIDS Accelerator for Apache Spark <https://nvidia.github.io/spark-rapids/>`_ product.
77

88
This tutorial will show you how to use **XGBoost4J-Spark-GPU**.
99

@@ -15,8 +15,8 @@ This tutorial will show you how to use **XGBoost4J-Spark-GPU**.
1515
Build an ML Application with XGBoost4J-Spark-GPU
1616
************************************************
1717

18-
Adding XGBoost to Your Project
19-
==============================
18+
Add XGBoost to Your Project
19+
===========================
2020

2121
Before we go into the tour of how to use XGBoost4J-Spark-GPU, you should first consult
2222
:ref:`Installation from Maven repository <install_jvm_packages>` in order to add XGBoost4J-Spark-GPU as
@@ -25,10 +25,10 @@ a dependency for your project. We provide both stable releases and snapshots.
2525
Data Preparation
2626
================
2727

28-
In this section, we use `Iris <https://archive.ics.uci.edu/ml/datasets/iris>`_ dataset as an example to
29-
showcase how we use Spark to transform raw dataset and make it fit to the data interface of XGBoost.
28+
In this section, we use the `Iris <https://archive.ics.uci.edu/ml/datasets/iris>`_ dataset as an example to
29+
showcase how we use Apache Spark to transform a raw dataset and make it fit the data interface of XGBoost.
3030

31-
Iris dataset is shipped in CSV format. Each instance contains 4 features, "sepal length", "sepal width",
31+
The Iris dataset is shipped in CSV format. Each instance contains 4 features, "sepal length", "sepal width",
3232
"petal length" and "petal width". In addition, it contains the "class" column, which is essentially the
3333
label with three possible values: "Iris Setosa", "Iris Versicolour" and "Iris Virginica".
3434

@@ -54,26 +54,26 @@ Read Dataset with Spark's Built-In Reader
5454
.schema(schema)
5555
.csv(dataPath)
5656
57-
At the first line, we create an instance of `SparkSession <https://spark.apache.org/docs/latest/sql-getting-started.html#starting-point-sparksession>`_
58-
which is the entry of any Spark program working with DataFrame. The ``schema`` variable
59-
defines the schema of DataFrame wrapping Iris data. With this explicitly set schema, we
60-
can define the columns' name as well as their types; otherwise the column name would be
57+
In the first line, we create an instance of a `SparkSession <https://spark.apache.org/docs/latest/sql-getting-started.html#starting-point-sparksession>`_
58+
which is the entry point of any Spark application working with DataFrames. The ``schema`` variable
59+
defines the schema of the DataFrame wrapping Iris data. With this explicitly set schema, we
60+
can define the column names as well as their types; otherwise the column names would be
6161
the default ones derived by Spark, such as ``_col0``, etc. Finally, we can use Spark's
62-
built-in csv reader to load Iris csv file as a DataFrame named ``xgbInput``.
62+
built-in CSV reader to load the Iris CSV file as a DataFrame named ``xgbInput``.
63+
64+
Apache Spark also contains many built-in readers for other formats such as ORC, Parquet, Avro, JSON.
6365

64-
Spark also contains many built-in readers for other format. eg ORC, Parquet, Avro, Json.
6566

6667
Transform Raw Iris Dataset
6768
--------------------------
6869

69-
To make Iris dataset be recognizable to XGBoost, we need to encode String-typed
70-
label, i.e. "class", to Double-typed label.
70+
To make the Iris dataset recognizable to XGBoost, we need to encode the String-typed
71+
label, i.e. "class", to the Double-typed label.
7172

7273
One way to convert the String-typed label to Double is to use Spark's built-in feature transformer
7374
`StringIndexer <https://spark.apache.org/docs/2.3.1/api/scala/index.html#org.apache.spark.ml.feature.StringIndexer>`_.
74-
but it has not been accelerated by Spark-Rapids yet, which means it will fall back
75-
to CPU to run and cause performance issue. Instead, we use an alternative way to acheive
76-
the same goal by the following code
75+
But this feature is not accelerated in RAPIDS Accelerator, which means it will fall back
76+
to CPU. Instead, we use an alternative way to achieve the same goal with the following code:
7777

7878
.. code-block:: scala
7979
@@ -102,7 +102,7 @@ the same goal by the following code
102102
+------------+-----------+------------+-----------+-----+
103103
104104
105-
With window operations, we have mapped string column of labels to label indices.
105+
With window operations, we have mapped the string column of labels to label indices.
106106

107107
Training
108108
========
@@ -133,7 +133,7 @@ To train a XGBoost model for classification, we need to claim a XGBoostClassifie
133133
The available parameters for training a XGBoost model can be found in :doc:`here </parameter>`.
134134
Similar to the XGBoost4J-Spark package, in addition to the default set of parameters,
135135
XGBoost4J-Spark-GPU also supports the camel-case variant of these parameters to be
136-
consistent with Spark's MLLIB naming convention.
136+
consistent with Spark's MLlib naming convention.
137137

138138
Specifically, each parameter in :doc:`this page </parameter>` has its equivalent form in
139139
XGBoost4J-Spark-GPU with camel case. For example, to set ``max_depth`` for each tree, you can pass
@@ -149,12 +149,11 @@ you can do it through setters in XGBoostClassifer:
149149
150150
.. note::
151151

152-
In contrast to the XGBoost4J-Spark package, which needs to first assemble the numeric
153-
feature columns into one column with VectorUDF type by VectorAssembler, the
154-
XGBoost4J-Spark-GPU does not require such transformation, it accepts an array of feature
152+
In contrast with XGBoost4j-Spark which accepts both a feature column with VectorUDT type and
153+
an array of feature column names, XGBoost4j-Spark-GPU only accepts an array of feature
155154
column names by ``setFeaturesCol(value: Array[String])``.
156155

157-
After we set XGBoostClassifier parameters and feature/label columns, we can build a
156+
After setting XGBoostClassifier parameters and feature/label columns, we can build a
158157
transformer, XGBoostClassificationModel by fitting XGBoostClassifier with the input
159158
DataFrame. This ``fit`` operation is essentially the training process and the generated
160159
model can then be used in other tasks like prediction.
@@ -166,12 +165,12 @@ model can then be used in other tasks like prediction.
166165
Prediction
167166
==========
168167

169-
When we get a model, either XGBoostClassificationModel or XGBoostRegressionModel, it takes a DataFrame,
170-
read the column containing feature vectors, predict for each feature vector, and output a new DataFrame
168+
When we get a model, either a XGBoostClassificationModel or a XGBoostRegressionModel, it takes a DataFrame as an input,
169+
reads the column containing feature vectors, predicts for each feature vector, and outputs a new DataFrame
171170
with the following columns by default:
172171

173172
* XGBoostClassificationModel will output margins (``rawPredictionCol``), probabilities(``probabilityCol``) and the eventual prediction labels (``predictionCol``) for each possible label.
174-
* XGBoostRegressionModel will output prediction label(``predictionCol``).
173+
* XGBoostRegressionModel will output prediction a label(``predictionCol``).
175174

176175
.. code-block:: scala
177176
@@ -180,7 +179,7 @@ with the following columns by default:
180179
results.show()
181180
182181
With the above code snippet, we get a DataFrame as result, which contains the margin, probability for each class,
183-
and the prediction for each instance
182+
and the prediction for each instance.
184183

185184
.. code-block:: none
186185
@@ -213,8 +212,9 @@ and the prediction for each instance
213212
Submit the application
214213
**********************
215214

216-
Take submitting the spark job to Spark Standalone cluster as an example, and assuming your application main class
217-
is ``Iris`` and the application jar is ``iris-1.0.0.jar``
215+
Here’s an example to submit an end-to-end XGBoost-4j-Spark-GPU Spark application to an
216+
Apache Spark Standalone cluster, assuming the application main class is Iris and the
217+
application jar is iris-1.0.0.jar
218218

219219
.. code-block:: bash
220220
@@ -237,10 +237,10 @@ is ``Iris`` and the application jar is ``iris-1.0.0.jar``
237237
--class ${main_class} \
238238
${app_jar}
239239
240-
* First, we need to specify the ``spark-rapids, cudf, xgboost4j-gpu, xgboost4j-spark-gpu`` packages by ``--packages``
241-
* Second, ``spark-rapids`` is a Spark plugin, so we need to configure it by specifying ``spark.plugins=com.nvidia.spark.SQLPlugin``
240+
* First, we need to specify the ``RAPIDS Accelerator, cudf, xgboost4j-gpu, xgboost4j-spark-gpu`` packages by ``--packages``
241+
* Second, ``RAPIDS Accelerator`` is a Spark plugin, so we need to configure it by specifying ``spark.plugins=com.nvidia.spark.SQLPlugin``
242242

243-
For details about ``spark-rapids`` other configurations, please refer to `configuration <https://nvidia.github.io/spark-rapids/docs/configs.html>`_.
243+
For details about other ``RAPIDS Accelerator`` other configurations, please refer to the `configuration <https://nvidia.github.io/spark-rapids/docs/configs.html>`_.
244244

245-
For ``spark-rapids Frequently Asked Questions``, please refer to
245+
For ``RAPIDS Accelerator Frequently Asked Questions``, please refer to the
246246
`frequently-asked-questions <https://nvidia.github.io/spark-rapids/docs/FAQ.html#frequently-asked-questions>`_.

doc/jvm/xgboost4j_spark_tutorial.rst

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -127,6 +127,11 @@ Now, we have a DataFrame containing only two columns, "features" which contains
127127
"sepal length", "sepal width", "petal length" and "petal width" and "classIndex" which has Double-typed
128128
labels. A DataFrame like this (containing vector-represented features and numeric labels) can be fed to XGBoost4J-Spark's training engine directly.
129129

130+
.. note::
131+
132+
There is no need to assemble feature columns from version 1.6.0+. Instead, users can specify an array of
133+
feture column names by ``setFeaturesCol(value: Array[String])`` and XGBoost4j-Spark will do it.
134+
130135
Dealing with missing values
131136
~~~~~~~~~~~~~~~~~~~~~~~~~~~
132137

0 commit comments

Comments
 (0)