Skip to content

Commit cbf3a5f

Browse files
wbo4958trivialfis
andauthored
[pyspark][doc] add more doc for pyspark (dmlc#8271)
Co-authored-by: fis <jm.yuan@outlook.com>
1 parent c91fed0 commit cbf3a5f

File tree

1 file changed

+171
-16
lines changed

1 file changed

+171
-16
lines changed

doc/tutorials/spark_estimator.rst

Lines changed: 171 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,23 @@
1-
###############################
2-
Using XGBoost PySpark Estimator
3-
###############################
4-
Starting from version 1.7.0, xgboost supports pyspark estimator APIs. The feature is
5-
still experimental and not yet ready for production use.
1+
################################
2+
Distributed XGBoost with PySpark
3+
################################
4+
5+
Starting from version 1.7.0, xgboost supports pyspark estimator APIs.
6+
7+
.. note::
8+
9+
The feature is still experimental and not yet ready for production use.
10+
11+
.. contents::
12+
:backlinks: none
13+
:local:
14+
15+
*************************
16+
XGBoost PySpark Estimator
17+
*************************
618

7-
*****************
819
SparkXGBRegressor
9-
*****************
20+
=================
1021

1122
SparkXGBRegressor is a PySpark ML estimator. It implements the XGBoost classification
1223
algorithm based on XGBoost python library, and it can be used in PySpark Pipeline
@@ -17,10 +28,14 @@ We can create a `SparkXGBRegressor` estimator like:
1728
.. code-block:: python
1829
1930
from xgboost.spark import SparkXGBRegressor
20-
spark_reg_estimator = SparkXGBRegressor(num_workers=2, max_depth=5)
31+
spark_reg_estimator = SparkXGBRegressor(
32+
features_col="features",
33+
label_col="label",
34+
num_workers=2,
35+
)
2136
2237
23-
The above snippet create an spark estimator which can fit on a spark dataset,
38+
The above snippet creates a spark estimator which can fit on a spark dataset,
2439
and return a spark model that can transform a spark dataset and generate dataset
2540
with prediction column. We can set almost all of xgboost sklearn estimator parameters
2641
as `SparkXGBRegressor` parameters, but some parameter such as `nthread` is forbidden
@@ -30,8 +45,9 @@ such as `weight_col`, `validation_indicator_col`, `use_gpu`, for details please
3045

3146
The following code snippet shows how to train a spark xgboost regressor model,
3247
first we need to prepare a training dataset as a spark dataframe contains
33-
"features" and "label" column, the "features" column must be `pyspark.ml.linalg.Vector`
34-
type or spark array type.
48+
"label" column and "features" column(s), the "features" column(s) must be `pyspark.ml.linalg.Vector`
49+
type or spark array type or a list of feature column names.
50+
3551

3652
.. code-block:: python
3753
@@ -51,17 +67,156 @@ type or spark array type.
5167
The above snippet code returns a `transformed_test_spark_dataframe` that contains the input
5268
dataset columns and an appended column "prediction" representing the prediction results.
5369

54-
55-
******************
5670
SparkXGBClassifier
57-
******************
58-
71+
==================
5972

6073
`SparkXGBClassifier` estimator has similar API with `SparkXGBRegressor`, but it has some
6174
pyspark classifier specific params, e.g. `raw_prediction_col` and `probability_col` parameters.
6275
Correspondingly, by default, `SparkXGBClassifierModel` transforming test dataset will
6376
generate result dataset with 3 new columns:
64-
6577
- "prediction": represents the predicted label.
6678
- "raw_prediction": represents the output margin values.
6779
- "probability": represents the prediction probability on each label.
80+
81+
82+
***************************
83+
XGBoost PySpark GPU support
84+
***************************
85+
86+
XGBoost PySpark supports GPU training and prediction. To enable GPU support, you first need
87+
to install the xgboost and cudf packages. Then you can set `use_gpu` parameter to `True`.
88+
89+
Below tutorial will show you how to train a model with XGBoost PySpark GPU on Spark
90+
standalone cluster.
91+
92+
93+
Write your PySpark application
94+
==============================
95+
96+
.. code-block:: python
97+
98+
from xgboost.spark import SparkXGBRegressor
99+
spark = SparkSession.builder.getOrCreate()
100+
101+
# read data into spark dataframe
102+
train_data_path = "xxxx/train"
103+
train_df = spark.read.parquet(data_path)
104+
105+
test_data_path = "xxxx/test"
106+
test_df = spark.read.parquet(test_data_path)
107+
108+
# assume the label column is named "class"
109+
label_name = "class"
110+
111+
# get a list with feature column names
112+
feature_names = [x.name for x in train_df.schema if x.name != label]
113+
114+
# create a xgboost pyspark regressor estimator and set use_gpu=True
115+
regressor = SparkXGBRegressor(
116+
features_col=feature_names,
117+
label_col=label_name,
118+
num_workers=2,
119+
use_gpu=True,
120+
)
121+
122+
# train and return the model
123+
model = regressor.fit(train_df)
124+
125+
# predict on test data
126+
predict_df = model.transform(test_df)
127+
predict_df.show()
128+
129+
Prepare the necessary packages
130+
==============================
131+
132+
We recommend using Conda or Virtualenv to manage python dependencies
133+
in PySpark. Please refer to
134+
`How to Manage Python Dependencies in PySpark <https://www.databricks.com/blog/2020/12/22/how-to-manage-python-dependencies-in-pyspark.html>`_.
135+
136+
.. code-block:: bash
137+
138+
conda create -y -n xgboost-env -c conda-forge conda-pack python=3.9
139+
conda activate xgboost-env
140+
pip install xgboost
141+
pip install cudf
142+
conda pack -f -o xgboost-env.tar.gz
143+
144+
145+
Submit the PySpark application
146+
==============================
147+
148+
Assuming you have configured your Spark cluster with GPU support, if not yet, please
149+
refer to `spark standalone configuration with GPU support <https://nvidia.github.io/spark-rapids/docs/get-started/getting-started-on-prem.html#spark-standalone-cluster>`_.
150+
151+
.. code-block:: bash
152+
153+
export PYSPARK_DRIVER_PYTHON=python
154+
export PYSPARK_PYTHON=./environment/bin/python
155+
156+
spark-submit \
157+
--master spark://<master-ip>:7077 \
158+
--conf spark.executor.resource.gpu.amount=1 \
159+
--conf spark.task.resource.gpu.amount=1 \
160+
--archives xgboost-env.tar.gz#environment \
161+
xgboost_app.py
162+
163+
164+
Model Persistence
165+
=================
166+
167+
Similar to standard PySpark ml estimators, one can persist and reuse the model with `save`
168+
and `load` methods:
169+
170+
.. code-block:: python
171+
172+
regressor = SparkXGBRegressor()
173+
model = regressor.fit(train_df)
174+
# save the model
175+
model.save("/tmp/xgboost-pyspark-model")
176+
# load the model
177+
model2 = SparkXGBRankerModel.load("/tmp/xgboost-pyspark-model")
178+
179+
To export the underlying booster model used by XGBoost:
180+
181+
.. code-block:: python
182+
183+
regressor = SparkXGBRegressor()
184+
model = regressor.fit(train_df)
185+
# the same booster object returned by xgboost.train
186+
booster: xgb.Booster = model.get_booster()
187+
booster.predict(...)
188+
booster.save_model("model.json")
189+
190+
This booster is shared by other Python interfaces and can be used by other language
191+
bindings like the C and R packages. Lastly, one can extract a booster file directly from
192+
saved spark estimator without going through the getter:
193+
194+
.. code-block:: python
195+
196+
import xgboost as xgb
197+
bst = xgb.Booster()
198+
bst.load_model("/tmp/xgboost-pyspark-model/model/part-00000")
199+
200+
Accelerate the whole pipeline of xgboost pyspark
201+
================================================
202+
203+
With `RAPIDS Accelerator for Apache Spark <https://nvidia.github.io/spark-rapids/>`_,
204+
you can accelerate the whole pipeline (ETL, Train, Transform) for xgboost pyspark
205+
without any code change by leveraging GPU.
206+
207+
Below is a simple example submit command for enabling GPU acceleration:
208+
209+
.. code-block:: bash
210+
211+
export PYSPARK_DRIVER_PYTHON=python
212+
export PYSPARK_PYTHON=./environment/bin/python
213+
214+
spark-submit \
215+
--master spark://<master-ip>:7077 \
216+
--conf spark.executor.resource.gpu.amount=1 \
217+
--conf spark.task.resource.gpu.amount=1 \
218+
--packages com.nvidia:rapids-4-spark_2.12:22.08.0 \
219+
--conf spark.plugins=com.nvidia.spark.SQLPlugin \
220+
--conf spark.sql.execution.arrow.maxRecordsPerBatch=1000000 \
221+
--archives xgboost-env.tar.gz#environment \
222+
xgboost_app.py

0 commit comments

Comments
 (0)