1- ###############################
2- Using XGBoost PySpark Estimator
3- ###############################
4- Starting from version 1.7.0, xgboost supports pyspark estimator APIs. The feature is
5- still experimental and not yet ready for production use.
1+ ################################
2+ Distributed XGBoost with PySpark
3+ ################################
4+
5+ Starting from version 1.7.0, xgboost supports pyspark estimator APIs.
6+
7+ .. note ::
8+
9+ The feature is still experimental and not yet ready for production use.
10+
11+ .. contents ::
12+ :backlinks: none
13+ :local:
14+
15+ *************************
16+ XGBoost PySpark Estimator
17+ *************************
618
7- *****************
819SparkXGBRegressor
9- *****************
20+ =================
1021
1122SparkXGBRegressor is a PySpark ML estimator. It implements the XGBoost classification
1223algorithm based on XGBoost python library, and it can be used in PySpark Pipeline
@@ -17,10 +28,14 @@ We can create a `SparkXGBRegressor` estimator like:
1728.. code-block :: python
1829
1930 from xgboost.spark import SparkXGBRegressor
20- spark_reg_estimator = SparkXGBRegressor(num_workers = 2 , max_depth = 5 )
31+ spark_reg_estimator = SparkXGBRegressor(
32+ features_col = " features" ,
33+ label_col = " label" ,
34+ num_workers = 2 ,
35+ )
2136
2237
23- The above snippet create an spark estimator which can fit on a spark dataset,
38+ The above snippet creates a spark estimator which can fit on a spark dataset,
2439and return a spark model that can transform a spark dataset and generate dataset
2540with prediction column. We can set almost all of xgboost sklearn estimator parameters
2641as `SparkXGBRegressor ` parameters, but some parameter such as `nthread ` is forbidden
@@ -30,8 +45,9 @@ such as `weight_col`, `validation_indicator_col`, `use_gpu`, for details please
3045
3146The following code snippet shows how to train a spark xgboost regressor model,
3247first we need to prepare a training dataset as a spark dataframe contains
33- "features" and "label" column, the "features" column must be `pyspark.ml.linalg.Vector `
34- type or spark array type.
48+ "label" column and "features" column(s), the "features" column(s) must be `pyspark.ml.linalg.Vector `
49+ type or spark array type or a list of feature column names.
50+
3551
3652.. code-block :: python
3753
@@ -51,17 +67,156 @@ type or spark array type.
5167 The above snippet code returns a `transformed_test_spark_dataframe ` that contains the input
5268dataset columns and an appended column "prediction" representing the prediction results.
5369
54-
55- ******************
5670SparkXGBClassifier
57- ******************
58-
71+ ==================
5972
6073`SparkXGBClassifier ` estimator has similar API with `SparkXGBRegressor `, but it has some
6174pyspark classifier specific params, e.g. `raw_prediction_col ` and `probability_col ` parameters.
6275Correspondingly, by default, `SparkXGBClassifierModel ` transforming test dataset will
6376generate result dataset with 3 new columns:
64-
6577- "prediction": represents the predicted label.
6678- "raw_prediction": represents the output margin values.
6779- "probability": represents the prediction probability on each label.
80+
81+
82+ ***************************
83+ XGBoost PySpark GPU support
84+ ***************************
85+
86+ XGBoost PySpark supports GPU training and prediction. To enable GPU support, you first need
87+ to install the xgboost and cudf packages. Then you can set `use_gpu ` parameter to `True `.
88+
89+ Below tutorial will show you how to train a model with XGBoost PySpark GPU on Spark
90+ standalone cluster.
91+
92+
93+ Write your PySpark application
94+ ==============================
95+
96+ .. code-block :: python
97+
98+ from xgboost.spark import SparkXGBRegressor
99+ spark = SparkSession.builder.getOrCreate()
100+
101+ # read data into spark dataframe
102+ train_data_path = " xxxx/train"
103+ train_df = spark.read.parquet(data_path)
104+
105+ test_data_path = " xxxx/test"
106+ test_df = spark.read.parquet(test_data_path)
107+
108+ # assume the label column is named "class"
109+ label_name = " class"
110+
111+ # get a list with feature column names
112+ feature_names = [x.name for x in train_df.schema if x.name != label]
113+
114+ # create a xgboost pyspark regressor estimator and set use_gpu=True
115+ regressor = SparkXGBRegressor(
116+ features_col = feature_names,
117+ label_col = label_name,
118+ num_workers = 2 ,
119+ use_gpu = True ,
120+ )
121+
122+ # train and return the model
123+ model = regressor.fit(train_df)
124+
125+ # predict on test data
126+ predict_df = model.transform(test_df)
127+ predict_df.show()
128+
129+ Prepare the necessary packages
130+ ==============================
131+
132+ We recommend using Conda or Virtualenv to manage python dependencies
133+ in PySpark. Please refer to
134+ `How to Manage Python Dependencies in PySpark <https://www.databricks.com/blog/2020/12/22/how-to-manage-python-dependencies-in-pyspark.html >`_.
135+
136+ .. code-block :: bash
137+
138+ conda create -y -n xgboost-env -c conda-forge conda-pack python=3.9
139+ conda activate xgboost-env
140+ pip install xgboost
141+ pip install cudf
142+ conda pack -f -o xgboost-env.tar.gz
143+
144+
145+ Submit the PySpark application
146+ ==============================
147+
148+ Assuming you have configured your Spark cluster with GPU support, if not yet, please
149+ refer to `spark standalone configuration with GPU support <https://nvidia.github.io/spark-rapids/docs/get-started/getting-started-on-prem.html#spark-standalone-cluster >`_.
150+
151+ .. code-block :: bash
152+
153+ export PYSPARK_DRIVER_PYTHON=python
154+ export PYSPARK_PYTHON=./environment/bin/python
155+
156+ spark-submit \
157+ --master spark://< master-ip> :7077 \
158+ --conf spark.executor.resource.gpu.amount=1 \
159+ --conf spark.task.resource.gpu.amount=1 \
160+ --archives xgboost-env.tar.gz#environment \
161+ xgboost_app.py
162+
163+
164+ Model Persistence
165+ =================
166+
167+ Similar to standard PySpark ml estimators, one can persist and reuse the model with `save `
168+ and `load ` methods:
169+
170+ .. code-block :: python
171+
172+ regressor = SparkXGBRegressor()
173+ model = regressor.fit(train_df)
174+ # save the model
175+ model.save(" /tmp/xgboost-pyspark-model" )
176+ # load the model
177+ model2 = SparkXGBRankerModel.load(" /tmp/xgboost-pyspark-model" )
178+
179+ To export the underlying booster model used by XGBoost:
180+
181+ .. code-block :: python
182+
183+ regressor = SparkXGBRegressor()
184+ model = regressor.fit(train_df)
185+ # the same booster object returned by xgboost.train
186+ booster: xgb.Booster = model.get_booster()
187+ booster.predict(... )
188+ booster.save_model(" model.json" )
189+
190+ This booster is shared by other Python interfaces and can be used by other language
191+ bindings like the C and R packages. Lastly, one can extract a booster file directly from
192+ saved spark estimator without going through the getter:
193+
194+ .. code-block :: python
195+
196+ import xgboost as xgb
197+ bst = xgb.Booster()
198+ bst.load_model(" /tmp/xgboost-pyspark-model/model/part-00000" )
199+
200+ Accelerate the whole pipeline of xgboost pyspark
201+ ================================================
202+
203+ With `RAPIDS Accelerator for Apache Spark <https://nvidia.github.io/spark-rapids/ >`_,
204+ you can accelerate the whole pipeline (ETL, Train, Transform) for xgboost pyspark
205+ without any code change by leveraging GPU.
206+
207+ Below is a simple example submit command for enabling GPU acceleration:
208+
209+ .. code-block :: bash
210+
211+ export PYSPARK_DRIVER_PYTHON=python
212+ export PYSPARK_PYTHON=./environment/bin/python
213+
214+ spark-submit \
215+ --master spark://< master-ip> :7077 \
216+ --conf spark.executor.resource.gpu.amount=1 \
217+ --conf spark.task.resource.gpu.amount=1 \
218+ --packages com.nvidia:rapids-4-spark_2.12:22.08.0 \
219+ --conf spark.plugins=com.nvidia.spark.SQLPlugin \
220+ --conf spark.sql.execution.arrow.maxRecordsPerBatch=1000000 \
221+ --archives xgboost-env.tar.gz#environment \
222+ xgboost_app.py
0 commit comments