Skip to content

Commit f23cc92

Browse files
[pyspark] User guide doc and tutorials (dmlc#8082)
Co-authored-by: Bobby Wang <wbo4958@gmail.com>
1 parent f801d3c commit f23cc92

File tree

4 files changed

+155
-4
lines changed

4 files changed

+155
-4
lines changed
Lines changed: 82 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,82 @@
1+
'''
2+
Collection of examples for using xgboost.spark estimator interface
3+
==================================================================
4+
5+
@author: Weichen Xu
6+
'''
7+
from pyspark.sql import SparkSession
8+
from pyspark.sql.functions import rand
9+
from pyspark.ml.linalg import Vectors
10+
import sklearn.datasets
11+
from sklearn.model_selection import train_test_split
12+
from xgboost.spark import SparkXGBClassifier, SparkXGBRegressor
13+
from pyspark.ml.evaluation import RegressionEvaluator, MulticlassClassificationEvaluator
14+
15+
16+
spark = SparkSession.builder.master("local[*]").getOrCreate()
17+
18+
19+
def create_spark_df(X, y):
20+
return spark.createDataFrame(
21+
spark.sparkContext.parallelize([
22+
(Vectors.dense(features), float(label))
23+
for features, label in zip(X, y)
24+
]),
25+
["features", "label"]
26+
)
27+
28+
29+
# load diabetes dataset (regression dataset)
30+
diabetes_X, diabetes_y = sklearn.datasets.load_diabetes(return_X_y=True)
31+
diabetes_X_train, diabetes_X_test, diabetes_y_train, diabetes_y_test = \
32+
train_test_split(diabetes_X, diabetes_y, test_size=0.3, shuffle=True)
33+
34+
diabetes_train_spark_df = create_spark_df(diabetes_X_train, diabetes_y_train)
35+
diabetes_test_spark_df = create_spark_df(diabetes_X_test, diabetes_y_test)
36+
37+
# train xgboost regressor model
38+
xgb_regressor = SparkXGBRegressor(max_depth=5)
39+
xgb_regressor_model = xgb_regressor.fit(diabetes_train_spark_df)
40+
41+
transformed_diabetes_test_spark_df = xgb_regressor_model.transform(diabetes_test_spark_df)
42+
regressor_evaluator = RegressionEvaluator(metricName="rmse")
43+
print(f"regressor rmse={regressor_evaluator.evaluate(transformed_diabetes_test_spark_df)}")
44+
45+
diabetes_train_spark_df2 = diabetes_train_spark_df.withColumn(
46+
"validationIndicatorCol", rand(1) > 0.7
47+
)
48+
49+
# train xgboost regressor model with validation dataset
50+
xgb_regressor2 = SparkXGBRegressor(max_depth=5, validation_indicator_col="validationIndicatorCol")
51+
xgb_regressor_model2 = xgb_regressor.fit(diabetes_train_spark_df2)
52+
transformed_diabetes_test_spark_df2 = xgb_regressor_model2.transform(diabetes_test_spark_df)
53+
print(f"regressor2 rmse={regressor_evaluator.evaluate(transformed_diabetes_test_spark_df2)}")
54+
55+
56+
# load iris dataset (classification dataset)
57+
iris_X, iris_y = sklearn.datasets.load_iris(return_X_y=True)
58+
iris_X_train, iris_X_test, iris_y_train, iris_y_test = \
59+
train_test_split(iris_X, iris_y, test_size=0.3, shuffle=True)
60+
61+
iris_train_spark_df = create_spark_df(iris_X_train, iris_y_train)
62+
iris_test_spark_df = create_spark_df(iris_X_test, iris_y_test)
63+
64+
# train xgboost classifier model
65+
xgb_classifier = SparkXGBClassifier(max_depth=5)
66+
xgb_classifier_model = xgb_classifier.fit(iris_train_spark_df)
67+
68+
transformed_iris_test_spark_df = xgb_classifier_model.transform(iris_test_spark_df)
69+
classifier_evaluator = MulticlassClassificationEvaluator(metricName="f1")
70+
print(f"classifier f1={classifier_evaluator.evaluate(transformed_iris_test_spark_df)}")
71+
72+
iris_train_spark_df2 = iris_train_spark_df.withColumn(
73+
"validationIndicatorCol", rand(1) > 0.7
74+
)
75+
76+
# train xgboost classifier model with validation dataset
77+
xgb_classifier2 = SparkXGBClassifier(max_depth=5, validation_indicator_col="validationIndicatorCol")
78+
xgb_classifier_model2 = xgb_classifier.fit(iris_train_spark_df2)
79+
transformed_iris_test_spark_df2 = xgb_classifier_model2.transform(iris_test_spark_df)
80+
print(f"classifier2 f1={classifier_evaluator.evaluate(transformed_iris_test_spark_df2)}")
81+
82+
spark.stop()

doc/tutorials/spark_estimator.rst

Lines changed: 66 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,66 @@
1+
###############################
2+
Using XGBoost PySpark Estimator
3+
###############################
4+
Starting from version 2.0, xgboost supports pyspark estimator APIs.
5+
The feature is still experimental and not yet ready for production use.
6+
7+
*****************
8+
SparkXGBRegressor
9+
*****************
10+
11+
SparkXGBRegressor is a PySpark ML estimator. It implements the XGBoost classification
12+
algorithm based on XGBoost python library, and it can be used in PySpark Pipeline
13+
and PySpark ML meta algorithms like CrossValidator/TrainValidationSplit/OneVsRest.
14+
15+
We can create a `SparkXGBRegressor` estimator like:
16+
17+
.. code-block:: python
18+
19+
from xgboost.spark import SparkXGBRegressor
20+
spark_reg_estimator = SparkXGBRegressor(num_workers=2, max_depth=5)
21+
22+
23+
The above snippet create an spark estimator which can fit on a spark dataset,
24+
and return a spark model that can transform a spark dataset and generate dataset
25+
with prediction column. We can set almost all of xgboost sklearn estimator parameters
26+
as `SparkXGBRegressor` parameters, but some parameter such as `nthread` is forbidden
27+
in spark estimator, and some parameters are replaced with pyspark specific parameters
28+
such as `weight_col`, `validation_indicator_col`, `use_gpu`, for details please see
29+
`SparkXGBRegressor` doc.
30+
31+
The following code snippet shows how to train a spark xgboost regressor model,
32+
first we need to prepare a training dataset as a spark dataframe contains
33+
"features" and "label" column, the "features" column must be `pyspark.ml.linalg.Vector`
34+
type or spark array type.
35+
36+
.. code-block:: python
37+
38+
xgb_regressor_model = xgb_regressor.fit(train_spark_dataframe)
39+
40+
41+
The following code snippet shows how to predict test data using a spark xgboost regressor model,
42+
first we need to prepare a test dataset as a spark dataframe contains
43+
"features" and "label" column, the "features" column must be `pyspark.ml.linalg.Vector`
44+
type or spark array type.
45+
46+
.. code-block:: python
47+
48+
transformed_test_spark_dataframe = xgb_regressor.predict(test_spark_dataframe)
49+
50+
51+
The above snippet code returns a `transformed_test_spark_dataframe` that contains the input
52+
dataset columns and an appended column "prediction" representing the prediction results.
53+
54+
55+
******************
56+
SparkXGBClassifier
57+
******************
58+
59+
60+
`SparkXGBClassifier` estimator has similar API with `SparkXGBRegressor`, but it has some
61+
pyspark classifier specific params, e.g. `raw_prediction_col` and `probability_col` parameters.
62+
Correspondingly, by default, `SparkXGBClassifierModel` transforming test dataset will
63+
generate result dataset with 3 new columns:
64+
- "prediction": represents the predicted label.
65+
- "raw_prediction": represents the output margin values.
66+
- "probability": represents the prediction probability on each label.

python-package/xgboost/spark/core.py

Lines changed: 0 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -379,10 +379,6 @@ def setParams(self, **kwargs): # pylint: disable=invalid-name
379379
)
380380
if k in _pyspark_param_alias_map:
381381
real_k = _pyspark_param_alias_map[k]
382-
if real_k in kwargs:
383-
raise ValueError(
384-
f"You should set only one of param '{k}' and '{real_k}'"
385-
)
386382
k = real_k
387383

388384
if self.hasParam(k):

python-package/xgboost/spark/estimator.py

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -31,6 +31,9 @@ class SparkXGBRegressor(_SparkXGBEstimator):
3131
3232
SparkXGBRegressor doesn't support `validate_features` and `output_margin` param.
3333
34+
SparkXGBRegressor doesn't support setting `nthread` xgboost param, instead, the `nthread`
35+
param for each xgboost worker will be set equal to `spark.task.cpus` config value.
36+
3437
callbacks:
3538
The export and import of the callback functions are at best effort.
3639
For details, see :py:attr:`xgboost.spark.SparkXGBRegressor.callbacks` param doc.
@@ -128,6 +131,10 @@ class SparkXGBClassifier(_SparkXGBEstimator, HasProbabilityCol, HasRawPrediction
128131
129132
SparkXGBClassifier doesn't support `validate_features` and `output_margin` param.
130133
134+
SparkXGBRegressor doesn't support setting `nthread` xgboost param, instead, the `nthread`
135+
param for each xgboost worker will be set equal to `spark.task.cpus` config value.
136+
137+
131138
Parameters
132139
----------
133140

0 commit comments

Comments
 (0)