sk-dist: Distributed scikit-learn meta-estimators in PySpark
What is it?
sk-dist is a Python module for machine learning built on top of
scikit-learn and is
distributed under the Apache 2.0 software
sk-dist module can be thought of as "distributed scikit-learn" as
its core functionality is to extend the
joblib parallelization of meta-estimator training to
- Distributed Training -
sk-distparallelizes the training of
scikit-learnmeta-estimators with PySpark. This allows distributed training of these estimators without any constraint on the physical resources of any one machine. In all cases, spark artifacts are automatically stripped from the fitted estimator. These estimators can then be pickled and un-pickled for prediction tasks, operating identically at predict time to their
scikit-learncounterparts. Supported tasks are:
- Grid Search: Hyperparameter optimization techniques, particularly GridSearchCV and RandomizedSeachCV, are distributed such that each parameter set candidate is trained in parallel.
- Multiclass Strategies: Multiclass classification strategies, particularly OneVsRestClassifier and OneVsOneClassifier, are distributed such that each binary probelm is trained in parallel.
- Tree Ensembles: Decision tree ensembles for classification and regression, particularly RandomForest and ExtraTrees, are distributed such that each tree is trained in parallel.
- Distributed Prediction -
sk-distprovides a prediction module which builds vectorized UDFs for PySpark DataFrames using fitted
scikit-learnestimators. This distributes the
scikit-learnestimators, enabling large scale prediction with
- Feature Encoding -
sk-distprovides a flexible feature encoding utility called
Encoderizerwhich encodes mix-typed feature spaces using either default behavior or user defined customizable settings. It is particularly aimed at text features, but it additionally handles numeric and dictionary type feature spaces.
- Python (>= 3.5)
- pandas (>=0.19.0)
- numpy (>=1.17.0)
- scipy (>=1.3.1)
- scikit-learn (>=0.21.3)
- joblib (>=0.11)
sk-dist does not support Python 2
sk-dist functionality requires a spark installation as well as
PySpark. Some functionality can run without spark, so spark related
dependencies are not required. The connection between sk-dist and spark
relies solely on a
sparkContext as an argument to various
sk-dist classes upon instantiation.
A variety of spark configurations and setups will work. It is left up to
the user to configure their own spark setup. Testing has been done on
spark 2.4, though any
spark 2.0+ versions are expected to work.
Additional spark related dependecies are
pyarrow, which is used only
skdist.predict functions. This uses vectorized pandas UDFs which
pyarrow>=0.8.0. Depending on the spark version, it may be
necessary to set
spark.conf.set("spark.sql.execution.arrow.enabled", "true") in the
The easiest way to install
sk-dist is with
pip install --upgrade sk-dist
You can also download the source code:
git clone https://github.com/Ibotta/sk-dist.git
pytest installed, you can run tests locally:
A note about unit testing: Unit tests are only written to test
functionality that (1) does not require a
sparkContext and (2) has
no dependencies outside of the package requirements. This means that
much of the distributed spark functionality is not included in unit
For a more complete testing experience and to ensure that your spark
distribution and configuration are compatible with
(which do instantiate a
sparkContext) in your spark environment.
The project was started at Ibotta Inc. on the machine learning team and open sourced in 2019.
It is currently maintained by the machine learning team at Ibotta.