Skip to content
Distributed scikit-learn meta-estimators in PySpark
Branch: master
Clone or download
denver1117 Merge pull request #9 from Ibotta/pr_template_fix
Remove errant text from PR template
Latest commit e7dca40 Sep 15, 2019
Type Name Latest commit message Commit time
Failed to load latest commit information.
.github/ISSUE_TEMPLATE initial commit Aug 28, 2019
build_tools initial commit Aug 28, 2019
doc/images initial commit Aug 28, 2019
examples Xgb example (#4) Sep 9, 2019
skdist Xgb example (#4) Sep 9, 2019
.gitignore initial commit Aug 28, 2019
.travis.yml initial commit Aug 28, 2019
CODEOWNERS initial commit Aug 28, 2019 initial commit Aug 28, 2019 initial commit Aug 28, 2019
LICENSE initial commit Aug 28, 2019
NOTICE initial commit Aug 28, 2019 Remove errant text from template Sep 14, 2019
README.rst initial commit Aug 28, 2019 initial commit Aug 28, 2019



sk-dist: Distributed scikit-learn meta-estimators in PySpark

License Build Status PyPI Package

What is it?

sk-dist is a Python module for machine learning built on top of scikit-learn and is distributed under the Apache 2.0 software license. The sk-dist module can be thought of as "distributed scikit-learn" as its core functionality is to extend the scikit-learn built-in joblib parallelization of meta-estimator training to spark.

Main Features

  • Distributed Training - sk-dist parallelizes the training of scikit-learn meta-estimators with PySpark. This allows distributed training of these estimators without any constraint on the physical resources of any one machine. In all cases, spark artifacts are automatically stripped from the fitted estimator. These estimators can then be pickled and un-pickled for prediction tasks, operating identically at predict time to their scikit-learn counterparts. Supported tasks are:
  • Distributed Prediction - sk-dist provides a prediction module which builds vectorized UDFs for PySpark DataFrames using fitted scikit-learn estimators. This distributes the predict and predict_proba methods of scikit-learn estimators, enabling large scale prediction with scikit-learn.
  • Feature Encoding - sk-dist provides a flexible feature encoding utility called Encoderizer which encodes mix-typed feature spaces using either default behavior or user defined customizable settings. It is particularly aimed at text features, but it additionally handles numeric and dictionary type feature spaces.



sk-dist requires:

sk-dist does not support Python 2

Spark Dependencies

Most sk-dist functionality requires a spark installation as well as PySpark. Some functionality can run without spark, so spark related dependencies are not required. The connection between sk-dist and spark relies solely on a sparkContext as an argument to various sk-dist classes upon instantiation.

A variety of spark configurations and setups will work. It is left up to the user to configure their own spark setup. Testing has been done on spark 2.4, though any spark 2.0+ versions are expected to work.

Additional spark related dependecies are pyarrow, which is used only for skdist.predict functions. This uses vectorized pandas UDFs which require pyarrow>=0.8.0. Depending on the spark version, it may be necessary to set spark.conf.set("spark.sql.execution.arrow.enabled", "true") in the spark configuration.

User Installation

The easiest way to install sk-dist is with pip:

pip install --upgrade sk-dist

You can also download the source code:

git clone


With pytest installed, you can run tests locally:

pytest sk-dist
Testing Caveats

A note about unit testing: Unit tests are only written to test functionality that (1) does not require a sparkContext and (2) has no dependencies outside of the package requirements. This means that much of the distributed spark functionality is not included in unit tests.


For a more complete testing experience and to ensure that your spark distribution and configuration are compatible with sk-dist, consider running the examples (which do instantiate a sparkContext) in your spark environment.


The project was started at Ibotta Inc. on the machine learning team and open sourced in 2019.

It is currently maintained by the machine learning team at Ibotta.

You can’t perform that action at this time.