Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Various additions to the README #16

Merged
merged 8 commits into from Sep 26, 2019
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
33 changes: 24 additions & 9 deletions README.rst
Expand Up @@ -9,14 +9,18 @@ sk-dist: Distributed scikit-learn meta-estimators in PySpark
What is it?
-----------

``sk-dist`` is a Python module for machine learning built on top of
``sk-dist`` is a Python package for machine learning built on top of
`scikit-learn <https://scikit-learn.org/stable/index.html>`__ and is
distributed under the `Apache 2.0 software
license <https://github.com/Ibotta/sk-dist/blob/master/LICENSE>`__. The
``sk-dist`` module can be thought of as "distributed scikit-learn" as
its core functionality is to extend the ``scikit-learn`` built-in
``joblib`` parallelization of meta-estimator training to
`spark <https://spark.apache.org/>`__.
`spark <https://spark.apache.org/>`__. A popular use case is the
parallelization of grid search as shown here:

.. figure:: https://github.com/Ibotta/sk-dist/blob/readme_enhancements/doc/images/grid_search.png
:alt: sk-dist

Check out the `blog post <https://medium.com/building-ibotta/train-sklearn-100x-faster-bec530fc1f45>`__
for more information on the motivation and use cases of ``sk-dist``.
Expand Down Expand Up @@ -134,13 +138,17 @@ With ``pytest`` installed, you can run tests locally:
pytest sk-dist

Examples
^^^^^^^^
--------
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯


For a more complete testing experience and to ensure that your spark
distribution and configuration are compatible with ``sk-dist``, consider
running the
`examples <https://github.com/Ibotta/sk-dist/tree/master/examples>`__
(which do instantiate a ``sparkContext``) in your spark environment.
The package contains numerous
`examples <https://github.com/Ibotta/sk-dist/tree/master/examples>`__
on how to use ``sk-dist`` in practice. Examples of note are:

- `Grid Search with XGBoost <https://github.com/Ibotta/sk-dist/blob/master/examples/search/xgb.py>`__
- `Spark ML Benchmark Comparison <https://github.com/Ibotta/sk-dist/blob/master/examples/search/spark_ml.py>`__
- `Encoderizer with 20 Newsgroups <https://github.com/Ibotta/sk-dist/blob/master/examples/encoder/basic_usage.py>`__
- `One-Vs-Rest vs One-Vs-One <https://github.com/Ibotta/sk-dist/blob/master/examples/multiclass/basic_usage.py>`__
- `Large Scale Sklearn Prediction with PySpark UDFs <https://github.com/Ibotta/sk-dist/blob/master/examples/predict/basic_usage.py>`_

Background
----------
Expand All @@ -149,7 +157,14 @@ The project was started at `Ibotta
Inc. <https://medium.com/building-ibotta>`__ on the machine learning
team and open sourced in 2019.

It is currently maintained by the machine learning team at Ibotta.
It is currently maintained by the machine learning team at Ibotta. Special
thanks to those who contributed to ``sk-dist`` while it was initially
in development at Ibotta:

- `Evan Harris <https://github.com/denver1117>`__
- `Nicole Woytarowicz <https://github.com/nicolele>`__
- `Mike Lewis <https://github.com/Mikelew88>`__
- `Bobby Crimi <https://github.com/rpcrimi>`__

.. figure:: https://github.com/Ibotta/sk-dist/blob/master/doc/images/ibottaml.png
:alt: IbottaML
Expand Down
Binary file added doc/images/grid_search.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.