Skip to content
Merged
84 changes: 79 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,18 +1,92 @@

# scikit-learn_bench

Benchmark for optimizations to scikit-learn in the Intel(R) Distribution for
Python*. See benchmark results [here](https://intelpython.github.io/scikit-learn_bench).
**scikit-learn_bench** benchmarks various implementations of machine learning algorithms across data analytics frameworks. Scikit-learn_bench can be extended to add new frameworks and algorithms. It currently support the [scikit-learn](https://scikit-learn.org/), [DAAL4PY](https://intelpython.github.io/daal4py/), [cuML](https://github.com/rapidsai/cuml), and [XGBoost](https://github.com/dmlc/xgboost) frameworks for commonly used [machine learning algorithms](#supported-algorithms).

See benchmark results [here](https://intelpython.github.io/scikit-learn_bench).


## Table of content

* [Prerequisites](#prerequisites)
* [How to create conda environment for benchmarking](#how-to-create-conda-environment-for-benchmarking)
* [Running Python benchmarks with runner script](#running-python-benchmarks-with-runner-script)
* [Supported algorithms](#supported-algorithms)
* [Algorithms parameters](#algorithms-parameters)
* [Legacy automatic building and running](#legacy-automatic-building-and-running)

## Prerequisites
- python and scikit-learn to run python versions
- `python` and `scikit-learn` to run python versions
- pandas when using its DataFrame as input data format
- `icc`, `ifort`, `mkl`, `daal` to compile and run native benchmarks
- machine learning frameworks, that you want to test. Check [this item](#how-to-create-conda-environment-for-benchmarking) to get additional information how to set environment.

## How to create conda environment for benchmarking
`conda create -n skl_bench -c intel python=3.7 scikit-learn pandas`

Create a suitable conda environment for each framework to test. Each item in the list below links to instructions to create an appropriate conda environment for the framework.

* [**scikit-learn**](https://github.com/PivovarA/scikit-learn_bench/blob/master/sklearn/README.md#how-to-create-conda-environment-for-benchmarking)
* [**daal4py**](https://github.com/PivovarA/scikit-learn_bench/blob/master/daal4py/README.md#how-to-create-conda-environment-for-benchmarking)
* [**cuml**](https://github.com/PivovarA/scikit-learn_bench/blob/master/cuml/README.md#how-to-create-conda-environment-for-benchmarking)
* [**xgboost**](https://github.com/PivovarA/scikit-learn_bench/tree/master/xgboost/README.md#how-to-create-conda-environment-for-benchmarking)


## Running Python benchmarks with runner script
`python runner.py --config config_example.json [--output-format json --verbose]`

Run `python runner.py --config configs/config_example.json [--output-format json --verbose]` to launch benchmarks.

runner options:
* ``config`` : the path to configuration file
* ``dummy-run`` : run configuration parser and datasets generation without benchmarks running
* ``verbose`` : print additional information during benchmarks running
* ``output-format``: *json* or *csv*. Output type of benchmarks to use with their runner

Benchmarks currently support the following frameworks:
* **scikit-learn**
* **daal4py**
* **cuml**
* **xgboost**

The configuration of benchmarks allows you to select the frameworks to run, select datasets for measurements and configure the parameters of the algorithms.

You can configure benchmarks by editing a config file. Check [config.json schema](https://github.com/PivovarA/scikit-learn_bench/blob/master/configs/README.md) for more details.

## Benchmark supported algorithms

| algorithm | benchmark name | sklearn | daal4py | cuml | xgboost |
|---|---|---|---|---|---|
|**[DBSCAN](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html)**|dbscan|:white_check_mark:|:white_check_mark:|:white_check_mark:|:x:|
|**[RandomForestClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)**|df_clfs|:white_check_mark:|:white_check_mark:|:white_check_mark:|:x:|
|**[RandomForestRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html)**|df_regr|:white_check_mark:|:white_check_mark:|:white_check_mark:|:x:|
|**[pairwise_distances](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise_distances.html)**|distances|:white_check_mark:|:white_check_mark:|:x:|:x:|
|**[KMeans](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html)**|kmeans|:white_check_mark:|:white_check_mark:|:white_check_mark:|:x:|
|**[KNeighborsClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html)**|knn_clsf|:white_check_mark:|:x:|:white_check_mark:|:x:|
|**[LinearRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html)**|linear|:white_check_mark:|:white_check_mark:|:white_check_mark:|:x:|
|**[LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)**|log_reg|:white_check_mark:|:white_check_mark:|:white_check_mark:|:x:|
|**[PCA](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html)**|pca|:white_check_mark:|:white_check_mark:|:white_check_mark:|:x:|
|**[Ridge](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html)**|ridge|:white_check_mark:|:white_check_mark:|:white_check_mark:|:x:|
|**[SVM](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html)**|svm|:white_check_mark:|:white_check_mark:|:white_check_mark:|:x:|
|**[train_test_split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)**|train_test_split|:white_check_mark:|:x:|:white_check_mark:|:x:|
|**[GradientBoostingClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html)**|gbt|:x:|:x:|:x:|:white_check_mark:|
|**[GradientBoostingRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingRegressor.html)**|gbt|:x:|:x:|:x:|:white_check_mark:|

## Algorithms parameters

You can launch benchmarks for each algorithm separately.
To do this, go to the directory with the benchmark:

cd <framework>

Run the following command:

python <benchmark_file> --dataset-name <path to the dataset> <other algorithm parameters>

The list of supported parameters for each algorithm you can find here:

* [**scikit-learn**](https://github.com/PivovarA/scikit-learn_bench/blob/master/sklearn/README.md#algorithms-parameters)
* [**daal4py**](https://github.com/PivovarA/scikit-learn_bench/blob/master/daal4py/README.md#algorithms-parameters)
* [**cuml**](https://github.com/PivovarA/scikit-learn_bench/blob/master/cuml/README.md#algorithms-parameters)
* [**xgboost**](https://github.com/PivovarA/scikit-learn_bench/tree/master/xgboost/README.md#algorithms-parameters)

## Legacy automatic building and running
- Run `make`. This will generate data, compile benchmarks, and run them.
Expand Down
67 changes: 67 additions & 0 deletions configs/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
## Config JSON Schema

Configure benchmarks by editing the `config.json` file.
You can configure some algorithm parameters, datasets, a list of frameworks to use, and the usage of some environment variables.
Refer to the tables below for descriptions of all fields in the configuration file.

- [Root Config Object](#root-config-object)
- [Common Object](#common-object)
- [Case Object](#case-object)
- [Dataset Object](#dataset-object)
- [Training Object](#training-object)
- [Testing Object](#testing-object)

### Root Config Object
| Field Name | Type | Description |
| ----- | ---- |------------ |
|omp_env| array[string] | For xgboost only. Specify an environment variable to set the number of omp threads |
|common| [Common Object](#common-object)| **REQUIRED** common benchmarks setting: frameworks and input data settings |
|cases| array[[Case Object](#case-object)] | **REQUIRED** list of algorithms, their parameters and training data |

### Common Object

| Field Name | Type | Description |
| ----- | ---- |------------ |
|lib| array[string] | **REQUIRED** list of test frameworks. It can be *sklearn*, *daal4py*, *cuml* or *xgboost* |
|data-format| array[string] | **REQUIRED** input data format. Data formats: *numpy*, *pandas* or *cudf* |
|data-order| array[string] | **REQUIRED** input data order. Data order: *C* (row-major, default) or *F* (column-major) |
|dtype| array[string] | **REQUIRED** input data type. Data type: *float64* (default) or *float32* |
|check-finitness| array[] | Check finiteness in sklearn input check(disabled by default) |

### Case Object

| Field Name | Type | Description |
| ----- | ---- |------------ |
|lib| array[string] | **REQUIRED** list of test frameworks. It can be *sklearn*, *daal4py*, *cuml* or *xgboost*|
|algorithm| string | **REQUIRED** benchmark name |
|dataset| array[[Dataset Object](#dataset-object)] | **REQUIRED** input data specifications. |
|benchmark parameters| array[Any] | **REQUIRED** algorithm parameters. a list of supported parameters can be found here |

### Dataset Object

| Field Name | Type | Description |
| ----- | ---- |------------ |
|source| string | **REQUIRED** data source. It can be *synthetic* or *csv* |
|type| string | **REQUIRED** for synthetic data only. The type of task for which the dataset is generated. It can be *classification*, *blobs* or *regression* |
|n_classes| int | For *synthetic* data and for *classification* type only. The number of classes (or labels) of the classification problem |
|n_clusters| int | For *synthetic* data and for *blobs* type only. The number of centers to generate |
|n_features| int | **REQUIRED** For *synthetic* data only. The number of features to generate |
|name| string | Name of dataset |
|training| [Training Object](#training-object) | **REQUIRED** algorithm parameters. a list of supported parameters can be found here |
|testing| [Testing Object](#testing-object) | **REQUIRED** algorithm parameters. a list of supported parameters can be found here |

### Training Object

| Field Name | Type | Description |
| ----- | ---- |------------ |
| n_samples | int | The total number of the training points |
| x | str | The path to the training samples |
| y | str | The path to the training labels |

### Testing Object

| Field Name | Type | Description |
| ----- | ---- |------------ |
| n_samples | int | The total number of the testing points |
| x | str | The path to the testing samples |
| y | str | The path to the testing labels |
File renamed without changes.
151 changes: 151 additions & 0 deletions cuml/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,151 @@

## How to create conda environment for benchmarking
`conda create -n skl_bench -c rapidsai -c conda-forge python=3.7 cuml pandas cudf`

## Algorithms parameters

You can launch benchmarks for each algorithm separately. The tables below list all supported parameters for each algorithm:

- [General](#general)
- [DBSCAN](#dbscan)
- [RandomForestClassifier](#randomforestclassifier)
- [RandomForestRegressor](#randomforestregressor)
- [pairwise_distances](#pairwise_distances)
- [KMeans](#kmeans)
- [KNeighborsClassifier](#kneighborsclassifier)
- [LinearRegression](#linearregression)
- [LogisticRegression](#logisticregression)
- [PCA](#pca)
- [Ridge Regression](#ridge)
- [SVC](#svc)
- [train_test_split](#train_test_split)

#### General
| Parameter Name | Type | Default Value | Description |
| ----- | ---- |---- |---- |
|num-threads|int|-1| The number of threads to use|
|arch|str|?|Achine architecture, for bookkeeping|
|batch|str|?|Batch ID, for bookkeeping|
|prefix|str|sklearn|Prefix string, for bookkeeping|
|header|action|False|Output CSV header|
|verbose|action|False|Output extra debug messages|
|data-format|str|numpy|Data formats: *numpy*, *pandas* or *cudf*|
|data-order|str|C|Data order: C (row-major, default) or F (column-major)|
|dtype|np.dtype|np.float64|Data type: *float64* (default) or *float32*|
|check-finiteness|action|False|Check finiteness in sklearn input check(disabled by default)|
|output-format|str|csv|Output format: *csv* (default) or *json*'|
|time-method|str|mean_min|Method used for time mesurements|
|box-filter-measurements|int|100|Maximum number of measurements in box filter|
|inner-loops|int|100|Maximum inner loop iterations. (we take the mean over inner iterations)|
|outer-loops|int|100|Maximum outer loop iterations. (we take the min over outer iterations)|
|time-limit|float|10|Target time to spend to benchmark|
|goal-outer-loops|int|10|The number of outer loops to aim while automatically picking number of inner loops. If zero, do not automatically decide number of inner loops|
|seed|int|12345|Seed to pass as random_state|
|dataset-name|str|None|Dataset name|


#### DBSCAN
| parameter Name | Type | default value | description |
| ----- | ---- |---- |---- |
| epsilon | float | 10 | Radius of neighborhood of a point|
| min_samples | int | 5 | The minimum number of samples required in a 'neighborhood to consider a point a core point |

#### RandomForestClassifier

| parameter Name | Type | default value | description |
| ----- | ---- |---- |---- |
| criterion | str | gini | *gini* or *entropy*. The function to measure the quality of a split |
|split-algorithm|str|hist|*hist* or *global_quantile*. The algorithm to determine how nodes are split in the tree|
| num-trees | int | 100 | The number of trees in the forest |
| max-features | float_or_int | None | Upper bound on features used at each split |
| max-depth | int | None | Upper bound on depth of constructed trees |
| min-samples-split | float_or_int | 2 | Minimum samples number for node splitting |
| max-leaf-nodes | int | None | Maximum leaf nodes per tree |
| min-impurity-decrease | float | 0 | Needed impurity decrease for node splitting |
| no-bootstrap | store_false | True | Don't control bootstraping |

#### RandomForestRegressor

| parameter Name | Type | default value | description |
| ----- | ---- |---- |---- |
| criterion | str | gini | *gini* or *entropy*. The function to measure the quality of a split |
|split-algorithm|str|hist|*hist* or *global_quantile*. The algorithm to determine how nodes are split in the tree|
| num-trees | int | 100 | The number of trees in the forest |
| max-features | float_or_int | None | Upper bound on features used at each split |
| max-depth | int | None | Upper bound on depth of constructed trees |
| min-samples-split | float_or_int | 2 | Minimum samples number for node splitting |
| max-leaf-nodes | int | None | Maximum leaf nodes per tree |
| min-impurity-decrease | float | 0 | Needed impurity decrease for node splitting |
| no-bootstrap | action | True | Don't control bootstraping |

#### KMeans

| parameter Name | Type | default value | description |
| ----- | ---- |---- |---- |
| init | str | | Initial clusters |
| tol | float | 0 | Absolute threshold |
| maxiter | int | 100 | Maximum number of iterations |
| samples-per-batch | int | 32768 | The number of samples per batch |
| n-clusters | int | | The number of clusters |

#### KNeighborsClassifier

| parameter Name | Type | default value | description |
| ----- | ---- |---- |---- |
| n-neighbors | int | 5 | The number of neighbors to use |
| weights | str | uniform | Weight function used in prediction |
| method | str | brute | Algorithm used to compute the nearest neighbors |
| metric | str | euclidean | Distance metric to use |

#### LinearRegression

| parameter Name | Type | default value | description |
| ----- | ---- |---- |---- |
| no-fit-intercept | action | True | Don't fit intercept (assume data already centered) |
| solver | str | eig | *eig* or *svd*. Solver used for training |

#### LogisticRegression

| parameter Name | Type | default value | description |
| ----- | ---- |---- |---- |
| no-fit-intercept | action | True | Don't fit intercept|
| solver | str | qn | *qn*, *owl*. Solver to use|
| maxiter | int | 100 | Maximum iterations for the iterative solver |
| C | float | 1.0 | Regularization parameter |
| tol | float | None | Tolerance for solver |

#### PCA

| parameter Name | Type | default value | description |
| ----- | ---- |---- |---- |
| svd-solver | str | full | *auto*, *full* or *jacobi*. SVD solver to use |
| n-components | int | None | The number of components to find |
| whiten | action | False | Perform whitening |

#### Ridge

| parameter Name | Type | default value | description |
| ----- | ---- |---- |---- |
| no-fit-intercept | action | True | Don't fit intercept (assume data already centered) |
| solver | str | eig | *eig*, *cd* or *svd*. Solver used for training |
| alpha | float | 1.0 | Regularization strength |

#### SVC

| parameter Name | Type | default value | description |
| ----- | ---- |---- |---- |
| C | float | 0.01 | SVM slack parameter |
| kernel | str | linear | *linear* or *rbf*. SVM kernel function |
| gamma | float | None | Parameter for kernel="rbf" |
| maxiter | int | 2000 | Maximum iterations for the iterative solver |
| max-cache-size | int | 64 | Maximum cache size for SVM. |
| tol | float | 1e-16 | Tolerance passed to sklearn.svm.SVC |
| no-shrinking | action | True | Don't use shrinking heuristic |

#### train_test_split

| parameter Name | Type | default value | description |
| ----- | ---- |---- |---- |
| train-size | float | 0.75 | Size of training subset |
| test-size | float | 0.25 | Size of testing subset |
| do-not-shuffle | action | False | Do not perform data shuffle before splitting |
Loading