IntelPython · PivovarA · Jul 16, 2020 · Jun 10, 2020 · Jun 22, 2020 · Jun 24, 2020
diff --git a/README.md b/README.md
@@ -1,18 +1,92 @@
+
 # scikit-learn_bench
 
-Benchmark for optimizations to scikit-learn in the Intel(R) Distribution for
-Python*. See benchmark results [here](https://intelpython.github.io/scikit-learn_bench).
+**scikit-learn_bench** benchmarks various implementations of machine learning algorithms across data analytics frameworks.  Scikit-learn_bench can be extended to add new frameworks and algorithms.  It currently support the [scikit-learn](https://scikit-learn.org/), [DAAL4PY](https://intelpython.github.io/daal4py/), [cuML](https://github.com/rapidsai/cuml), and [XGBoost](https://github.com/dmlc/xgboost) frameworks for commonly used [machine learning algorithms](#supported-algorithms).
+
+See benchmark results [here](https://intelpython.github.io/scikit-learn_bench).
+
+
+## Table of content
+
+* [Prerequisites](#prerequisites)
+* [How to create conda environment for benchmarking](#how-to-create-conda-environment-for-benchmarking)
+* [Running Python benchmarks with runner script](#running-python-benchmarks-with-runner-script)
+* [Supported algorithms](#supported-algorithms)
+* [Algorithms parameters](#algorithms-parameters)
+* [Legacy automatic building and running](#legacy-automatic-building-and-running)
 
 ## Prerequisites
-- python and scikit-learn to run python versions
+- `python` and `scikit-learn` to run python versions
 - pandas when using its DataFrame as input data format
 - `icc`, `ifort`, `mkl`, `daal` to compile and run native benchmarks
+- machine learning frameworks, that you want to test. Check [this item](#how-to-create-conda-environment-for-benchmarking) to get additional information how to set environment.
 
 ## How to create conda environment for benchmarking
-`conda create -n skl_bench -c intel python=3.7 scikit-learn pandas`
+
+Create a suitable conda environment for each framework to test. Each item in the list below links to instructions to create an appropriate conda environment for the framework.
+
+* [**scikit-learn**](https://github.com/PivovarA/scikit-learn_bench/blob/master/sklearn/README.md#how-to-create-conda-environment-for-benchmarking)
+* [**daal4py**](https://github.com/PivovarA/scikit-learn_bench/blob/master/daal4py/README.md#how-to-create-conda-environment-for-benchmarking)
+* [**cuml**](https://github.com/PivovarA/scikit-learn_bench/blob/master/cuml/README.md#how-to-create-conda-environment-for-benchmarking)
+* [**xgboost**](https://github.com/PivovarA/scikit-learn_bench/tree/master/xgboost/README.md#how-to-create-conda-environment-for-benchmarking)
+
 
 ## Running Python benchmarks with runner script
-`python runner.py --config config_example.json [--output-format json --verbose]`
+
+Run `python runner.py --config configs/config_example.json [--output-format json --verbose]` to launch benchmarks.
+
+runner options:
+* ``config`` : the path to configuration file
+* ``dummy-run`` : run configuration parser and datasets generation without benchmarks running
+* ``verbose`` : print additional information during benchmarks running
+* ``output-format``: *json* or *csv*. Output type of benchmarks to use with their runner
+
+Benchmarks currently support the following frameworks:
+* **scikit-learn**
+* **daal4py**
+* **cuml**
+* **xgboost**
+
+The configuration of benchmarks allows you to select the frameworks to run, select datasets for measurements and configure the parameters of the algorithms.
+
+ You can configure benchmarks by editing a config file. Check  [config.json schema](https://github.com/PivovarA/scikit-learn_bench/blob/master/configs/README.md) for more details.
+
+## Benchmark supported algorithms
+
+| algorithm  | benchmark name | sklearn | daal4py | cuml | xgboost |
+|---|---|---|---|---|---|
+|**[DBSCAN](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html)**|dbscan|:white_check_mark:|:white_check_mark:|:white_check_mark:|:x:|
+|**[RandomForestClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)**|df_clfs|:white_check_mark:|:white_check_mark:|:white_check_mark:|:x:|
+|**[RandomForestRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html)**|df_regr|:white_check_mark:|:white_check_mark:|:white_check_mark:|:x:|
+|**[pairwise_distances](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise_distances.html)**|distances|:white_check_mark:|:white_check_mark:|:x:|:x:|
+|**[KMeans](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html)**|kmeans|:white_check_mark:|:white_check_mark:|:white_check_mark:|:x:|
+|**[KNeighborsClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html)**|knn_clsf|:white_check_mark:|:x:|:white_check_mark:|:x:|
+|**[LinearRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html)**|linear|:white_check_mark:|:white_check_mark:|:white_check_mark:|:x:|
+|**[LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)**|log_reg|:white_check_mark:|:white_check_mark:|:white_check_mark:|:x:|
+|**[PCA](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html)**|pca|:white_check_mark:|:white_check_mark:|:white_check_mark:|:x:|
+|**[Ridge](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html)**|ridge|:white_check_mark:|:white_check_mark:|:white_check_mark:|:x:|
+|**[SVM](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html)**|svm|:white_check_mark:|:white_check_mark:|:white_check_mark:|:x:|
+|**[train_test_split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)**|train_test_split|:white_check_mark:|:x:|:white_check_mark:|:x:|
+|**[GradientBoostingClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html)**|gbt|:x:|:x:|:x:|:white_check_mark:|
+|**[GradientBoostingRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingRegressor.html)**|gbt|:x:|:x:|:x:|:white_check_mark:|
+
+##  Algorithms parameters
+
+You can launch benchmarks for each algorithm separately.
+To do this, go to the directory with the benchmark:
+
+    cd <framework>
+
+Run the following command:
+
+    python <benchmark_file> --dataset-name <path to the dataset> <other algorithm parameters>
+
+The list of supported parameters for each algorithm you can find here:
+
+* [**scikit-learn**](https://github.com/PivovarA/scikit-learn_bench/blob/master/sklearn/README.md#algorithms-parameters)
+* [**daal4py**](https://github.com/PivovarA/scikit-learn_bench/blob/master/daal4py/README.md#algorithms-parameters)
+* [**cuml**](https://github.com/PivovarA/scikit-learn_bench/blob/master/cuml/README.md#algorithms-parameters)
+* [**xgboost**](https://github.com/PivovarA/scikit-learn_bench/tree/master/xgboost/README.md#algorithms-parameters)
 
 ## Legacy automatic building and running
 - Run `make`. This will generate data, compile benchmarks, and run them.

diff --git a/configs/README.md b/configs/README.md
@@ -0,0 +1,67 @@
+##  Config JSON Schema
+
+Configure benchmarks by editing the `config.json` file.
+You can configure some algorithm parameters, datasets, a list of frameworks to use, and the usage of some environment variables.
+Refer to the tables below for descriptions of all fields in the configuration file.
+
+- [Root Config Object](#root-config-object)
+- [Common Object](#common-object)
+- [Case Object](#case-object)
+- [Dataset Object](#dataset-object)
+- [Training Object](#training-object)
+- [Testing Object](#testing-object)
+
+###  Root Config Object
+| Field Name  | Type | Description |
+| ----- | ---- |------------ |
+|omp_env| array[string] | For xgboost only. Specify an environment variable to set the number of omp threads |
+|common| [Common Object](#common-object)| **REQUIRED** common benchmarks setting: frameworks and input data settings |
+|cases| array[[Case Object](#case-object)] | **REQUIRED**  list of algorithms, their parameters and training data |
+
+###  Common Object
+
+| Field Name  | Type | Description |
+| ----- | ---- |------------ |
+|lib| array[string] | **REQUIRED** list of test frameworks. It can be *sklearn*, *daal4py*, *cuml* or *xgboost* |
+|data-format| array[string] | **REQUIRED** input data format. Data formats: *numpy*, *pandas* or *cudf* |
+|data-order| array[string] | **REQUIRED**  input data order. Data order: *C* (row-major, default) or *F* (column-major) |
+|dtype| array[string] | **REQUIRED**  input data type. Data type: *float64* (default) or *float32* |
+|check-finitness| array[] | Check finiteness in sklearn input check(disabled by default) |
+
+###  Case Object
+
+| Field Name  | Type | Description |
+| ----- | ---- |------------ |
+|lib| array[string] | **REQUIRED** list of test frameworks. It can be *sklearn*, *daal4py*, *cuml* or *xgboost*|
+|algorithm| string | **REQUIRED** benchmark name |
+|dataset| array[[Dataset Object](#dataset-object)] | **REQUIRED**  input data specifications. |
+|benchmark parameters| array[Any] | **REQUIRED** algorithm parameters. a list of supported parameters can be found here |
+
+###  Dataset Object
+
+| Field Name  | Type | Description |
+| ----- | ---- |------------ |
+|source| string | **REQUIRED** data source. It can be *synthetic* or *csv* |
+|type| string | **REQUIRED**  for synthetic data only. The type of task for which the dataset is generated. It can be *classification*, *blobs* or *regression* |
+|n_classes| int | For *synthetic* data and for *classification* type only. The number of classes (or labels) of the classification problem |
+|n_clusters| int | For *synthetic* data and for *blobs* type only. The number of centers to generate |
+|n_features| int | **REQUIRED**  For *synthetic* data only. The number of features to generate |
+|name| string | Name of dataset |
+|training| [Training Object](#training-object) | **REQUIRED** algorithm parameters. a list of supported parameters can be found here |
+|testing| [Testing Object](#testing-object) | **REQUIRED** algorithm parameters. a list of supported parameters can be found here |
+
+###  Training Object
+
+| Field Name  | Type | Description |
+| ----- | ---- |------------ |
+| n_samples | int | The total number of the training points |
+| x | str | The path to the training samples |
+| y | str | The path to the training labels |
+
+###  Testing Object
+
+| Field Name  | Type | Description |
+| ----- | ---- |------------ |
+| n_samples | int | The total number of the testing points |
+| x | str | The path to the testing samples |
+| y | str | The path to the testing labels |
diff --git a/config_example.json → configs/config_example.json b/config_example.json → configs/config_example.json
diff --git a/cuml/README.md b/cuml/README.md
@@ -0,0 +1,151 @@
+
+## How to create conda environment for benchmarking
+`conda create -n skl_bench -c rapidsai -c conda-forge python=3.7 cuml pandas cudf`
+
+##  Algorithms parameters
+
+You can launch benchmarks for each algorithm separately. The tables below list all supported parameters for each algorithm:
+
+- [General](#general)
+- [DBSCAN](#dbscan)
+- [RandomForestClassifier](#randomforestclassifier)
+- [RandomForestRegressor](#randomforestregressor)
+- [pairwise_distances](#pairwise_distances)
+- [KMeans](#kmeans)
+- [KNeighborsClassifier](#kneighborsclassifier)
+- [LinearRegression](#linearregression)
+- [LogisticRegression](#logisticregression)
+- [PCA](#pca)
+- [Ridge Regression](#ridge)
+- [SVC](#svc)
+- [train_test_split](#train_test_split)
+
+#### General
+| Parameter Name  | Type | Default Value | Description |
+| ----- | ---- |---- |---- |
+|num-threads|int|-1| The number of threads to use|
+|arch|str|?|Achine architecture, for bookkeeping|
+|batch|str|?|Batch ID, for bookkeeping|
+|prefix|str|sklearn|Prefix string, for bookkeeping|
+|header|action|False|Output CSV header|
+|verbose|action|False|Output extra debug messages|
+|data-format|str|numpy|Data formats: *numpy*, *pandas* or *cudf*|
+|data-order|str|C|Data order: C (row-major, default) or F (column-major)|
+|dtype|np.dtype|np.float64|Data type: *float64* (default) or *float32*|
+|check-finiteness|action|False|Check finiteness in sklearn input check(disabled by default)|
+|output-format|str|csv|Output format: *csv* (default) or *json*'|
+|time-method|str|mean_min|Method used for time mesurements|
+|box-filter-measurements|int|100|Maximum number of measurements in box filter|
+|inner-loops|int|100|Maximum inner loop iterations. (we take the mean over inner iterations)|
+|outer-loops|int|100|Maximum outer loop iterations. (we take the min over outer iterations)|
+|time-limit|float|10|Target time to spend to benchmark|
+|goal-outer-loops|int|10|The number of outer loops to aim while automatically picking number of inner loops. If zero, do not automatically decide number of inner loops|
+|seed|int|12345|Seed to pass as random_state|
+|dataset-name|str|None|Dataset name|
+
+
+#### DBSCAN
+| parameter Name  | Type | default value | description |
+| ----- | ---- |---- |---- |
+| epsilon | float | 10 | Radius of neighborhood of a point|
+| min_samples | int | 5 | The minimum number of samples required in a 'neighborhood to consider a point a core point |
+
+#### RandomForestClassifier
+
+| parameter Name  | Type | default value | description |
+| ----- | ---- |---- |---- |
+| criterion | str | gini | *gini* or *entropy*. The function to measure the quality of a split |
+|split-algorithm|str|hist|*hist* or *global_quantile*. The algorithm to determine how nodes are split in the tree|
+| num-trees | int | 100 | The number of trees in the forest |
+| max-features | float_or_int | None | Upper bound on features used at each split |
+| max-depth | int | None | Upper bound on depth of constructed trees |
+| min-samples-split | float_or_int | 2 | Minimum samples number for node splitting |
+| max-leaf-nodes | int | None | Maximum leaf nodes per tree |
+| min-impurity-decrease | float | 0 | Needed impurity decrease for node splitting |
+| no-bootstrap | store_false | True | Don't control bootstraping |
+
+#### RandomForestRegressor
+
+| parameter Name  | Type | default value | description |
+| ----- | ---- |---- |---- |
+| criterion | str | gini | *gini* or *entropy*. The function to measure the quality of a split |
+|split-algorithm|str|hist|*hist* or *global_quantile*. The algorithm to determine how nodes are split in the tree|
+| num-trees | int | 100 | The number of trees in the forest |
+| max-features | float_or_int | None | Upper bound on features used at each split |
+| max-depth | int | None | Upper bound on depth of constructed trees |
+| min-samples-split | float_or_int | 2 | Minimum samples number for node splitting |
+| max-leaf-nodes | int | None | Maximum leaf nodes per tree |
+| min-impurity-decrease | float | 0 | Needed impurity decrease for node splitting |
+| no-bootstrap | action | True | Don't control bootstraping |
+
+#### KMeans
+
+| parameter Name  | Type | default value | description |
+| ----- | ---- |---- |---- |
+| init | str |  | Initial clusters |
+| tol | float | 0 | Absolute threshold |
+| maxiter | int | 100 | Maximum number of iterations |
+| samples-per-batch | int | 32768 | The number of samples per batch |
+| n-clusters | int |  | The number of clusters |
+
+#### KNeighborsClassifier
+
+| parameter Name  | Type | default value | description |
+| ----- | ---- |---- |---- |
+| n-neighbors | int | 5 | The number of neighbors to use |
+| weights | str | uniform | Weight function used in prediction |
+| method | str | brute | Algorithm used to compute the nearest neighbors |
+| metric | str | euclidean | Distance metric to use |
+
+#### LinearRegression
+
+| parameter Name  | Type | default value | description |
+| ----- | ---- |---- |---- |
+| no-fit-intercept | action | True | Don't fit intercept (assume data already centered) |
+| solver | str | eig | *eig* or *svd*. Solver used for training |
+
+#### LogisticRegression
+
+| parameter Name  | Type | default value | description |
+| ----- | ---- |---- |---- |
+| no-fit-intercept | action | True | Don't fit intercept|
+| solver | str | qn | *qn*, *owl*. Solver to use|
+| maxiter | int | 100 | Maximum iterations for the iterative solver |
+| C | float | 1.0 | Regularization parameter |
+| tol | float | None | Tolerance for solver |
+
+#### PCA
+
+| parameter Name  | Type | default value | description |
+| ----- | ---- |---- |---- |
+| svd-solver | str | full | *auto*, *full* or *jacobi*. SVD solver to use |
+| n-components | int | None | The number of components to find |
+| whiten | action | False | Perform whitening |
+
+#### Ridge
+
+| parameter Name  | Type | default value | description |
+| ----- | ---- |---- |---- |
+| no-fit-intercept | action | True | Don't fit intercept (assume data already centered) |
+| solver | str | eig | *eig*, *cd* or *svd*. Solver used for training |
+| alpha | float | 1.0 | Regularization strength |
+
+#### SVC
+
+| parameter Name  | Type | default value | description |
+| ----- | ---- |---- |---- |
+| C | float | 0.01 | SVM slack parameter |
+| kernel | str | linear | *linear* or *rbf*. SVM kernel function |
+| gamma | float | None | Parameter for kernel="rbf" |
+| maxiter | int | 2000 | Maximum iterations for the iterative solver |
+| max-cache-size | int | 64 | Maximum cache size for SVM. |
+| tol | float | 1e-16 | Tolerance passed to sklearn.svm.SVC |
+| no-shrinking | action | True | Don't use shrinking heuristic |
+
+#### train_test_split
+
+| parameter Name  | Type | default value | description |
+| ----- | ---- |---- |---- |
+| train-size | float | 0.75 | Size of training subset |
+| test-size | float | 0.25 | Size of testing subset |
+| do-not-shuffle | action | False | Do not perform data shuffle before splitting |