-
Notifications
You must be signed in to change notification settings - Fork 74
Update README.md #29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Update README.md #29
Changes from all commits
Commits
Show all changes
14 commits
Select commit
Hold shift + click to select a range
e7a216c
Update README.md
PivovarA 97f1e5e
fixed a typo
PivovarA 6ff1b75
apply general comments to new Benchmarks Documentation
PivovarA 0a09e23
add appropriate links
PivovarA e06b9e5
Add info about IDP to sklearn README
PivovarA 402b799
highlight code sample
PivovarA 30f170a
rename supported algorithms
PivovarA 5c9b305
Change daal4py env build instruction
PivovarA 9e6a147
add ml frameworks to Prerequisites
PivovarA aa9cf21
apply Bill comments
PivovarA 7e29ed8
Update configs/README.md
PivovarA 3c92165
Update README.md
PivovarA 5705182
Apply suggestions from code review
PivovarA 52e96f2
apply some changes to the tables sells about capital letter, 'The' an…
PivovarA File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,67 @@ | ||
## Config JSON Schema | ||
|
||
Configure benchmarks by editing the `config.json` file. | ||
You can configure some algorithm parameters, datasets, a list of frameworks to use, and the usage of some environment variables. | ||
Refer to the tables below for descriptions of all fields in the configuration file. | ||
|
||
- [Root Config Object](#root-config-object) | ||
- [Common Object](#common-object) | ||
- [Case Object](#case-object) | ||
- [Dataset Object](#dataset-object) | ||
- [Training Object](#training-object) | ||
- [Testing Object](#testing-object) | ||
|
||
### Root Config Object | ||
| Field Name | Type | Description | | ||
| ----- | ---- |------------ | | ||
|omp_env| array[string] | For xgboost only. Specify an environment variable to set the number of omp threads | | ||
|common| [Common Object](#common-object)| **REQUIRED** common benchmarks setting: frameworks and input data settings | | ||
|cases| array[[Case Object](#case-object)] | **REQUIRED** list of algorithms, their parameters and training data | | ||
|
||
### Common Object | ||
|
||
| Field Name | Type | Description | | ||
| ----- | ---- |------------ | | ||
|lib| array[string] | **REQUIRED** list of test frameworks. It can be *sklearn*, *daal4py*, *cuml* or *xgboost* | | ||
|data-format| array[string] | **REQUIRED** input data format. Data formats: *numpy*, *pandas* or *cudf* | | ||
|data-order| array[string] | **REQUIRED** input data order. Data order: *C* (row-major, default) or *F* (column-major) | | ||
|dtype| array[string] | **REQUIRED** input data type. Data type: *float64* (default) or *float32* | | ||
|check-finitness| array[] | Check finiteness in sklearn input check(disabled by default) | | ||
|
||
### Case Object | ||
|
||
| Field Name | Type | Description | | ||
| ----- | ---- |------------ | | ||
|lib| array[string] | **REQUIRED** list of test frameworks. It can be *sklearn*, *daal4py*, *cuml* or *xgboost*| | ||
|algorithm| string | **REQUIRED** benchmark name | | ||
|dataset| array[[Dataset Object](#dataset-object)] | **REQUIRED** input data specifications. | | ||
|benchmark parameters| array[Any] | **REQUIRED** algorithm parameters. a list of supported parameters can be found here | | ||
|
||
### Dataset Object | ||
|
||
| Field Name | Type | Description | | ||
| ----- | ---- |------------ | | ||
|source| string | **REQUIRED** data source. It can be *synthetic* or *csv* | | ||
|type| string | **REQUIRED** for synthetic data only. The type of task for which the dataset is generated. It can be *classification*, *blobs* or *regression* | | ||
|n_classes| int | For *synthetic* data and for *classification* type only. The number of classes (or labels) of the classification problem | | ||
|n_clusters| int | For *synthetic* data and for *blobs* type only. The number of centers to generate | | ||
|n_features| int | **REQUIRED** For *synthetic* data only. The number of features to generate | | ||
|name| string | Name of dataset | | ||
|training| [Training Object](#training-object) | **REQUIRED** algorithm parameters. a list of supported parameters can be found here | | ||
|testing| [Testing Object](#testing-object) | **REQUIRED** algorithm parameters. a list of supported parameters can be found here | | ||
|
||
### Training Object | ||
|
||
| Field Name | Type | Description | | ||
| ----- | ---- |------------ | | ||
| n_samples | int | The total number of the training points | | ||
| x | str | The path to the training samples | | ||
| y | str | The path to the training labels | | ||
|
||
### Testing Object | ||
|
||
| Field Name | Type | Description | | ||
| ----- | ---- |------------ | | ||
| n_samples | int | The total number of the testing points | | ||
| x | str | The path to the testing samples | | ||
| y | str | The path to the testing labels | |
File renamed without changes.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,151 @@ | ||
|
||
## How to create conda environment for benchmarking | ||
`conda create -n skl_bench -c rapidsai -c conda-forge python=3.7 cuml pandas cudf` | ||
|
||
## Algorithms parameters | ||
|
||
You can launch benchmarks for each algorithm separately. The tables below list all supported parameters for each algorithm: | ||
|
||
- [General](#general) | ||
- [DBSCAN](#dbscan) | ||
- [RandomForestClassifier](#randomforestclassifier) | ||
- [RandomForestRegressor](#randomforestregressor) | ||
- [pairwise_distances](#pairwise_distances) | ||
- [KMeans](#kmeans) | ||
- [KNeighborsClassifier](#kneighborsclassifier) | ||
- [LinearRegression](#linearregression) | ||
- [LogisticRegression](#logisticregression) | ||
- [PCA](#pca) | ||
- [Ridge Regression](#ridge) | ||
- [SVC](#svc) | ||
- [train_test_split](#train_test_split) | ||
|
||
#### General | ||
| Parameter Name | Type | Default Value | Description | | ||
| ----- | ---- |---- |---- | | ||
|num-threads|int|-1| The number of threads to use| | ||
|arch|str|?|Achine architecture, for bookkeeping| | ||
|batch|str|?|Batch ID, for bookkeeping| | ||
|prefix|str|sklearn|Prefix string, for bookkeeping| | ||
|header|action|False|Output CSV header| | ||
|verbose|action|False|Output extra debug messages| | ||
|data-format|str|numpy|Data formats: *numpy*, *pandas* or *cudf*| | ||
|data-order|str|C|Data order: C (row-major, default) or F (column-major)| | ||
|dtype|np.dtype|np.float64|Data type: *float64* (default) or *float32*| | ||
|check-finiteness|action|False|Check finiteness in sklearn input check(disabled by default)| | ||
|output-format|str|csv|Output format: *csv* (default) or *json*'| | ||
|time-method|str|mean_min|Method used for time mesurements| | ||
|box-filter-measurements|int|100|Maximum number of measurements in box filter| | ||
|inner-loops|int|100|Maximum inner loop iterations. (we take the mean over inner iterations)| | ||
|outer-loops|int|100|Maximum outer loop iterations. (we take the min over outer iterations)| | ||
|time-limit|float|10|Target time to spend to benchmark| | ||
|goal-outer-loops|int|10|The number of outer loops to aim while automatically picking number of inner loops. If zero, do not automatically decide number of inner loops| | ||
|seed|int|12345|Seed to pass as random_state| | ||
|dataset-name|str|None|Dataset name| | ||
|
||
|
||
#### DBSCAN | ||
| parameter Name | Type | default value | description | | ||
| ----- | ---- |---- |---- | | ||
| epsilon | float | 10 | Radius of neighborhood of a point| | ||
| min_samples | int | 5 | The minimum number of samples required in a 'neighborhood to consider a point a core point | | ||
|
||
#### RandomForestClassifier | ||
|
||
| parameter Name | Type | default value | description | | ||
| ----- | ---- |---- |---- | | ||
| criterion | str | gini | *gini* or *entropy*. The function to measure the quality of a split | | ||
|split-algorithm|str|hist|*hist* or *global_quantile*. The algorithm to determine how nodes are split in the tree| | ||
| num-trees | int | 100 | The number of trees in the forest | | ||
| max-features | float_or_int | None | Upper bound on features used at each split | | ||
| max-depth | int | None | Upper bound on depth of constructed trees | | ||
| min-samples-split | float_or_int | 2 | Minimum samples number for node splitting | | ||
| max-leaf-nodes | int | None | Maximum leaf nodes per tree | | ||
| min-impurity-decrease | float | 0 | Needed impurity decrease for node splitting | | ||
| no-bootstrap | store_false | True | Don't control bootstraping | | ||
|
||
#### RandomForestRegressor | ||
|
||
| parameter Name | Type | default value | description | | ||
| ----- | ---- |---- |---- | | ||
| criterion | str | gini | *gini* or *entropy*. The function to measure the quality of a split | | ||
|split-algorithm|str|hist|*hist* or *global_quantile*. The algorithm to determine how nodes are split in the tree| | ||
| num-trees | int | 100 | The number of trees in the forest | | ||
| max-features | float_or_int | None | Upper bound on features used at each split | | ||
| max-depth | int | None | Upper bound on depth of constructed trees | | ||
| min-samples-split | float_or_int | 2 | Minimum samples number for node splitting | | ||
| max-leaf-nodes | int | None | Maximum leaf nodes per tree | | ||
| min-impurity-decrease | float | 0 | Needed impurity decrease for node splitting | | ||
| no-bootstrap | action | True | Don't control bootstraping | | ||
|
||
#### KMeans | ||
|
||
| parameter Name | Type | default value | description | | ||
| ----- | ---- |---- |---- | | ||
| init | str | | Initial clusters | | ||
| tol | float | 0 | Absolute threshold | | ||
| maxiter | int | 100 | Maximum number of iterations | | ||
| samples-per-batch | int | 32768 | The number of samples per batch | | ||
| n-clusters | int | | The number of clusters | | ||
|
||
#### KNeighborsClassifier | ||
|
||
| parameter Name | Type | default value | description | | ||
| ----- | ---- |---- |---- | | ||
| n-neighbors | int | 5 | The number of neighbors to use | | ||
| weights | str | uniform | Weight function used in prediction | | ||
| method | str | brute | Algorithm used to compute the nearest neighbors | | ||
| metric | str | euclidean | Distance metric to use | | ||
|
||
#### LinearRegression | ||
|
||
| parameter Name | Type | default value | description | | ||
| ----- | ---- |---- |---- | | ||
| no-fit-intercept | action | True | Don't fit intercept (assume data already centered) | | ||
| solver | str | eig | *eig* or *svd*. Solver used for training | | ||
|
||
#### LogisticRegression | ||
|
||
| parameter Name | Type | default value | description | | ||
| ----- | ---- |---- |---- | | ||
| no-fit-intercept | action | True | Don't fit intercept| | ||
| solver | str | qn | *qn*, *owl*. Solver to use| | ||
| maxiter | int | 100 | Maximum iterations for the iterative solver | | ||
| C | float | 1.0 | Regularization parameter | | ||
| tol | float | None | Tolerance for solver | | ||
|
||
#### PCA | ||
|
||
| parameter Name | Type | default value | description | | ||
| ----- | ---- |---- |---- | | ||
| svd-solver | str | full | *auto*, *full* or *jacobi*. SVD solver to use | | ||
| n-components | int | None | The number of components to find | | ||
| whiten | action | False | Perform whitening | | ||
|
||
#### Ridge | ||
|
||
| parameter Name | Type | default value | description | | ||
| ----- | ---- |---- |---- | | ||
| no-fit-intercept | action | True | Don't fit intercept (assume data already centered) | | ||
| solver | str | eig | *eig*, *cd* or *svd*. Solver used for training | | ||
| alpha | float | 1.0 | Regularization strength | | ||
|
||
#### SVC | ||
|
||
| parameter Name | Type | default value | description | | ||
| ----- | ---- |---- |---- | | ||
| C | float | 0.01 | SVM slack parameter | | ||
| kernel | str | linear | *linear* or *rbf*. SVM kernel function | | ||
| gamma | float | None | Parameter for kernel="rbf" | | ||
| maxiter | int | 2000 | Maximum iterations for the iterative solver | | ||
| max-cache-size | int | 64 | Maximum cache size for SVM. | | ||
| tol | float | 1e-16 | Tolerance passed to sklearn.svm.SVC | | ||
| no-shrinking | action | True | Don't use shrinking heuristic | | ||
|
||
#### train_test_split | ||
|
||
| parameter Name | Type | default value | description | | ||
| ----- | ---- |---- |---- | | ||
| train-size | float | 0.75 | Size of training subset | | ||
| test-size | float | 0.25 | Size of testing subset | | ||
| do-not-shuffle | action | False | Do not perform data shuffle before splitting | |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.