Datasets used for producing benchmarks in scikit-learn intelex #135

vineel96 · 2023-05-04T05:04:36Z

Hello,
Can I get the information of datasets used for producing benchmark results(speedup values) for different scikit-learn algorithms as shown in figure under Acceleration sub section at https://github.com/intel/scikit-learn-intelex . Image is also attached here:

Alexsandruss · 2023-05-05T07:32:28Z

Datasets are specified in this config: https://github.com/IntelPython/scikit-learn_bench/blob/master/configs/skl_config.json
Data generation/loading functions are defined here: https://github.com/IntelPython/scikit-learn_bench/tree/master/datasets

vineel96 · 2023-05-05T12:41:01Z

Hi,
Thank you for the links. So, all experiments in figure are done with synthetic datasets generated from sklearn's make_blobs (except for SVC and RF where dataset is mentioned) using this script https://github.com/IntelPython/scikit-learn_bench/blob/master/datasets/make_datasets.py right?

Alexsandruss · 2023-05-05T13:11:51Z

Hi, Thank you for the links. So, all experiments in figure are done with synthetic datasets generated from sklearn's make_blobs (except for SVC and RF where dataset is mentioned) using this script https://github.com/IntelPython/scikit-learn_bench/blob/master/datasets/make_datasets.py right?

Yes, that's right.

vineel96 · 2023-05-08T09:52:52Z

Thanks for the information

vineel96 · 2023-05-22T08:42:43Z

Hi @Alexsandruss,
For the inference,

which data is used for kmeans (there is no "testing" attribute for kmeans in skl_config.json)
For knn, training and testing samples are generated seperately or training samples are itself used for testing?
for knn-kdt, linear regression, ridge regression, there is no testing data info is provided, so which data is used for inference?
for random forest and svc there is no info provided for train and test split. Which data is used for inference?
In inference speedup graph, dbscan algorithm is not shown, why?

Alexsandruss · 2023-05-22T09:19:31Z

1-4. If 'testing' field is not provided, than data is same for training and inference. Train and test split is defined in data loaders for named datasets.
5. sklearn's DBSCAN doesn't have separate function for inference

vineel96 · 2023-05-22T14:46:00Z

Hi @Alexsandruss ,
1-4. Generally we use different data for inference and training right? Is it ok to use same training data for inference also?
For named datasets, example higgs_one_m for random forest, in the above speedup graph it shows size of data
as 1M for both inference graph and training graph. But in loader_classification.py(in datasets folder), it shows different
split for train as (1000000, 28) and inference as (500000, 28). So which split is actually used in inference speedup
graph? (this is same for all named dataset)
5. So which function is used for dbscan in training speedup graph, fit() or fit_predict()?
6. For knn kdtree, there is no fit() function. So in training speedup graph, only object creation KDTree() is considered for timing or any other is used? Also for inference which function is used? is tree.query() is used in inference?
7. Also can you provide parameter information that was used for each algorithm while generating above speedup graph? Like for SVC and RF? I see for other algorithms parameters info is given in skl_config.json.
8. Also what's "time_method", "time_limit" for kmeans in skl_config.json file? Also n_clusters in it refers to initial no of clusters?

This was referenced May 5, 2023

Datasets used for producing speedup benchmarks acceleration intel/scikit-learn-intelex#1272

Closed

Question: Identity of the Dataset intel/scikit-learn-intelex#1273

Closed

Alexsandruss mentioned this issue May 5, 2023

Datasets used for producing speedup benchmarks in scikit-learn intelex oneapi-src/oneDAL#2348

Closed

vineel96 closed this as completed May 8, 2023

vineel96 reopened this May 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Datasets used for producing benchmarks in scikit-learn intelex #135

Datasets used for producing benchmarks in scikit-learn intelex #135

vineel96 commented May 4, 2023

Alexsandruss commented May 5, 2023

vineel96 commented May 5, 2023 •

edited

Loading

Alexsandruss commented May 5, 2023

vineel96 commented May 8, 2023

vineel96 commented May 22, 2023

Alexsandruss commented May 22, 2023

vineel96 commented May 22, 2023 •

edited

Loading

Datasets used for producing benchmarks in scikit-learn intelex #135

Datasets used for producing benchmarks in scikit-learn intelex #135

Comments

vineel96 commented May 4, 2023

Alexsandruss commented May 5, 2023

vineel96 commented May 5, 2023 • edited Loading

Alexsandruss commented May 5, 2023

vineel96 commented May 8, 2023

vineel96 commented May 22, 2023

Alexsandruss commented May 22, 2023

vineel96 commented May 22, 2023 • edited Loading

vineel96 commented May 5, 2023 •

edited

Loading

vineel96 commented May 22, 2023 •

edited

Loading