Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Datasets used for producing benchmarks in scikit-learn intelex #135

Open
vineel96 opened this issue May 4, 2023 · 7 comments
Open

Datasets used for producing benchmarks in scikit-learn intelex #135

vineel96 opened this issue May 4, 2023 · 7 comments

Comments

@vineel96
Copy link

vineel96 commented May 4, 2023

Hello,
Can I get the information of datasets used for producing benchmark results(speedup values) for different scikit-learn algorithms as shown in figure under Acceleration sub section at https://github.com/intel/scikit-learn-intelex . Image is also attached here:
scikit-learn-acceleration-2021 2 3

@Alexsandruss
Copy link
Contributor

Datasets are specified in this config: https://github.com/IntelPython/scikit-learn_bench/blob/master/configs/skl_config.json
Data generation/loading functions are defined here: https://github.com/IntelPython/scikit-learn_bench/tree/master/datasets

@vineel96
Copy link
Author

vineel96 commented May 5, 2023

Hi,
Thank you for the links. So, all experiments in figure are done with synthetic datasets generated from sklearn's make_blobs (except for SVC and RF where dataset is mentioned) using this script https://github.com/IntelPython/scikit-learn_bench/blob/master/datasets/make_datasets.py right?

@Alexsandruss
Copy link
Contributor

Hi, Thank you for the links. So, all experiments in figure are done with synthetic datasets generated from sklearn's make_blobs (except for SVC and RF where dataset is mentioned) using this script https://github.com/IntelPython/scikit-learn_bench/blob/master/datasets/make_datasets.py right?

Yes, that's right.

@vineel96
Copy link
Author

vineel96 commented May 8, 2023

Thanks for the information

@vineel96 vineel96 closed this as completed May 8, 2023
@vineel96 vineel96 reopened this May 22, 2023
@vineel96
Copy link
Author

Hi @Alexsandruss,
For the inference,

  1. which data is used for kmeans (there is no "testing" attribute for kmeans in skl_config.json)
  2. For knn, training and testing samples are generated seperately or training samples are itself used for testing?
  3. for knn-kdt, linear regression, ridge regression, there is no testing data info is provided, so which data is used for inference?
  4. for random forest and svc there is no info provided for train and test split. Which data is used for inference?
  5. In inference speedup graph, dbscan algorithm is not shown, why?

@Alexsandruss
Copy link
Contributor

1-4. If 'testing' field is not provided, than data is same for training and inference. Train and test split is defined in data loaders for named datasets.
5. sklearn's DBSCAN doesn't have separate function for inference

@vineel96
Copy link
Author

vineel96 commented May 22, 2023

Hi @Alexsandruss ,
1-4. Generally we use different data for inference and training right? Is it ok to use same training data for inference also?
For named datasets, example higgs_one_m for random forest, in the above speedup graph it shows size of data
as 1M for both inference graph and training graph. But in loader_classification.py(in datasets folder), it shows different
split for train as (1000000, 28) and inference as (500000, 28). So which split is actually used in inference speedup
graph? (this is same for all named dataset)
5. So which function is used for dbscan in training speedup graph, fit() or fit_predict()?
6. For knn kdtree, there is no fit() function. So in training speedup graph, only object creation KDTree() is considered for timing or any other is used? Also for inference which function is used? is tree.query() is used in inference?
7. Also can you provide parameter information that was used for each algorithm while generating above speedup graph? Like for SVC and RF? I see for other algorithms parameters info is given in skl_config.json.
8. Also what's "time_method", "time_limit" for kmeans in skl_config.json file? Also n_clusters in it refers to initial no of clusters?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants