Skip to content

This page is dedicated for the Data Governance in Data Lakes project by the DTIM research group of the ESSI department, UPC. We provide here the main datasets and sources for experimental evaluation of our techniques on OpenML data.

License

AymanUPC/datasets_proximity_openml

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

31 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Datasets Proximity Mining with OpenML

This page is dedicated for the Data Governance in Data Lakes project by the DTIM Research Group, UPC. We provide here the main datasets and sources for experimental evaluation of our techniques on OpenML data.

Getting Started

We provide the annotated ground-truth used in our experiments under the folder "datasets". There are two main types of files there provided in the Comma Separated Values (CSV) format:

  • The "index" CSV files: They store the original index of datasets collected for the training set or the testing set in the experiments. It comes with the OpenML dataset ID, the dataset name, and the meta-features collected for each dataset.
  • The "matching" CSV files: They store the meta-features' distances between all pairs of datasets in the training set or the testing set of the experiments. It comes with the OpenML dataset ID, the dataset name, the meta-features distances between the pair in each row, and the ground-truth of the relationships between the datasets in the pair: datesets_subject_main_match (1 or 0) for the Rel(d1,d2) relationship and datesets_duplicates_match (1 or 0) for the Dup(d1,d2) relationship.

To retrieve the original datasets from OpenML using the APIs provided by them and the dataset IDs (did) from our CSV files, please refer to the OpenML API guide.

Built With

License

This project is licensed under the Apache License 2.0 License - see the LICENSE.md file for details.

Acknowledgements

We are sincerely thankful to all the annotators who have validated and collaborated in creating the ground-truth datasets for the experiments. We thank the collaborators from the school of Pharmacy for helping us with the annotation of the datasets.

Reference

For more details about the datasets in this project, how they were collected, and for a detailed description of the data collected, please see the main publication resulting from this project: "Alserafi, A., Calders, T., Abelló, A., & Romero, O. (2017, October). DS-Prox: Dataset Proximity Mining for Governing the Data Lake. In International Conference on Similarity Search and Applications (pp. 284-299). Springer, Cham.".

About

This page is dedicated for the Data Governance in Data Lakes project by the DTIM research group of the ESSI department, UPC. We provide here the main datasets and sources for experimental evaluation of our techniques on OpenML data.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published