This page is dedicated for the Data Governance in Data Lakes project by the DTIM Research Group, UPC. We provide here the main datasets and sources for experimental evaluation of our techniques on OpenML data.
We provide the annotated ground-truth used in our experiments under the folder "datasets". There are two main types of files there provided in the Comma Separated Values (CSV) format:
- The "index" CSV files: They store the original index of datasets collected for the training set or the testing set in the experiments. It comes with the OpenML dataset ID, the dataset name, and the meta-features collected for each dataset.
- The "matching" CSV files: They store the meta-features' distances between all pairs of datasets in the training set or the testing set of the experiments. It comes with the OpenML dataset ID, the dataset name, the meta-features distances between the pair in each row, and the ground-truth of the relationships between the datasets in the pair: datesets_subject_main_match (1 or 0) for the Rel(d1,d2) relationship and datesets_duplicates_match (1 or 0) for the Dup(d1,d2) relationship.
To retrieve the original datasets from OpenML using the APIs provided by them and the dataset IDs (did) from our CSV files, please refer to the OpenML API guide.
This project is licensed under the Apache License 2.0 License - see the LICENSE.md file for details.
We are sincerely thankful to all the annotators who have validated and collaborated in creating the ground-truth datasets for the experiments. We thank the collaborators from the school of Pharmacy for helping us with the annotation of the datasets.
For more details about the datasets in this project, how they were collected, and for a detailed description of the data collected, please see the main publication resulting from this project: "Alserafi, A., Calders, T., Abelló, A., & Romero, O. (2017, October). DS-Prox: Dataset Proximity Mining for Governing the Data Lake. In International Conference on Similarity Search and Applications (pp. 284-299). Springer, Cham.".