GitHub - Sanderror/OpenML_Data_Cleaner: OpenML Data Cleaner tool

The OpenML Data Cleaner tool consists of a few important files

Before running the code, make sure to install the packages specified in the requirements.txt file

The app works for Python 3.9.

The dashapp.py file contains the app itself with all the functionalities of the OpenML Data Cleaner. To run the OpenML Data Cleaner, simply run this file.

To run the file, you first have to add an OpenAI API key to the create_correct_cryp_section function in the create_section.py file, for the Cryptic Attribute Names correction to work as intended.
The most optimal view of the tool currently requires you to zoom out to 67% (this will be fixed soon), otherwise the text and sections are too big.
The app right now requires a user to upload a dataset as a .csv file. This will be changed soon.
To see the error detections and corrections for all errors, it is advised to use the synthetic_data.csv file in the data folder of this repository

This dashapp.py file makes use of:

the error_detection.py file, which contains functions for error detection
the error_correctinon.py file, which contains functions for error correction
the create_sections.py file, which is used to create certain sections within the tool

The pFAHES folder contains detection methods for the disguised missing values based on Qahtan et al. (2018)'s paper

These functions are used in the detect_dmv() functinon within the error_detection.py file

The lookups folder is used to store some look-up files for the CrypticIdentifier() function as created by Zhang et al. (2023)

The cached_files folder is used to store files during the running of the dash app

The assets folder contains a .css file for the styling of some parts of the tool

The data folder contains datasets and other files that were used for the experiment notebooks

The experiment notebooks contain the procedure and results of the experiments (data quality and cryptic column name query optimizing)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
assets		assets
cached_files		cached_files
data		data
lookups		lookups
pFAHES		pFAHES
Cryptic attribute name experiment.ipynb		Cryptic attribute name experiment.ipynb
Dataset quality improvement experiment.ipynb		Dataset quality improvement experiment.ipynb
Final_BEP_paper_OpenML (1).pdf		Final_BEP_paper_OpenML (1).pdf
README.md		README.md
create_sections.py		create_sections.py
dashapp.py		dashapp.py
error_correction.py		error_correction.py
error_detection.py		error_detection.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages