cleaning-data-lakes-in-a-self-supervised-manner

Self-supervised data cleaning experiments. Experiments available:

GARF (Self-supervised and Interpretable Data Cleaning with Sequence Generative Adversarial Networks)
RNN-GARF (GARF with LSTM instead of SeqGAN)
BiLSTM-GARF (GARF with Bidirectional LSTM instead of SeqGAN)
MPGARF (Multi-Core GARF)
Tuning-GARF (Hyperparameter search for network size in original GARF)
Linear-GARF (GARF with FD-detection instead of SeqGAN)
Distilled-GARF (RNN-GARF with Knowledge Distillation)
Column-Scalability (How well does GARF scale with increasing column size?)
Column-Ordering (Does column ordering has any effect on found rules?)
Tax-GARF (How well does GARF scale with with increasing tuple size?)
GARF Applicability (What needs to be done to a dataset so that we can not use GARF on it anymore?)
GARF Enhancement (Can we improve results by duplicating our dataset?)
Rule-Coverage (How much does a dataset need to change in order that we cannot reusage the found rules?)
Rule-Reusage (Can we transfer rules to Tax?)
LakeCleanerNaive (Multiple GARF models, without interaction, single core version)
LakeCleanerRNNNaive (Multiple RNN-GARF models, without interaction, single core version)
LakeCleaner (Multiple RNN-GARF models, rule transfer, dataset clustering, single core version)

Run an experiment:

main.py [-h] [-d DATASETS] [-e ERROR] [-g GENERATOR]
        [-i INTERVAL_FOR_ERROR_REMOVAL] [-l LIMIT] [-m METHOD]
        [-n N_JOBS] [-o] [-p] [-r RUNS] [-s] [-t]

optional arguments:
  -h, --help            show this help message and exit
  -d DATASETS, --datasets DATASETS
                        Datasets to run on (comma separated)
  -e ERROR, --error ERROR
                        Maximum error rate
  -g GENERATOR, --generator GENERATOR
                        Error generator, (0) = simple, (1) = BART (applies
                        only to original_experiment = False)
  -i INTERVAL_FOR_ERROR_REMOVAL, --interval_for_error_removal INTERVAL_FOR_ERROR_REMOVAL
                        The interval for removing defective tuples from
                        training set, 0 = [0], 1 = [0, 1], 2 = [0, 0.5, 1]
  -l LIMIT, --limit LIMIT
                        Limit the number of dataset rows
  -m METHOD, --method METHOD
                        Method to run
  -n N_JOBS, --n_jobs N_JOBS
                        Number of jobs to run in parallel
  -o, --original_experiment
                        Run the original experiment
  -p, --prepare         Prepare the databases
  -r RUNS, --runs RUNS  Number of runs per error rate
  -s, --send            Send push notification
  -t, --tex             Generate latex style plots, requires latex installed
                        on system

ToDo

Citations

Peng, Jinfeng, et al. "Self-supervised and Interpretable Data Cleaning with Sequence Generative Adversarial Networks." Proceedings of the VLDB Endowment 16.3 (2022): 433-446.
Wang, Pei, and Yeye He. "Uni-detect: A unified approach to automated error detection in tables." Proceedings of the 2019 International Conference on Management of Data. 2019.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
datasets		datasets
experiments		experiments
methods		methods
preparation		preparation
search		search
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
argparser.py		argparser.py
create_db.py		create_db.py
eva.py		eva.py
factory.py		factory.py
insert_error.py		insert_error.py
latex_plot.py		latex_plot.py
main.py		main.py
reset.py		reset.py

License

S-Eggers/MA-Cleaning-Data-Lakes-in-a-Self-Supervised-Manner

Folders and files

Latest commit

History

Repository files navigation

cleaning-data-lakes-in-a-self-supervised-manner

ToDo

Citations

About

Resources

License

Stars

Watchers

Forks

Languages