Skip to content

SuDIS-ZJU/awesome-tabular-data-augmentation

Repository files navigation

♣️ Tabular Data Augmentation for Machine Learning: Progress and Prospects of Embracing Generative AI.

A survey providing a comprehensive examination of tabular data augmentation (TDA) methods tailored for ML scenarios, with a special emphasis on the recent advancements in incorporating generative AI techniques.

Checkout the latest version of the paper at Arxiv.

This content is a work in progress and will be continuously updated. Stay tuned!

An example of TDA for ML

① A data scientist aims to predict house prices based on factors like location and floor using an original training set ($T^O$) with limited features and records. The initial ML model yields sub-optimal results due to insufficient data and numerous missing or incorrect values. To improve performance, the scientist uses TDA to augment the original dataset with additional attributes (columns), records (rows), and corrected values (cells). ② Before augmentation, preparation steps, such as table annotation (e.g., recovering the missing column type of the 4th column in $T^O$), enhance the TDA process's effectiveness. ③ Augmentation can be achieved through retrieval-based methods (e.g., integrating the 2nd row in $T^C_1$ from external data source) or generation-based methods that synthesize new data. ④ The augmented table ($T^A$) combines the original and new data. ⑤ After augmentation, evaluation steps evaluate the effectiveness of TDA process. ⑥ Finally, the result-TDA table enable the scientist to train a more accurate price prediction model.

Sources

TDA pipeline

The overview of TDA pipeline and the task-based taxonomy for TDA approaches. The input and output of the TDA pipeline are the original table $T^O$ and the augmented table $T^A$, respectively. The TDA pipeline comprises three main procedures: pre-augmentation, augmentation, and post-augmentation.

Sources

Pre-augmentation

In the TDA pipeline, pre-augmentation encompasses preparation tasks to facilitate effective augmentation. Below is the overview of the pre-augmentation tasks and their target TDA tasks.

Sources

Overview of pre-augmentation tasks. The table contains a non-exhaustive list of representative TDA-related works (arranged in chronological order) and the corresponding pre-augmentation tasks they involve.

No. Reference Pub.Year Publication Pre-augmentation
Methods
1 Annotating and searching web tables
using entities, types and relationships
2010 VLDB Table Annotation (Ontology), Table Representation(Content)
2 Recovering semantics of tables on the web 2011 VLDB Table Annotation (Ontology), Table Representation(Content), Schema Matching(Textual)
3 MICE: Multivariate Imputation by Chained Equations in R 2011 J. Stat. Soft. Table Representation(Content)
4 MissForest—non-parametric missing value imputation
for mixed-type data
2012 Bioinformatics Table Representation(Content)
5 Finding related tables 2012 SIGMOD Table Representation(Content+Metadata), Schema Matching(Textual+Meatadata), Entity Matching(KB)
6 InfoGather: entity augmentation and attribute discovery
by holistic matching with web tables
2012 SIGMOD Table Simpification(Summariztion), Table Representation(Content+Metadata),
Table Indexing(Inverted index), Schema Matching(Textual+Meatadata)
7 Towards a Hybrid Imputation Approach Using Web Tables 2015 BDC Table Representation(Content+Metadata), Table Indexing(Inverted index)
8 TabEL: Entity Linking in Web Tables 2015 ISWC Table Representation(Content), Entity Matching(DB)
9 Deep feature synthesis:
Towards automating data science endeavors
2015 DSAA Table Representation(Content)
10 Entity Resolution in the Web of Data 2015 SLDSK Table Representation(Content), Entity Matching(KB)
11 ExploreKit: Automatic Feature Generation and Selection 2016 ICDM Table Representation(Content)
12 LSH ensemble: internet-scale domain search 2016 VLDB Table Representation(Content), Table Indexing(LSH)
13 EntiTables: Smart Assistance for Entity-Focused Tables 2017 SIGIR Table Representation(Content), Table Indexing(Inverted index), Entity Matching(KB+DB)
14 Table union search on open data 2018 VLDB Error Handling(Implicit), Table Annotation(Supervised-learning), Table Representation(Content),
Table Indexing(LSH), Schema Matching(Textual)
15 Aurum: A Data Discovery System 2018 ICDE Table Representation(Content), Table Navigation(Linkage graph), Schema Matching(Numerical)
16 Data synthesis based on generative adversarial networks 2018 VLDB Table Representation(Content)
17 GAIN: Missing Data Imputation
using Generative Adversarial Nets
2018 ICML Table Representation(Content)
18 Table2Vec: Neural Word and Entity Embeddings
for Table Population and Retrieval
2019 SIGIR Table Representation(Content), Entity Matching(KB+DB)
19 JOSIE: Overlap Set Similarity Search for
Finding Joinable Tables in Data Lakes
2019 SIGMOD Table Representation(Content), Table Indexing(Inverted index)
20 Sherlock: A Deep Learning Approach to
Semantic Data Type Detection
2019 SIGKDD Table Annotation(Supervised-learning), Table Representation(Content)
21 Auto-completion for Data Cells in Relational Tables 2019 CIKM Table Simpification(Summariztion), Table Representation(Content+Metadata), Entity Matching(KB+DB)
22 MIWAE: Deep Generative Modelling
and Imputation of Incomplete Data Sets
2019 ICML Table Representation(Content)
23 FakeTables: Using GANs to Generate
Functional Dependency Preserving Tables
with Bounded Real Data
2019 IJCAI Table Representation(Content)
24 PATE-GAN: Generating Synthetic Data
with Differential Privacy Guarantees
2019 ICLR Table Representation(Content)
25 Modeling Tabular data using Conditional GAN 2019 Nips Table Representation(Content)
26 MisGAN: Learning from Incomplete Data
with Generative Adversarial Networks
2019 ICLR Table Representation(Content)
27 Handling incomplete heterogeneous data using VAEs 2020 PR Table Representation(Content)
28

Dataset Discovery in Data Lakes

2020 ICDE Error Handling(Implicit), Table Annotation(Supervised-learning), Table Representation(Content),
Table Indexing(LSH), Schema Matching(Textual+Numerical)
29 ARDA: automatic relational data augmentation
for machine learning
2020 VLDB Error Handling(Explicit), Table Simplification(Sampling), Table Representation(Content)
30 Creating Embeddings of Heterogeneous Relational Datasets
for Data Integration Tasks
2020 SIGMOD Error Handling(Explicit), Table Representation(Content+Context)
31 Sato: contextual semantic type detection in tables 2020 Publication Error Handling(Implicit), Table Annotation(Supervised-learning), Table Simplification(Summarization),
Table Representation(Content+context)
32 Organizing Data Lakes for Navigation 2020 VLDB Table Representation (Content), Table Navigation (Linkage graph)
33 Discovering related data at scale 2021 VLDB Table Simplification (Sampling), Table Representation (Content+Metadata)
34 Efficient Joinable Table Discovery in Data Lakes:
A High-Dimensional Similarity-Based Approach
2021 ICDE Error Handling (Implicit), Table Representation (Content), Table Indexing (Inverted index), Schema Matching (Textual)
35 Automatic data acquisition for deep learning 2021 VLDB Table Representation(Content)
36 RONIN: data lake exploration 2021 VLDB Table Representation(Content),Table Navigation(Hierarchical structure)
37 Conditional Wasserstein GAN-based oversampling of tabular data
for imbalanced learning
2021 ESWA Table Representation(Content)
38 SIGRNN: Synthetic Minority Instances Generation
in Imbalanced Datasets using a Recurrent Neural Network
2021 ICPRAM Table Representation(Content)
39 Multiple Imputation via Generative Adversarial Network
for High-dimensional Blockwise Missing Value Problems
2021 ICMLA Table Representation(Content)
40 Leva: Boosting Machine Learning Performance
with Relational Embedding Data Augmentation
2022 SIGMOD Error Handling (Explicit), Table Representation (Content)
41 MATE: multi-attribute table extraction 2022 VLDB Table Representation (Content), Table Indexing (Inverted index)
42 Integrating Data Lake Tables 2022 VLDB Table Annotation (PLMs), Table Representation (Content), Table Navigation (Clustering)
43 Feature Augmentation with Reinforcement Learning 2022 ICDE Table Simplification (Sampling), Table Representation (Content)
44 Selective data acquisition in the wild for model charging 2022 VLDB Table Representation (Content), Table Navigation (Clustering)
45 A Sketch-based Index for Correlated Dataset Search 2022 ICDE Table Representation(Content), Table Navigation(Linkage graph), Schema Matching(Numerical)
46 StruBERT: Structure-aware BERT for Table Search and Matching 2022 WWW Table Representation(Content)
47 TURL: Table Understanding through Representation Learning 2022 SIGMOD Error Handling (Implicit), Table Simplification (Summarization), Table Representation (Content+Context+Metadata)
48 Data Lake Organization 2022 TKDE Table Representation (Content), Table Navigation (Linkage graph)
49 Annotating Columns with Pre-trained Language Models 2022 SIGMOD Table Annotation (PLMs), Table Representation (Content+Context)
50 TransTab: Learning Transferable Tabular Transformers Across Tables 2022 Nips Table Representation (Content), Table Navigation (Linkage graph)
51 SOS: Score-based Oversampling for Tabular Data 2022 SIGKDD Table Representation(Content)
52 Watchog: A Light-weight Contrastive Learning based Framework
for Column Annotation
2023 SIGMOD Error Handling (Implicit), Table Annotation (PLMs), Table Simplification (Summarization),
Table Representation (Content+Context+Metadata)
53 DeepJoin: Joinable Table Discovery with Pre-Trained Language Models 2023 VLDB Error Handling (Implicit), Table Simplification (Sampling), Table Representation (Content),
Table Indexing (HNSW), Schema Matching (Textual)
54 SANTOS: Relationship-based Semantic Table Union Search 2023 SIGMOD Error Handling (Implicit), Table Annotation (Supervised-learning), Table Simplification (smallsetminus),
Table Representation (Content+Context), Table Indexing (Inverted index), Schema Matching (Textual)
55 Semantics-Aware Dataset Discovery from Data Lakes
with Contextualized Column-Based Representation Learning
2023 VLDB Error Handling (Implicit), Table Annotation (smallsetminus), Table Simplification (Sampling),
Table Representation (Content+Context), Table Indexing (LSH+HNSW), Schema Matching (Textual)
56 Automatic Table Union Search with Tabular Representation Learning 2023 ACL Findings Error Handling (Implicit), Table Simplification (Sampling), Table Representation (Content+Context)
57 Retrieval-Based Transformer for Table Augmentation 2023 ACL Findings Error Handling (Implicit), Table Representation (Content+Context+Metadata)
58 HyTrel: Hypergraph-enhanced Tabular Data Representation Learning 2023 Nips Error Handling (Implicit), Table Representation (Content+Context)
59 GOGGLE: GENERATIVE MODELLING FOR TABULAR DATA BY LEARNING RELATIONAL STRUCTURE 2023 ICLR Table Representation (Content+Context)
60 Interpretable tabular data generation 2023 KAIS Table Representation(Content)
61 STASY: SCORE-BASED TABULAR DATA SYNTHESIS 2023 ICLR Table Representation(Content)
62 CoDi: Co-evolving Contrastive Diffusion Models
for Mixed-type Tabular Synthesis
2023 ICML Table Representation(Content)
63 Diffusion models for missing value imputation
in tabular data
2023 CoRR Table Representation(Content)
64 Beyond Discrete Selection: Continuous Embedding Space Optimization
for Generative Feature Selection
2023 ICDM Table Representation(Content)
65 Reinforcement-Enhanced Autoregressive Feature Transformation:
Gradient-steered Search in Continuous Space for Postfix Expressions
2023 Nips Table Representation(Content)
66 OmniMatch: Effective Self-Supervised Any-Join Discovery
in Tabular Data Repositories
2024 CoRR Error Handling(Implicit), Table Representation (Content), Table Navigation (Linkage graph)
67 FeatNavigator: Automatic Feature Augmentation on Tabular Data 2024 CoRR Error Handling(Implicit), Table Representation (Content)
68 Controllable Tabular Data Synthesis Using Diffusion Models 2024 SIGMOD Table Representation(Content)
69 SMARTFEAT:Efficient Feature Construction
through Feature-Level Foundation Model Interactions
2024 CIDR Error Handling(Implicit), Table Representation (Content+Context)
70 Differentially Private Tabular Data Synthesis using Large Language Models 2024 CoRR Error Handling(Implicit), Table Representation (Content)

The illustration of different TDA scenarios and their suited pre-augmentation tasks.

Sources

Augmentation

Overview of the TDA tasks. The superscripts R and G indicate the TDA task is whether retrieval-based and generation-based, respectively.

Sources

The categorization of the TDA approaches, from both task-oriented and table-level perspectives. We also provide a concise introduction to the key TDA techniques within each category. Sources

🔎 Retrieval-based

Entity Augmentation

[SIGMOD'12] InfoGather: entity augmentation and attribute discovery by holistic matching with web tables [paper]

[VLDB'18] Table union search on open data. [paper]

[ICDE'20] Dataset Discovery in Data Lakes [paper]

[SIGMOD'12] Finding related tables [paper]

[SIGIR'17] EntiTables: Smart Assistance for Entity-Focused Tables [paper] [code]

[SIGIR'19] Table2Vec: Neural Word and Entity Embeddings for Table Population and Retrieval [paper] [code]

[SIGMOD'23] SANTOS: Relationship-based Semantic Table Union Searchtaset Discovery from Data Lakes with Contextualized Column-based Representation Learning [paper] [code]

[SIGMOD'20] Creating Embeddings of Heterogeneous Relational Datasets for Data Integration Tasks [paper] [code]

[NIPS'23] HyTrel: Hypergraph-enhanced Tabular Data Representation Learning [paper] [code]

[VLDB'23] Semantics-aware Dataset Discovery from Data Lakes with Contextualized Column-based Representation Learning. [paper] [code]

[ACL'23] Automatic Table Union Search with Tabular Representation Learning [paper]

Schema Augmentation

[VLDB'16] LSH ensemble: internet-scale domain search [paper] [code]

[SIGMOD'19] JOSIE: Overlap Set Similarity Search for Finding Joinable Tables in Data Lakes. [paper] [code]

[VLDB'20] ARDA: automatic relational data augmentation for machine learning [paper]

[ICDE'22] Feature Augmentation with Reinforcement Learning [paper]

[CoRR'24] FeatNavigator: Automatic Feature Augmentation on Tabular Data [paper]

[ICDE'21] Efficient Joinable Table Discovery in Data Lakes: A High-Dimensional Similarity-Based Approach [paper]

[VLDB'23] DeepJoin: Joinable Table Discovery with Pre-Trained Language Models [paper]

[CoRR'24] OmniMatch: Effective Self-Supervised Any-Join Discovery in Tabular Data Repositories [paper]

[SIGMOD'20] Creating Embeddings of Heterogeneous Relational Datasets for Data Integration Tasks [paper] [code]

[SIGMOD'22] Leva: Boosting Machine Learning Performance with Relational Embedding DataAugmentation [paper]

Cell Completion

[SIGMOD'12] InfoGather: entity augmentation and attribute discovery by holistic matching with web tables [paper]

[SIGIR'17] EntiTables: Smart Assistance for Entity-Focused Tables [paper] [code]

[ACL'23] Retrieval-Based Transformer for Table Augmentation [paper]

[SIGMOD'22] TURL:Table Understanding through Representation Learning [paper] [code]

[BDC'15] Towards a Hybrid Imputation Approach Using Web Tables [paper]

[CIKM'19] Auto-completion for Data Cells in Relational Tables [paper]

Table Integration

[SIGMOD'12] InfoGather: entity augmentation and attribute discovery by holistic matching with web tables [paper]

[SIGIR'17] EntiTables: Smart Assistance for Entity-Focused Tables [paper] [code]

[ACL'23] Retrieval-Based Transformer for Table Augmentation [paper]

[VLDB'22] Integrating Data Lake Tables [paper] [code]

[SIGMOD'22] Leva: Boosting Machine Learning Performance with Relational Embedding DataAugmentation [paper]

🔧 Generation-based

Record Generation

[SIGMOD'07] Privacy, accuracy, and consistency too: a holistic solution to contingency table release [paper]

[SIGMOD'14] PrivBayes: private data release via bayesian networks [paper]

[ICLR'19] PATE-GAN: Generating Synthetic Data with Differential Privacy Guarantees [paper]

[VLDB'18] **** [paper] [code]

[IJCAI'19] FakeTables: Using GANs to Generate Functional Dependency Preserving Tables with Bounded Real Data [paper]

[KAIS'23] Interpretable tabular data generation [paper]

[ICLR'23] STaSy: Score-based Tabular data Synthesis [paper] [code]

[ICLR'23] GOGGLE: Generative modelling for tabular data by learning relational structure [paper]

[ICML'23] CoDi: Co-evolving Contrastive Diffusion Models for Mixed type Tabular Synthesis [paper] [code]

[SIGMOD'24] Controllable Tabular Data Synthesis Using Diffusion Models [paper]

[CoRR'24] Differentially Private Tabular Data Synthesis using Large Language Models [paper]

[NIPS'19] Modeling Tabular data using Conditional GAN [paper]

[ESWA'21] Conditional Wasserstein GAN-based oversampling of tabular data for imbalanced learning [paper]

[ICPRAM'21] SIGRNN: Synthetic Minority Instances Generation in Imbalanced Datasets using a Recurrent Neural Network [paper] [code]

[SIGKDD'22] SOS: Score-based Oversampling for Tabular Data [paper] [code]

Feature Construction

[DSAA'15] Deep feature synthesis: Towards automating data science endeavors [paper]

[ICDM'16] ExploreKit: Automatic Feature Generation and Selection [paper]

[CIDR'24] SMARTFEAT: Efficient Feature Construction through Feature-Level Foundation Model Interactions [paper]

[ICDM'23] Beyond Discrete Selection: Continuous Embedding Space Optimization for Generative Feature Selection [paper]

[NIPS'23] Reinforcement-Enhanced Autoregressive Feature Transformation: Gradient-steered Search in Continuous Space for Postfix Expressions [paper]

Cell Imputation

[J. Stat. Soft.'11] MICE : Multivariate Imputation by Chained Equations in R [paper]

[Bioinformatics'12] MissForest—non-parametric missing value imputation for mixed-type data [paper]

[ICML'19] MIWAE: Deep Generative Modelling and Imputation of Incomplete Data Sets [paper]

[PR'20] Handling incomplete heterogeneous data using VAEs [paper]

[ICML'18] GAIN: Missing Data Imputation using Generative Adversarial Nets [paper]

[ICMLA'21] Multiple Imputation via Generative Adversarial Network for High dimensional Blockwise Missing Value Problems [paper]

[Nips'22] Diffusion models for missing value imputation in tabular data [paper] [code]

Table Synthesis

[NIPS'22] TransTab: Learning Transferable Tabular Transformers Across Tables [paper] [code]

Post-augmentation

📅 TDA Datasets

Representative datasets used in TDA studies, including their basic properties and the specific TDA tasks they are suitable for.

No. Dataset Tables Columns Avg # Columns Rows Avg # Rows URL
1 Web_Manual 371 - - 18921 51 http://websail-fe.cs.northwestern.edu/TabEL/#Web_Manual
2 Wiki_Link 6085 - - 121700 20 http://websail-fe.cs.northwestern.edu/TabEL/#Wiki_Link
3 WDC web table corpus 50M 250M 5 700M 14 https://webdatacommons.org/webtables/#results-2015
4 WikiTables corpus 1.6M 30.4M 19 - - http://websail-fe.cs.northwestern.edu/TabEL/#WikiTables
5 TUS Small 1,530 14,810 10 6.8 M 4,466 https://github.com/RJMillerLab/table-union-search-benchmark
6 TUS Large 5,043 54,923 11 9.7M 1,915 https://github.com/RJMillerLab/table-union-search-benchmark
7 SANTOS Small 550 6,322 11 3.8 M 6,921 https://github.com/northeastern-datalab/santos
8 SANTOS Large 11,090 123,477 11 70 M 7,675 https://github.com/northeastern-datalab/santos
9 BTS 1 30 - 1M - https://www.transtats.bts.gov/DataIndex.asp
10 UCI datasets
(e.g.,Adult, Covertype)
- - - - - https://archive.ics.uci.edu
11 Kaggle
(e.g., Diabetes, Bank)
- - - - - https://www.kaggle.com
12 OpenML repository
(e.g., Heart, Horce)
- - - - - https://www.openml.org

📐 Evaluation Polices

Original-table-based evaluation

[SIGIR'17] EntiTables: Smart Assistance for Entity-Focused Tables [paper] [code]

[SIGIR'19] Table2Vec: Neural Word and Entity Embeddings for Table Population and Retrieval [paper] [code]

[VLDB'18] Data synthesis based on generative adversarial networks [paper] [code]

[IJCAI'19] FakeTables: Using GANs to Generate Functional Dependency Preserving Tables with Bounded Real Data [paper]

Model-based evaluation

[VLDB'22] Selective data acquisition in the wild for model charging [paper]

[IJCAI'19] FakeTables: Using GANs to Generate Functional Dependency Preserving Tables with Bounded Real Data [paper]

[VLDB'23] Semantics-aware Dataset Discovery from Data Lakes with Contextualized Column-based Representation Learning. [paper] [code]

[ICDE'23] Metam: Goal-Oriented Data Discovery [paper]

[CoRR'24] FeatNavigator: Automatic Feature Augmentation on Tabular Data [paper]

[ICDE'22] Feature Augmentation with Reinforcement Learning [paper]

[VLDB'18] Data synthesis based on generative adversarial networks [paper] [code]

[SIGMOD'22] Leva: Boosting Machine Learning Performance with Relational Embedding DataAugmentation [paper]

📈 Optimization Strategies

Iteration-based optimization

[VLDB'20] ARDA: automatic relational data augmentation for machine learning [paper]

[SIGMOD'22] Leva: Boosting Machine Learning Performance with Relational Embedding DataAugmentation [paper]

[CoRR'24] FeatNavigator: Automatic Feature Augmentation on Tabular Data [paper]

Reinforcement-learning-based optimization

[VLDB'21] Automatic data acquisition for deep learning [paper]

[ICDE'22] Feature Augmentation with Reinforcement Learning [paper]

[VLDB'22] Selective data acquisition in the wild for model charging [paper]

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published