♣️ Tabular Data Augmentation for Machine Learning: Progress and Prospects of Embracing Generative AI.

A survey providing a comprehensive examination of tabular data augmentation (TDA) methods tailored for ML scenarios, with a special emphasis on the recent advancements in incorporating generative AI techniques.

Checkout the latest version of the paper at Arxiv.

This content is a work in progress and will be continuously updated. Stay tuned!

An example of TDA for ML

① A data scientist aims to predict house prices based on factors like location and floor using an original training set ($T^O$) with limited features and records. The initial ML model yields sub-optimal results due to insufficient data and numerous missing or incorrect values. To improve performance, the scientist uses TDA to augment the original dataset with additional attributes (columns), records (rows), and corrected values (cells). ② Before augmentation, preparation steps, such as table annotation (e.g., recovering the missing column type of the 4th column in $T^O$), enhance the TDA process's effectiveness. ③ Augmentation can be achieved through retrieval-based methods (e.g., integrating the 2nd row in $T^C_1$ from external data source) or generation-based methods that synthesize new data. ④ The augmented table ($T^A$) combines the original and new data. ⑤ After augmentation, evaluation steps evaluate the effectiveness of TDA process. ⑥ Finally, the result-TDA table enable the scientist to train a more accurate price prediction model.

TDA pipeline

The overview of TDA pipeline and the task-based taxonomy for TDA approaches. The input and output of the TDA pipeline are the original table $T^O$ and the augmented table $T^A$, respectively. The TDA pipeline comprises three main procedures: pre-augmentation, augmentation, and post-augmentation.

Pre-augmentation ⏩

In the TDA pipeline, pre-augmentation encompasses preparation tasks to facilitate effective augmentation. Below is the overview of the pre-augmentation tasks and their target TDA tasks.

Overview of pre-augmentation tasks. The table contains a non-exhaustive list of representative TDA-related works (arranged in chronological order) and the corresponding pre-augmentation tasks they involve.

No.	Reference	Pub.Year	Publication	Pre-augmentation Methods
1	Annotating and searching web tables using entities, types and relationships	2010	VLDB	Table Annotation (Ontology), Table Representation(Content)
2	Recovering semantics of tables on the web	2011	VLDB	Table Annotation (Ontology), Table Representation(Content), Schema Matching(Textual)
3	MICE: Multivariate Imputation by Chained Equations in R	2011	J. Stat. Soft.	Table Representation(Content)
4	MissForest—non-parametric missing value imputation for mixed-type data	2012	Bioinformatics	Table Representation(Content)
5	Finding related tables	2012	SIGMOD	Table Representation(Content+Metadata), Schema Matching(Textual+Meatadata), Entity Matching(KB)
6	InfoGather: entity augmentation and attribute discovery by holistic matching with web tables	2012	SIGMOD	Table Simpification(Summariztion), Table Representation(Content+Metadata), Table Indexing(Inverted index), Schema Matching(Textual+Meatadata)
7	Towards a Hybrid Imputation Approach Using Web Tables	2015	BDC	Table Representation(Content+Metadata), Table Indexing(Inverted index)
8	TabEL: Entity Linking in Web Tables	2015	ISWC	Table Representation(Content), Entity Matching(DB)
9	Deep feature synthesis: Towards automating data science endeavors	2015	DSAA	Table Representation(Content)
10	Entity Resolution in the Web of Data	2015	SLDSK	Table Representation(Content), Entity Matching(KB)
11	ExploreKit: Automatic Feature Generation and Selection	2016	ICDM	Table Representation(Content)
12	LSH ensemble: internet-scale domain search	2016	VLDB	Table Representation(Content), Table Indexing(LSH)
13	EntiTables: Smart Assistance for Entity-Focused Tables	2017	SIGIR	Table Representation(Content), Table Indexing(Inverted index), Entity Matching(KB+DB)
14	Table union search on open data	2018	VLDB	Error Handling(Implicit), Table Annotation(Supervised-learning), Table Representation(Content), Table Indexing(LSH), Schema Matching(Textual)
15	Aurum: A Data Discovery System	2018	ICDE	Table Representation(Content), Table Navigation(Linkage graph), Schema Matching(Numerical)
16	Data synthesis based on generative adversarial networks	2018	VLDB	Table Representation(Content)
17	GAIN: Missing Data Imputation using Generative Adversarial Nets	2018	ICML	Table Representation(Content)
18	Table2Vec: Neural Word and Entity Embeddings for Table Population and Retrieval	2019	SIGIR	Table Representation(Content), Entity Matching(KB+DB)
19	JOSIE: Overlap Set Similarity Search for Finding Joinable Tables in Data Lakes	2019	SIGMOD	Table Representation(Content), Table Indexing(Inverted index)
20	Sherlock: A Deep Learning Approach to Semantic Data Type Detection	2019	SIGKDD	Table Annotation(Supervised-learning), Table Representation(Content)
21	Auto-completion for Data Cells in Relational Tables	2019	CIKM	Table Simpification(Summariztion), Table Representation(Content+Metadata), Entity Matching(KB+DB)
22	MIWAE: Deep Generative Modelling and Imputation of Incomplete Data Sets	2019	ICML	Table Representation(Content)
23	FakeTables: Using GANs to Generate Functional Dependency Preserving Tables with Bounded Real Data	2019	IJCAI	Table Representation(Content)
24	PATE-GAN: Generating Synthetic Data with Differential Privacy Guarantees	2019	ICLR	Table Representation(Content)
25	Modeling Tabular data using Conditional GAN	2019	Nips	Table Representation(Content)
26	MisGAN: Learning from Incomplete Data with Generative Adversarial Networks	2019	ICLR	Table Representation(Content)
27	Handling incomplete heterogeneous data using VAEs	2020	PR	Table Representation(Content)
28	Dataset Discovery in Data Lakes	2020	ICDE	Error Handling(Implicit), Table Annotation(Supervised-learning), Table Representation(Content), Table Indexing(LSH), Schema Matching(Textual+Numerical)
29	ARDA: automatic relational data augmentation for machine learning	2020	VLDB	Error Handling(Explicit), Table Simplification(Sampling), Table Representation(Content)
30	Creating Embeddings of Heterogeneous Relational Datasets for Data Integration Tasks	2020	SIGMOD	Error Handling(Explicit), Table Representation(Content+Context)
31	Sato: contextual semantic type detection in tables	2020	Publication	Error Handling(Implicit), Table Annotation(Supervised-learning), Table Simplification(Summarization), Table Representation(Content+context)
32	Organizing Data Lakes for Navigation	2020	VLDB	Table Representation (Content), Table Navigation (Linkage graph)
33	Discovering related data at scale	2021	VLDB	Table Simplification (Sampling), Table Representation (Content+Metadata)
34	Efficient Joinable Table Discovery in Data Lakes: A High-Dimensional Similarity-Based Approach	2021	ICDE	Error Handling (Implicit), Table Representation (Content), Table Indexing (Inverted index), Schema Matching (Textual)
35	Automatic data acquisition for deep learning	2021	VLDB	Table Representation(Content)
36	RONIN: data lake exploration	2021	VLDB	Table Representation(Content)，Table Navigation(Hierarchical structure)
37	Conditional Wasserstein GAN-based oversampling of tabular data for imbalanced learning	2021	ESWA	Table Representation(Content)
38	SIGRNN: Synthetic Minority Instances Generation in Imbalanced Datasets using a Recurrent Neural Network	2021	ICPRAM	Table Representation(Content)
39	Multiple Imputation via Generative Adversarial Network for High-dimensional Blockwise Missing Value Problems	2021	ICMLA	Table Representation(Content)
40	Leva: Boosting Machine Learning Performance with Relational Embedding Data Augmentation	2022	SIGMOD	Error Handling (Explicit), Table Representation (Content)
41	MATE: multi-attribute table extraction	2022	VLDB	Table Representation (Content), Table Indexing (Inverted index)
42	Integrating Data Lake Tables	2022	VLDB	Table Annotation (PLMs), Table Representation (Content), Table Navigation (Clustering)
43	Feature Augmentation with Reinforcement Learning	2022	ICDE	Table Simplification (Sampling), Table Representation (Content)
44	Selective data acquisition in the wild for model charging	2022	VLDB	Table Representation (Content), Table Navigation (Clustering)
45	A Sketch-based Index for Correlated Dataset Search	2022	ICDE	Table Representation(Content), Table Navigation(Linkage graph), Schema Matching(Numerical)
46	StruBERT: Structure-aware BERT for Table Search and Matching	2022	WWW	Table Representation(Content)
47	TURL: Table Understanding through Representation Learning	2022	SIGMOD	Error Handling (Implicit), Table Simplification (Summarization), Table Representation (Content+Context+Metadata)
48	Data Lake Organization	2022	TKDE	Table Representation (Content), Table Navigation (Linkage graph)
49	Annotating Columns with Pre-trained Language Models	2022	SIGMOD	Table Annotation (PLMs), Table Representation (Content+Context)
50	TransTab: Learning Transferable Tabular Transformers Across Tables	2022	Nips	Table Representation (Content), Table Navigation (Linkage graph)
51	SOS: Score-based Oversampling for Tabular Data	2022	SIGKDD	Table Representation(Content)
52	Watchog: A Light-weight Contrastive Learning based Framework for Column Annotation	2023	SIGMOD	Error Handling (Implicit), Table Annotation (PLMs), Table Simplification (Summarization), Table Representation (Content+Context+Metadata)
53	DeepJoin: Joinable Table Discovery with Pre-Trained Language Models	2023	VLDB	Error Handling (Implicit), Table Simplification (Sampling), Table Representation (Content), Table Indexing (HNSW), Schema Matching (Textual)
54	SANTOS: Relationship-based Semantic Table Union Search	2023	SIGMOD	Error Handling (Implicit), Table Annotation (Supervised-learning), Table Simplification (smallsetminus), Table Representation (Content+Context), Table Indexing (Inverted index), Schema Matching (Textual)
55	Semantics-Aware Dataset Discovery from Data Lakes with Contextualized Column-Based Representation Learning	2023	VLDB	Error Handling (Implicit), Table Annotation (smallsetminus), Table Simplification (Sampling), Table Representation (Content+Context), Table Indexing (LSH+HNSW), Schema Matching (Textual)
56	Automatic Table Union Search with Tabular Representation Learning	2023	ACL Findings	Error Handling (Implicit), Table Simplification (Sampling), Table Representation (Content+Context)
57	Retrieval-Based Transformer for Table Augmentation	2023	ACL Findings	Error Handling (Implicit), Table Representation (Content+Context+Metadata)
58	HyTrel: Hypergraph-enhanced Tabular Data Representation Learning	2023	Nips	Error Handling (Implicit), Table Representation (Content+Context)
59	GOGGLE: GENERATIVE MODELLING FOR TABULAR DATA BY LEARNING RELATIONAL STRUCTURE	2023	ICLR	Table Representation (Content+Context)
60	Interpretable tabular data generation	2023	KAIS	Table Representation(Content)
61	STASY: SCORE-BASED TABULAR DATA SYNTHESIS	2023	ICLR	Table Representation(Content)
62	CoDi: Co-evolving Contrastive Diffusion Models for Mixed-type Tabular Synthesis	2023	ICML	Table Representation(Content)
63	Diffusion models for missing value imputation in tabular data	2023	CoRR	Table Representation(Content)
64	Beyond Discrete Selection: Continuous Embedding Space Optimization for Generative Feature Selection	2023	ICDM	Table Representation(Content)
65	Reinforcement-Enhanced Autoregressive Feature Transformation: Gradient-steered Search in Continuous Space for Postfix Expressions	2023	Nips	Table Representation(Content)
66	OmniMatch: Effective Self-Supervised Any-Join Discovery in Tabular Data Repositories	2024	CoRR	Error Handling(Implicit), Table Representation (Content), Table Navigation (Linkage graph)
67	FeatNavigator: Automatic Feature Augmentation on Tabular Data	2024	CoRR	Error Handling(Implicit), Table Representation (Content)
68	Controllable Tabular Data Synthesis Using Diffusion Models	2024	SIGMOD	Table Representation(Content)
69	SMARTFEAT:Efficient Feature Construction through Feature-Level Foundation Model Interactions	2024	CIDR	Error Handling(Implicit), Table Representation (Content+Context)
70	Differentially Private Tabular Data Synthesis using Large Language Models	2024	CoRR	Error Handling(Implicit), Table Representation (Content)

The illustration of different TDA scenarios and their suited pre-augmentation tasks.

Augmentation ⏩

Overview of the TDA tasks. The superscripts R and G indicate the TDA task is whether retrieval-based and generation-based, respectively.

The categorization of the TDA approaches, from both task-oriented and table-level perspectives. We also provide a concise introduction to the key TDA techniques within each category.

🔎 Retrieval-based

Entity Augmentation

[SIGMOD'12] InfoGather: entity augmentation and attribute discovery by holistic matching with web tables [paper]

[VLDB'18] Table union search on open data. [paper]

[ICDE'20] Dataset Discovery in Data Lakes [paper]

[SIGMOD'12] Finding related tables [paper]

[SIGIR'17] EntiTables: Smart Assistance for Entity-Focused Tables [paper] [code]

[SIGIR'19] Table2Vec: Neural Word and Entity Embeddings for Table Population and Retrieval [paper] [code]

[SIGMOD'23] SANTOS: Relationship-based Semantic Table Union Searchtaset Discovery from Data Lakes with Contextualized Column-based Representation Learning [paper] [code]

[SIGMOD'20] Creating Embeddings of Heterogeneous Relational Datasets for Data Integration Tasks [paper] [code]

[NIPS'23] HyTrel: Hypergraph-enhanced Tabular Data Representation Learning [paper] [code]

[VLDB'23] Semantics-aware Dataset Discovery from Data Lakes with Contextualized Column-based Representation Learning. [paper] [code]

[ACL'23] Automatic Table Union Search with Tabular Representation Learning [paper]

Schema Augmentation

[VLDB'16] LSH ensemble: internet-scale domain search [paper] [code]

[SIGMOD'19] JOSIE: Overlap Set Similarity Search for Finding Joinable Tables in Data Lakes. [paper] [code]

[VLDB'20] ARDA: automatic relational data augmentation for machine learning [paper]

[ICDE'22] Feature Augmentation with Reinforcement Learning [paper]

[CoRR'24] FeatNavigator: Automatic Feature Augmentation on Tabular Data [paper]

[ICDE'21] Efficient Joinable Table Discovery in Data Lakes: A High-Dimensional Similarity-Based Approach [paper]

[VLDB'23] DeepJoin: Joinable Table Discovery with Pre-Trained Language Models [paper]

[CoRR'24] OmniMatch: Effective Self-Supervised Any-Join Discovery in Tabular Data Repositories [paper]

[SIGMOD'20] Creating Embeddings of Heterogeneous Relational Datasets for Data Integration Tasks [paper] [code]

[SIGMOD'22] Leva: Boosting Machine Learning Performance with Relational Embedding DataAugmentation [paper]

Cell Completion

[SIGMOD'12] InfoGather: entity augmentation and attribute discovery by holistic matching with web tables [paper]

[SIGIR'17] EntiTables: Smart Assistance for Entity-Focused Tables [paper] [code]

[ACL'23] Retrieval-Based Transformer for Table Augmentation [paper]

[SIGMOD'22] TURL:Table Understanding through Representation Learning [paper] [code]

[BDC'15] Towards a Hybrid Imputation Approach Using Web Tables [paper]

[CIKM'19] Auto-completion for Data Cells in Relational Tables [paper]

Table Integration

[SIGMOD'12] InfoGather: entity augmentation and attribute discovery by holistic matching with web tables [paper]

[SIGIR'17] EntiTables: Smart Assistance for Entity-Focused Tables [paper] [code]

[ACL'23] Retrieval-Based Transformer for Table Augmentation [paper]

[VLDB'22] Integrating Data Lake Tables [paper] [code]

[SIGMOD'22] Leva: Boosting Machine Learning Performance with Relational Embedding DataAugmentation [paper]

🔧 Generation-based

Record Generation

[SIGMOD'07] Privacy, accuracy, and consistency too: a holistic solution to contingency table release [paper]

[SIGMOD'14] PrivBayes: private data release via bayesian networks [paper]

[ICLR'19] PATE-GAN: Generating Synthetic Data with Differential Privacy Guarantees [paper]

[VLDB'18] **** [paper] [code]

[IJCAI'19] FakeTables: Using GANs to Generate Functional Dependency Preserving Tables with Bounded Real Data [paper]

[KAIS'23] Interpretable tabular data generation [paper]

[ICLR'23] STaSy: Score-based Tabular data Synthesis [paper] [code]

[ICLR'23] GOGGLE: Generative modelling for tabular data by learning relational structure [paper]

[ICML'23] CoDi: Co-evolving Contrastive Diffusion Models for Mixed type Tabular Synthesis [paper] [code]

[SIGMOD'24] Controllable Tabular Data Synthesis Using Diffusion Models [paper]

[CoRR'24] Differentially Private Tabular Data Synthesis using Large Language Models [paper]

[NIPS'19] Modeling Tabular data using Conditional GAN [paper]

[ESWA'21] Conditional Wasserstein GAN-based oversampling of tabular data for imbalanced learning [paper]

[ICPRAM'21] SIGRNN: Synthetic Minority Instances Generation in Imbalanced Datasets using a Recurrent Neural Network [paper] [code]

[SIGKDD'22] SOS: Score-based Oversampling for Tabular Data [paper] [code]

Feature Construction

[DSAA'15] Deep feature synthesis: Towards automating data science endeavors [paper]

[ICDM'16] ExploreKit: Automatic Feature Generation and Selection [paper]

[CIDR'24] SMARTFEAT: Efficient Feature Construction through Feature-Level Foundation Model Interactions [paper]

[ICDM'23] Beyond Discrete Selection: Continuous Embedding Space Optimization for Generative Feature Selection [paper]

[NIPS'23] Reinforcement-Enhanced Autoregressive Feature Transformation: Gradient-steered Search in Continuous Space for Postfix Expressions [paper]

Cell Imputation

[J. Stat. Soft.'11] MICE : Multivariate Imputation by Chained Equations in R [paper]

[Bioinformatics'12] MissForest—non-parametric missing value imputation for mixed-type data [paper]

[ICML'19] MIWAE: Deep Generative Modelling and Imputation of Incomplete Data Sets [paper]

[PR'20] Handling incomplete heterogeneous data using VAEs [paper]

[ICML'18] GAIN: Missing Data Imputation using Generative Adversarial Nets [paper]

[ICMLA'21] Multiple Imputation via Generative Adversarial Network for High dimensional Blockwise Missing Value Problems [paper]

[Nips'22] Diffusion models for missing value imputation in tabular data [paper] [code]

Table Synthesis

[NIPS'22] TransTab: Learning Transferable Tabular Transformers Across Tables [paper] [code]

Post-augmentation ⏩

📅 TDA Datasets

Representative datasets used in TDA studies, including their basic properties and the specific TDA tasks they are suitable for.

No.	Dataset	Tables	Columns	Avg # Columns	Rows	Avg # Rows	URL
1	Web_Manual	371	-	-	18921	51	http://websail-fe.cs.northwestern.edu/TabEL/#Web_Manual
2	Wiki_Link	6085	-	-	121700	20	http://websail-fe.cs.northwestern.edu/TabEL/#Wiki_Link
3	WDC web table corpus	50M	250M	5	700M	14	https://webdatacommons.org/webtables/#results-2015
4	WikiTables corpus	1.6M	30.4M	19	-	-	http://websail-fe.cs.northwestern.edu/TabEL/#WikiTables
5	TUS Small	1,530	14,810	10	6.8 M	4,466	https://github.com/RJMillerLab/table-union-search-benchmark
6	TUS Large	5,043	54,923	11	9.7M	1,915	https://github.com/RJMillerLab/table-union-search-benchmark
7	SANTOS Small	550	6,322	11	3.8 M	6,921	https://github.com/northeastern-datalab/santos
8	SANTOS Large	11,090	123,477	11	70 M	7,675	https://github.com/northeastern-datalab/santos

9	BTS	1	30	-	1M	-	https://www.transtats.bts.gov/DataIndex.asp
10	UCI datasets (e.g.,Adult, Covertype)	-	-	-	-	-	https://archive.ics.uci.edu
11	Kaggle (e.g., Diabetes, Bank)	-	-	-	-	-	https://www.kaggle.com
12	OpenML repository (e.g., Heart, Horce)	-	-	-	-	-	https://www.openml.org

📐 Evaluation Polices

Original-table-based evaluation

[SIGIR'17] EntiTables: Smart Assistance for Entity-Focused Tables [paper] [code]

[SIGIR'19] Table2Vec: Neural Word and Entity Embeddings for Table Population and Retrieval [paper] [code]

[VLDB'18] Data synthesis based on generative adversarial networks [paper] [code]

[IJCAI'19] FakeTables: Using GANs to Generate Functional Dependency Preserving Tables with Bounded Real Data [paper]

Model-based evaluation

[VLDB'22] Selective data acquisition in the wild for model charging [paper]

[IJCAI'19] FakeTables: Using GANs to Generate Functional Dependency Preserving Tables with Bounded Real Data [paper]

[VLDB'23] Semantics-aware Dataset Discovery from Data Lakes with Contextualized Column-based Representation Learning. [paper] [code]

[ICDE'23] Metam: Goal-Oriented Data Discovery [paper]

[CoRR'24] FeatNavigator: Automatic Feature Augmentation on Tabular Data [paper]

[ICDE'22] Feature Augmentation with Reinforcement Learning [paper]

[VLDB'18] Data synthesis based on generative adversarial networks [paper] [code]

[SIGMOD'22] Leva: Boosting Machine Learning Performance with Relational Embedding DataAugmentation [paper]

📈 Optimization Strategies

Iteration-based optimization

[VLDB'20] ARDA: automatic relational data augmentation for machine learning [paper]

[SIGMOD'22] Leva: Boosting Machine Learning Performance with Relational Embedding DataAugmentation [paper]

[CoRR'24] FeatNavigator: Automatic Feature Augmentation on Tabular Data [paper]

Reinforcement-learning-based optimization

[VLDB'21] Automatic data acquisition for deep learning [paper]

[ICDE'22] Feature Augmentation with Reinforcement Learning [paper]

[VLDB'22] Selective data acquisition in the wild for model charging [paper]

Name		Name	Last commit message	Last commit date
Latest commit History 132 Commits
Figures		Figures
Augmentation.md		Augmentation.md
Chronicle_for_TDA.md		Chronicle_for_TDA.md
Post-augmentation.md		Post-augmentation.md
Pre-augmentation.md		Pre-augmentation.md
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

♣️ Tabular Data Augmentation for Machine Learning: Progress and Prospects of Embracing Generative AI.

An example of TDA for ML

TDA pipeline

Pre-augmentation ⏩

Augmentation ⏩

🔎 Retrieval-based

Entity Augmentation

Schema Augmentation

Cell Completion

Table Integration

🔧 Generation-based

Record Generation

Feature Construction

Cell Imputation

Table Synthesis

Post-augmentation ⏩

📅 TDA Datasets

📐 Evaluation Polices

Original-table-based evaluation

Model-based evaluation

📈 Optimization Strategies

Iteration-based optimization

Reinforcement-learning-based optimization

About

Releases

Packages

Contributors 2

SuDIS-ZJU/awesome-tabular-data-augmentation

Folders and files

Latest commit

History

Repository files navigation

♣️ Tabular Data Augmentation for Machine Learning: Progress and Prospects of Embracing Generative AI.

An example of TDA for ML

TDA pipeline

Pre-augmentation ⏩

Augmentation ⏩

🔎 Retrieval-based

Entity Augmentation

Schema Augmentation

Cell Completion

Table Integration

🔧 Generation-based

Record Generation

Feature Construction

Cell Imputation

Table Synthesis

Post-augmentation ⏩

📅 TDA Datasets

📐 Evaluation Polices

Original-table-based evaluation

Model-based evaluation

📈 Optimization Strategies

Iteration-based optimization

Reinforcement-learning-based optimization

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Packages