♣️ Tabular Data Augmentation for Machine Learning: Progress and Prospects of Embracing Generative AI.
A survey providing a comprehensive examination of tabular data augmentation (TDA) methods tailored for ML scenarios, with a special emphasis on the recent advancements in incorporating generative AI techniques.
Checkout the latest version of the paper at Arxiv.
This content is a work in progress and will be continuously updated. Stay tuned!
① A data scientist aims to predict house prices based on factors like location and floor using an original training set (
The overview of TDA pipeline and the task-based taxonomy for TDA approaches. The input and output of the TDA pipeline are the original table
Pre-augmentation ⏩
In the TDA pipeline, pre-augmentation encompasses preparation tasks to facilitate effective augmentation. Below is the overview of the pre-augmentation tasks and their target TDA tasks.
Overview of pre-augmentation tasks. The table contains a non-exhaustive list of representative TDA-related works (arranged in chronological order) and the corresponding pre-augmentation tasks they involve.
No. | Reference | Pub.Year | Publication | Pre-augmentation Methods |
---|---|---|---|---|
1 | Annotating and searching web tables using entities, types and relationships |
2010 | VLDB | Table Annotation (Ontology), Table Representation(Content) |
2 | Recovering semantics of tables on the web | 2011 | VLDB | Table Annotation (Ontology), Table Representation(Content), Schema Matching(Textual) |
3 | MICE: Multivariate Imputation by Chained Equations in R | 2011 | J. Stat. Soft. | Table Representation(Content) |
4 | MissForest—non-parametric missing value imputation for mixed-type data |
2012 | Bioinformatics | Table Representation(Content) |
5 | Finding related tables | 2012 | SIGMOD | Table Representation(Content+Metadata), Schema Matching(Textual+Meatadata), Entity Matching(KB) |
6 | InfoGather: entity augmentation and attribute discovery by holistic matching with web tables |
2012 | SIGMOD | Table Simpification(Summariztion), Table Representation(Content+Metadata), Table Indexing(Inverted index), Schema Matching(Textual+Meatadata) |
7 | Towards a Hybrid Imputation Approach Using Web Tables | 2015 | BDC | Table Representation(Content+Metadata), Table Indexing(Inverted index) |
8 | TabEL: Entity Linking in Web Tables | 2015 | ISWC | Table Representation(Content), Entity Matching(DB) |
9 | Deep feature synthesis: Towards automating data science endeavors |
2015 | DSAA | Table Representation(Content) |
10 | Entity Resolution in the Web of Data | 2015 | SLDSK | Table Representation(Content), Entity Matching(KB) |
11 | ExploreKit: Automatic Feature Generation and Selection | 2016 | ICDM | Table Representation(Content) |
12 | LSH ensemble: internet-scale domain search | 2016 | VLDB | Table Representation(Content), Table Indexing(LSH) |
13 | EntiTables: Smart Assistance for Entity-Focused Tables | 2017 | SIGIR | Table Representation(Content), Table Indexing(Inverted index), Entity Matching(KB+DB) |
14 | Table union search on open data | 2018 | VLDB | Error Handling(Implicit), Table Annotation(Supervised-learning), Table Representation(Content), Table Indexing(LSH), Schema Matching(Textual) |
15 | Aurum: A Data Discovery System | 2018 | ICDE | Table Representation(Content), Table Navigation(Linkage graph), Schema Matching(Numerical) |
16 | Data synthesis based on generative adversarial networks | 2018 | VLDB | Table Representation(Content) |
17 | GAIN: Missing Data Imputation using Generative Adversarial Nets |
2018 | ICML | Table Representation(Content) |
18 | Table2Vec: Neural Word and Entity Embeddings for Table Population and Retrieval |
2019 | SIGIR | Table Representation(Content), Entity Matching(KB+DB) |
19 | JOSIE: Overlap Set Similarity Search for Finding Joinable Tables in Data Lakes |
2019 | SIGMOD | Table Representation(Content), Table Indexing(Inverted index) |
20 | Sherlock: A Deep Learning Approach to Semantic Data Type Detection |
2019 | SIGKDD | Table Annotation(Supervised-learning), Table Representation(Content) |
21 | Auto-completion for Data Cells in Relational Tables | 2019 | CIKM | Table Simpification(Summariztion), Table Representation(Content+Metadata), Entity Matching(KB+DB) |
22 | MIWAE: Deep Generative Modelling and Imputation of Incomplete Data Sets |
2019 | ICML | Table Representation(Content) |
23 | FakeTables: Using GANs to Generate Functional Dependency Preserving Tables with Bounded Real Data |
2019 | IJCAI | Table Representation(Content) |
24 | PATE-GAN: Generating Synthetic Data with Differential Privacy Guarantees |
2019 | ICLR | Table Representation(Content) |
25 | Modeling Tabular data using Conditional GAN | 2019 | Nips | Table Representation(Content) |
26 | MisGAN: Learning from Incomplete Data with Generative Adversarial Networks |
2019 | ICLR | Table Representation(Content) |
27 | Handling incomplete heterogeneous data using VAEs | 2020 | PR | Table Representation(Content) |
28 | Dataset Discovery in Data Lakes |
2020 | ICDE | Error Handling(Implicit), Table Annotation(Supervised-learning), Table Representation(Content), Table Indexing(LSH), Schema Matching(Textual+Numerical) |
29 | ARDA: automatic relational data augmentation for machine learning |
2020 | VLDB | Error Handling(Explicit), Table Simplification(Sampling), Table Representation(Content) |
30 | Creating Embeddings of Heterogeneous Relational Datasets for Data Integration Tasks |
2020 | SIGMOD | Error Handling(Explicit), Table Representation(Content+Context) |
31 | Sato: contextual semantic type detection in tables | 2020 | Publication | Error Handling(Implicit), Table Annotation(Supervised-learning), Table Simplification(Summarization), Table Representation(Content+context) |
32 | Organizing Data Lakes for Navigation | 2020 | VLDB | Table Representation (Content), Table Navigation (Linkage graph) |
33 | Discovering related data at scale | 2021 | VLDB | Table Simplification (Sampling), Table Representation (Content+Metadata) |
34 | Efficient Joinable Table Discovery in Data Lakes: A High-Dimensional Similarity-Based Approach |
2021 | ICDE | Error Handling (Implicit), Table Representation (Content), Table Indexing (Inverted index), Schema Matching (Textual) |
35 | Automatic data acquisition for deep learning | 2021 | VLDB | Table Representation(Content) |
36 | RONIN: data lake exploration | 2021 | VLDB | Table Representation(Content),Table Navigation(Hierarchical structure) |
37 | Conditional Wasserstein GAN-based oversampling of tabular data for imbalanced learning |
2021 | ESWA | Table Representation(Content) |
38 | SIGRNN: Synthetic Minority Instances Generation in Imbalanced Datasets using a Recurrent Neural Network |
2021 | ICPRAM | Table Representation(Content) |
39 | Multiple Imputation via Generative Adversarial Network for High-dimensional Blockwise Missing Value Problems |
2021 | ICMLA | Table Representation(Content) |
40 | Leva: Boosting Machine Learning Performance with Relational Embedding Data Augmentation |
2022 | SIGMOD | Error Handling (Explicit), Table Representation (Content) |
41 | MATE: multi-attribute table extraction | 2022 | VLDB | Table Representation (Content), Table Indexing (Inverted index) |
42 | Integrating Data Lake Tables | 2022 | VLDB | Table Annotation (PLMs), Table Representation (Content), Table Navigation (Clustering) |
43 | Feature Augmentation with Reinforcement Learning | 2022 | ICDE | Table Simplification (Sampling), Table Representation (Content) |
44 | Selective data acquisition in the wild for model charging | 2022 | VLDB | Table Representation (Content), Table Navigation (Clustering) |
45 | A Sketch-based Index for Correlated Dataset Search | 2022 | ICDE | Table Representation(Content), Table Navigation(Linkage graph), Schema Matching(Numerical) |
46 | StruBERT: Structure-aware BERT for Table Search and Matching | 2022 | WWW | Table Representation(Content) |
47 | TURL: Table Understanding through Representation Learning | 2022 | SIGMOD | Error Handling (Implicit), Table Simplification (Summarization), Table Representation (Content+Context+Metadata) |
48 | Data Lake Organization | 2022 | TKDE | Table Representation (Content), Table Navigation (Linkage graph) |
49 | Annotating Columns with Pre-trained Language Models | 2022 | SIGMOD | Table Annotation (PLMs), Table Representation (Content+Context) |
50 | TransTab: Learning Transferable Tabular Transformers Across Tables | 2022 | Nips | Table Representation (Content), Table Navigation (Linkage graph) |
51 | SOS: Score-based Oversampling for Tabular Data | 2022 | SIGKDD | Table Representation(Content) |
52 | Watchog: A Light-weight Contrastive Learning based Framework for Column Annotation |
2023 | SIGMOD | Error Handling (Implicit), Table Annotation (PLMs), Table Simplification (Summarization), Table Representation (Content+Context+Metadata) |
53 | DeepJoin: Joinable Table Discovery with Pre-Trained Language Models | 2023 | VLDB | Error Handling (Implicit), Table Simplification (Sampling), Table Representation (Content), Table Indexing (HNSW), Schema Matching (Textual) |
54 | SANTOS: Relationship-based Semantic Table Union Search | 2023 | SIGMOD | Error Handling (Implicit), Table Annotation (Supervised-learning), Table Simplification (smallsetminus), Table Representation (Content+Context), Table Indexing (Inverted index), Schema Matching (Textual) |
55 | Semantics-Aware Dataset Discovery from Data Lakes with Contextualized Column-Based Representation Learning |
2023 | VLDB | Error Handling (Implicit), Table Annotation (smallsetminus), Table Simplification (Sampling), Table Representation (Content+Context), Table Indexing (LSH+HNSW), Schema Matching (Textual) |
56 | Automatic Table Union Search with Tabular Representation Learning | 2023 | ACL Findings | Error Handling (Implicit), Table Simplification (Sampling), Table Representation (Content+Context) |
57 | Retrieval-Based Transformer for Table Augmentation | 2023 | ACL Findings | Error Handling (Implicit), Table Representation (Content+Context+Metadata) |
58 | HyTrel: Hypergraph-enhanced Tabular Data Representation Learning | 2023 | Nips | Error Handling (Implicit), Table Representation (Content+Context) |
59 | GOGGLE: GENERATIVE MODELLING FOR TABULAR DATA BY LEARNING RELATIONAL STRUCTURE | 2023 | ICLR | Table Representation (Content+Context) |
60 | Interpretable tabular data generation | 2023 | KAIS | Table Representation(Content) |
61 | STASY: SCORE-BASED TABULAR DATA SYNTHESIS | 2023 | ICLR | Table Representation(Content) |
62 | CoDi: Co-evolving Contrastive Diffusion Models for Mixed-type Tabular Synthesis |
2023 | ICML | Table Representation(Content) |
63 | Diffusion models for missing value imputation in tabular data |
2023 | CoRR | Table Representation(Content) |
64 | Beyond Discrete Selection: Continuous Embedding Space Optimization for Generative Feature Selection |
2023 | ICDM | Table Representation(Content) |
65 | Reinforcement-Enhanced Autoregressive Feature Transformation: Gradient-steered Search in Continuous Space for Postfix Expressions |
2023 | Nips | Table Representation(Content) |
66 | OmniMatch: Effective Self-Supervised Any-Join Discovery in Tabular Data Repositories |
2024 | CoRR | Error Handling(Implicit), Table Representation (Content), Table Navigation (Linkage graph) |
67 | FeatNavigator: Automatic Feature Augmentation on Tabular Data | 2024 | CoRR | Error Handling(Implicit), Table Representation (Content) |
68 | Controllable Tabular Data Synthesis Using Diffusion Models | 2024 | SIGMOD | Table Representation(Content) |
69 | SMARTFEAT:Efficient Feature Construction through Feature-Level Foundation Model Interactions |
2024 | CIDR | Error Handling(Implicit), Table Representation (Content+Context) |
70 | Differentially Private Tabular Data Synthesis using Large Language Models | 2024 | CoRR | Error Handling(Implicit), Table Representation (Content) |
The illustration of different TDA scenarios and their suited pre-augmentation tasks.
Augmentation ⏩
Overview of the TDA tasks. The superscripts R and G indicate the TDA task is whether retrieval-based and generation-based, respectively.
The categorization of the TDA approaches, from both task-oriented and table-level perspectives. We also provide a concise introduction to the key TDA techniques within each category.
[SIGMOD'12] InfoGather: entity augmentation and attribute discovery by holistic matching with web tables [paper]
[VLDB'18] Table union search on open data. [paper]
[ICDE'20] Dataset Discovery in Data Lakes [paper]
[SIGMOD'12] Finding related tables [paper]
[SIGIR'17] EntiTables: Smart Assistance for Entity-Focused Tables [paper] [code]
[SIGIR'19] Table2Vec: Neural Word and Entity Embeddings for Table Population and Retrieval [paper] [code]
[SIGMOD'23] SANTOS: Relationship-based Semantic Table Union Searchtaset Discovery from Data Lakes with Contextualized Column-based Representation Learning [paper] [code]
[SIGMOD'20] Creating Embeddings of Heterogeneous Relational Datasets for Data Integration Tasks [paper] [code]
[NIPS'23] HyTrel: Hypergraph-enhanced Tabular Data Representation Learning [paper] [code]
[VLDB'23] Semantics-aware Dataset Discovery from Data Lakes with Contextualized Column-based Representation Learning. [paper] [code]
[ACL'23] Automatic Table Union Search with Tabular Representation Learning [paper]
[VLDB'16] LSH ensemble: internet-scale domain search [paper] [code]
[SIGMOD'19] JOSIE: Overlap Set Similarity Search for Finding Joinable Tables in Data Lakes. [paper] [code]
[VLDB'20] ARDA: automatic relational data augmentation for machine learning [paper]
[ICDE'22] Feature Augmentation with Reinforcement Learning [paper]
[CoRR'24] FeatNavigator: Automatic Feature Augmentation on Tabular Data [paper]
[ICDE'21] Efficient Joinable Table Discovery in Data Lakes: A High-Dimensional Similarity-Based Approach [paper]
[VLDB'23] DeepJoin: Joinable Table Discovery with Pre-Trained Language Models [paper]
[CoRR'24] OmniMatch: Effective Self-Supervised Any-Join Discovery in Tabular Data Repositories [paper]
[SIGMOD'20] Creating Embeddings of Heterogeneous Relational Datasets for Data Integration Tasks [paper] [code]
[SIGMOD'22] Leva: Boosting Machine Learning Performance with Relational Embedding DataAugmentation [paper]
[SIGMOD'12] InfoGather: entity augmentation and attribute discovery by holistic matching with web tables [paper]
[SIGIR'17] EntiTables: Smart Assistance for Entity-Focused Tables [paper] [code]
[ACL'23] Retrieval-Based Transformer for Table Augmentation [paper]
[SIGMOD'22] TURL:Table Understanding through Representation Learning [paper] [code]
[BDC'15] Towards a Hybrid Imputation Approach Using Web Tables [paper]
[CIKM'19] Auto-completion for Data Cells in Relational Tables [paper]
[SIGMOD'12] InfoGather: entity augmentation and attribute discovery by holistic matching with web tables [paper]
[SIGIR'17] EntiTables: Smart Assistance for Entity-Focused Tables [paper] [code]
[ACL'23] Retrieval-Based Transformer for Table Augmentation [paper]
[VLDB'22] Integrating Data Lake Tables [paper] [code]
[SIGMOD'22] Leva: Boosting Machine Learning Performance with Relational Embedding DataAugmentation [paper]
[SIGMOD'07] Privacy, accuracy, and consistency too: a holistic solution to contingency table release [paper]
[SIGMOD'14] PrivBayes: private data release via bayesian networks [paper]
[ICLR'19] PATE-GAN: Generating Synthetic Data with Differential Privacy Guarantees [paper]
[IJCAI'19] FakeTables: Using GANs to Generate Functional Dependency Preserving Tables with Bounded Real Data [paper]
[KAIS'23] Interpretable tabular data generation [paper]
[ICLR'23] STaSy: Score-based Tabular data Synthesis [paper] [code]
[ICLR'23] GOGGLE: Generative modelling for tabular data by learning relational structure [paper]
[ICML'23] CoDi: Co-evolving Contrastive Diffusion Models for Mixed type Tabular Synthesis [paper] [code]
[SIGMOD'24] Controllable Tabular Data Synthesis Using Diffusion Models [paper]
[CoRR'24] Differentially Private Tabular Data Synthesis using Large Language Models [paper]
[NIPS'19] Modeling Tabular data using Conditional GAN [paper]
[ESWA'21] Conditional Wasserstein GAN-based oversampling of tabular data for imbalanced learning [paper]
[ICPRAM'21] SIGRNN: Synthetic Minority Instances Generation in Imbalanced Datasets using a Recurrent Neural Network [paper] [code]
[SIGKDD'22] SOS: Score-based Oversampling for Tabular Data [paper] [code]
[DSAA'15] Deep feature synthesis: Towards automating data science endeavors [paper]
[ICDM'16] ExploreKit: Automatic Feature Generation and Selection [paper]
[CIDR'24] SMARTFEAT: Efficient Feature Construction through Feature-Level Foundation Model Interactions [paper]
[ICDM'23] Beyond Discrete Selection: Continuous Embedding Space Optimization for Generative Feature Selection [paper]
[NIPS'23] Reinforcement-Enhanced Autoregressive Feature Transformation: Gradient-steered Search in Continuous Space for Postfix Expressions [paper]
[J. Stat. Soft.'11] MICE : Multivariate Imputation by Chained Equations in R [paper]
[Bioinformatics'12] MissForest—non-parametric missing value imputation for mixed-type data [paper]
[ICML'19] MIWAE: Deep Generative Modelling and Imputation of Incomplete Data Sets [paper]
[PR'20] Handling incomplete heterogeneous data using VAEs [paper]
[ICML'18] GAIN: Missing Data Imputation using Generative Adversarial Nets [paper]
[ICMLA'21] Multiple Imputation via Generative Adversarial Network for High dimensional Blockwise Missing Value Problems [paper]
[Nips'22] Diffusion models for missing value imputation in tabular data [paper] [code]
[NIPS'22] TransTab: Learning Transferable Tabular Transformers Across Tables [paper] [code]
Post-augmentation ⏩
Representative datasets used in TDA studies, including their basic properties and the specific TDA tasks they are suitable for.
No. | Dataset | Tables | Columns | Avg # Columns | Rows | Avg # Rows | URL |
---|---|---|---|---|---|---|---|
1 | Web_Manual | 371 | - | - | 18921 | 51 | http://websail-fe.cs.northwestern.edu/TabEL/#Web_Manual |
2 | Wiki_Link | 6085 | - | - | 121700 | 20 | http://websail-fe.cs.northwestern.edu/TabEL/#Wiki_Link |
3 | WDC web table corpus | 50M | 250M | 5 | 700M | 14 | https://webdatacommons.org/webtables/#results-2015 |
4 | WikiTables corpus | 1.6M | 30.4M | 19 | - | - | http://websail-fe.cs.northwestern.edu/TabEL/#WikiTables |
5 | TUS Small | 1,530 | 14,810 | 10 | 6.8 M | 4,466 | https://github.com/RJMillerLab/table-union-search-benchmark |
6 | TUS Large | 5,043 | 54,923 | 11 | 9.7M | 1,915 | https://github.com/RJMillerLab/table-union-search-benchmark |
7 | SANTOS Small | 550 | 6,322 | 11 | 3.8 M | 6,921 | https://github.com/northeastern-datalab/santos |
8 | SANTOS Large | 11,090 | 123,477 | 11 | 70 M | 7,675 | https://github.com/northeastern-datalab/santos |
9 | BTS | 1 | 30 | - | 1M | - | https://www.transtats.bts.gov/DataIndex.asp |
10 | UCI datasets (e.g.,Adult, Covertype) |
- | - | - | - | - | https://archive.ics.uci.edu |
11 | Kaggle (e.g., Diabetes, Bank) |
- | - | - | - | - | https://www.kaggle.com |
12 | OpenML repository (e.g., Heart, Horce) |
- | - | - | - | - | https://www.openml.org |
[SIGIR'17] EntiTables: Smart Assistance for Entity-Focused Tables [paper] [code]
[SIGIR'19] Table2Vec: Neural Word and Entity Embeddings for Table Population and Retrieval [paper] [code]
[VLDB'18] Data synthesis based on generative adversarial networks [paper] [code]
[IJCAI'19] FakeTables: Using GANs to Generate Functional Dependency Preserving Tables with Bounded Real Data [paper]
[VLDB'22] Selective data acquisition in the wild for model charging [paper]
[IJCAI'19] FakeTables: Using GANs to Generate Functional Dependency Preserving Tables with Bounded Real Data [paper]
[VLDB'23] Semantics-aware Dataset Discovery from Data Lakes with Contextualized Column-based Representation Learning. [paper] [code]
[ICDE'23] Metam: Goal-Oriented Data Discovery [paper]
[CoRR'24] FeatNavigator: Automatic Feature Augmentation on Tabular Data [paper]
[ICDE'22] Feature Augmentation with Reinforcement Learning [paper]
[VLDB'18] Data synthesis based on generative adversarial networks [paper] [code]
[SIGMOD'22] Leva: Boosting Machine Learning Performance with Relational Embedding DataAugmentation [paper]
[VLDB'20] ARDA: automatic relational data augmentation for machine learning [paper]
[SIGMOD'22] Leva: Boosting Machine Learning Performance with Relational Embedding DataAugmentation [paper]
[CoRR'24] FeatNavigator: Automatic Feature Augmentation on Tabular Data [paper]
[VLDB'21] Automatic data acquisition for deep learning [paper]
[ICDE'22] Feature Augmentation with Reinforcement Learning [paper]
[VLDB'22] Selective data acquisition in the wild for model charging [paper]