TabLLM: Few-shot Classification of Tabular Data with Large Language Models

This repository contains the code to reproduce the results of the paper TabLLM: Few-shot Classification of Tabular Data with Large Language Models by Stefan Hegselmann, Alejandro Buendia, Hunter Lang, Monica Agrawal, Xiaoyi Jiang, and David Sontag.

Reproducing the main results consists of three steps:

Creating textual serializations of the nine public tabular datasets
Train and evaluate TabLLM (use code from t-few project) on serialized datasets
Running the baseline models on the tabular datasets

We did not include the code to serialize and evaluate the private healthcare dataset due to privacy concerns. Also, code for some additional experiments is not included. Feel free to contact us if you have any questions concerning these experiments.

Preparing the Environment

We used conda to create a virtual environment using python 3.8:

conda create -n tabllm python==3.8
conda activate tabllm

Next, install the necessary requirements.

conda install numpy scipy pandas scikit-learn
conda install pytorch==1.10.1 torchvision==0.11.2 torchaudio==0.10.1 cudatoolkit=11.3 -c pytorch -c conda-forge
pip install datasets transformers sentencepiece protobuf xgboost lightgbm tabpfn

1. Creating Serialized Datasets

To create a textual serialization for one of the tabular datasets execute the following script with additional optional arguments for a specific serialization type. This will create a folder with a huggingface dataset in datasets_serialized:

create_external_datasets.py --dataset (car|income|diabetes|heart|bank|blood|calhousing|creditg|jungle) (--list) (--list (--tabletotext|--t0serialization|--values|--permuted|--shuffled))

For the serialization Text GPT, we used a script querying the GPT-3 API with a row entry encoded as a list and the prompts given in the paper.

We provide the Text serializations in datasets_serialized. The other serializations are omitted here due to size constraints. The Text serialization achieved the best results in our experiments.

2. Train and Evaluate TabLLM on Serialized Datasets

We used the codebase of the t-few project for our experiments. We made some small modifications to their code to enable experiments with our custom datasets and templates. We included all changed files in the t-few folder. The script few-shot-pretrained-100k.sh runs all our TabLLM experiments for the different serializations. As a result, an exp_out folder is created with the results. For more information, please consider the original t-few repository.

3. Running the Baseline Models

We tested TabLLM against several baselines. They use the standard non-serialized datasets. The hyperparameter ranges are given in the paper. You can specify the baseline models and datasets that you want to run in the code. To run a baseline model execute

evaluate_external_datasets.py

We hope these instructions help you to reproduce our results. Feel free to contact us if you have any questions!

Citation

If you want to cite our work please use:

@inproceedings{hegselmann2023tabllm,
  title={Tabllm: Few-shot classification of tabular data with large language models},
  author={Hegselmann, Stefan and Buendia, Alejandro and Lang, Hunter and Agrawal, Monica and Jiang, Xiaoyi and Sontag, David},
  booktitle={International Conference on Artificial Intelligence and Statistics},
  pages={5549--5581},
  year={2023},
  organization={PMLR}
}

We use the code of

@article{liu2022few,
  title={Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning},
  author={Liu, Haokun and Tam, Derek and Muqeeth, Mohammed and Mohta, Jay and Huang, Tenghao and Bansal, Mohit and Raffel, Colin A},
  journal={Advances in Neural Information Processing Systems},
  volume={35},
  pages={1950--1965},
  year={2022}
}

@inproceedings{bach2022promptsource,
  title={PromptSource: An Integrated Development Environment and Repository for Natural Language Prompts},
  author={Bach, Stephen and Sanh, Victor and Yong, Zheng Xin and Webson, Albert and Raffel, Colin and Nayak, Nihal V and Sharma, Abheesht and Kim, Taewoon and Bari, M Saiful and F{\'e}vry, Thibault and others},
  booktitle={Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: System Demonstrations},
  pages={93--104},
  year={2022}
}

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
datasets		datasets
datasets_serialized		datasets_serialized
helper		helper
t-few		t-few
templates		templates
LICENSE		LICENSE
README.md		README.md
create_external_datasets.py		create_external_datasets.py
evaluate_external_dataset.py		evaluate_external_dataset.py
query_gpt3.py		query_gpt3.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

datasets

datasets

datasets_serialized

datasets_serialized

helper

helper

t-few

t-few

templates

templates

LICENSE

LICENSE

README.md

README.md

create_external_datasets.py

create_external_datasets.py

evaluate_external_dataset.py

evaluate_external_dataset.py

query_gpt3.py

query_gpt3.py

Repository files navigation

TabLLM: Few-shot Classification of Tabular Data with Large Language Models

Preparing the Environment

1. Creating Serialized Datasets

2. Train and Evaluate TabLLM on Serialized Datasets

3. Running the Baseline Models

Citation

About

Releases

Packages

Languages

License

Jnelen/TabLLM

Folders and files

Latest commit

History

Repository files navigation

TabLLM: Few-shot Classification of Tabular Data with Large Language Models

Preparing the Environment

1. Creating Serialized Datasets

2. Train and Evaluate TabLLM on Serialized Datasets

3. Running the Baseline Models

Citation

About

Resources

License

Stars

Watchers

Forks

Languages