NL2SQL by generating candidate SQLs and ranking for them.
The official repository which contains the code and pre-trained models for our paper GAR: A Generate-and-Rank Approach for Natural Language to SQL Translation. 2023 IEEE 39th International Conference on Data Engineering (ICDE).
If you use our code in your study, or find GAR useful, please cite it as follows:
@inproceedings{Yuankai2023:GAR,
author = {Yuankai Fan,Zhenying He,Tonghui Ren,Dianjun Guo,Lin Chen,Ruisi Zhu,Guanduo Chen,Yinan Jing,Kai Zhang,X.Sean Wang},
title = {{GAR}: A Generate-and-Rank Approach for Natural Language to SQL Translation},
booktitle = {39th {IEEE} International Conference on Data Engineering, {ICDE} 2023, Anaheim, CA, USA, April 3-7, 2023},
pages = {110--122},
publisher = {{IEEE}},
year = {2023},
url = {https://doi.org/10.1109/ICDE55515.2023.00016},
doi = {10.1109/ICDE55515.2023.00016},
timestamp = {Fri, 28 Jul 2023 08:30:20 +0200},
biburl = {https://dblp.org/rec/conf/icde/FanHRGCZCJZW23.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
This code implements:
- The GAR method for generate-and-ranking queries.
- An NL2SQL model under a few-shot learning setting, which can achieve significant improvement over several benchmarks.
TL;DR: We introduce GAR -- a novel generate-and-rank approach to tackle NL2SQL translation problem. GAR assumes a set of sample queries on a database is given and can use SQL queries that are "component-similar" to the given samples to answer NL queries.
The objective of NL2SQL translation is to convert a natural language query into an SQL query.
Although seq2seq-based approaches have shown good results on standard benchmarks, they may not perform well on more complex queries that demand an understanding of the database's specific structure and semantics. The main issue is that such complex queries require more training data on the target database, which is not generally provided in the benchmarks.
A more effective strategy would be to move away from the conventional seq2seq framework and aim to address the shortage of training data for the NL2SQL task. This is the approach taken by the GAR method.
Given a set of sample SQL queries, GAR uses the following three steps to do the translation:
- Generalization: Use a set of generalization rules to generalize the sample queries to provide good coverage for component-similar queries.
- SQL2NL: Translate the sample and generalized SQL queries to NL expressions called dialects.
- Learning-to-rank: Rank the dialect expressions based on the semantic similarity with a given NL query and find the closest one and hence the SQL query as the translation result.
This process is illustrated in the diagram below:
First, you should set up a python environment. This code base has been tested under python 3.7.
- Install the required packages
pip install -r requirements.txt --no-deps
- Download the Spider and GEO datasets, and put the data into the datasets folder (QBEN data is published here.). Unpack the datasets and create the following directory structure:
/datasets
├── spider
│ ├── database
│ │ └── ...
│ ├── dev.json
│ ├── dev_gold.sql
│ ├── tables.json
│ ├── train_gold.sql
│ ├── train_others.json
│ └── train_spider.json
└── geo
├── database
│ └── ...
├── dev.json
├── test.json
├── train.json
├── tables.json
The training script is located in the root directory train_pipeline.sh
.
You can run it with:
$ bash train_pipeline.sh <dataset_name> <train_data_path> <dev_data_path> <table_path> <db_dir>
The training script will create the directory saved_models
in the current directory. Training artifacts like checkpoints will be stored in this directory.
The training includes four phases:
- Retrieval model training data generation. Please note that this phase expects to take some time to generate a large set of SQL-dialect pairs for each training database.
- Retrieval model training
- Re-ranking model training data generation
- Re-ranking model training
The default configuration is stored in configs/cofig.py
. You can use this configuration to reproduce our results.
We have uploaded ranking model checkpoints on different datasets below. Meanwhile, since the generalization process is time-consuming, we also provide the generalized SQL queries for reproducing our experimental results.
Model | Dataset | Download Queries | Download Model |
---|---|---|---|
gar.geo |
GEO | gar.geo.zip | gar.geo.tar.gz |
gar-j.geo |
GEO | gar-j.geo.zip | gar-j.geo.tar.gz |
gar.spider |
Spider | gar.spider.zip | gar.spider.tar.gz |
gar-j.spider |
Spider | gar-j.spider.zip | gar-j.spider.tar.gz |
Unpack the model checkpoints and the corresponding generalized queries with the following directory structure:
/saved_models
├── gar.spider
│ └── ...
├── gar-j.spider
│ └── ...
├── gar.geo
│ └── ...
├── gar-j.geo
│ └── ...
/serialization
├── gar.spider
├── gar-j.spider
├── gar.geo
├── gar-j.geo
The evaluation script is located in the root directory test_pipeline.sh
.
You can run it with:
$ bash test_pipeline.sh <dataset_name> <test_file_path> <test_gold_sql_file> <table_path> <db_dir>
The evaluation script will create the directory output
in the current directory.
The evaluation results will be stored there.
GAR is accomplished with demonstration as well! This is our demonstration paper GENSQL: A Generative Natural Language Interface to Database Systems. 2023 IEEE 39th International Conference on Data Engineering (ICDE).
If you want to try on your own data, please contact us for the permission of the Admin Interface.😊
Please cite it if you use GenSQL in your work:
@inproceedings{Yuankai2023:GenSQL,
author = {Yuankai Fan,Yuankai Fan, Tonghui Ren, Zhenying He, X.Sean Wang, Ye Zhang, Xingang Li},
title = {{GenSQL}: A Generative Natural Language Interface to Database Systems},
booktitle = {39th {IEEE} International Conference on Data Engineering, {ICDE} 2023,Anaheim, CA, USA, April 3-7, 2023},
pages = {3603--3606},
publisher = {{IEEE}},
year = {2023},
url = {https://doi.org/10.1109/ICDE55515.2023.00278},
doi = {10.1109/ICDE55515.2023.00278},
timestamp = {Thu, 27 Jul 2023 17:17:25 +0200},
biburl = {https://dblp.org/rec/conf/icde/FanRHWZL23.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
This project welcomes contributions and suggestions 👍.
If you find bugs in our code, encounter problems when running the code, or have suggestions for GAR, please submit an issue or reach out to me (fanyuankai@fudan.edu.cn)!