ServiceNow completed its acquisition of Element AI on January 8, 2021. All references to Element AI in the materials that are part of this project should refer to ServiceNow.
This is the official implementation of the following paper:
Torsten Scholak, Nathan Schucher, Dzmitry Bahdanau. PICARD - Parsing Incrementally for Constrained Auto-Regressive Decoding from Language Models. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP).
If you use this code, please cite:
@inproceedings{Scholak2021:PICARD,
author = {Torsten Scholak and Nathan Schucher and Dzmitry Bahdanau},
title = "{PICARD}: Parsing Incrementally for Constrained Auto-Regressive Decoding from Language Models",
booktitle = "Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing",
month = nov,
year = "2021",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2021.emnlp-main.779",
pages = "9895--9901",
}
This code implements:
- The PICARD algorithm for constrained decoding from language models.
- A text-to-SQL semantic parser based on pre-trained sequence-to-sequence models and PICARD achieving state-of-the-art performance on both the Spider and the CoSQL datasets.
TL;DR: We introduce PICARD -- a new method for simple and effective constrained decoding from large pre-trained language models. On the challenging Spider and CoSQL text-to-SQL datasets, PICARD significantly improves the performance of fine-tuned but otherwise unmodified T5 models. Using PICARD, our T5-3B models achieved state-of-the-art performance on both Spider and CoSQL.
In text-to-SQL translation, the goal is to translate a natural language question into a SQL query. There are two main challenges to this task:
- The generated SQL needs to be semantically correct, that is, correctly reflect the meaning of the question.
- The SQL also needs to be valid, that is, it must not result in an execution error.
So far, there has been a trade-off between these two goals: The second problem can be solved by using a special decoder architecture that -- by construction -- always produces valid SQL. This is the approach taken by most prior work. Those decoders are called "constrained decoders", and they need to be trained from scratch on the text-to-SQL dataset. However, this limits the generality of the decoders, which is a problem for the first goal.
A better approach would be to use a pre-trained encoder-decoder model and to constrain its decoder to produce valid SQL after fine-tuning the model on the text-to-SQL task. This is the approach taken by the PICARD algorithm.
- It’s an incremental parsing algorithm that integrates with ordinary beam search.
- It doesn’t require any training.
- It doesn’t require modifying the model.
- It works with any model that generates a sequence of tokens (including language models).
- It doesn’t require a special vocabulary.
- It works with character-, sub-word-, and word-level language models.
The following picture shows how PICARD is integrated with beam search.
Decoding starts from the left and proceeds to the right.
The algorithm begins with a single token (usually <s>
),
and then keeps expanding the beam with hypotheses generated token-by-token by the decoder.
At each decoding step and for each hypothesis,
PICARD checks whether the next top-k
tokens are valid.
In the image above, only 3 token predictions are shown, and k
is set to 2.
Valid tokens (☑) are added to the beam. Invalid ones (☒) are discarded. The k+1
-th, k+2
-th, ... tokens are discarded, too.
Like in normal beam search, the beam is pruned to contain only the top-n
hypotheses.
n
is the beam size, and in the image above it is set to 2 as well.
Hypotheses that are terminated with the end-of-sentence token (usually </s>
) are not expanded further.
The algorithm stops when the all hypotheses are terminated
or when the maximum number of tokens has been reached.
In PICARD, checking, accepting, and rejecting of tokens and token sequences is achieved through parsing. Parsing means that we attempt to assemble a data structure from the tokens that are currently in the beam or are about to be added to it. This data structure (and the parsing rules that are used to build it) encode the constraints we want to enforce.
In the case of SQL, the data structure we parse to is the abstract syntax tree (AST) of the SQL query. The parsing rules are defined in a computer program called a parser. Database engines, such as PostgreSQL, MySQL, and SQLite, have their own built-in parser that they use internally to process SQL queries. For Spider and CoSQL, we have implemented a parser that supports a subset of the SQLite syntax and that checks additional constraints on the AST. In our implementation, the parsing rules are made up from simpler rules and primitives that are provided by a third-party parsing library.
PICARD uses a parsing library called attoparsec that supports incremental input. This is a special capability that is not available in many other parsing libraries. You can feed attoparsec a string that represents only part of the expected input to parse. When parsing reaches the end of an input fragment, attoparsec will return a continuation function that can be used to continue parsing. Think of the continuation function as a suspended computation that can be resumed later. Input fragments can be parsed one after the other when they become available until the input is complete.
Herein lies the key to PICARD: Incremental parsing of input fragments is exactly what we need to check tokens one by one during decoding.
In PICARD, parsing is initialized with an empty string, and attoparsec will return the first continuation function. We then call that continuation function with all the token predictions we want to check in the first decoding step. For those tokens that are valid, the continuation function will return a new continuation function that we can use to continue parsing in the next decoding step. For those tokens that are invalid, the continuation function will return a failure value which cannot be used to continue parsing. Such failures are discarded and never end up in the beam. We repeat the process until the end of the input is reached. The input is complete once the model predicts the end-of-sentence token. When that happens, we finalize the parsing by calling the continuation function with an empty string. If the parsing is successful, it will return the final AST. If not, it will return a failure value.
The parsing rules are described at a high level in the PICARD paper. For details, see the PICARD code, specifically the Language.SQL.SpiderSQL.Parse module.
Let's look at the numbers:
On Spider
URL | Based on | Exact-set Match Accuracy | Execution Accuracy | ||
---|---|---|---|---|---|
Dev | Test | Dev | Test | ||
tscholak/cxmefzzi w PICARD | T5-3B | 75.5 % | 71.9 % | 79.3 % | 75.1 % |
tscholak/cxmefzzi w/o PICARD | T5-3B | 71.5 % | 68.0 % | 74.4 % | 70.1 % |
tscholak/3vnuv1vf w PICARD | t5.1.1.lm100k.large | 74.8 % | — | 79.2 % | — |
tscholak/3vnuv1vf w/o PICARD | t5.1.1.lm100k.large | 71.2 % | — | 74.4 % | — |
tscholak/1wnr382e w PICARD | T5-Large | 69.1 % | — | 72.9 % | — |
tscholak/1wnr382e w/o PICARD | T5-Large | 65.3 % | — | 67.2 % | — |
tscholak/1zha5ono w PICARD | t5.1.1.lm100k.base | 66.6 % | — | 68.4 % | — |
tscholak/1zha5ono w/o PICARD | t5.1.1.lm100k.base | 59.4 % | — | 60.0 % | — |
Click on the links to download the models. tscholak/cxmefzzi and tscholak/1wnr382e are the versions of the model that we used in our experiments for the paper, reported as T5-3B and T5-Large, respectively. tscholak/cxmefzzi, tscholak/3vnuv1vf, and tscholak/1zha5ono were trained to use database content, whereas tscholak/1wnr382e was not.
Note that, without PICARD, 12% of the SQL queries generated by tscholak/cxmefzzi on Spider’s development set resulted in an execution error. With PICARD, this number decreased to 2%.
On CoSQL Dialogue State Tracking
URL | Based on | Question Match Accuracy | Interaction Match Accuracy | ||
---|---|---|---|---|---|
Dev | Test | Dev | Test | ||
tscholak/2e826ioa w PICARD | T5-3B | 56.9 % | 54.6 % | 24.2 % | 23.7 % |
tscholak/2e826ioa w/o PICARD | T5-3B | 53.8 % | 51.4 % | 21.8 % | 21.7 % |
tscholak/2jrayxos w PICARD | t5.1.1.lm100k.large | 54.2 % | — | — | — |
tscholak/2jrayxos w/o PICARD | t5.1.1.lm100k.large | 52.5 % | — | — | — |
Click on the links to download the models. tscholak/2e826ioa is the version of the model that we used in our experiments for the paper, reported as T5-3B.
- Install docker desktop
- Install WSL2
- Set WSL2 as docker desktop's base engine
- This repository uses git submodules. Clone it like this into the Linux FS:
$ git clone https://github.com/SethCram/Linguists-NLP-to-SQL.git
$ cd Linguists-NLP-to-SQL
$ git submodule update --init --recursive
- Optionally open the repo in ide vscode via:
code .
- Run
make serve
to get api up and running - Optionally navigate to the api UI at http://localhost:8000/docs#/default/ask_ask__db_id___question__get
- Optionally head over to the frontend at https://github.com/SethCram/linguists-client#developer-setup and follow the setup steps
- Commit new changes to the github repository
- Open up the Makefile and navigate to the top
- Change the image's id to the most recent git commit id, so the next image built doesn't overwrite or conflict with a pre-existing one, and the code used to construct the new image is from the most recent git commit
GIT_HEAD_REF := $(shell git rev-parse HEAD)
- Make sure your logged in as the owner or a collaborator of the docker hub's repository on docker desktop
- Build a new image for evaluation (can takes up to 2 hours)
make build-eval-image
- Once again, change the Makefile's image id to ensure we're pulling down the correct image to test
GIT_HEAD_REF := [hardcodedGitCommitIdUsedToBuildTheImage]
- Pull down the new image and test it out
make serve
- If the image is stable, feel free to commit the new makefile with
GIT_HEAD_REF := [hardcodedGitCommitIdUsedToBuildTheImage]
to the repository
- if issues are encountered, feel free to submit a Github issue or refer to the pre-exisiting ones located at https://github.com/ServiceNow/picard/issues
The training script is located in seq2seq/run_seq2seq.py
.
You can run it with:
$ make train
The model will be trained on the Spider dataset by default.
You can also train on CoSQL by running make train-cosql
.
The training script will create the directory train
in the current directory.
Training artifacts like checkpoints will be stored in this directory.
The default configuration is stored in configs/train.json
.
The settings are optimized for a GPU with 40GB of memory.
These training settings should result in a model with at least 71% exact-set-match accuracy on the Spider development set. With PICARD, the accuracy should go up to at least 75%.
We have uploaded a model trained on the Spider dataset to the huggingface model hub, tscholak/cxmefzzi. A model trained on the CoSQL dialog state tracking dataset is available, too, tscholak/2e826ioa.
The evaluation script is located in seq2seq/run_seq2seq.py
.
You can run it with:
$ make eval
By default, the evaluation will be run on the Spider evaluation set.
Evaluation on the CoSQL evaluation set can be run with make eval-cosql
.
The evaluation script will create the directory eval
in the current directory.
The evaluation results will be stored there.
The default configuration is stored in configs/eval.json
.
A trained model can be served using the seq2seq/serve_seq2seq.py
script.
The configuration file can be found in configs/serve.json
.
You can start serving with:
$ make serve
By default, the 800-million-parameter tscholak/3vnuv1vf model will be loaded. You can also load a different model by specifying the model name in the configuration file. The device to use can be specified as well. The default is to use the first available GPU. CPU can be used by specifying -1
.
When the script is called, it uses the folder specified by the db_path
option to look for SQL database files.
The default folder is database
, which will be created in the current directory.
Initially, this folder will be empty, and you can add your own SQL files to it.
The structure of the folder should be like this:
database/
my_1st_database/
my_1st_database.sqlite
my_2nd_database/
my_2nd_database.sqlite
where my_1st_database
and my_2nd_database
are the db_id
s of the databases.
Once the server is up and running, use the Swagger UI to test inference with the /ask
endpoint.
The server will be listening at http://localhost:8000/
,
and the Swagger UI will be available at http://localhost:8000/docs#/default/ask_ask__db_id___question__get
.
There are three docker images that can be used to run the code:
- tscholak/text-to-sql-dev: Base image with development dependencies. Use this for development. Pull it with
make pull-dev-image
from the docker hub. Rebuild the image withmake build-dev-image
. - tsscholak/text-to-sql-train: Training image with development dependencies but without Picard dependencies. Use this for fine-tuning a model. Pull it with
make pull-train-image
from the docker hub. Rebuild the image withmake build-train-image
. - tscholak/text-to-sql-eval: Training/evaluation image with all dependencies. Use this for evaluating a fine-tuned model with Picard. This image can also be used for training if you want to run evaluation during training with Picard. Pull it with
make pull-eval-image
from the docker hub. Rebuild the image withmake build-eval-image
.
All images are tagged with the current commit hash. The images are built with the buildx tool which is available in the latest docker-ce. Use make init-buildkit
to initialize the buildx tool on your machine. You can then use make build-dev-image
, make build-train-image
, etc. to rebuild the images. Local changes to the code will not be reflected in the docker images unless they are committed to git.
Make sure the server chosen has atleast 50GBs of storage and 10GBs of RAM since the docker image is around 25GBs, huggingface model is 3GB, large database files are stored server-side, and the docker container likely requires atleast 10GBs of RAM to build and run. The recommended server setup is Rocky Linux 8 connected to the internet via ethernet or wifi.
- Make sure the package manager is up to date
- on Ubuntu distributions (Ubuntu, Debian, Kali, etc.)
sudo apt-get -y update
- on RHEL distributions (RHEL, Rocky, Fedora, CentOS, etc.)
sudo yum -y update
- on Ubuntu distributions (Ubuntu, Debian, Kali, etc.)
- Remove old docker versions and download Docker, Make, nginx, and git
- on Ubuntu distributions
sudo apt-get remove docker docker-engine docker.io containerd runc sudo apt install docker.io make nginx git
- on RHEL distributions
sudo yum remove docker \ docker-client \ docker-client-latest \ docker-common \ docker-latest \ docker-latest-logrotate \ docker-logrotate \ docker-engine \ podman \ runc sudo yum install -y yum-utils sudo yum-config-manager \ --add-repo \ https://download.docker.com/linux/centos/docker-ce.repo sudo yum install -y docker-ce docker-ce-cli containerd.io docker-buildx-plugin make nginx git
- on Ubuntu distributions
- Clone the repository and make sure it's up to date
git clone https://github.com/SethCram/Linguists-NLP-to-SQL.git cd Linguists-NLP-to-SQL git submodule update --init --recursive
- Allow http traffic through port 80
- on firewalld
sudo sh -c "firewall-cmd --permanent --zone=public --add-service=http && firewall-cmd --reload"
- on iptables
sudo sh -c "iptables -I INPUT -p tcp -m tcp --dport 80 -j ACCEPT && service iptables save"
- on firewalld
- Make sure docker and nginx automatically run on server restart
sudo systemctl enable docker.service sudo systemctl enable nginx
- Start docker and install the docker image
sudo systemctl start docker sudo make pull-eval-image
- If docker can't be accessed, try granting the current user docker access (may require user relogin)
sudo usermod -a -G docker $USER
- If prompted for image selection, select the 3rd docker image (the one starting with "docker.io")
- If docker can't be accessed, try granting the current user docker access (may require user relogin)
- Redirect the server traffic to the api
add this inside the "server" block:
sudo vi /etc/nginx/nginx.conf
add this within the "http" block but outside the "server" block:location /api { proxy_pass http://localhost:8000/; proxy_http_version 1.1; proxy_set_header Host $http_host; proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; proxy_set_header X-Forwarded-Proto $scheme; proxy_set_header Upgrade $http_upgrade; proxy_set_header Connection $connection_upgrade; proxy_cache_bypass $http_upgrade; }
verify the syntax of the config file is okay and start nginx using itmap $http_upgrade $connection_upgrade { default upgrade; '' close; } upstream uvicorn { server unix:/tmp/uvicorn.sock; }
sudo nginx -t sudo service nginx restart
- if SELinux is being used, tell it to allow httpd traffic:
sudo setsebool -P httpd_can_network_connect 1
- if SELinux is being used, tell it to allow httpd traffic:
- Launch the API interactively on port 8000
sudo make serve
- Cancel the process with Ctrl-C now that we verified that it launched okay
- Relaunch the API in the background, indefinitely, and on server restart
sudo make serve_prod
- Verify the API is running
- Navigate to http://[publicIPAddress]/api/getDatabases/ in a browser or run
curl http://[publicIPAddress]/api/getDatabases/
and see either a list of uploaded databases or an error 500 with "There was an error when attempting to list all the database folders." message. Either way, the backend is running properly.
- Navigate to http://[publicIPAddress]/api/getDatabases/ in a browser or run
- Optionally, verify the API's fastAPI/Swagger/UI page is live
- Navigate to http://[publicIPAddress]:8000/docs#/ in a browser or run
curl http://[publicIPAddress]:8000/docs#/
- If this fails, try installing Python, pip, and 'uvicorn[standard]' then
docker ps
anddocker stop [containerId]
anddocker rm [containerId]
and rerunsudo make serve_prod
and attempt to visit the API's UI again
- Navigate to http://[publicIPAddress]:8000/docs#/ in a browser or run
- Head over to the frontend deployment instructions