-
Download the
train.jsonfile,train_tables.jsonfile andtrain_databasesfile of the BIRD train set in thedata/bird/trainfolder. Download address: https://bird-bench.github.io/ -
Download the
dev.jsonfile,dev_tables.jsonfile anddev_databasesfile of the BIRD development set in thedata/bird/devfolder. Download address: https://bird-bench.github.io/ -
Download the corresponding model parameters (
Llama-3.1-8B-Instruct,Qwen2.5-Coder-7B-Instruct,Qwen2.5-7B-Insruct,deepseek-coder-6.7b-instruct,Mistral-7B-Instruct-v0.3) from huggingface to the LLaMaFactory directory. For example, llama download address: https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct
conda create -n genlink python=3.9
conda activate genlink
pip install -r requirements.txt# Step1: Construct schema format using Codes method and get 'sft_bird_with_evidence_train_text2sql.json' file
# Step2: Place 'sft_bird_with_evidence_train_text2sql.json' in the /data/bird/train directoryRefer to the Codes official script, the link is: https://github.com/RUCKBReasoning/codes
# Step1: Construct BIRD training data `t2s_genlink_bird_train_9428.json`
# Run the function construct_bird_t2s_train_full_schema on text2sql.py
python core/text2sql.py
# Step2: Use the train.sh command to train each model separately in LLaMA-Factory
# When computing resources are limited, separate training can be performed, which requires at least 24GB of video memory overheadsh
sh core/train.shFor the training techniques and details of LLaMA-Factory, please refer to the official script: https://github.com/hiyouga/LLaMA-Factory
# Step 1: Constrcut BIRD dev set of GDSL module
# Run the function construct_bird_t2s_infer_full_schema on text2sql.py and get the inference file "t2s_genlink_bird_dev_1534.json"
python core/text2sql.py
# Step2: Use the inference_GDSL.sh command to inference each model separately in LLaMA-Factory
sh core/inference_GDSL.sh
# Step3: Place all generated_predictions.json files in the output/bird/GDSL directory and rename the files to (model)_generated_predictions.jsonl, for example llama8B_generationd_precisions.jsonl
# Integrate the inference results of all models and extract the corresponding schema from SQL
python core/merge.py
python core/extract_schema.py# Step 1: Constrcut BIRD dev set of GDSL module
# Run the function construct_bird_t2s_infer_simplified_schema on text2sql.py and get the inference file "t2s_genlink_bird_dev_1534_simplified_schema.json"
python core/text2sql.py
# Step2: Use the inference_MMSG.sh command to inference each model separately in LLaMA-Factory
sh core/inference_MMSG.sh
# Step3: Place all generated_predictions.json files in the output/bird/MMSG directory
# Integrate the inference results of all models
python core/merge.py
# Step4: Using the self-consistency method to select the optimal SQL from multiple models as the final result
python core/select_candidate.pyRefer to the official evaluation script, the link is: https://github.com/AlibabaResearch/DAMO-ConvAI/tree/main/bird

