Skip to content

DMIRLAB-Group/GenLink

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GenLink: Generation-Driven Schema-Linking via Multi-Model Learning for Text-to-SQL

Overview

Main Results

Project directory structure

  • Download the train.json file, train_tables.json file and train_databases file of the BIRD train set in the data/bird/train folder. Download address: https://bird-bench.github.io/

  • Download the dev.json file, dev_tables.json file and dev_databases file of the BIRD development set in the data/bird/dev folder. Download address: https://bird-bench.github.io/

  • Download the corresponding model parameters (Llama-3.1-8B-Instruct, Qwen2.5-Coder-7B-Instruct, Qwen2.5-7B-Insruct, deepseek-coder-6.7b-instruct, Mistral-7B-Instruct-v0.3) from huggingface to the LLaMaFactory directory. For example, llama download address: https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct

Environment

conda create -n genlink python=3.9
conda activate genlink
pip install -r requirements.txt

RUN

1. Data Preprocessing

# Step1: Construct schema format using Codes method and get 'sft_bird_with_evidence_train_text2sql.json' file
# Step2: Place 'sft_bird_with_evidence_train_text2sql.json' in the /data/bird/train directory

Refer to the Codes official script, the link is: https://github.com/RUCKBReasoning/codes

2. Model training on LLaMA-Factory

# Step1: Construct BIRD training data `t2s_genlink_bird_train_9428.json`
# Run the function construct_bird_t2s_train_full_schema on text2sql.py
python core/text2sql.py 

# Step2: Use the train.sh command to train each model separately in LLaMA-Factory
# When computing resources are limited, separate training can be performed, which requires at least 24GB of video memory overheadsh 
sh core/train.sh

For the training techniques and details of LLaMA-Factory, please refer to the official script: https://github.com/hiyouga/LLaMA-Factory

3. Generation-Driven Schema Linking (GDSL)

# Step 1: Constrcut BIRD dev set of GDSL module
# Run the function construct_bird_t2s_infer_full_schema on text2sql.py and get the inference file "t2s_genlink_bird_dev_1534.json"
python core/text2sql.py 

# Step2: Use the inference_GDSL.sh command to inference each model separately in LLaMA-Factory
sh core/inference_GDSL.sh

# Step3: Place all generated_predictions.json files in the output/bird/GDSL directory and rename the files to (model)_generated_predictions.jsonl, for example llama8B_generationd_precisions.jsonl
# Integrate the inference results of all models and extract the corresponding schema from SQL
python core/merge.py
python core/extract_schema.py

4. Multi-Model SQL Generation (MMSG)

# Step 1: Constrcut BIRD dev set of GDSL module
# Run the function construct_bird_t2s_infer_simplified_schema on text2sql.py and get the inference file "t2s_genlink_bird_dev_1534_simplified_schema.json"
python core/text2sql.py 

# Step2: Use the inference_MMSG.sh command to inference each model separately in LLaMA-Factory
sh core/inference_MMSG.sh

# Step3: Place all generated_predictions.json files in the output/bird/MMSG directory
# Integrate the inference results of all models
python core/merge.py

# Step4: Using the self-consistency method to select the optimal SQL from multiple models as the final result
python core/select_candidate.py

Evaluation

Execution (EX) Evaluation:

Refer to the official evaluation script, the link is: https://github.com/AlibabaResearch/DAMO-ConvAI/tree/main/bird

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •