GenLink: Generation-Driven Schema-Linking via Multi-Model Learning for Text-to-SQL

Overview

Main Results

Project directory structure

Download the train.json file, train_tables.json file and train_databases file of the BIRD train set in the data/bird/train folder. Download address: https://bird-bench.github.io/
Download the dev.json file, dev_tables.json file and dev_databases file of the BIRD development set in the data/bird/dev folder. Download address: https://bird-bench.github.io/
Download the corresponding model parameters (Llama-3.1-8B-Instruct, Qwen2.5-Coder-7B-Instruct, Qwen2.5-7B-Insruct, deepseek-coder-6.7b-instruct, Mistral-7B-Instruct-v0.3) from huggingface to the LLaMaFactory directory. For example, llama download address: https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct

Environment

conda create -n genlink python=3.9
conda activate genlink
pip install -r requirements.txt

RUN

1. Data Preprocessing

# Step1: Construct schema format using Codes method and get 'sft_bird_with_evidence_train_text2sql.json' file
# Step2: Place 'sft_bird_with_evidence_train_text2sql.json' in the /data/bird/train directory

Refer to the Codes official script, the link is: https://github.com/RUCKBReasoning/codes

2. Model training on LLaMA-Factory

# Step1: Construct BIRD training data `t2s_genlink_bird_train_9428.json`
# Run the function construct_bird_t2s_train_full_schema on text2sql.py
python core/text2sql.py 

# Step2: Use the train.sh command to train each model separately in LLaMA-Factory
# When computing resources are limited, separate training can be performed, which requires at least 24GB of video memory overheadsh 
sh core/train.sh

For the training techniques and details of LLaMA-Factory, please refer to the official script: https://github.com/hiyouga/LLaMA-Factory

3. Generation-Driven Schema Linking (GDSL)

# Step 1: Constrcut BIRD dev set of GDSL module
# Run the function construct_bird_t2s_infer_full_schema on text2sql.py and get the inference file "t2s_genlink_bird_dev_1534.json"
python core/text2sql.py 

# Step2: Use the inference_GDSL.sh command to inference each model separately in LLaMA-Factory
sh core/inference_GDSL.sh

# Step3: Place all generated_predictions.json files in the output/bird/GDSL directory and rename the files to (model)_generated_predictions.jsonl, for example llama8B_generationd_precisions.jsonl
# Integrate the inference results of all models and extract the corresponding schema from SQL
python core/merge.py
python core/extract_schema.py

4. Multi-Model SQL Generation (MMSG)

# Step 1: Constrcut BIRD dev set of GDSL module
# Run the function construct_bird_t2s_infer_simplified_schema on text2sql.py and get the inference file "t2s_genlink_bird_dev_1534_simplified_schema.json"
python core/text2sql.py 

# Step2: Use the inference_MMSG.sh command to inference each model separately in LLaMA-Factory
sh core/inference_MMSG.sh

# Step3: Place all generated_predictions.json files in the output/bird/MMSG directory
# Integrate the inference results of all models
python core/merge.py

# Step4: Using the self-consistency method to select the optimal SQL from multiple models as the final result
python core/select_candidate.py

Evaluation

Execution (EX) Evaluation:

Refer to the official evaluation script, the link is: https://github.com/AlibabaResearch/DAMO-ConvAI/tree/main/bird

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
core		core
evaluation		evaluation
figs		figs
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

GenLink: Generation-Driven Schema-Linking via Multi-Model Learning for Text-to-SQL

Overview

Main Results

Project directory structure

Environment

RUN

1. Data Preprocessing

2. Model training on LLaMA-Factory

3. Generation-Driven Schema Linking (GDSL)

4. Multi-Model SQL Generation (MMSG)

Evaluation

Execution (EX) Evaluation:

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

DMIRLAB-Group/GenLink

Folders and files

Latest commit

History

Repository files navigation

GenLink: Generation-Driven Schema-Linking via Multi-Model Learning for Text-to-SQL

Overview

Main Results

Project directory structure

Environment

RUN

1. Data Preprocessing

2. Model training on LLaMA-Factory

3. Generation-Driven Schema Linking (GDSL)

4. Multi-Model SQL Generation (MMSG)

Evaluation

Execution (EX) Evaluation:

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages