This repository contains the source code and experimental artifacts for the paper:
"DynaQuery: A Self-Adapting Framework for Querying Structured and Multimodal Data"
This artifact fully supports the reproduction of all main experimental results presented in the paper, covering Research Questions 1, 2, and 3.
/dynaquery: Core source code for the DynaQuery framework./experiments: Self-contained Python scripts used to generate and evaluate all experimental results./setup: Files required for the initial experimental setup (e.g., database dump)./artifacts: Static, verifiable artifacts (e.g., execution logs, ground truth) that support the qualitative analyses in the paper.download_data.sh: Script to automatically download external datasets.requirements.txt: Python package dependencies required to run the code.LICENSE: License for this repository./LICENSES: Licenses for all third-party code and data used in this work.
Prerequisites:
- A Unix-like environment (Linux, macOS, WSL) with
bash,wget/curl,unzip, andgit. - Python 3.9+
gdownfor downloading from Google Drive (pip install gdown).- A Kaggle account and the
kaggleCLI package. - A running MySQL server instance (required for the RQ2 case study, RQ3, and live demo).
git clone https://github.com/aymanehassini/DynaQuery.git
cd DynaQuerypython3 -m venv venv
source venv/bin/activate
pip install -r requirements.txtCreate a .env file by copying the provided template.
cp .env.example .envThen, edit the .env file and add your GOOGLE_API_KEY and database credentials.
Our primary training dataset is hosted on Kaggle. The automated download script requires the Kaggle API to be configured.
- Log in to your Kaggle account.
- Go to your Account Settings page and click "Create New API Token".
This will download akaggle.jsonfile. - Place this file in a
.kaggledirectory in your home folder:
mkdir -p ~/.kaggle
mv ~/Downloads/kaggle.json ~/.kaggle/
chmod 600 ~/.kaggle/kaggle.jsonOur experiments rely on a fine-tuned BERT model. For convenience and reproducibility, we provide the exact model checkpoint used in our paper.
1- Navigate to the Releases page of this GitHub repository and download the bert-checkpoint-rq2.zip file.
2- Unzip the file to a stable location on your machine.
3- Open dynaquery/config/settings.py and update the CHECKPOINT_PATH variable to point to the location of the unzipped directory.
We strongly recommend using an absolute path.
This script will download all large external datasets (Spider, BIRD, and our Annotated Rationale Dataset) into a new external_data/ directory.
bash download_data.shThis section details how to reproduce the key results from the paper.
IMPORTANT: All commands must be run from the root directory of the DynaQuery repository.
The following three-stage process reproduces our results on the Spider benchmark, as presented in Tables [2] and [3] of the paper.
This first experiment evaluates the component-level performance (Precision, Recall, and F1-Score) of our SILE linker against the RAG baseline on a reproducible 500-entry sample.
Command:
python -m experiments.rq1.run_rq3_spider_linkingStep 1 (Data Preparation):
The script first checks for the existence of the full Spider Schema Linking dataset in external_data/. If found, it programmatically creates a reproducible 500-entry random sample and saves it to outputs/spider/spider_linking_sample_500.jsonl.
Step 2 (Evaluation):
It then iterates through this newly created 500-entry sample, running both the DynaQuery (SILE) and RAG Baseline linkers for each question.
Step 3 (Reporting):
Finally, it aggregates the results and prints the final performance scores to the console.
Expected Output: The script will run for approximately 45 minutes, depending on API latency. Upon completion, it will print a formatted table to the console with the final scores. The F1-Score for DynaQuery (SILE) should be approximately 0.77 for DynaQuery (SILE) and 0.34 for the RAG Baseline.
This script uses the sample file created in Stage 1 to run the full DynaQuery and RAG pipelines, generating the SQL queries needed for the final evaluation.
Command:
python -m experiments.rq1.run_rq1_spider_e2eExpected Output:
The script will run for a significant amount of time (approx. 2-3 hours). It will create three new files in the outputs/spider/ directory:
predictions_dynaquery-schema-linking.sqlpredictions_rag-schema-linking.sqldev_gold_sample-schema-linking.sql
This final stage uses our robust, self-contained analysis script to calculate the end-to-end Execution Accuracy and break down the results by query difficulty. This script uses our direct-execution protocol and a heuristic-based classifier modeled on the official Spider hardness rules.
python -m experiments.rq1.analyze_hardness \
--pred_file outputs/spider/predictions_dynaquery-schema-linking.sql \
--gold_file outputs/spider/dev_gold_sample-schema-linking.sql \
--db_dir external_data/spider/database/python -m experiments.rq1.analyze_hardness \
--pred_file outputs/spider/predictions_rag-schema-linking.sql \
--gold_file outputs/spider/dev_gold_sample-schema-linking.sql \
--db_dir external_data/spider/database/Expected Output: Each command will run quickly and print a formatted table breaking down the Execution Accuracy by difficulty. The overall EA for DynaQuery should be 80.0% and for the RAG baseline should be 57.1%.
This four-stage process reproduces our results on the BIRD benchmark, as presented in Tables [4] and [5] of the paper.
This first script is the main entry point for the BIRD experiment. It performs two key tasks:
- It creates a reproducible, 500-entry stratified random sample of the BIRD development set.
- It runs the full DynaQuery and RAG pipelines on this sample, generating "sparse" prediction files that are compatible with the official BIRD evaluation tools.
Command:
python -m experiments.rq1.run_rq3_bird_e2eOutput:
data_samples/bird_dev_sample_500.json (The stratified sample file)outputs/bird_dynaquery/predict_dev.json (DynaQuery's (SILE) predictions)outputs/bird_rag/predict_dev.json (RAG's predictions)
This stage uses our modified versions of the official BIRD evaluation scripts to execute all queries and produce raw, per-query JSON reports for both Execution Accuracy (EA) and Valid Efficiency Score (VES).
Commands for DynaQuery:
- Get Raw EA Report
python -m vendor.evaluation \
--predicted_sql_path outputs/bird_dynaquery/ \
--ground_truth_path external_data/bird/ \
--data_mode dev \
--db_root_path external_data/bird/dev_databases/ \
--diff_json_path external_data/bird/dev.json \
--output_path outputs/bird/dynaquery_bird_results_ea.json- Get Raw VES Report
python -m vendor.evaluation_ves \
--predicted_sql_path outputs/bird_dynaquery/ \
--ground_truth_path external_data/bird/ \
--data_mode dev \
--db_root_path external_data/bird/dev_databases/ \
--diff_json_path external_data/bird/dev.json \
--output_path outputs/bird/dynaquery_bird_results_ves.jsonCommands for RAG Baseline:
- Get Raw EA Report
python -m vendor.evaluation \
--predicted_sql_path outputs/bird_rag/ \
--ground_truth_path external_data/bird/ \
--data_mode dev \
--db_root_path external_data/bird/dev_databases/ \
--diff_json_path external_data/bird/dev.json \
--output_path outputs/bird/rag_bird_results_ea.json- Get Raw VES Report
python -m vendor.evaluation_ves \
--predicted_sql_path outputs/bird_rag/ \
--ground_truth_path external_data/bird/ \
--data_mode dev \
--db_root_path external_data/bird/dev_databases/ \
--diff_json_path external_data/bird/dev.json \
--output_path outputs/bird/rag_bird_results_ves.jsonOutput:
outputs/bird/dynaquery_bird_results_ea.jsonoutputs/bird/dynaquery_bird_results_ves.jsonoutputs/bird/dynaquery_rag_results_ea.jsonoutputs/bird/rag_bird_results_ves.json
This stage uses our custom analysis scripts to parse the raw reports from Stage 2 and calculate the true, final scores for our 500-item sample.
Step 3a: Calculate Final Execution Accuracy (Table [6]) and Generate Failure Reports
python -m experiments.rq1.calculate_true_vesOutput:
- Prints the final, stratified VES breakdown tables for both models.
This final script takes the failure reports generated in Stage 3a and uses sqlglot to automatically categorize every error, producing the final analysis for the paper.
Command:
python -m experiments.rq1.analyze_rq1_bird_failuresOutput:
- Prints the two failure analysis summary tables (one for DynaQuery, one for RAG).
- Creates detailed
..._analyzed.jsonreports in theoutputs/bird/directory.
The OOD case study requires a small, persistent MySQL database.
First, log in to your MySQL server and create a new database:
-- (In your MySQL client)
CREATE DATABASE dynaquery_ood;From your terminal (not the MySQL client), run the following command to load the schema and data.
Replace [username] with your MySQL username. You will be prompted for your password.
mysql -u [username] -p dynaquery_ood < setup/ood_database.sqlUpdate Credentials:
Ensure your .env file is configured with the correct credentials to access the dynaquery_ood database.
This experiment evaluates the three classifier architectures on our 1,000-sample IID test set.
The required train_split.csv (4,000 samples) and test_split.csv (1,000 samples) files are included in the DynaQuery-Eval-5K benchmark, which is downloaded into the external_data/ directory during setup.
This command re-trains the BERT model from scratch using the provided training split and evaluates it on the test split.
The command includes the exact hyperparameters we used in training.
python -m experiments.rq2.bert.train \
--train_file external_data/dynaquery_eval_5k_benchmark/train_split.csv \
--test_file external_data/dynaquery_eval_5k_benchmark/test_split.csv \
--epochs 4 \
--learning_rate 2e-5 \
--batch_size 32 \
--output_dir outputs/rq2/bert_retrained_checkpointExpected Output:
The script will train the model and then print a final evaluation report.
The eval_F1_macro score should be approximately 0.991.
LLM (Rule-Based Prompt):
This is the default behavior of the script and represents our final, recommended architecture.
python -m experiments.rq2.run_rq2_classifier_comparison Expected Output:
The macro F1-score should be approximately 78.0%.
(Optional) LLM (Descriptive Prompt):
To reproduce the score for the descriptive prompt, you must manually edit the dynaquery/chains/llm_classifier.py file.
Inside the get_llm_native_classifier_chain function, perform the following two steps:
- Comment out the active "Rule-Based Prompt" string.
- Uncomment the inactive "Descriptive Prompt" string provided in the comments.
After saving the change, run the evaluation command again from from the project root directory:
python -m experiments.rq2.run_rq2_classifier_comparison Expected Output:
With the descriptive prompt active, the macro F1_macro should be approximately 94.7%.
Artifact Note:
For full transparency, we also provide the script used to generate the data splits at experiments/rq2/bert/split.py.
Running this script is not necessary to reproduce the results, as the output files are already provided in the Kaggle download.
The analysis in the paper is based on static, verifiable execution logs from our OOD experiment, which are provided as research artifacts.
-
Verifiable Execution Logs: The complete console output for each pipeline is located in:
artifacts/rq2/bert_run_log.txtartifacts/rq2/llm_descriptive_run_log.txtartifacts/rq2/llm_rule_based_run_log.txt
-
Human-Annotated Ground Truth: The expected correct classification and justification for each query is provided in the subdirectory:
artifacts/rq2/ground_truth/
To verify our claims, inspect these execution logs. For instance, the BERT failure in Figure 2(a) can be confirmed by examining the "Question 2" block in artifacts/rq2/bert_run_log.txt.
This final evaluation tests the end-to-end generalization of the full DynaQuery framework on our novel Olist Multimodal Benchmark.
The Olist benchmark requires a dedicated MySQL database.
- Create the Database:
CREATE DATABASE dynaquery_olist;
- Load the Data:
mysql -u [username] -p dynaquery_olist < setup/olist_multimodal_benchmark.sql - Update Credentials:
Ensure your
.envfile is configured to access thedynaquery_olistdatabase.
The evaluation consists of running our two main pipelines against their respective benchmark query suites (provided in artifacts/rq3/benchmark_suites/).
The experiment is run in two modes: a baseline "schema-aware" mode and a "semantics-aware" mode, controlled by the presence of the dynaquery/config/schema_comments.json file.
- Run the Baseline (Schema-Aware) Evaluation:
First, ensure the schema comments file does not exist by temporarily renaming it:
mv dynaquery/config/schema_comments.json dynaquery/config/schema_comments.json.bakNow, run the benchmark scripts:
# Run SQP Baseline
python -m experiments.rq3.run_sqp_benchmark \
--benchmark_file artifacts/rq3/benchmark_suites/sqp_benchmark_suite.csv \
--results_file outputs/rq3/sqp_baseline_results.csv
# Run MMP Baseline
python -m experiments.rq3.run_mmp_benchmark \
--benchmark_file artifacts/rq3/benchmark_suites/mmp_benchmark_suite.csv \
--results_file outputs/rq3/mmp_baseline_results.csvExpected Output:
- The overall Execution Accuracy (SQP) should be approximately 65.0%.
- The F1-Score (MMP) should be approximately 54.0%.
- Run the Enriched (Semantics-Aware) Evaluation:
Restore the schema comments file to enable semantics-awareness:
mv dynaquery/config/schema_comments.json.bak dynaquery/config/schema_comments.jsonNow, run the benchmark scripts again:
# Run SQP Enriched
python -m experiments.rq3.run_sqp_benchmark \
--benchmark_file artifacts/rq3/benchmark_suites/sqp_benchmark_suite.csv \
--results_file outputs/rq3/sqp_enriched_results.csv
# Run MMP Enriched
python -m experiments.rq3.run_mmp_benchmark \
--benchmark_file artifacts/rq3/benchmark_suites/mmp_benchmark_suite.csv \
--results_file outputs/rq3/mmp_enriched_results.csvExpected Output:
- The overall Execution Accuracy (SQP) should be approximately 95.0%.
- The F1-Score (MMP) should be approximately 93.0%.
The summary results printed to the console by the scripts directly correspond to Tables 9 and 10 in the paper.
For a more detailed, row-by-row verification of our findings, we provide the complete outputs from our original experimental run as verifiable artifacts.
These are located in the artifacts/rq3/verifiable_logs/ directory and include:
-
Detailed Result CSVs (e.g.,
results-sqp-enriched.csv):
These files contain the generated SQL for each query (for SQP) or the final set of accepted IDs (for MMP), allowing for a direct audit of our accuracy calculations. -
Verbose Log Files (e.g.,
output-sqp-enriched.log):
These files contain the complete console trace for each run, showing the step-by-step reasoning of the LLM components.
The original code for the DynaQuery framework is licensed under the MIT License.
This work builds upon, includes derivative works of, and uses data from several excellent open-source projects. We are grateful to their creators for making their work available.
-
Spider Benchmark Tooling: Our evaluation scripts in
/experiments/analysis/are derivative works of the official Spider benchmark tooling and are used in compliance with the Apache License 2.0. The full license text is available inLICENSES/LICENSE-SPIDER-EVAL.txt. -
BIRD Benchmark Tooling: The evaluation scripts in
vendor/bird_eval/are derivative works of the official BIRD benchmark tooling and are used in compliance with the MIT License. The full license text is available inLICENSES/LICENSE-BIRD-EVAL.txt. -
Public Datasets: This project requires several public datasets for evaluation. Please see
LICENSES/DATA_LICENSES.mdfor information regarding their respective licenses.