This project proposes Text2VectorSQL, a new task aimed at building a unified natural language interface for querying both structured data and unstructured data.
Traditional Text2SQL systems have made significant progress in accessing structured data like tables, but they cannot understand semantic or multi-modal queries. Meanwhile, vector search has become the standard for querying unstructured data (such as text, images), but integrating it with SQL (known as VectorSQL) still relies on manual, error-prone query construction and lacks standardized evaluation methods.
Text2VectorSQL aims to bridge this fundamental gap by providing a comprehensive foundational ecosystem:
This repository contains a complete ecosystem, divided into four core modules:
- Data Synthesis (Data_Synthesizer): A scalable pipeline for automatically synthesizing "quadruplet" training data (database, natural language question, VectorSQL query, chain-of-thought) starting from public base tables.
- Execution Engine (Execution_Engine):
A backend engine responsible for parsing and executing VectorSQL queries. It handles the special
lembed(model, text)function by calling an Embedding Service and translates it into a native query compatible with the target database (SQLite, PostgreSQL, ClickHouse). - Embedding Service (Embedding_Service): A high-performance API service based on FastAPI that provides on-demand text and image vectorization capabilities for the Execution Engine. It supports multiple models, multiple GPUs, and can automatically cache models.
- Evaluation Framework (Evaluation_Framework): A framework for comprehensively evaluating the performance of Text2VectorSQL models. It provides an accurate assessment by executing the model-generated SQL and the gold SQL, then comparing their execution results (rather than the SQL strings).
-
Clone this repository:
git clone https://github.com/OpenDCAI/Text2VectorSQL.git --depth 1 cd Text2VectorSQL -
Depending on the module you need to use, install its separate dependencies. Each module (
Data_Synthesizer,Execution_Engine,Embedding_Service,Evaluation_Framework) has arequirements.txtfile in its directory.For example, to install the dependencies for the Execution Engine:
cd Execution_Engine pip install -r requirements.txt
You can use the project's toolchain according to the following scenarios based on your goals:
Execution_Engine depends on this service to fetch vectors.
- Configure the service:
Go to the
Embedding_Service/directory, create aconfig.yamlfile, and specify the models you want to use.# Example config.yaml server: host: "0.0.0.0" port: 8000 models: - name: "all-MiniLM-L6-v2" hf_model_path: "sentence-transformers/all-MiniLM-L6-v2" local_model_path: "./models/all-MiniLM-L6-v2" trust_remote_code: true # ... other models
- Start the service:
The service will run on
cd Embedding_Service/ bash run.shhttp://0.0.0.0:8000. Models will be downloaded automatically on the first run.
Ensure the Embedding Service from Scenario 1 is running.
Execution_Engine can be used as a command-line tool:
- Configure the engine:
Go to the
Execution_Engine/directory, createengine_config.yaml, and specify the Embedding Service address and database connection information.embedding_service: url: "http://127.0.0.1:8000/embed" # Corresponds to the service in Scenario 1 database_connections: clickhouse: host: "localhost" port: 8123 # ... timeouts: sql_execution: 60
- Run the query:
cd Execution_Engine/ python execution_engine.py \ --sql "SELECT Name FROM musical m ORDER BY L2Distance(Category_embedding, lembed('all-MiniLM-L6-v2','opera')) LIMIT 5;" \ --db-type "clickhouse" \ --db-identifier "musical" \ --config "engine_config.yaml"
Use the Data_Synthesizer module.
- Configure the pipeline:
Go to the
Data_Synthesizer/directory, copypipeline/config.yaml.exampletoconfig.yaml. Fill in your LLM API-Key, Base-URL, etc., inconfig.yaml. - Select a dataset:
Edit
pipeline/general_pipeline.pyand modify theDATASET_BACKENDandDATASET_TO_LOADvariables at the top. - Run the pipeline:
The final synthesized dataset will be saved in the
cd Data_Synthesizer/ python pipeline/general_pipeline.pyresult_pathconfigured inconfig.yaml.
Use the Evaluation_Framework module.
-
Prepare data: Ensure you have an evaluation file (
eval_data_file) containing the model's predicted SQL. If not, rungenerate.pyto create it. -
Configure evaluation: Go to the
Evaluation_Framework/directory and createevaluation_config.yaml. Configure the database type (db_type), database file root directory (base_dir), input file (eval_data_file), and evaluation metrics (metrics). -
Run the evaluation pipeline:
cd Evaluation_Framework/ # Run the full pipeline (SQL execution + result evaluation) python run_eval_pipeline.py --all --config evaluation_config.yaml # Or run in steps # python run_eval_pipeline.py --execute --config evaluation_config.yaml # python run_eval_pipeline.py --evaluate --config evaluation_config.yaml
The evaluation report (JSON file) will be saved in the configured output path. You can use
aggregate_results.pyto aggregate multiple reports into a CSV.
This module is the foundation for training powerful Text2VectorSQL models (like UniVectorSQL). It generates high-quality Text2VectorSQL datasets through an automated pipeline.
Core Pipeline:
- Database Synthesis & Enhancement (
database_synthesis): Generates structured databases based on Web tables. - Database Vectorization (
vectorization): Identifies "semantically-rich" columns (e.g., descriptions), uses Sentence Transformers to generate vector embeddings, and builds new databases that support vector queries. - VectorSQL & Question Synthesis (
synthesis_sql,synthesis_nl): Automatically generates VectorSQL queries of varying complexity and reverse-translates them to generate corresponding natural language questions. - Chain-of-Thought Synthesis (
synthesis_cot): Generates detailed reasoning steps (Chain-of-Thought) for each data sample, explaining the derivation process from the question to the VectorSQL.
You can run the complete end-to-end synthesis process with a single command using the pipeline/general_pipeline.py script.
This is the runtime core of Text2VectorSQL. It acts as a bridge, parsing VectorSQL queries that contain semantic search intent and translating them into native queries that the database can understand.
Core Features:
- Parses
lembedFunction: The engine is specifically designed to process queries with thelembed(model, text)syntax. - Dynamic Vectorization:
- The engine parses the query and extracts all unique
(model, text)combinations. - It sends a network request to an external Embedding Service to get the vector representations for this text.
- The engine parses the query and extracts all unique
- SQL Translation & Execution:
- After receiving the vectors, the engine replaces the
lembed(...)call with the database's native vector literal (e.g.,[0.1, 0.2, ...]in PostgreSQL). - It connects to the target database (supports PostgreSQL, ClickHouse, SQLite) and executes the translated native query.
- After receiving the vectors, the engine replaces the
- Robustness: The engine implements robust timeout and error management for network requests and database execution.
This is a standalone, high-performance API service that acts as the vectorization backend for the Execution_Engine.
Key Features:
- High-Performance: Based on FastAPI and Uvicorn, providing asynchronous processing capabilities.
- Multi-Model & Multi-GPU: Supports loading and managing multiple models simultaneously via
config.yaml(e.g.,all-MiniLM-L6-v2orCLIP), with support for tensor parallelism. - Automatic Caching: Automatically downloads models from Hugging Face Hub on startup and caches them locally, avoiding repeated downloads.
- Core API Endpoints:
/embed: Receives a model name and a list of text/images, returns vectors./health: Health check.
To objectively evaluate the true capabilities of Text2VectorSQL models, we built a dedicated evaluation framework. The framework compares the execution results of the predicted SQL and the gold SQL, rather than just the SQL strings. This is crucial for vector search, as approximate results are acceptable, and SQL queries with different syntax can produce the same result.
Evaluation Pipeline:
-
SQL Generation (
generate.py): Calls the model (via vLLM or API) to generate predicted SQL queries for the evaluation questions. -
SQL Execution (
sql_executor.py): Executes the predicted SQL and all gold SQL queries separately in sandboxed processes (with timeouts) and caches both execution results. -
Result Evaluation (
evaluate_results.py): Compares the predicted results with the gold results and calculates a series of evaluation metrics.- Set-based Metrics: Precision, Recall, F1-Score.
- Rank-based Metrics: nDCG@k, MAP, MRR.
-
Decomposed Metrics:
$ACC_{SQL}$ (SQL skeleton correctness) and$ACC_{Vec}$ (Vector component correctness).
-
Result Aggregation (
aggregate_results.py): Aggregates the JSON reports from multiple experiments into an easy-to-compare CSV file.
If you find this work helpful, please cite our paper:
@article{wang2025text2vectorsql,
title={Text2VectorSQL: Towards a Unified Interface for Vector Search and SQL Queries},
author={Wang, Zhengren and Yao, Dongwen and Li, Bozhou and Ma, Dongsheng and Li, Bo and Li, Zhiyu and Xiong, Feiyu and Cui, Bin and Tang, Linpeng and Zhang, Wentao},
journal={arXiv preprint arXiv:2506.23071},
year={2025}
}This repo benefits from: LLaMA-Factory, vLLM, ChatData, PostgreSQL,ClickHouse, pgvector, sqlite-vec.
Thanks for wonderful works.
For any questions or feedback, please reach out to us at wzr@stu.pku.edu.cn.

