Skip to content

RubikSQL/RubikSQL-paper

Repository files navigation

RubikSQL Paper (LaTeX Source)

arXiv Category

This repository contains the ACM LaTeX source of the paper:

Rubik: Bridging the NL2SQL Research-to-Production Gap via Lifelong Learning Agentic Knowledge Base
SIGMOD 2026 Industrial Track (IN PROGRESS)

Compiled PDF:


Abstract

Deploying NL2SQL systems in real-world enterprises often presents significant challenges, including domain-specific terminology, implicit user intent, wide table schemas, and contextual sensitivity. We present RubikSQL, a novel system that redefines NL2SQL as a lifelong learning task requiring continuous Knowledge Base (KB) maintenance. RubikSQL emphasizes the KB construction and evolution, integrating various database context engineering and user query augmentation techniques such as database profiling, structured information extraction, agentic context mining, and Chain-of-Thought (CoT)-enhanced SQL profiling. To utilize diverse knowledge sources within the KB, RubikSQL proposes the Unified Knowledge Format (UKF) as a semantic layer. Finally, RubikSQL utilizes Knowledge Distillation as building blocks and assembles a multi-agent workflow tailored for enterprise NL2SQL. RubikSQL achieves SOTA performance on both the KaggleDBQA and BIRD Mini-Dev datasets. To bridge the research-to-production gap, we also release RubikBench, an enterprise NL2SQL benchmark that captures the vital traits of industrial scenarios.


Result Highlights

BIRD Mini-Dev / Dev

Method TTS Dev EX (%)
gemini-2.5-flash n=1 59.4 (Mini-Dev)
CSC-SQL n=72 71.33
XiYan-SQL n=5 73.34
Contextual-SQL n=32 73.50
CHASE-SQL n=21 74.90
AskData n=3 73.0 / 75.36
Rubik (Ours) n=1 75.9 (Mini-Dev)
Rubik (Ours) n=8 77.3 (Mini-Dev)

Mini-Dev is a subset of BIRD Dev; other methods are reported on Dev.

KaggleDBQA

Method TTS Test EX (%)
RAT-SQL n=1 26.8
DIN-SQL n=1 27.0
ZeroNL2SQL n=1 44.9
ODIS-Codex n=1 54.8
Rubik (Ours) n=1 54.1
Rubik (Ours) n=8 58.9

Conference Information

  • Venue: ACM SIGMOD 2026, Industrial Track (IN PROGRESS)
  • Area: NL2SQL, agentic systems, knowledge bases, database systems

Repository Layout

  • main.tex – main ACM LaTeX file.
  • src/ – section-wise LaTeX sources (intro, method, benchmark, experiments, etc.).
  • bib/ – bibliography files.
  • fig/ – figures used in the paper.

To build the paper locally:

pdflatex main.tex
bibtex main
pdflatex main.tex
pdflatex main.tex

Citation

If you find Rubik or RubikBench useful, please cite:

@misc{chen2025rubiksqllifelonglearningagentic,
      title={Rubik: Bridging the NL2SQL Research-to-Production Gap via Lifelong Learning Agentic Knowledge Base}, 
      author={Zui Chen and Han Li and Xinhao Zhang and Xiaoyu Chen and Chunyin Dong and Yifeng Wang and Xin Cai and Su Zhang and Ziqi Li and Chi Ding and Jinxu Li and Shuai Wang and Dousheng Zhao and Sanhai Gao and Guangyi Liu},
      year={2025},
      eprint={2508.17590},
      archivePrefix={arXiv},
      primaryClass={cs.DB},
      url={https://arxiv.org/abs/2508.17590}, 
}

About

SIGMOD 2026 Industrial Track Submission

Resources

Stars

Watchers

Forks

Contributors

Languages