Database Homepage: RubikBench
RubikBench is an enterprise-scale financial database designed for realistic Natural Language to SQL (NL2SQL) research and evaluation.
The RubikBench database contains the financial data of APEX, a hypothetical international automobile manufacturing and sales company. As a financial database, it is designed to support various analytical queries related to the company's operations, sales, and financial performance. This (imaginary) company operates mainly in China, the United States, and Europe. Therefore the database is bilingual, with both English and Chinese values, and uses three currencies: CNY, USD, and EUR.
While the data values are synthesized, the schemas and structural patterns are closely modeled after actual enterprise financial databases, ensuring practical relevance for NL2SQL system development and evaluation. The database is specifically designed to reflect the complexities encountered in real-world enterprise environments, including wide table schemas, domain-specific knowledge, diverse metrics and data calibers, etc.
Meanwhile, we provide a full NL2SQL evaluation toolkit, the rubikbench python package and CLI, which helps researchers easily evaluate their systems on RubikBench as well as other popular databases such as BirdSQL (MINIDEV) and KaggleDBQA.
# Clone the repository
git clone https://github.com/Magolor/RubikBench.git
cd RubikBench
# Install dependencies
pip install -e .
# AgentHeaven Init
ahvn setupIt is recommended to install fzf for better CLI experience.
# Download RubikBench from HuggingFace (default)
rubikbench setup
# Download BirdSQL MINIDEV from Google Drive
rubikbench setup -B BirdSQL
# Download KaggleDBQA from Google Drive
rubikbench setup -B KaggleDBQA
# Download to custom location
rubikbench setup --data <your_data_path>
# Alternatively, directly git clone from HuggingFace to get full dataset (including parquet files and stats)
# git lfs install
git clone https://huggingface.co/datasets/Magolor/RubikBench ./data/Note: The setup command automatically downloads and extracts the archive. Use --remove-zip to delete the archive after extraction.
# Generate a submission template (uses default queries path)
rubikbench template -out submission.json
# It is recommended to test against a certain type (e.g., lang-english) of queries
rubikbench template -q ./data/RubikBench/queries/RubikBench.json --tags lang-english -out submissions/template.json
# You can also specify a custom SQL placeholder (default is empty string)
rubikbench template -q ./data/RubikBench/queries/RubikBench.json --tags lang-english -out submissions/template.json --sql "SELECT 1+1;"
# Now fill in your SQL predictions in submission.json
# For example:
# {
# "Q00001": "SELECT SUM(ptd_amt_a_cny) FROM INCOME_ALL WHERE period = '202506'",
# "Q00002": "SELECT ...",
# ...
# }
# Then evaluate
rubikbench eval submission.json
# Evaluate specific queries or difficulty levels
rubikbench eval submission.json --ids Q00001 --ids Q00002
rubikbench eval submission.json --split simple
# Use unordered evaluation
rubikbench eval submission.json --unordered
# Enable verbose output (progress bar)
rubikbench eval submission.json --verbose
# Save detailed results
rubikbench eval submission.json --output-file results.jsonNote:
rubikbench evalonly evaluates queries that exist in the submission file. If you want to evaluate a subset of queries, make sure to include only those keys in the submission file. When including keys with empty or null SQL, those queries will be marked as failed. It is generally recommended to generate template on a specific subset of queries (e.g., by tags or difficulty levels) to get more meaningful results. For example, the defaultsubmissions/template.jsonis obtained via:rubikbench template --tags lang-english -out submissions/template.json.
from ahvn.utils.db import Database
from rubikbench import RubikBenchEvaluator, QuerySet
db = Database(provider="duckdb", database="./data/RubikBench/databases/RubikBench.duckdb")
queries = QuerySet("./data/RubikBench/queries/RubikBench.json")
# Create a full template (empty SQL placeholders)
# queries.create_template("submission_full.json") # Uncomment to create full template
# Sample 10 queries and create a template with placeholder SQL
sample_queries = queries.sample(n=10, seed=42)
sample_queries.create_template("./submissions/sample.json", placeholder="SELECT 1+1;")
# Evaluate submissions (default: ordered evaluation)
evaluator = RubikBenchEvaluator(
db=db,
queries=QuerySet("./queries/RubikBench.json"),
bf_beta=2.0,
sf_beta=1.0,
src="duckdb" # Your SQL dialect, if not DuckDB, SQL will be converted automatically via SQLGlot to DuckDB
)
report = evaluator.evaluate_submission(submission="./submissions/sample.json")
# Access results by difficulty
print(f"Overall BF2 Ordered: {report.scores['overall']['bfo']:5.2%}")
print(f"Simple BF2 ordered: {report.scores['simple']['bfo']:5.2%}")
print(f"Moderate BF2 ordered: {report.scores['moderate']['bfo']:5.2%}")
print(f"Challenging BF2 ordered: {report.scores['challenging']['bfo']:5.2%}")
# Access summary statistics
print(f"Total queries: {report.N['overall']}")
print(f"Compilable: {report.C['overall']}")
print(f"Compilable rate: {report.compilable['overall']:5.2%}")Currently this feature is only tested for RubikBench, other databases may have issues.
When fzf is installed, you can use:
rubikbench browse [--split/-s simple/moderate/challenging/nightmare/unknown] [--tags/-t tag1] [--tags/-t tag2] ...
to interactively browse and filter queries via a terminal UI. It supports filtering by:
- Difficulty levels
- Tags
- Keywords in questions (partial match)
Use mouse to scroll through long questions.
Typing keywords to filter questions in real-time.
Use TAB to toggle preview containing query information, SQL and execution results (rubikbench exec).
Use CTRL+E to open pager, which allows copying easily.
Use CTRL-D to copy the current selected query display to clipboard.
- Total Size: ~ 38 GB (DuckDB) / ~ 11 GB (parquet)
- Total Tables: 20 tables
- Total Records: ~ 901.53 million rows
- Time Coverage: Monthly from January 2020 to December 2025 (72 months)
- Languages: Bilingual (English and Chinese)
- Database Format: DuckDB Native (recommended), parquet (for huggingface)
- Formats like sqlite is not supported as it does not compress the data. The full, uncompressed size of data can be over 500GB.
| Table Name | Rows | Columns | Period | Region | Customer | Dealer | Product | Contract | Project | Revenue | Expense |
|---|---|---|---|---|---|---|---|---|---|---|---|
| INCOME_ALL | 12.41M | 61 | 72 | ✓ | ✓ | ✓ | ✓ | ||||
| INCOME_ACCESSORY | 3.37M | 61 | 72 | ✓ | ✓ | ✓ | ✓ | ||||
| INCOME_CYBVEHSYS | 0.99M | 61 | 72 | ✓ | ✓ | ✓ | ✓ | ||||
| INCOME_GALMOTGRO | 1.72M | 61 | 72 | ✓ | ✓ | ✓ | ✓ | ||||
| INCOME_NOVDYN | 1.42M | 61 | 72 | ✓ | ✓ | ✓ | ✓ | ||||
| INCOME_SERVICE | 1.48M | 61 | 72 | ✓ | ✓ | ✓ | ✓ | ||||
| INCOME_SHIMOTCOR | 1.13M | 61 | 72 | ✓ | ✓ | ✓ | ✓ | ||||
| INCOME_TYRAUTGRO | 1.04M | 61 | 72 | ✓ | ✓ | ✓ | ✓ | ||||
| INCOME_WEYMOTCOR | 1.25M | 61 | 72 | ✓ | ✓ | ✓ | ✓ | ||||
| BUDGET_AND_FORECAST_ALL | 96.58M | 37 | 72 | ✓ | ✓ | ✓ | ✓ | ||||
| BUDGET_AND_FORECAST_ACCESSORY | 30.18M | 37 | 72 | ✓ | ✓ | ✓ | ✓ | ||||
| BUDGET_AND_FORECAST_CYBVEHSYS | 8.50M | 37 | 72 | ✓ | ✓ | ✓ | ✓ | ||||
| BUDGET_AND_FORECAST_GALMOTGRO | 11.23M | 37 | 72 | ✓ | ✓ | ✓ | ✓ | ||||
| BUDGET_AND_FORECAST_NOVDYN | 10.46M | 37 | 72 | ✓ | ✓ | ✓ | ✓ | ||||
| BUDGET_AND_FORECAST_SERVICE | 9.48M | 37 | 72 | ✓ | ✓ | ✓ | ✓ | ||||
| BUDGET_AND_FORECAST_SHIMOTCOR | 8.86M | 37 | 72 | ✓ | ✓ | ✓ | ✓ | ||||
| BUDGET_AND_FORECAST_TYRAUTGRO | 8.51M | 37 | 72 | ✓ | ✓ | ✓ | ✓ | ||||
| BUDGET_AND_FORECAST_WEYMOTCOR | 9.35M | 37 | 72 | ✓ | ✓ | ✓ | ✓ | ||||
| PROFIT_AND_LOSS | 161.74M | 67 | 72 | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ||
| SALES_LEDGER | 521.81M | 55 | 72 | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| Dimension | Hierarchy Levels | Total Entities |
|---|---|---|
| Period | 1 | 72 months (202001-202512) |
| Region | 6 | 2 overseas/domestic, 11 sales regions, 22 countries, 37 national areas, 46 provinces, 47 cities |
| Product | 4 | 3 (Lv0), 19 (Lv1), 93 (Lv2), 353 (Lv3) |
| Customer | 4 | 4 (Lv1), 9 (Lv2), 14 (Lv3), 22 (Lv4) |
| Dealer | 3 | 3 (Lv1), 5 (Lv2), 9 (Lv3) including 'Direct Sales' |
| Report Item | 4 | 2 (Lv0), 6 (Lv1), 10 (Lv2), 22 (Lv3) |
| Contract/Project | 2 | 40,756 distinct contracts, 46,895 distinct projects |
| Caliber | 1 | 2 calibers (A, B) |
| Currency | 1 | 3 currencies (CNY, EUR, USD) |
RubikBench is a database containing the financial data of APEX, an (imaginary) international automobile manufacturing and sales company. As a financial database, it is designed to support various analytical queries related to the company's operations, sales, and financial performance. This (imaginary) company operates mainly in China, the United States, and Europe. Therefore the database is bilingual, with both English and Chinese values, and uses three currencies: CNY, USD, and EUR.
Specifically, there are 6 key dimensions: Period (time, monthly), Product, Region, Customer, Dealer, and Report Item (revenues and expenses). Also, there are extra dimensions including: Contract, Project, Currency, and Caliber.
The minimal granularity of the data is a payment of a project. Each project happens between APEX and a customer at a specific region, with an optional dealer, over a period of time. Payments of a project can be distributed over multiple months in that time period. Each project may contain multiple products. Each contract may contain multiple projects.
RubikBench contains 20 tables of 4 major categories, all of them are aggregated views over the fact table (which is not exposed directly):
- The
INCOMEtables, which contain data aggregated over project and dealer, and only includes revenues. The INCOME tables are the smallest tables in RubikBench, aiming for quick analytical queries. - The
BUDGET_AND_FORECASTtables, which contain data aggregated over project, customer, and dealer. These tables contain only amt/forecast/budget/target values. Notice that the semantics of these values could be counter-intuitive: While the target value is the target for monthly revenues and expenses inYYYYMM, as one would expect; the forecast value ofYYYYMMis the forecast ofYYYY12(yearly) based on the information available at the end ofYYYYMM; the budget ofYYYYMMis a constant value indicating the yearly budget duplicated for each month in the year. - The
PROFIT_AND_LOSStable, which contains data aggregated only over project. It contains the most comprehensive dimensions and measures, including both revenues and expenses. It is the largest table in RubikBench, aiming to support detailed financial analysis. - The
SALES_LEDGERtable, which contains the lowest granularity data, i.e., payment-level data. It is designed to support detailed audit and traceability of sales. However, it is limited to sales-related revenues and expenses only.
The products of APEX are divided into 3 major divisions: Automobiles, Accessories, and Services.
- Automobiles are produced by enterprise brand groups, which are 6 sub-brands under APEX. Each brand group has its own
INCOMEandBUDGET_AND_FORECASTtables. - Accessories and Services each have their own
INCOMEandBUDGET_AND_FORECASTtables. Accessories only have equipment revenues and costs, while services only have service revenues and costs.
The default Caliber is code A, which clearly separates equipment and service report items, reflecting the real financial statistics. However, to facilitate the cooperation between different divisions, there are also Caliber B, which moves 5% of service revenue to equipment revenue.
Notice that due to historical reasons as well as query efficiency expectations, currencies and calibers are organized differently across different tables. For example, for the INCOME and PROFIT_AND_LOSS tables, different currencies and calibers are stored in different columns as column name suffixes (e.g., _cny_a, _usd_a, _eur_b, etc.); while for the SALES_LEDGER and BUDGET_AND_FORECAST tables, caliber and currency are separate columns that MUST be filtered on in the query predicates to avoid duplicated results.
Financial amounts presents both ptd (monthly) values and ytd (year-to-date) values. For example, ptd of YYYYMM means the amount for that month, while ytd of YYYYMM means the cumulative amount from YYYY01 to YYYYMM (inclusive). Furthermore, _py columns contain previous year data, which means that, for exammple, ytd py columns with period='YYYYMM' contain cumulative amounts from YYYY-1 01 to YYYY-1 MM, etc.
RubikBench v0.9.2 contains 3198 fully human-verified queries covering diverse financial analysis scenarios, with comprehensive annotations including difficulty levels, query tags, and user context profiles. The queries are written in both English (1,689) and Chinese (1,509), reflecting the bilingual nature of the database.
| Difficulty | Count | % | Avg. SQL Length | Min SQL Length | Max SQL Length |
|---|---|---|---|---|---|
| Simple | 1,047 | 32.7% | 218 chars | 11 chars | 1,240 chars |
| Moderate | 1,531 | 47.9% | 641 chars | 213 chars | 3,492 chars |
| Challenging | 415 | 13.0% | 1,755 chars | 339 chars | 3,635 chars |
| Nightmare | 205 | 6.4% | 1,881 chars | 579 chars | 4,183 chars |
Queries span all 4 major table categories (a query may reference multiple tables):
| Table Category | Queries | Individual Tables |
|---|---|---|
| INCOME_* | 1,615 | INCOME_ALL (581), INCOME_WEYMOTCOR (153), INCOME_SHIMOTCOR (142), INCOME_TYRAUTGRO (132), INCOME_ACCESSORY (130), INCOME_NOVDYN (122), INCOME_SERVICE (121), INCOME_GALMOTGRO (120), INCOME_CYBVEHSYS (118) |
| BUDGET_AND_FORECAST_* | 724 | BUDGET_AND_FORECAST_ALL (124), BUDGET_AND_FORECAST_WEYMOTCOR (97), BUDGET_AND_FORECAST_TYRAUTGRO (85), BUDGET_AND_FORECAST_SERVICE (81), BUDGET_AND_FORECAST_SHIMOTCOR (79), BUDGET_AND_FORECAST_NOVDYN (77), BUDGET_AND_FORECAST_ACCESSORY (72), BUDGET_AND_FORECAST_CYBVEHSYS (54), BUDGET_AND_FORECAST_GALMOTGRO (53) |
| PROFIT_AND_LOSS | 408 | — |
| SALES_LEDGER | 367 | — |
| SQL Feature | Queries | % |
|---|---|---|
GROUP BY |
2,296 | 71.8% |
ORDER BY |
2,139 | 66.9% |
CASE WHEN |
1,409 | 44.1% |
COALESCE |
876 | 27.4% |
| Subquery | 768 | 24.0% |
Window Function (OVER) |
746 | 23.3% |
LIMIT |
635 | 19.9% |
CTE (WITH) |
438 | 13.7% |
HAVING |
334 | 10.4% |
DISTINCT |
260 | 8.1% |
JOIN |
94 | 2.9% |
CAST |
86 | 2.7% |
UNION |
28 | 0.9% |
Each query includes a simulated user profile with an occupation field, representing different roles that would typically interact with financial data:
| Occupation | Count | % | Description |
|---|---|---|---|
| Sales | 707 | 22.1% | Sales representatives and managers |
| Finance | 675 | 21.1% | Financial analysts and accountants |
| Management | 559 | 17.5% | Executive and strategic decision-makers |
| Guest | 363 | 11.4% | External users with limited access |
| Developer | 356 | 11.1% | Technical users and system developers |
| Unspecified | 538 | 16.8% | No specific occupation context |
Each query in RubikBench.json follows this structure:
{
"benchmark": "RubikBench",
"database": "RubikBench",
"id": "Q00001",
"question": "Show revenue for China, Q2 2025.",
"context": {
"query_time": 202509,
"user_profile": {
"occupation": null,
"caliber": "A",
"currency": null,
"region": {},
"department": {},
"preferences": []
}
},
"schema": null,
"dialect": "duckdb",
"sql": "SELECT ... FROM ... WHERE ...",
"sql.1": "SELECT ... FROM ... WHERE ...",
...,
"metadata": {
"difficulty": "simple",
"query_tags": [
"level-simple",
"lang-english",
"type-basic",
"value-amount",
"period-quarterly"
],
"order-relevant": false,
"verified": true
}
}Explanation of fields:
- benchmark: Name of the benchmark (e.g.,
RubikBench) - database: Name of the database (e.g.,
RubikBench) - id: Unique query identifier
- question: Natural language question
- context: Contextual information (e.g., query time, user profile)
- schema: Optional schema information (null if not provided). If provided, the SQL output should use the table/column names as given in the schema.
- dialect: SQL dialect of the ground-truth SQL (currently only
duckdbis supported) - sql, sql.1, ...: One or more ground-truth SQL queries (all acceptable, the maximum matching score across variants is used during evaluation)
- metadata: Additional metadata including:
- difficulty: Difficulty level (
simple,moderate,challenging,nightmare) - query_tags: List of tags describing query characteristics, see
query_tags.yamlfor all possible tags. Notice that the query tags are annotated with the aid of LLMs and may contain errors. Welcome contributions to improve the annotations. - order-relevant: Whether the query result is order-relevant (it could be true/false, or null if unlabeled, in which case the evaluator will auto-detect based on
ORDER BYclause in the standard ground-truth SQL) - verified: Whether the ground-truth SQL has been human-verified, always true for RubikBench official queries. Welcome contributions to add more verified queries, alternative SQLs, or corrections to existing queries.
- difficulty: Difficulty level (
During NL2SQL evaluation, only database, question, context, schema, and dialect should be used as input to the NL2SQL system. The benchmark, id, ground-truth SQL(s) and metadata are only used for evaluation and analysis.
Submit predictions as a JSON file mapping query IDs to SQL strings:
{
"Q00001": "SELECT SUM(...) FROM INCOME_ALL WHERE ...",
"Q00002": "SELECT ... FROM ..."
}from rubikbench.queries import QuerySet
queries = QuerySet("./data/RubikBench/queries/RubikBench.json")
# Filter by difficulty
simple = queries.filter(difficulty="simple")
moderate = queries.filter(difficulty=["moderate", "challenging"])
# Filter by specific IDs
subset = queries.filter(ids=["Q00001", "Q00002", "Q00003"])
# Filter by tags
monthly = queries.filter(tags=["period-monthly"])
# Combine filters
subset = queries.filter(difficulty="simple", verified_only=True)
# Stable Random sampling:
# Randomly sample n queries.
# We use AgentHeaven's hash-based sampling for reproducibility. This means that for a fixed seed,
# if you sample n queries, and then sample m > n queries, the n sampled queries will
# always be included in the m sampled queries. This is also robust with regards to changes
# in the query set (e.g., adding new queries will not drastically change the sampled queries).
# To run sampling with filter conditions, first filter the queries, then manuallly call:
# `ahvn.utils.basic.rnd_utils.StableRNG(seed=seed).hash_sample(filtered_queries_subset, k=n_samples)`
sample = queries.sample(n=10, seed=42)
for query in sample:
print(query["id"], query["question"])
print()
sample = queries.sample(n=20, seed=42)
for query in sample:
print(query["id"], query["question"])
print()RubikBench computes 5 metric variants for each query (unordered SF is deprecated):
| Metric | Description |
|---|---|
| EX | Exact Match - binary (0 or 1) |
| BF | Bipartite F-beta - continuous score [0, 1] |
| SF | Soft F-beta (ordered only) - continuous score [0, 1] |
Each metric has Ordered (o) and Unordered (u) variants where applicable:
- Ordered (
o): Respects row order (only for queries that are order-relevant) - Unordered (
u): Ignores row order (multiset comparison). Note: SF unordered is not supported as it is not well-defined.
The evaluation report shows:
- N: Total number of queries evaluated
- C: Number of compilable SQLs (no execution error)
- compilable: C/N rate
- scores: Success rate (queries with score ≥ 99.99%) / N
Notice that all the ground-truth results in the built-in benchmarks are guaranteed to have at most 1000 rows. If a prediction or ground truth result exceeds 1000 rows, a warning will be printed and evaluation will proceed (but may be slow or out-of-memory, and BF falls back to SF for slow cases). For custom datasets, ground truth queries returning more than 1000 rows are discouraged due to evaluation efficiency — consider adding LIMIT clauses where appropriate.
Though we attempt to imitate BIRD's evaluation, the original implementation of BIRD's EX and SF metrics are buggy, which deduplicated rows by treating a table as a set or rows. We corrected this behavior by treating a table as a mulitset of rows instead. This is the default behavior (--no-dedup in CLI) for all evaluation metrics, including EX, SF and BF, which can lead to a lowered evaluation score for certain special cases. Nevertheless, we provide argument --dedup in CLI to mimic BIRD's buggy behavior if needed.
Binary 0/1 metric checking if predicted results exactly match ground truth.
The unordered EX score behaves similar to BIRD-SQL EX Accuracy, which only compares whether the two sets of rows are identical (each row is considered an ordered tuple, with column names ignored). As RubikBench mainly involves numerical computation, we added a conversion for all floating point numbers to round to 3 decimal places before comparison, to reduce floating point precision issues.
from rubikbench.metrics import ex_match
# Unordered comparison (default, with dedup)
score = ex_match(pd_rows, gt_rows) # Returns 0 or 1
# Ordered comparison (respects row order)
score = ex_match(pd_rows, gt_rows, ordered=True)
# Strict evaluation: preserve duplicates
score = ex_match(pd_rows, gt_rows, dedup=False)
# Mimic BIRD's buggy behavior for unordered comparison (legacy)
score = ex_match(pd_rows, gt_rows, ordered=False, fixed=False, dedup=False)- ordered=False (default): Multiset comparison, ignores row order
- ordered=True: Row-by-row positional comparison
- dedup=True (default): Drop duplicate rows before comparison
- dedup=False: Preserve duplicates for strict evaluation
The soft F-beta score is a more lenient metric that gives partial credit for partially correct rows. Only the ordered variant is supported.
The ordered SF (β=1.0) score behaves similar to BIRD-SQL Mini-Dev Soft F1-Score, which computes the cell-level precision and recall between the predicted and ground truth table, where a cell is only allowed to match cells in the same row. Again, we added a conversion for all floating point numbers to round to 3 decimal places before comparison, to reduce floating point precision issues.
from rubikbench.metrics import soft_fbeta_score
# Ordered matching (only supported mode)
score = soft_fbeta_score(pd_rows, gt_rows, beta=1.0, ordered=True)
# Strict evaluation: preserve duplicates
score = soft_fbeta_score(pd_rows, gt_rows, beta=1.0, ordered=True, dedup=False)- β=1.0 (default): Balanced precision/recall
- Partial credit for partially correct rows
- ordered=True: Match by position; unordered calls raise an error
- dedup=True (default): Drop duplicate rows before scoring
- dedup=False: Preserve duplicates for strict evaluation
The Bipartite F-beta score (
This metric is motivated by practical NL2SQL applications where:
- Recall matters more than precision (retrieving all required information is more important than avoiding minor extra content)
- Formatting variations are acceptable (e.g., different column aliases, additional rank columns in sorted results)
- Row order should be conditionally enforced (only when the ground truth contains
ORDER BYor the query is labelled as order-relevant)
BF is a even more flexible version of SF that uses optimal bipartite matching to align predicted and ground truth rows instead of computing cell-level precision/recall. This allows BF to better handle cases with extra/missing rows and columns, as well as unordered results.
from rubikbench.metrics import bfbeta_score
# Auto-detect order mode from SQL
score = bfbeta_score(pd_rows, gt_rows, beta=2.0)
# Explicit order mode
score = bfbeta_score(pd_rows, gt_rows, beta=2.0, ordered=True)
# Detect order mode from ground truth SQL
gt_sql = "SELECT ... ORDER BY ..."
score = bfbeta_score(pd_rows, gt_rows, sql=gt_sql, beta=2.0, ordered=None)
# Strict evaluation: preserve duplicates
score = bfbeta_score(pd_rows, gt_rows, beta=2.0, dedup=False)Parameters:
- β=2.0 (default): Favors recall over precision (use β=1.0 for balanced precision/recall)
- ordered=True: Uses non-intersecting DP matching (for
ORDER BYqueries) - ordered=False: Uses Hungarian algorithm for bipartite matching (unordered queries)
- ordered=None (default): Auto-detect based on
ORDER BYclause in ground truth SQL - dedup=True (default): Drop duplicate rows before scoring
- dedup=False: Preserve duplicates for strict evaluation
Mathematical Formulation
Given predicted result rows
This yields a weighted bipartite graph
-
Unordered queries (
$Q_{\text{unordered}}$ ): Weighted Bipartite Matching ($WBM$ ) via the Hungarian algorithm -
Ordered queries (
$Q_{\text{ordered}}$ ): Non-Intersecting WBM ($WBM_{NI}$ ) via dynamic programming
Finally, the overall Bipartite F-beta score across all queries is computed as:
Queries can have multiple valid SQLs: e.g., when asking two different values, should the be organized as rows or columns? When asking for quarterly trends, should the periods be named 202503, 202506, 202509, 202512 or Q1, Q2, Q3, Q4? For such cases, RubikBench allows multiple ground-truth SQLs per query, labelled as sql, sql.1, sql.2, etc. Typically, those SQLs may have different outputs, sometimes different SQLs with the same output.
When evaluating, the submission SQL is compared against all ground-truth SQL variants, and the best score is taken as the final score for that query.
{
"id": "Q00001",
"sql": "SELECT SUM(ytd_amt) FROM SALES_LEDGER WHERE sales_region_name_en = 'China'",
"sql.1": "SELECT sum(ytd_amt) AS 'YTD Amount' FROM SALES_LEDGER WHERE sales_region_name_zh = '中国'",
"sql.2": "SELECT SUM(ytd_amt) FROM SALES_LEDGER WHERE overseas_flag_name_en = 'Domestic'"
}The evaluator takes the maximum score across all SQL variants for a query with multiple ground truths.
Warning: Though multiple SQL ground truths could exist, there may still be many valid SQLs not covered in the ground truth. Therefore, we recommend using the Bipartite F-beta (BF) metric as the primary evaluation metric, as it is more tolerant to minor variations in result formatting compared to Exact Match (EX) or Soft F-beta (SF).
RubikBench provides a comprehensive CLI for evaluation:
# Show all commands
rubikbench --help
# Show command-specific help
rubikbench eval --help
rubikbench exec --help
rubikbench template --help
rubikbench info --help
rubikbench setup --help# Download RubikBench from HuggingFace (default)
rubikbench setup
# Download BirdSQL MINIDEV from Google Drive
rubikbench setup -B BirdSQL
# Download KaggleDBQA from Google Drive
rubikbench setup -B KaggleDBQA
# Download to custom location
rubikbench setup --data <your_data_path>
# Force re-download even if files exist
rubikbench setup --force
# Remove archive after extraction
rubikbench setup --remove-zipDefault paths (RubikBench):
- Data directory:
./data/RubikBench/ - Queries file:
./data/RubikBench/queries/RubikBench.json - Database file:
./data/RubikBench/databases/RubikBench.duckdb
Default paths (BirdSQL):
- Data directory:
./data/BirdSQL/ - Queries file:
./data/BirdSQL/queries/BirdSQL.json - Databases directory:
./data/BirdSQL/databases/*.sqlite
Default paths (KaggleDBQA):
- Data directory:
./data/KaggleDBQA/ - Queries file:
./data/KaggleDBQA/queries/KaggleDBQA.json - Databases directory:
./data/KaggleDBQA/databases/*.sqlite
# Test default RubikBench database
rubikbench test
# Test any database file (provider auto-detected from extension)
rubikbench test -db ./data/BirdSQL/databases/financial.sqlite# Show overall statistics (uses default queries path)
rubikbench info
# Show statistics for custom queries file
rubikbench info -q ./data/RubikBench/queries/RubikBench.json# Generate template for all queries (uses default paths)
rubikbench template
# Generate to custom output file
rubikbench template -out my_submission.json# Only simple queries
rubikbench template --split simple -out simple_only.json
# Filter by tags (any tag matches)
rubikbench template --tags lang-english -out english_only.json
# Multiple tags
rubikbench template -t lang-english -t type-basic -out filtered.json
# Specific queries
rubikbench template --ids Q00001 --ids Q00002 -out subset.json
# Combine filters (multiple difficulty levels and tags)
rubikbench template -s simple -s moderate -t lang-english -out combined.json
# Use custom queries file
rubikbench template -q ./data/RubikBench/queries/RubikBench.json --split challenging
# Specify a custom SQL placeholder (default is empty string)
rubikbench template --split simple --sql "SELECT 1+1;" -out template_with_default.json
# Combine all features
rubikbench template -q ./data/RubikBench/queries/RubikBench.json -t lang-english -s simple -out submission.json --sql "SELECT * FROM INCOME_ALL LIMIT 1;"# Evaluate with default paths (ordered mode, no progress)
rubikbench eval submission.json
# Evaluate with verbose output (progress bar)
rubikbench eval submission.json --verbose# Evaluate only specific queries
rubikbench eval submission.json --ids Q00001 --ids Q00002 --ids Q00003
# Evaluate by difficulty (using --split)
rubikbench eval submission.json --split simple
# Multiple difficulty levels
rubikbench eval submission.json -s simple -s moderate
# Use custom queries or database paths
rubikbench eval submission.json -q ./data/RubikBench/queries/RubikBench.json -db ./data/RubikBench/databases/RubikBench.duckdb# Use unordered evaluation
rubikbench eval submission.json --unordered
# Custom beta parameters (using new abbreviations)
rubikbench eval submission.json --bf-beta 3.0 --sf-beta 0.5
rubikbench eval submission.json -bfb 3.0 -sfb 0.5
# Strict evaluation: preserve duplicate rows
rubikbench eval submission.json --no-dedup
# Specify SQL dialect (auto-convert to DuckDB)
rubikbench eval submission.json --dialect postgres
# Save detailed results
rubikbench eval submission.json --output-file detailed_results.json
# Enable verbose output (progress bar)
rubikbench eval submission.json --verboseCLI Option Abbreviations:
| Full Option | Abbreviation | Description |
|---|---|---|
--input-file |
-in |
Submission file path (alternative to positional argument) |
--queries |
-q |
Queries file path |
--database |
-db |
Database file path |
--output-file |
-out |
Output JSON path |
--ids |
-i |
Specific query IDs |
--split |
-s |
Filter by difficulty |
--bf-beta |
-bfb |
Beta for BF score |
--sf-beta |
-sfb |
Beta for SF score |
--ordered |
-o |
Use ordered evaluation (default) |
--unordered |
-u |
Use unordered evaluation |
--dedup |
Drop duplicate rows before scoring (default) | |
--no-dedup |
Preserve duplicates for strict evaluation | |
--verbose |
-v |
Enable progress bar |
Evaluate and debug individual queries:
# Evaluate with default paths (ordered mode)
rubikbench exec Q00001 \
"SELECT SUM(ptd_amt_a_cny) FROM INCOME_ALL WHERE period = '202506'"
# Use unordered evaluation
rubikbench exec Q00001 \
"SELECT SUM(ptd_amt_a_cny) FROM INCOME_ALL WHERE period = '202506'" --unordered
# Use custom database path
rubikbench exec Q00001 \
"SELECT SUM(ptd_amt_a_cny) FROM INCOME_ALL WHERE period = '202506'" \
-db /path/to/RubikBench.duckdb
# Execute the ground truth SQL and display the question only
rubikbench exec Q00001
# Execute a SQL only
rubikbench exec "SELECT COUNT(*) AS cnt FROM SALES_LEDGER;"For convenience, RubikBench integrates SQLGlot to convert SQL queries from other dialects to DuckDB.
# CLI: Specify source dialect
rubikbench eval submission.json -q ./data/RubikBench/queries/RubikBench.json --dialect postgres
# Python API
evaluator = RubikBenchEvaluator(
db=db,
queries=queries,
src="oracle" # Your SQL dialect
)Or, use the conversion function directly:
from rubikbench.dialect import convert_sql
# Convert Oracle SQL to DuckDB
duckdb_sql = convert_sql(
"SELECT * FROM users WHERE ROWNUM <= 10",
source_dialect="oracle",
target_dialect="duckdb" # default
)
# Result: "SELECT * FROM users LIMIT 10"For SQL formatting and pretty-printing:
from ahvn.utils.db import prettify_sql
# Prettify a SQL query for better readability
sql = "/* Comments */ SELECT name, COUNT(*) as cnt FROM users WHERE age>18 GROUP BY name ORDER BY cnt DESC"
pretty_sql = prettify_sql(sql, dialect="sqlite", comments=False)
print(pretty_sql)
# Result:
# SELECT
# "name",
# COUNT(*) AS "cnt"
# FROM "users"
# WHERE
# "age" > 18
# GROUP BY
# "name"
# ORDER BY
# "cnt" DESCParameters:
- query (str): The SQL query to prettify
- dialect (str): The SQL dialect to use (default: "sqlite")
- comments (bool): Whether to keep comments in the output (default: True)
The prettify_sql function uses SQLGlot to format queries with proper indentation, keyword casing, and structural organization. If formatting fails, it returns the original query stripped of whitespace.
Warning: SQLGlot may not perfectly convert all SQL constructs between dialects, especially for complex queries. Always verify the converted SQL against the target database to ensure correctness.
RubikBench supports a pluggable benchmark architecture. Each benchmark lives under src/rubikbench/benchmarks/ and exposes a common API so the CLI can treat all benchmarks identically.
| Benchmark | Database | Source | Setup Command |
|---|---|---|---|
| RubikBench | DuckDB | GitHub, HuggingFace | rubikbench setup |
| BirdSQL (MINIDEV) | SQLite | GitHub, Google Drive, Corrections | rubikbench setup -B BirdSQL |
| KaggleDBQA | SQLite | GitHub, Google Drive | Coming soon |
All benchmark modules follow the same pattern:
- Module location:
src/rubikbench/benchmarks/<name>.py - Constants: Each module defines
DEFAULT_DATA_DIR, a default queries path, and any source-specific identifiers. - Unified API: Each module exposes a
setup(data_dir, *, force, remove_zip)function that handles downloading, extracting, and converting queries. - Query conversion: External benchmarks convert their native query format into the unified JSON format used by RubikBench (see section 5 for query schema).
- Registry: The
benchmarks/__init__.pyregisters all benchmarks in aBENCHMARKSdict, so the CLI dispatches purely by name.
After setup, all downstream commands (eval, exec, info, template, browse) work the same way — just point --queries and --database at the appropriate paths.
# 1. Download and prepare BirdSQL MINIDEV
rubikbench setup -B BirdSQL
# Or use the original BirdSQL queries without corrections
# rubikbench setup -B BirdSQL --birdsql-no-corrections
# 2. Generate a submission template for BirdSQL queries
rubikbench template -q ./data/BirdSQL/queries/BirdSQL.json --dialect sqlite -out submissions/birdsql.json
# 3. (Fill in your predicted SQL in birdsql_submission.json)
# 4. Evaluate
rubikbench eval submissions/birdsql.json -q ./data/BirdSQL/queries/BirdSQL.json --dialect sqlite --verbose
# To imitate BirdSQL's original evaluation behavior more closely, use `--dedup` to enable deduplication of rows for evaluation.
# rubikbench eval submissions/birdsql.json -q ./data/BirdSQL/queries/BirdSQL.json --dialect sqlite --dedup --verbose
# Or evaluate against a specific BirdSQL database
# rubikbench eval submissions/birdsql.json \
# -q ./data/BirdSQL/queries/financial.json \
# -db ./data/BirdSQL/databases/financial.sqlite \
# --dialect sqlite \
# --verbose
# Test connectivity to any SQLite database if you want
rubikbench test -db ./data/BirdSQL/databases/financial.sqliteNote: Certain queries in BIRD MINIDEV contain a very large SQL result, which may lead to slow evaluation. This typically requires 5 ~ 10 minutes to complete a full evaluation.
# 1. Download and prepare KaggleDBQA
rubikbench setup -B KaggleDBQA
# 2. Generate a submission template for KaggleDBQA queries
rubikbench template -q ./data/KaggleDBQA/queries/KaggleDBQA.json --dialect sqlite -out submissions/kaggledbqa.json
# 3. (Fill in your predicted SQL in kaggledbqa_submission.json)
# 4. Evaluate
rubikbench eval submissions/kaggledbqa.json -q ./data/KaggleDBQA/queries/KaggleDBQA.json --dialect sqlite --verboseTo add support for a new benchmark:
- Create
src/rubikbench/benchmarks/<name>.pywith:DEFAULT_DATA_DIR = Path("./data/<Name>")def setup(data_dir: Path, *, force: bool = False, remove_zip: bool = False) -> None
- Register it in
src/rubikbench/benchmarks/__init__.py:- Import the
setupfunction - Add an entry to the
BENCHMARKSdict - Add the display name to
BENCHMARK_NAMES
- Import the
- The
setup()function should:- Download the raw data into
data_dir/download/ - Extract and organise files into
data_dir/raw/ - Convert queries to the unified JSON format in
data_dir/queries/
- Download the raw data into
-
[2025-02] RubikBench v0.9.2:
- Added 3198 queries and query statistics (schema-grounded queries to be added).
- Integrated BirdSQL MINIDEV benchmark and KaggleDBQA benchmark into RubikBench CLI.
- Released RubikBench evaluation toolkit as open-source Python package: rubikbench.
-
[2026-02] Updated RubikBench v0.9.1 database on huggingface: RubikBench.
- Add support for
.parquetformat. - More realistic business logic, reduced the number of products per project
- Diversify between SALES_LEDGER and PROFIT_AND_LOSS
- Fixed some geographical region errors
- Re-generate all projects and contracts
- Expand to 9 dealers
- ytd/ptd fix
- Add support for
-
[2026-01] Initial release of RubikBench v0.9 database on huggingface: RubikBench.
If you find Rubik or RubikBench useful, please cite:
@misc{chen2025rubiksqllifelonglearningagentic,
title={Rubik: Bridging the NL2SQL Research-to-Production Gap via Lifelong Learning Agentic Knowledge Base},
author={Zui Chen and Han Li and Xinhao Zhang and Xiaoyu Chen and Chunyin Dong and Yifeng Wang and Xin Cai and Su Zhang and Ziqi Li and Chi Ding and Jinxu Li and Shuai Wang and Dousheng Zhao and Sanhai Gao and Guangyi Liu},
year={2025},
eprint={2508.17590},
archivePrefix={arXiv},
primaryClass={cs.DB},
url={https://arxiv.org/abs/2508.17590},
}