Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Panel of BIRD Annotation Issues. #39

Open
huybery opened this issue Jun 11, 2023 · 30 comments
Open

Panel of BIRD Annotation Issues. #39

huybery opened this issue Jun 11, 2023 · 30 comments
Labels
BIRD help wanted Extra attention is needed

Comments

@huybery
Copy link
Collaborator

huybery commented Jun 11, 2023

Hi all,

Although BIRD has incurred significant annotation costs, we still cannot guarantee that all the data is accurately labeled. We hope that the community can assist us in building BIRD together! You can continuously report any errors you find under this issue, and we will perform a dataset update at a designated time.

Thanks a lot!

Best,
Binyuan

@huybery huybery added BIRD help wanted Extra attention is needed labels Jun 11, 2023
@tshu-w
Copy link

tshu-w commented Jun 27, 2023

Should the db_id movies_4 in train/train.json be movie_4 as there is not movies_4 database in train_databases?

@tshu-w
Copy link

tshu-w commented Jun 27, 2023

There is a typo of this sql in train.json (REL -> REAL)

SELECT CAST(SUM(CASE WHEN T2.DESCRIPTION = 'Allergy to grass pollen' THEN 1 ELSE 0 END) AS REL) * 100 / COUNT(T1.patient) FROM patients AS T1 INNER JOIN allergies AS T2 ON T1.patient = T2.PATIENT WHERE T1.birthplace = 'Pembroke MA US'

@tshu-w
Copy link

tshu-w commented Jun 28, 2023

These two queries also cannot be parsed by sqlglot, but I am not sure if it is the wrong SQL statement or the problem of sqlglot.

SELECT T1.first_name, T1.last_name FROM current AS T1 INNER JOIN `current-terms` AS T2 ON T1.bioguide_id = T2.bioguide WHERE T2.party = 'Republican' AND T1.gender_bio = 'F' AND T2.END > DATE() GROUP BY T1.bioguide_id
SELECT max_temperature_f, date FROM weather WHERE max_temperature_f = (SELECT MAX(max_temperature_f) FROM weather WHERE max_temperature_f IS NOT NULL AND max_temperature_f IS NOT '')

@ronch99
Copy link

ronch99 commented Jun 28, 2023

Incorrect SQLite database for retail_world. It doesn't match the annotated SQL queries, e.g.:

  • Table "Order Details" in SQL queries doesn't match "OrderDetails" (no space) in the SQLite file
  • Table "Territories" and "Region" are missing in the SQLite file

@22842219
Copy link

also I have found there exists corrupted database, for example, coinmarketcap.sqlite

@wbbeyourself
Copy link

wbbeyourself commented Aug 4, 2023

Incorrect evidence in dev set.

在 california_schools/frpm.csv 中,eligible free rate = Free Meal Count / Enrollment,同时 column 【Percent (%) Eligible Free (K-12)】确实是 【Free Meal Count (K-12)】 / 【Enrollment (K-12)】。

但是在 dev set中第一条测试数据就存在错误:

原始数据:

db_id: california_schools
question:
What is the highest eligible free rate for K-12 students in the schools in Alameda County?
evidence:
Eligible free rate for K-12 = FRPM Count (K-12) / Enrollment (K-12)
Gold SQL:
SELECT FRPM Count (K-12) / Enrollment (K-12)
FROM frpm
WHERE County Name = 'Alameda'
ORDER BY (CAST(FRPM Count (K-12) AS REAL) / Enrollment (K-12)) DESC LIMIT 1

两处错误:
(1) evidence 错误,正确的应该为 : Eligible free rate for K-12 = Free Meal Count (K-12) / Enrollment (K-12)
(2) Gold SQL 写得复杂了,除法是多此一举。直接写成如下的即可,无需除法。因为 【Percent (%) Eligible Free (K-12)】就代表了 Eligible free rate for K-12

SELECT MAX(Percent (%) Eligible Free (K-12))
FROM frpm
WHERE County Name = 'Alameda';

或者

SELECT Free Meal Count (K-12) / Enrollment (K-12)
FROM frpm
WHERE County Name = 'Alameda'
ORDER BY (CAST(Free Meal Count (K-12) AS REAL) / Enrollment (K-12)) DESC LIMIT 1

类似的问题可能还有不少。

@accpatrick
Copy link

@tshu-w @wbbeyourself @ronch99 @22842219
Thank you for your interest in our work! We are pleased to announce the release of an updated and cleaner version of the development set. To access the new dev dataset, you can either download it from our website or use the following direct link: https://bird-bench.oss-cn-beijing.aliyuncs.com/dev.zip.

The latest update includes the following improvements:

  1. Thoroughly reviewed and corrected all gold SQL errors.
  2. Revised Question/SQLs with null results for better clarity and less false positives.
  3. Fixed errors in some columns of the database description files.
  4. Completed the contents of two previously blank tables.

As a result of these enhancements, the ChatGPT (gpt-3.5-turbo) and GPT4 (gpt-4-32k) EX scores have improved to 42.24 (from 37.22) and 49.15 (from 46.35), respectively. We will update the leaderboard with all new results shortly.

Please note that the test datasets remain unchanged at this time since we have already devoted much more considerable effort in examining errors, and human evaluation during the construction of test datasets. This can ensure its quality.

BTW, our paper has been accepted by NeurIPS 2023 as a Spotlight. Thanks for your support and suggestions! We will keep optimizing our work, thanks!

@josem7
Copy link

josem7 commented Oct 16, 2023

In train datast I think this example is wrong

{
    "db_id": "movie_platform",
    "question": "What is the name of the longest movie title? When was it released?",
    "evidence": "longest movie title refers to MAX(LENGTH(movie_title)); when it was released refers to movie_release_year;",
    "SQL": "SELECT movie_title, movie_release_year FROM movies ORDER BY LENGTH(movie_popularity) DESC LIMIT 1"
},

Should be

"SQL": "SELECT movie_title, movie_release_year FROM movies ORDER BY LENGTH(movie_title) DESC LIMIT 1"

@josem7
Copy link

josem7 commented Oct 16, 2023

{
    "db_id": "movie_platform",
    "question": "Name the movie with the most ratings.",
    "evidence": "movie with the most rating refers to MAX(SUM(rating_score));",
    "SQL": "SELECT movie_title FROM movies GROUP BY movie_title ORDER BY COUNT(movie_title) DESC LIMIT 1"
},

This example also seems wrong as it should be counting the amount of ratings in the ratings table joined with the movies table (to extract the movie title)

@josem7
Copy link

josem7 commented Oct 21, 2023

question 24 in dev dataset might be missing the evidence

{
    "question_id": 24,
    "db_id": "california_schools",
    "question": "Give the names of the schools with the percent eligible for free meals in K-12 is more than 0.1 and test takers whose test score is greater than or equal to 1500?",
    "evidence": "",
    "SQL": "SELECT T2.`School Name` FROM satscores AS T1 INNER JOIN frpm AS T2 ON T1.cds = T2.CDSCode WHERE CAST(T2.`Free Meal Count (K-12)` AS REAL) / T2.`Enrollment (K-12)` > 0.1 AND T1.NumGE1500 > 0",
    "difficulty": "moderate"
},

@tshu-w
Copy link

tshu-w commented Feb 28, 2024

https://arxiv.org/abs/2402.12243

We found that noise in questions and gold queries are preva- lent in the dataset, with varying amounts across domains, and with an uneven distribution be- tween noise types. The presence of incorrect gold SQL queries, which then generate incor- rect gold answers, has a significant impact on the benchmark’s reliability. Surprisingly, when evaluating models on corrected SQL queries, zero-shot baselines surpassed the performance of state-of-the-art prompting methods. We con- clude that informative noise labels and reliable benchmarks are crucial to developing new Text- to-SQL methods that can handle varying types of noise.

@accpatrick
Copy link

@tshu-w @wbbeyourself @josem7 @ronch99
Thanks for your kind reminder. We are currently starting the third round of reviews for the development set. This time, we also pay more attention to questions, evidence sentences and database description files instead of just SQLs.

Of course, we will thoroughly consider your findings and make the necessary corrections during this review.

After this, we will also review the training set when our expert students and professors are available. We sincerely appreciate your support and suggestions throughout this process.

It is indeed a challenging and time-consuming task to meticulously go through all examples and ensure their quality especially our data is more complex. We aim to release this version around late March or early April. Thanks for your understanding.

@accpatrick
Copy link

@tshu-w, thank you for sharing this insightful paper. We highly value such research that judges the quality of benchmarks. A similar paper from EMNLP (https://arxiv.org/pdf/2310.18538.pdf) also provides a comprehensive examination of Spider, BIRD, and Spider-DK. These papers are crucial for advancing our field.

The paper (https://arxiv.org/abs/2402.12243) presents an important research problem on how to develop robust models to deal with noise in the benchmark. However, some details are unclear, making it difficult for us to fully understand. For instance, 1) the authors mention random sampling of examples and a 49% error rate in the california_schools set. Upon our careful review, we just found 11 out of 89 errors, which we plan to refine. We believe that multiple random samplings, as suggested in this paper (https://arxiv.org/pdf/2305.14210.pdf), would provide more reliable insights due to the large standard deviation in random sampling. Otherwise there may be a biased sampling.

  1. The paper doesn't clarify under what conditions questions are ambiguous, i.e., with or without knowledge evidence. Normally, kg evidence originates from question annotators and is collected by experts to reflect user requirements. For example, in Table 4, example 2, the evidence clearly states "A11 refers to average salary", which is already included in the database description file. One feature of our work is to create a more realistic environment for database agent development, since humans also need column descriptions and data value samples to generate precise SQLs. Therefore, with certain knowledge and files, most questions are answerable. We deem that most humans cannot find out the correct columns about average salary among anonymous columns names (A1, A2, ..., A11...). They are usually provided with database files.

  2. We mentioned in our paper that synonyms are one form of external knowledge that models should map them to identify corresponding schema items. This is a significant challenge in real text-to-SQL models. The ability to access database values and output valid values is also crucial. Actually, Our goal is to shift the focus of text-to-SQL models from only semantic parsing to understanding values. One easy way is to inspect value samples of databases as (https://arxiv.org/pdf/2204.00498.pdf).

  3. The conclusions about GPT 3.5 outperforming other advanced methods after "fixing" via their rules are not very illustrative to us, because GPT models are constantly evolving. It's unclear whether the new GPT 3.5 has memorized our data or if additional reasoning is hard for weaker models 3.5 instead of GPT-4 to understand. We found that GPT-3.5 and Codellama-34B-instruct perform well with clear and less complex prompts in our recent work. However, their performance drops significantly with even additional explanations such as COT or synonyms. Therefore, more models should be involved in experiments. Also, MAC-SQL has the relative improvements on Spider, which contains very few or even no noise in questions. Why its performance is dropped in BIRD after they "fix" these noise.

  4. We are not clear about that the distribution of errors are uneven or even, because most models in the leaderboard maintain the same order across dev and test. And Mac-SQL and DIN-SQL has improvements in both datasets.

  5. Finally, 100 % clear questions without reference to the database environment including docs do not exist in the world, and can be considered a new method to deal with the realistic questions. Real DB AGIs are expected to search for the related information by observation among dirty values, complex shemas, database files and even external tools like web search by themselves to make the user question intents more clear in the real scenarios. That's why RAG and Tool-augmented research is important in AGI.

Even these, we sincerely appreciate their highlighting of the join conditions issue as Example 3 in which we did pay less attention on this (FK-->PK) during checking. We will pay more attention to in this review round.

In conclusion, we have learned a lot from these benchmark judgement papers and understand what we need to focus on next. We welcome more such papers or feedbacks to promote this field together. Also, we appreciate your support and feedback and will continue to refine our work until the real AGIs in DB are born. Thanks.

@nnarodytska
Copy link

nnarodytska commented Mar 15, 2024

Updated (one more inconsistency was found):
"question_id": 99 has a possible incorrect ground truth SQL query :

Q: Among the accounts who have loan validity more than 12 months, list out the accounts that have the highest approved amount and have account opening date in 1993."

A (SQL):
SELECT T1.account_id FROM loan AS T1 INNER JOIN disp AS T2 ON T1.account_id = T2.account_id WHERE STRFTIME('%Y', T1.date) = '1993' AND T1.duration = 12 ORDER BY T1.amount DESC LIMIT 1

I think the join should be on the "account" table rather than  "disp" and T1.duration > 12
Alos, the question asks to filter on "have account opening date". However, SQL query filters on load date: "STRFTIME('%Y', T1.date) = '1993'".

Suggested correction:
SELECT T1.account_id FROM loan AS T1 INNER JOIN account AS T2 ON T1.account_id = T2.account_id WHERE STRFTIME('%Y', T2.date) = '1993' AND T1.duration > 12 ORDER BY T1.amount DESC LIMIT 1

Thanks!

nina


From dev.json

  "question_id": 99,
        "db_id": "financial",
        "question": "Among the accounts who have loan validity more than 12 months, list out the accounts that have the highest approved amount and have account opening date in 1993.",
        "evidence": "Loan validity more than 12 months refers to duration > 12",
        "SQL": "SELECT T1.account_id FROM loan AS T1 INNER JOIN disp AS T2 ON T1.account_id = T2.account_id WHERE STRFTIME('%Y', T1.date) = '1993' AND T1.duration = 12 ORDER BY T1.amount DESC LIMIT 1",
        "difficulty": "moderate"

@nnarodytska
Copy link

"question_id": 95 has an incorrect ground truth SQL query:

Q: "List out the account numbers of clients who are youngest and have highest average salary?",

A (SQL): "SELECT T1.account_id FROM account AS T1 INNER JOIN district AS T2 ON T1.district_id = T2.district_id WHERE T2.district_id = ( SELECT district_id FROM client ORDER BY birth_date DESC LIMIT 1 ) ORDER BY T2.A11 DESC LIMIT 1"

Multiple accounts are linked with the district that has the highest A11
However, SQL query limits it to 1 (ORDER BY T2.A11 DESC LIMIT 1) which is incorrect.

Suggested correction to output all accounts with the highest average salary:

SELECT T1.account_id, MAX(T2.A11) FROM account AS T1 INNER JOIN district AS T2 ON T1.district_id = T2.district_id WHERE T2.district_id = ( SELECT district_id FROM client ORDER BY birth_date DESC LIMIT 1 ) GROUP BY T2.A11, T1.account_id


From dev.json

{
"question_id": 95,
"db_id": "financial",
"question": "List out the account numbers of clients who are youngest and have highest average salary?",
"evidence": "If the person A's birthdate < B's birthdate, it means that person B is younger than person A; A11 refers to average salary",
"SQL": "SELECT T1.account_id FROM account AS T1 INNER JOIN district AS T2 ON T1.district_id = T2.district_id WHERE T2.district_id = ( SELECT district_id FROM client ORDER BY birth_date DESC LIMIT 1 ) ORDER BY T2.A11 DESC LIMIT 1",
"difficulty": "moderate"
},

Thanks!

nina

@nnarodytska
Copy link

nnarodytska commented Mar 15, 2024

Incorrect description of database "financial".
Issue: A4-A7 are specified as text but their types are integers (as A8's type).
The type makes a difference when we order by any of these columns as text and integer columns are sorted differently.
For example, in "question_id": 115, "db_id": "financial", there is an ordering over A4:
"SELECT CAST(SUM(T1.gender = 'M') AS REAL) * 100 / COUNT(T1.client_id) FROM client AS T1 INNER JOIN district AS T2 ON T1.district_id = T2.district_id WHERE T2.A3 = 'south Bohemia' GROUP BY T2.A4 ORDER BY T2.A4 DESC LIMIT 1",

File: llm/data/dev_databases/financial/database_description/district.csv
A4,number of inhabitants ,,text,
A5,no. of municipalities with inhabitants < 499,municipality < district < region,text,
A6,no. of municipalities with inhabitants 500-1999,municipality < district < region,text,
A7,no. of municipalities with inhabitants 2000-9999,municipality < district < region,text,
A8,no. of municipalities with inhabitants > 10000,municipality < district < region,integer,

Suggested corrections:

A4,number of inhabitants ,,integer,
A5,no. of municipalities with inhabitants < 499,municipality < district < region,integer,
A6,no. of municipalities with inhabitants 500-1999,municipality < district < region,integer,
A7,no. of municipalities with inhabitants 2000-9999,municipality < district < region,integer,
A8,no. of municipalities with inhabitants > 10000,municipality < district < region,integer,

@nnarodytska
Copy link

"question_id": 118 has an incorrect ground truth SQL query:

Q: "For loan amount less than USD100,000, what is the percentage of accounts that is still running with no issue.",

E (evidence): "Status = 'C' stands for running contract, ok so far; Percentage of accounts by condition = [(total(amount) & condition) / (total amount)] * 100%.",

A (SQL): 'SELECT CAST(SUM(status = 'C') AS REAL) * 100 / COUNT(amount) FROM loan WHERE amount < 100000',

There are a few issues with SQL:

According to the evidence, we need to divide the total amount of running contract accounts by the total amount (" [(total(amount) & condition) / (total amount)] * 100%."). However, SQL incorrectly

  • counts accounts with running contracts rather than computes the total amount on these accounts ("SUM(status = 'C') ")
  • counts amounts ("COUNT(amount)") instead of computing total amount

Suggested correction (discussion):
It's not clear what the best way to proceed is. The simplest fix would be to correct the SQL. However, I think the actual issue is that the evidence doesn't really correspond to the question. The question asks about the percentage of accounts, but the evidence suggests that we will need to find the percentage of account amounts.


From dev.json

{
    "question_id": 118,
    "db_id": "financial",
    "question": "For loan amount less than USD100,000, what is the percentage of accounts that is still running with no issue.",
    "evidence": "Status = 'C' stands for running contract, ok so far; Percentage of accounts by condition = [(total(amount) & condition) / (total amount)] * 100%.",
    "SQL": "SELECT CAST(SUM(status = 'C') AS REAL) * 100 / COUNT(amount) FROM loan WHERE amount < 100000",
    "difficulty": "moderate"
},

@nnarodytska
Copy link

nnarodytska commented Mar 19, 2024

"question_id": 141 has an incorrect ground truth SQL query:

Q: "Which districts have transactions greater than USS$10,000 in 1997?",
A (SQL): "SELECT T1.district_id FROM account AS T1 INNER JOIN district AS T2 ON T1.district_id = T2.district_id INNER JOIN trans AS T3 ON T1.account_id = T3.account_id WHERE STRFTIME('%Y', T3.date) = '1997' GROUP BY T1.district_id HAVING SUM(T3.amount) > 10000",

An issue with SQL:

The question asks about all transactions that satisfy conditions. However, SQL computes the total amount of transactions per district (GROUP BY T1.district_id HAVING SUM(T3.amount) > 10000) in the WHERE clause.
Namely, "HAVING SUM(T3.amount)" computes the sum of transaction amounts per district.

Suggested correction :

"SELECT T1.district_id FROM account AS T1 INNER JOIN district AS T2 ON T1.district_id = T2.district_id INNER JOIN trans AS T3 ON T1.account_id = T3.account_id WHERE STRFTIME('%Y', T3.date) = '1997' and T3.amount > 10000 GROUP BY T1.district_id"


From dev.json

{
    "question_id": 141,
    "db_id": "financial",
    "question": "Which districts have transactions greater than USS$10,000 in 1997?",
    "evidence": "",
    "SQL": "SELECT T1.district_id FROM account AS T1 INNER JOIN district AS T2 ON T1.district_id = T2.district_id INNER JOIN trans AS T3 ON T1.account_id = T3.account_id WHERE STRFTIME('%Y', T3.date) = '1997' GROUP BY T1.district_id HAVING SUM(T3.amount) > 10000",
    "difficulty": "simple"
},

@nnarodytska
Copy link

"question_id": 143 has an incorrect ground truth SQL query:

Q: "What are the accounts that have both gold and junior credit cards?",
A (SQL): "SELECT T2.account_id FROM card AS T1 INNER JOIN disp AS T2 ON T1.disp_id = T2.disp_id WHERE T1.type IN ('gold', 'junior')",

Issues with SQL:

The question asks about accounts that have both gold and junior credit cards.
However, SQL returns accounts that have at least one of them ( IN ('gold', 'junior')) .

Suggested correction (one of many):

"SELECT T2.account_id FROM disp AS T2 
INNER JOIN card AS T1 ON T1.disp_id = T2.disp_id 
INNER JOIN card AS T3 ON T3.disp_id = T2.disp_id 
WHERE T1.card_id = T3.card_id and T1.type = 'gold' and T3.type = 'junior'"

From dev.json
{
"question_id": 143,
"db_id": "financial",
"question": "What are the accounts that have both gold and junior credit cards?",
"evidence": "",
"SQL": "SELECT T2.account_id FROM card AS T1 INNER JOIN disp AS T2 ON T1.disp_id = T2.disp_id WHERE T1.type IN ('gold', 'junior')",
"difficulty": "simple"
},

@nnarodytska
Copy link

"question_id": 145 has an incorrect ground truth SQL query:

Q: "Who are the account holder identification numbers whose spent per month on the credit card is less than the average, in 1998?"
A (SQL): "SELECT T1.Account_id FROM Trans AS T1 INNER JOIN Account AS T2 ON T1.Account_id = T2.Account_id WHERE DATE_FORMAT(T1.date, '%Y') = '1998' AND T1.operation = 'VYBER KARTOU' AND T1.amount< (SELECT AVG(amount) FROM Trans WHERE DATE_FORMAT(date, '%Y') = '1998')",

Issues with SQL:

The question asks about accounts whose spent per month on the credit card is less than the average in 1998
However, SQL returns accounts with credit cards that have transactions with an amount less than the average in 1998

Suggested correction discussion:
We need to compute "spent per month on the credit card" to answer this question. Can you please confirm that it was intended to return "spent per month" on the credit card? Thanks!


From dev.json

{
    "question_id": 145,
    "db_id": "financial",
    "question": "Who are the account holder identification numbers whose spent per month on the credit card is less than the average, in 1998?",
    "evidence": "Operation = 'VYBER KARTOU' refers to credit card withdrawal",
    "SQL": "SELECT T1.account_id FROM trans AS T1 INNER JOIN account AS T2 ON T1.account_id = T2.account_id WHERE STRFTIME('%Y', T1.date) = '1998' AND T1.operation = 'VYBER KARTOU' AND T1.amount < (SELECT AVG(amount) FROM trans WHERE STRFTIME('%Y', date) = '1998')",
    "difficulty": "moderate"
},

@nnarodytska
Copy link

"question_id": 162 has an incorrect ground truth SQL query:

Q: "What is the region of the client with the id 3541 from?"
A (SQL): "SELECT T2.district_id, T1.A3 FROM district AS T1 INNER JOIN client AS T2 ON T1.district_id = T2.district_id WHERE T2.client_id = 3541",

Issues with SQL:

The question asks about regions only.
SQL provides information about region and district id (which is not part of the question)

Suggested correction :
"SELECT T1.A3 FROM district AS T1 INNER JOIN client AS T2 ON T1.district_id = T2.district_id WHERE T2.client_id = 3541",


From dev.json

{
    "question_id": 162,
    "db_id": "financial",
    "question": "What is the region of the client with the id 3541 from?",
    "evidence": "A3 refers to region",
    "SQL": "SELECT T2.district_id, T1.A3 FROM district AS T1 INNER JOIN client AS T2 ON T1.district_id = T2.district_id WHERE T2.client_id = 3541",
    "difficulty": "simple"
},

@nnarodytska
Copy link

"question_id": 163 has a possible incorrect ground truth SQL query:

Q: "Which district has the most accounts with loan contracts finished with no problems?"
A (SQL): "SELECT T1.district_id FROM district AS T1 INNER JOIN account AS T2 ON T1.district_id = T2.district_id INNER JOIN loan AS T3 ON T2.account_id = T3.account_id WHERE T3.status = 'A' GROUP BY T1.district_id ORDER BY COUNT(T2.account_id) DESC LIMIT 1",

Issues with SQL:

The question asks about a district that satisfies a condition ("Which district").
District table has an A2 column which is a district_name.
It is logical to output the district name when we asked "Which district..."
However, SQL outputs district_id.

Suggested correction :
"SELECT T1.A2 FROM District AS T1 INNER JOIN Account AS T2 ON T1.District_id = T2.District_id INNER JOIN Loan AS T3 ON T2.Account_id = T3.Account_id WHERE T3.status = 'A' GROUP BY T1.District_id ORDER BY COUNT(T2.Account_id) DESC LIMIT 1"


From dev.json
{
"question_id": 163,
"db_id": "financial",
"question": "Which district has the most accounts with loan contracts finished with no problems?",
"evidence": "status = 'A' refers to loan contracts finished with no problems",
"SQL": "SELECT T1.district_id FROM district AS T1 INNER JOIN account AS T2 ON T1.district_id = T2.district_id INNER JOIN loan AS T3 ON T2.account_id = T3.account_id WHERE T3.status = 'A' GROUP BY T1.district_id ORDER BY COUNT(T2.account_id) DESC LIMIT 1",
"difficulty": "moderate"
},

@nnarodytska
Copy link

"question_id": 180 has an incorrect ground truth SQL query:

Q: "Please provide a list of clients who were born between 1983 and 1987 and whose account branch is in East Bohemia, along with their IDs."
A (SQL):
"SELECT T1.client_id, T3.account_id
FROM
client AS T1
INNER JOIN district AS T2 ON T1.district_id = T2.district_id
INNER JOIN account AS T3 ON T2.district_id = T3.district_id
WHERE T2.A3 = 'east Bohemia' AND STRFTIME('%Y', T1.birth_date) BETWEEN '1983' AND '1987'",

Issues with SQL:
The question asks about clients who have accounts with conditions.
However, SQL joins
a) client with district on district_id to obtain client-to-district mapping
b) with account on district_id. However, it includes accounts that might belong to a different client!

If one runs the ground truth SQL against the database you will get the following pairs (clint_id, account_id)
for client 379, for example:

"(379, 56), (379, 237), (379, 311), (379, 351), (379, 446), (379, 562), (379, 590), (379, 757), (379, 1002), (379, 1158), (379, 1199), (379, 1296), (379, 1406), (379, 1437), (379, 1441), (379, 1501), (379, 1528), (379, 1641), (379, 1645), (379, 1810), (379, 1859), (379, 1902), (379, 2162), (379, 2266), (379, 2334), (379, 2584), (379, 2674), (379, 2926), (379, 3175), (379, 3562), (379, 3616), (379, 3661), (379, 3691), (379, 3699), (379, 3835), (379, 3865), (379, 3890), (379, 3945), (379, 4127), (379, 4240), (379, 4503), (379, 5952), (379, 6461), (379, 7713), (379, 8225), (379, 10857), (379, 10940), (379, 11317)"

However, client 379 has only 1 account "311"., so only one pair is correct and others are incorrect.

Suggested correction :

We need to ensure that only valid client_id, account_id pairs are joined.
To do so, we need to make sure that we check that a client with client_id does have an account with account_id in each output.
Here is a suggested correction.

"""SELECT T1.client_id, T3.account_id FROM client AS T1
INNER JOIN district AS T2 ON T1.district_id = T2.district_id
INNER JOIN disp AS T4 ON T1.client_id = T4.client_id
INNER JOIN account AS T3 ON T2.district_id = T3.district_id and T4.account_id = T3.account_id
WHERE T2.A3 = 'east Bohemia' AND STRFTIME('%Y', T1.birth_date) BETWEEN '1983' AND '1987'"""


From dev.json

{
    "question_id": 180,
    "db_id": "financial",
    "question": "Please provide a list of clients who were born between 1983 and 1987 and whose account branch is in East Bohemia, along with their IDs.",
    "evidence": "",
    "SQL": "SELECT T1.client_id, T3.account_id FROM client AS T1 INNER JOIN district AS T2 ON T1.district_id = T2.district_id INNER JOIN account AS T3 ON T2.district_id = T3.district_id WHERE T2.A3 = 'east Bohemia' AND STRFTIME('%Y', T1.birth_date) BETWEEN '1983' AND '1987'",
    "difficulty": "moderate"
},

@nnarodytska
Copy link

"question_id": 181 has an incorrect ground truth SQL query:

Q: "Please provide the IDs of the 3 female clients with the largest loans."
A (SQL): "SQL": "SELECT T1.client_id FROM client AS T1 INNER JOIN account AS T2 ON T1.district_id = T2.district_id INNER JOIN loan AS T3 ON T2.account_id = T3.account_id WHERE T1.gender = 'F' ORDER BY T3.amount DESC LIMIT 3"

Issues with SQL:

Conceptually, the same issue as in 180. First, SQL query finds clients that have accounts using
"client AS T1 INNER JOIN account AS T2 ON T1.district_id = T2.district_id"
However, joining these two tables (client and account) on district_id only is incorrect. Namely, a client who has an account with district_id = X will be associated with all accounts with district_id = X (including accounts of other clients)

Suggested correction :

The same as in 180. Using the table disp to find valid client account associations.


From dev.json

    {
        "question_id": 181,
        "db_id": "financial",
        "question": "Please provide the IDs of the 3 female clients with the largest loans.",
        "evidence": "Female refers to gender = 'F'",
        "SQL": "SELECT T1.client_id FROM client AS T1 INNER JOIN account AS T2 ON T1.district_id = T2.district_id INNER JOIN loan AS T3 ON T2.account_id = T3.account_id WHERE T1.gender = 'F' ORDER BY T3.amount DESC LIMIT 3",
        "difficulty": "simple"
    },

@nnarodytska
Copy link

"question_id": 186 has an incorrect ground truth SQL query:

Q: "What percentage of male clients request for weekly statements to be issued?",
A (SQL): "SELECT CAST(SUM(T1.gender = 'M') AS REAL) * 100 / COUNT(T1.client_id) FROM client AS T1 INNER JOIN account AS T2 ON T1.district_id = T2.district_id WHERE T2.frequency = 'POPLATEK TYDNE'",

Issues with SQL:
The same issue as 180, 181: joining client and account in district_id only
"client AS T1 INNER JOIN account AS T2 ON T1.district_id = T2.district_id"


From dev.json

    {
        "question_id": 186,
        "db_id": "financial",
        "question": "What percentage of male clients request for weekly statements to be issued?",
        "evidence": "Percentage of male clients = [count(male clients who requested weekly statements / count(clients who requested weekly statements)] * 100%; Male means gender = 'M'; 'POPLATEK TYDNE' stands for weekly issuance",
        "SQL": "SELECT CAST(SUM(T1.gender = 'M') AS REAL) * 100 / COUNT(T1.client_id) FROM client AS T1 INNER JOIN account AS T2 ON T1.district_id = T2.district_id WHERE T2.frequency = 'POPLATEK TYDNE'",
        "difficulty": "moderate"
    },

@nnarodytska
Copy link

"question_id": 189 has an incorrect ground truth SQL query:
Q: "Name the account numbers of female clients who are oldest and have lowest average salary?",
A (SQL): "SELECT T3.account_id FROM client AS T1 INNER JOIN district AS T2 ON T1.district_id = T2.district_id INNER JOIN account AS T3 ON T2.district_id = T3.district_id WHERE T1.gender = 'F' ORDER BY T1.birth_date ASC, T2.A11 ASC LIMIT 1",

Issues with SQL:
The same issue as above. joining client and account in district_id only


From dev.json

{
    "question_id": 189,
    "db_id": "financial",
    "question": "Name the account numbers of female clients who are oldest and have lowest average salary?",
    "evidence": "Female refers to 'F' in the gender; A11 contains information about average salary",
    "SQL": "SELECT T3.account_id FROM client AS T1 INNER JOIN district AS T2 ON T1.district_id = T2.district_id INNER JOIN account AS T3 ON T2.district_id = T3.district_id WHERE T1.gender = 'F' ORDER BY T1.birth_date ASC, T2.A11 ASC LIMIT 1",
    "difficulty": "moderate"
},

@nnarodytska
Copy link

nnarodytska commented Mar 20, 2024

"question_id": 192 has an incorrect ground truth SQL query:

Q: "What is the average amount of loan which are still on running contract with statement issuance after each transaction?"
A (SQL): "SELECT AVG(T2.payments) FROM account AS T1 INNER JOIN loan AS T2 ON T1.account_id = T2.account_id WHERE T2.status IN ('C', 'D') AND T1.frequency = 'POPLATEK PO OBRATU'"

Issues with SQL:
The question asks about the average amount of loan.
SQL outputs average payments

Suggested correction :

"SELECT AVG(T2.amount) FROM account AS T1 INNER JOIN loan AS T2 ON T1.account_id = T2.account_id WHERE T2.status IN ('C', 'D') AND T1.frequency = 'POPLATEK PO OBRATU'"


From dev.json

    {
        "question_id": 192,
        "db_id": "financial",
        "question": "What is the average amount of loan which are still on running contract with statement issuance after each transaction?",
        "evidence": "status = 'C' stands for running contract, OK so far; status = 'D' stands for running contract, client in debt. 'POPLATEK PO OBRATU' stands for issuance after transaction",
        "SQL": "SELECT AVG(T2.payments) FROM account AS T1 INNER JOIN loan AS T2 ON T1.account_id = T2.account_id WHERE T2.status IN ('C', 'D') AND T1.frequency = 'POPLATEK PO OBRATU'",
        "difficulty": "moderate"
    },

@nnarodytska
Copy link

"question_id": 129 has an incorrect ground truth SQL query:

"question": "Which are the top ten withdrawals (non-credit card) by district names for the month of January 1996?",
"SQL": "SELECT T1.district_id FROM district AS T1 INNER JOIN account AS T2 ON T1.district_id = T2.district_id INNER JOIN trans AS T3 ON T2.account_id = T3.account_id WHERE T3.type = 'VYDAJ' AND T2.date LIKE '1996-01%' ORDER BY A2 ASC LIMIT 10",

Issues with SQL:

Issue 1
SQL (T2.date LIKE '1996-01%') filters on accounts that were opened in Jan, 1996
Question asks about transations for the month of January 1996 (should be filtered on T3.date)

Issue 2
SQL (SELECT T1.district_id) output district_ids with repetitions.
Question asks about output ordered district names without repetitions.

Suggested correction :
"SQL": "SELECT DISTINCT T1.A2 FROM district AS T1 INNER JOIN account AS T2 ON T1.district_id = T2.district_id INNER JOIN trans AS T3 ON T2.account_id = T3.account_id WHERE T3.type = 'VYDAJ' AND T3.date LIKE '1996-01%' ORDER BY A2 ASC LIMIT 10",


From dev.json

{
    "question_id": 129,
    "db_id": "financial",
    "question": "Which are the top ten withdrawals (non-credit card) by district names for the month of January 1996?",
    "evidence": "Non-credit card withdraws refers to type = 'VYDAJ'; January 1996 can be found by date LIKE '1996-01%' in the database; A2 means district names",
    "SQL": "SELECT T1.district_id FROM district AS T1 INNER JOIN account AS T2 ON T1.district_id = T2.district_id INNER JOIN trans AS T3 ON T2.account_id = T3.account_id WHERE T3.type = 'VYDAJ' AND T2.date LIKE '1996-01%' ORDER BY A2 ASC LIMIT 10",
    "difficulty": "moderate"
},

@nnarodytska
Copy link

Hi @huybery and @accpatrick,

Thank you for developing and maintaining BIRD!

I went over many of the question/SQL pairs for the financial dataset. Can you please have a look at a few issues I have reported with ground truth SQLs? Most of them would be easy to fix if you agree with them and fixes. However, one issue occurs in many question/ground-truth-SQL pairs and would require more attention. This is a logical issue related to the structure of the database relationships.

There is a M:M relationship between clients and accounts in the database. It is modeled using the 'disp' table, which has columns client_id and account_id that connect clients and accounts that are related.

In many questions, we need to find information about clients and their accounts. To answer these questions correctly, we have to join clients and accounts with 'disp' table to find valid client-account pairs. Many ground truth SQLs answer these questions by joining clients and accounts with the 'district' table on the column 'district_id'. However, this logic is incorrect as it returns records containing unrelated clients and accounts. Please see the example above (for "question_id": 180 ). I have reported a few more questions above; the issue is also occurring in questions 94, 95, 113, 132, 176, and 177 as far as I can see (There might be more questions with this issue).

Please let me know what you think.

Thanks!

nina

@freiz
Copy link

freiz commented Apr 21, 2024

Hi BIRD team,

I just wanted to reach out and say a huge thank you for all the amazing work you've put into creating and maintaining the standard dataset.

So, I recently built a solution and decided to sample 30 questions from the dev set to evaluate it. As I was going through the failing questions, I noticed that there were 6 incorrect SQL queries that I'm pretty confident about. That's a bit higher than I expected, to be honest (around 20% or more, However, the significance is limited by the small sample size.).

<style type="text/css"></style>

Id DB Question Evidence Golden Comment
70 california_schools How many active and closed District Community Day Schools are there in the county of Alpine?   SELECT COUNT(School) FROM schools WHERE (StatusType = 'Closed' OR StatusType = 'Active') AND County = 'Alpine' Missed filtering option: 'District Community Day School'
484 card_games Please list the Italian names of the cards in the set Coldsnap with the highest converted mana cost. card set Coldsnap refers to name = 'Coldsnap'; Italian refers to language = 'Italian' SELECT T2.name FROM foreign_data AS T1 INNER JOIN cards AS T2 ON T2.uuid = T1.uuid INNER JOIN sets AS T3 ON T3.code = T2.setCode WHERE T3.name = 'Coldsnap' AND T1.language = 'Italian' ORDER BY T2.convertedManaCost DESC LIMIT 1 1. the selected name is not italian name (not from the foreign_data table)2. should not only limit 1
592 codebase_community How many users are awarded with more than 5 badges? more than 5 badges refers to Count (Name) > 5; user refers to UserId SELECT UserId FROM ( SELECT UserId, COUNT(Name) AS num FROM badges GROUP BY UserId ) T WHERE T.num > 5 Not answering the question, should return COUNT
927 formula_1 Which driver created the fastest lap speed in a Formula_1 race? Please give both his forename and surname.   SELECT T1.forename, T1.surname FROM drivers AS T1 INNER JOIN results AS T2 ON T2.driverId = T1.driverId WHERE T2.fastestLapTime IS NOT NULL ORDER BY T2.fastestLapSpeed DESC LIMIT 1 fastestLapSpeed is string instead of number, sorting without casting to number is incorrectMinor: filtering fastestLapTime IS NOT NULL does not make a lot sense, however there is no missing data anyway
1029 european_football_2 What are the speed in which attacks are put together of the top 4 teams with the highest build Up Play Speed? speed in which attacks are put together refers to buildUpPlaySpeed;highest build up play speed refers to MAX(buildUpPlaySpeed) SELECT t1.buildUpPlaySpeed FROM Team_Attributes AS t1 INNER JOIN Team AS t2 ON t1.team_api_id = t2.team_api_id ORDER BY t1.buildUpPlayDribbling ASC LIMIT 4 Should sort by buildUpPlaySpeed instead of buildUpPlayDribbling based on the Evidence
1259 thrombosis_prediction Please list the diseases of the patients born after 1995-1-1 and have a normal Rhuematoid Factor. diseases refers to Diagnosis; born after 1995/1/1 refers to YEAR(Birthday) > = 1995; normal Rhuematoid Factor refers to RA IN('-', '+-'); SELECT T1.Diagnosis FROM Patient AS T1 INNER JOIN Laboratory AS T2 ON T1.ID = T2.ID WHERE (T2.RA = '-' OR T2.RA = '+-') AND T1.Birthday > 1995-01-01 date 1995-01-01 is not quoted, so it compares a date with number 1993

Just to be sure, I double-checked that I'm using the latest data and compared the SQL in dev.json and dev.gold.sql to make sure everything lines up. But hey, if I've made any silly mistakes or wrong assumptions while working with the data, please give me a heads up!

@huybery , @accpatrick

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
BIRD help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

9 participants