New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Panel of BIRD Annotation Issues. #39
Comments
Should the db_id |
There is a typo of this sql in SELECT CAST(SUM(CASE WHEN T2.DESCRIPTION = 'Allergy to grass pollen' THEN 1 ELSE 0 END) AS REL) * 100 / COUNT(T1.patient) FROM patients AS T1 INNER JOIN allergies AS T2 ON T1.patient = T2.PATIENT WHERE T1.birthplace = 'Pembroke MA US' |
These two queries also cannot be parsed by SELECT T1.first_name, T1.last_name FROM current AS T1 INNER JOIN `current-terms` AS T2 ON T1.bioguide_id = T2.bioguide WHERE T2.party = 'Republican' AND T1.gender_bio = 'F' AND T2.END > DATE() GROUP BY T1.bioguide_id SELECT max_temperature_f, date FROM weather WHERE max_temperature_f = (SELECT MAX(max_temperature_f) FROM weather WHERE max_temperature_f IS NOT NULL AND max_temperature_f IS NOT '') |
Incorrect SQLite database for
|
also I have found there exists corrupted database, for example, coinmarketcap.sqlite |
Incorrect evidence in dev set. 在 california_schools/frpm.csv 中,eligible free rate = Free Meal Count / Enrollment,同时 column 【Percent (%) Eligible Free (K-12)】确实是 【Free Meal Count (K-12)】 / 【Enrollment (K-12)】。 但是在 dev set中第一条测试数据就存在错误: 原始数据: db_id: california_schools 两处错误: SELECT MAX(Percent (%) Eligible Free (K-12)) 或者 SELECT 类似的问题可能还有不少。 |
@tshu-w @wbbeyourself @ronch99 @22842219 The latest update includes the following improvements:
As a result of these enhancements, the ChatGPT (gpt-3.5-turbo) and GPT4 (gpt-4-32k) EX scores have improved to 42.24 (from 37.22) and 49.15 (from 46.35), respectively. We will update the leaderboard with all new results shortly. Please note that the test datasets remain unchanged at this time since we have already devoted much more considerable effort in examining errors, and human evaluation during the construction of test datasets. This can ensure its quality. BTW, our paper has been accepted by NeurIPS 2023 as a Spotlight. Thanks for your support and suggestions! We will keep optimizing our work, thanks! |
In train datast I think this example is wrong
Should be "SQL": "SELECT movie_title, movie_release_year FROM movies ORDER BY LENGTH(movie_title) DESC LIMIT 1" |
|
question 24 in dev dataset might be missing the evidence
|
https://arxiv.org/abs/2402.12243
|
@tshu-w @wbbeyourself @josem7 @ronch99 Of course, we will thoroughly consider your findings and make the necessary corrections during this review. After this, we will also review the training set when our expert students and professors are available. We sincerely appreciate your support and suggestions throughout this process. It is indeed a challenging and time-consuming task to meticulously go through all examples and ensure their quality especially our data is more complex. We aim to release this version around late March or early April. Thanks for your understanding. |
@tshu-w, thank you for sharing this insightful paper. We highly value such research that judges the quality of benchmarks. A similar paper from EMNLP (https://arxiv.org/pdf/2310.18538.pdf) also provides a comprehensive examination of Spider, BIRD, and Spider-DK. These papers are crucial for advancing our field. The paper (https://arxiv.org/abs/2402.12243) presents an important research problem on how to develop robust models to deal with noise in the benchmark. However, some details are unclear, making it difficult for us to fully understand. For instance, 1) the authors mention random sampling of examples and a 49% error rate in the california_schools set. Upon our careful review, we just found 11 out of 89 errors, which we plan to refine. We believe that multiple random samplings, as suggested in this paper (https://arxiv.org/pdf/2305.14210.pdf), would provide more reliable insights due to the large standard deviation in random sampling. Otherwise there may be a biased sampling.
Even these, we sincerely appreciate their highlighting of the join conditions issue as Example 3 in which we did pay less attention on this (FK-->PK) during checking. We will pay more attention to in this review round. In conclusion, we have learned a lot from these benchmark judgement papers and understand what we need to focus on next. We welcome more such papers or feedbacks to promote this field together. Also, we appreciate your support and feedback and will continue to refine our work until the real AGIs in DB are born. Thanks. |
Updated (one more inconsistency was found): Q: Among the accounts who have loan validity more than 12 months, list out the accounts that have the highest approved amount and have account opening date in 1993." A (SQL): I think the join should be on the "account" table rather than "disp" and T1.duration > 12 Suggested correction: Thanks! nina From dev.json "question_id": 99, |
"question_id": 95 has an incorrect ground truth SQL query: Q: "List out the account numbers of clients who are youngest and have highest average salary?", A (SQL): Multiple accounts are linked with the district that has the highest A11 Suggested correction to output all accounts with the highest average salary:
From dev.json { Thanks! nina |
Incorrect description of database "financial". File: llm/data/dev_databases/financial/database_description/district.csv Suggested corrections: A4,number of inhabitants ,,integer, |
"question_id": 118 has an incorrect ground truth SQL query: Q: "For loan amount less than USD100,000, what is the percentage of accounts that is still running with no issue.", E (evidence): "Status = 'C' stands for running contract, ok so far; Percentage of accounts by condition = [(total(amount) & condition) / (total amount)] * 100%.", A (SQL): There are a few issues with SQL: According to the evidence, we need to divide the total amount of running contract accounts by the total amount (" [(total(amount) & condition) / (total amount)] * 100%."). However, SQL incorrectly
Suggested correction (discussion): From dev.json
|
"question_id": 141 has an incorrect ground truth SQL query: Q: "Which districts have transactions greater than USS$10,000 in 1997?", An issue with SQL: The question asks about all transactions that satisfy conditions. However, SQL computes the total amount of transactions per district (GROUP BY T1.district_id HAVING SUM(T3.amount) > 10000) in the WHERE clause. Suggested correction : " From dev.json
|
"question_id": 143 has an incorrect ground truth SQL query: Q: "What are the accounts that have both gold and junior credit cards?", Issues with SQL: The question asks about accounts that have both gold and junior credit cards. Suggested correction (one of many):
From dev.json |
"question_id": 145 has an incorrect ground truth SQL query: Q: "Who are the account holder identification numbers whose spent per month on the credit card is less than the average, in 1998?" Issues with SQL: The question asks about accounts whose spent per month on the credit card is less than the average in 1998 Suggested correction discussion: From dev.json
|
"question_id": 162 has an incorrect ground truth SQL query: Q: "What is the region of the client with the id 3541 from?" Issues with SQL: The question asks about regions only. Suggested correction : From dev.json
|
"question_id": 163 has a possible incorrect ground truth SQL query: Q: "Which district has the most accounts with loan contracts finished with no problems?" Issues with SQL: The question asks about a district that satisfies a condition ("Which district"). Suggested correction : From dev.json |
"question_id": 180 has an incorrect ground truth SQL query: Q: "Please provide a list of clients who were born between 1983 and 1987 and whose account branch is in East Bohemia, along with their IDs." Issues with SQL: If one runs the ground truth SQL against the database you will get the following pairs (clint_id, account_id)
However, client 379 has only 1 account "311"., so only one pair is correct and others are incorrect. Suggested correction : We need to ensure that only valid client_id, account_id pairs are joined.
From dev.json
|
"question_id": 181 has an incorrect ground truth SQL query: Q: "Please provide the IDs of the 3 female clients with the largest loans." Issues with SQL: Conceptually, the same issue as in 180. First, SQL query finds clients that have accounts using Suggested correction : The same as in 180. Using the table disp to find valid client account associations. From dev.json
|
"question_id": 186 has an incorrect ground truth SQL query: Q: "What percentage of male clients request for weekly statements to be issued?", Issues with SQL: From dev.json
|
"question_id": 189 has an incorrect ground truth SQL query: Issues with SQL: From dev.json
|
"question_id": 192 has an incorrect ground truth SQL query: Q: "What is the average amount of loan which are still on running contract with statement issuance after each transaction?" Issues with SQL: Suggested correction :
From dev.json
|
"question_id": 129 has an incorrect ground truth SQL query: "question": "Which are the top ten withdrawals (non-credit card) by district names for the month of January 1996?", Issues with SQL: Issue 1 Issue 2 Suggested correction : From dev.json
|
Hi @huybery and @accpatrick, Thank you for developing and maintaining BIRD! I went over many of the question/SQL pairs for the There is a M:M relationship between In many questions, we need to find information about clients and their accounts. To answer these questions correctly, we have to join Please let me know what you think. Thanks! nina |
Hi BIRD team, I just wanted to reach out and say a huge thank you for all the amazing work you've put into creating and maintaining the standard dataset. So, I recently built a solution and decided to sample 30 questions from the dev set to evaluate it. As I was going through the failing questions, I noticed that there were 6 incorrect SQL queries that I'm pretty confident about. That's a bit higher than I expected, to be honest (around 20% or more, However, the significance is limited by the small sample size.). <style type="text/css"></style>
Just to be sure, I double-checked that I'm using the latest data and compared the SQL in |
Hi all,
Although
BIRD
has incurred significant annotation costs, we still cannot guarantee that all the data is accurately labeled. We hope that the community can assist us in building BIRD together! You can continuously report any errors you find under this issue, and we will perform a dataset update at a designated time.Thanks a lot!
Best,
Binyuan
The text was updated successfully, but these errors were encountered: