Low accuracy in predicting SQL using RESDSQL on my dataset #45

VirendraSttl · 2023-07-12T07:22:52Z

Hello everyone,

I hope you're doing well. I encountered an issue while using RESDSQL for predicting SQL on my dataset. Despite following all the recommended steps, I'm observing an accuracy range of only 30-40%. I would greatly appreciate any suggestions or insights on increasing the predictions' accuracy.

Thank you in advance for your assistance!

lihaoyang-ruc · 2023-07-12T11:44:19Z

Hi!
Have you fine-tuned RESDSQL on your dataset? Or did you only use the checkpoints we provided to perform inference on your dataset?

VirendraSttl · 2023-07-12T12:26:05Z

I have been utilizing the provided checkpoints in RESDSQL to enhance my work. However, I am uncertain about the process of fine-tuning RESDSQL on my own dataset. Could you kindly provide guidance on how to proceed with fine-tuning RESDSQL using my specific dataset?

lihaoyang-ruc · 2023-07-17T11:16:51Z

RESDSQL has been fine-tuned on Spider. Therefore, you should prepare your dataset in the same format as it (its home page https://yale-lily.github.io/spider).

In fact, most Text-to-SQL datasets organize their data in Spider's format (e.g., Dr. Spider, CSpider, BIRD, Kaggle-DBQA, etc.).

VirendraSttl · 2023-07-17T11:40:23Z

I had already set my dataset in the format of Spider (i.e. tables.json)

lihaoyang-ruc · 2023-07-18T09:07:49Z

Just tables.json is not enough.

To train RESDSQL on your dataset, you have to prepare at least three files (Take Spider's file as an example):

database, a folder where the sqlite databases are saved.
train_spider.json, a json file that contains pairs of training data, each of them should contain three fields: db_id, query, and question.
tables.json, a json file that describes the schema of all databases.

To run inference and evaluation, you should prepare a separate dev_gold.sql file containing the gold SQL query and its corresponding db_id.

VirendraSttl · 2023-07-26T14:51:19Z

Okay, I'll try this solution.

BTW I have a question do I need to train the model every time whenever I change my dataset?
Is there any way RESDSQL will generate an SQL query on the hidden test set?

lihaoyang-ruc · 2023-07-27T13:16:43Z

No, if your training set and test set have the same (or similar) distribution, it can be naturally generalized to the hidden test set without additional training.

VirendraSttl · 2023-07-28T11:15:55Z

Okay.

I followed your training steps and noticed that both train_spider.json and dev.json were required. However, I am a bit confused about their differences. Are they essentially the same file with different names, or do they serve distinct purposes in the training process?

lihaoyang-ruc · 2023-07-28T12:52:22Z

They are different files. train_spider.json is the training set, and dev.json is the development set.

lihaoyang-ruc · 2023-07-28T12:53:04Z

We use dev.json to select the best checkpoint during fine-tuning.

VirendraSttl · 2023-07-28T13:39:33Z

So can I use the same dev.json file? or Do I need to create it separately as per my dataset?

lihaoyang-ruc · 2023-07-28T13:45:20Z

My suggestion would be to create a separate dev.json so that you can evaluate the performance of the model on unseen data.

lihaoyang-ruc · 2023-07-28T13:47:26Z

If you're training and evaluating your model on the training set, I don't think it makes sense because the model will memorize your training data to quickly reach (close to) 100% accuracy.

VirendraSttl · 2023-07-28T13:52:21Z

You mean, train_spider.json and dev.json are the same as we are splitting our data into two sets i.e. train set and test set

lihaoyang-ruc · 2023-07-28T13:53:19Z

Yes

lihaoyang-ruc mentioned this issue Jul 27, 2023

请问如果要自己准备dataset做训练或者测试，有什么格式要求吗？ #51

Closed

lihaoyang-ruc closed this as completed Sep 4, 2023

lihaoyang-ruc mentioned this issue May 15, 2024

你好，请问如何将自己的数据集处理成CSpider的形式？ #72

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Low accuracy in predicting SQL using RESDSQL on my dataset #45

Low accuracy in predicting SQL using RESDSQL on my dataset #45

VirendraSttl commented Jul 12, 2023

lihaoyang-ruc commented Jul 12, 2023

VirendraSttl commented Jul 12, 2023

lihaoyang-ruc commented Jul 17, 2023

VirendraSttl commented Jul 17, 2023

lihaoyang-ruc commented Jul 18, 2023

VirendraSttl commented Jul 26, 2023

lihaoyang-ruc commented Jul 27, 2023

VirendraSttl commented Jul 28, 2023

lihaoyang-ruc commented Jul 28, 2023

lihaoyang-ruc commented Jul 28, 2023

VirendraSttl commented Jul 28, 2023

lihaoyang-ruc commented Jul 28, 2023

lihaoyang-ruc commented Jul 28, 2023

VirendraSttl commented Jul 28, 2023

lihaoyang-ruc commented Jul 28, 2023

Low accuracy in predicting SQL using RESDSQL on my dataset #45

Low accuracy in predicting SQL using RESDSQL on my dataset #45

Comments

VirendraSttl commented Jul 12, 2023

lihaoyang-ruc commented Jul 12, 2023

VirendraSttl commented Jul 12, 2023

lihaoyang-ruc commented Jul 17, 2023

VirendraSttl commented Jul 17, 2023

lihaoyang-ruc commented Jul 18, 2023

VirendraSttl commented Jul 26, 2023

lihaoyang-ruc commented Jul 27, 2023

VirendraSttl commented Jul 28, 2023

lihaoyang-ruc commented Jul 28, 2023

lihaoyang-ruc commented Jul 28, 2023

VirendraSttl commented Jul 28, 2023

lihaoyang-ruc commented Jul 28, 2023

lihaoyang-ruc commented Jul 28, 2023

VirendraSttl commented Jul 28, 2023

lihaoyang-ruc commented Jul 28, 2023