Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Low accuracy in predicting SQL using RESDSQL on my dataset #45

Closed
VirendraSttl opened this issue Jul 12, 2023 · 15 comments
Closed

Low accuracy in predicting SQL using RESDSQL on my dataset #45

VirendraSttl opened this issue Jul 12, 2023 · 15 comments

Comments

@VirendraSttl
Copy link

Hello everyone,

I hope you're doing well. I encountered an issue while using RESDSQL for predicting SQL on my dataset. Despite following all the recommended steps, I'm observing an accuracy range of only 30-40%. I would greatly appreciate any suggestions or insights on increasing the predictions' accuracy.

Thank you in advance for your assistance!

@lihaoyang-ruc
Copy link
Contributor

Hi!
Have you fine-tuned RESDSQL on your dataset? Or did you only use the checkpoints we provided to perform inference on your dataset?

@VirendraSttl
Copy link
Author

I have been utilizing the provided checkpoints in RESDSQL to enhance my work. However, I am uncertain about the process of fine-tuning RESDSQL on my own dataset. Could you kindly provide guidance on how to proceed with fine-tuning RESDSQL using my specific dataset?

@lihaoyang-ruc
Copy link
Contributor

RESDSQL has been fine-tuned on Spider. Therefore, you should prepare your dataset in the same format as it (its home page https://yale-lily.github.io/spider).

In fact, most Text-to-SQL datasets organize their data in Spider's format (e.g., Dr. Spider, CSpider, BIRD, Kaggle-DBQA, etc.).

@VirendraSttl
Copy link
Author

I had already set my dataset in the format of Spider (i.e. tables.json)

@lihaoyang-ruc
Copy link
Contributor

Just tables.json is not enough.

To train RESDSQL on your dataset, you have to prepare at least three files (Take Spider's file as an example):

  • database, a folder where the sqlite databases are saved.
  • train_spider.json, a json file that contains pairs of training data, each of them should contain three fields: db_id, query, and question.
  • tables.json, a json file that describes the schema of all databases.

To run inference and evaluation, you should prepare a separate dev_gold.sql file containing the gold SQL query and its corresponding db_id.

@VirendraSttl
Copy link
Author

Okay, I'll try this solution.

BTW I have a question do I need to train the model every time whenever I change my dataset?
Is there any way RESDSQL will generate an SQL query on the hidden test set?

@lihaoyang-ruc
Copy link
Contributor

No, if your training set and test set have the same (or similar) distribution, it can be naturally generalized to the hidden test set without additional training.

@VirendraSttl
Copy link
Author

Okay.

I followed your training steps and noticed that both train_spider.json and dev.json were required. However, I am a bit confused about their differences. Are they essentially the same file with different names, or do they serve distinct purposes in the training process?

@lihaoyang-ruc
Copy link
Contributor

They are different files. train_spider.json is the training set, and dev.json is the development set.

@lihaoyang-ruc
Copy link
Contributor

We use dev.json to select the best checkpoint during fine-tuning.

@VirendraSttl
Copy link
Author

So can I use the same dev.json file? or Do I need to create it separately as per my dataset?

@lihaoyang-ruc
Copy link
Contributor

My suggestion would be to create a separate dev.json so that you can evaluate the performance of the model on unseen data.

@lihaoyang-ruc
Copy link
Contributor

If you're training and evaluating your model on the training set, I don't think it makes sense because the model will memorize your training data to quickly reach (close to) 100% accuracy.

@VirendraSttl
Copy link
Author

You mean, train_spider.json and dev.json are the same as we are splitting our data into two sets i.e. train set and test set

@lihaoyang-ruc
Copy link
Contributor

Yes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants