New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding a notebook to make use of the TSSB-3M bugs dataset #1425
Adding a notebook to make use of the TSSB-3M bugs dataset #1425
Conversation
❌ pre-commit failed. |
❌ pre-commit failed. |
Now I have run the pre-commit script and the check passes. |
Hi @RiccardoRiglietti, would you like this to be considered for merging now or to wait until the "to do" comment in the notebook regarding commit messages is worked on as well? |
@olliestanley Given that the commits are not in the dataset, but they must be taken from GitHub with the API or web scraping or cloning the repositories with their history, so it is a large amount of work (also considering limited API quotas). In my opinion the pairs of wrong/correct code already represent a good addition to the training data, as there are over 3 million such pairs, so I think it is best to ask to be considered for merging now. As far as what is the progress of adding commits messages, you could ask @zirui that mentioned that he started working on this at #1395 |
Ok great! Could you add a [Open in Colab] sticker to the top of the notebook, as in the other notebooks in the repo, e.g. here? And maybe delete the empty cell at the end with outdated output? Then I think we can merge this |
@olliestanley I cleaned up the notebook, removed the outdated cell, added the open in colab badge, added section titles and a short description at the start. |
Thanks! Looks like the Colab link still points to the essay revision dataset, so it will need to be tweaked to point to the path your notebook will be at once it's merged. After that I think this will be ready, I'm off to bed now so will confirm it tomorrow :) |
@olliestanley I put the link temporarily to the online version of the notebook, good night. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!
ManySStuBs4J Dataset
The ManySStuBs4J corpus is a collection of simple fixes to Java bugs, designed for evaluating program repair techniques. We collect all bug-fixing changes using the SZZ heuristic, and then filter these to obtain a data set of small bug fix changes.
It contains over 3 million bugs, so the benefit from it would be substantial.
It is easy to put it into dialogue form:
User: Find the bug in the following code:
{CODE}
Reply: The bugfix can be described as follows:
{COMMIT_MESSAGE}
The fixed code is:
{FIXED-CODE}
It would be a substantial boost to our dataset as far as code is concerned.
As for now, I create prompts only with broken and fixed code, without commit messages that must be scraped from GitHub. Still it should be very useful.
Issue link
#1395
Notebook on Colab
Here is the Google Colab code to generate the prompts from the dataset: https://colab.research.google.com/drive/16Fr4iW0x1JdPC8x6GiBb6b4YtEr1CDGH?authuser=4#scrollTo=GFt0SJ1vpMPR
I also added it into the notebooks folder.
Contributing
Adding a way to incorporate commit messages into the prompts would be a great contribution. This can be done by scraping the GitHub API for the commit messages based on the commit hash, or by downloading the repository with the full history and extracting the commit messages from there.