Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding a notebook to make use of the TSSB-3M bugs dataset #1425

Merged
merged 6 commits into from Feb 21, 2023

Conversation

RiccardoRiglietti
Copy link
Contributor

@RiccardoRiglietti RiccardoRiglietti commented Feb 10, 2023

ManySStuBs4J Dataset

The ManySStuBs4J corpus is a collection of simple fixes to Java bugs, designed for evaluating program repair techniques. We collect all bug-fixing changes using the SZZ heuristic, and then filter these to obtain a data set of small bug fix changes.

It contains over 3 million bugs, so the benefit from it would be substantial.

It is easy to put it into dialogue form:

User: Find the bug in the following code:
{CODE}
Reply: The bugfix can be described as follows:
{COMMIT_MESSAGE}
The fixed code is:
{FIXED-CODE}

It would be a substantial boost to our dataset as far as code is concerned.

As for now, I create prompts only with broken and fixed code, without commit messages that must be scraped from GitHub. Still it should be very useful.

Issue link

#1395

Notebook on Colab

Here is the Google Colab code to generate the prompts from the dataset: https://colab.research.google.com/drive/16Fr4iW0x1JdPC8x6GiBb6b4YtEr1CDGH?authuser=4#scrollTo=GFt0SJ1vpMPR

I also added it into the notebooks folder.

Contributing

Adding a way to incorporate commit messages into the prompts would be a great contribution. This can be done by scraping the GitHub API for the commit messages based on the commit hash, or by downloading the repository with the full history and extracting the commit messages from there.

@github-actions
Copy link

pre-commit failed.
Please run pre-commit run --all-files locally and commit the changes.
Find more information in the repository's CONTRIBUTING.md

@github-actions
Copy link

pre-commit failed.
Please run pre-commit run --all-files locally and commit the changes.
Find more information in the repository's CONTRIBUTING.md

@RiccardoRiglietti
Copy link
Contributor Author

Now I have run the pre-commit script and the check passes.

@olliestanley
Copy link
Collaborator

Hi @RiccardoRiglietti, would you like this to be considered for merging now or to wait until the "to do" comment in the notebook regarding commit messages is worked on as well?

@RiccardoRiglietti
Copy link
Contributor Author

@olliestanley Given that the commits are not in the dataset, but they must be taken from GitHub with the API or web scraping or cloning the repositories with their history, so it is a large amount of work (also considering limited API quotas). In my opinion the pairs of wrong/correct code already represent a good addition to the training data, as there are over 3 million such pairs, so I think it is best to ask to be considered for merging now. As far as what is the progress of adding commits messages, you could ask @zirui that mentioned that he started working on this at #1395

@olliestanley
Copy link
Collaborator

@olliestanley Given that the commits are not in the dataset, but they must be taken from GitHub with the API or web scraping or cloning the repositories with their history, so it is a large amount of work (also considering limited API quotas). In my opinion the pairs of wrong/correct code already represent a good addition to the training data, as there are over 3 million such pairs, so I think it is best to ask to be considered for merging now. As far as what is the progress of adding commits messages, you could ask @zirui that mentioned that he started working on this at #1395

Ok great! Could you add a [Open in Colab] sticker to the top of the notebook, as in the other notebooks in the repo, e.g. here? And maybe delete the empty cell at the end with outdated output? Then I think we can merge this

@RiccardoRiglietti
Copy link
Contributor Author

@olliestanley I cleaned up the notebook, removed the outdated cell, added the open in colab badge, added section titles and a short description at the start.

@olliestanley
Copy link
Collaborator

@olliestanley I cleaned up the notebook, removed the outdated cell, added the open in colab badge, added section titles and a short description at the start.

Thanks! Looks like the Colab link still points to the essay revision dataset, so it will need to be tweaked to point to the path your notebook will be at once it's merged. After that I think this will be ready, I'm off to bed now so will confirm it tomorrow :)

@RiccardoRiglietti
Copy link
Contributor Author

@olliestanley I put the link temporarily to the online version of the notebook, good night.

@olliestanley olliestanley enabled auto-merge (squash) February 21, 2023 16:36
Copy link
Collaborator

@olliestanley olliestanley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

@olliestanley olliestanley merged commit ac41901 into LAION-AI:main Feb 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants