New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Using the huge ManySStuBs4J code bugs dataset #1395
Comments
We need someone to help with this. Looks very straight forward. |
@ontocord I got wrong to fixed code done. Getting commits is gonna be hard because they have to be found online on GitHub. This generates prompts_{num}.txt files in the It is already very useful in this state in my opinion but if someone good at using the GitHub API or web scraping in general could improve it with the commit messages it would be even better. Here is the same code as the notebook in plain text:
|
Can you push the generated bugfix_prompts to hf please? As parquet? Please see the guide we have in the data directory. |
And check in your notebook. |
@ontocord This is the notebook Colab Notebook If you run it you can download the generated prompts so far, it only takes around 20 minutes. |
This is the pull request: #1410 Tell me if it is correct, it is the first pull request that I do. |
I think I can help to get commit messages from GitHub, but it may not be done too quickly, as there is a rate limit imposed by the GitHub API. |
@andreaskoepf and others who are controlling PR will review. thank you! |
@ontocord I managed to run precommit by installing it with conda rather than with snap |
I'm working on getting commit messages from GitHub. Due to the limit of 5,000 requests imposed by the GitHub API, the process is slow, and I am trying to find a way to speed it up. please assign this issue to me. @caridorc-tergiliti @ontocord |
@zirui Thanks for your effort, if you are querying the GitHub API for commits, it might be worth it to also get more context for the code, i.e. more lines before and after the bug as the dataset only contains a few lines near the bug (but I am not sure if this is worth the extra effort, your call). Another possibility is downloading the GitHub repositories and using the python github library to get the commit data from them. The repositories are only 8266 even if the bugfixes are over 3 million. So maybe it is possible to download the whole repository with the whole history and query it locally for the bugfixes. Link to the PR: #1425 |
Thanks @RiccardoRiglietti |
Reopening to track the commit issue addition on top of the work in #1425 |
Looks like this issue is moving along nicely @zirui and @caridorc-tergiliti |
Hi all @zirui and @caridorc-tergiliti Are we good with this issue? Any results? |
Sorry for the late reply. and after further filtering and checking, I will create a pull request for the OA repository this week |
The ManySStuBs4J dataset is a huge open source dataset of bugs based on GitHub.
It is easy to put it into dialogue form:
User: Find the bug in the following code:
{INITIAL_CODE}
Reply: The bugfix can be described as follows:
{COMMIT_MESSAGE}
The fixed code is:
{FIXED-CODE}
It would be a substantial boost to our dataset as far as code is concerned.
The text was updated successfully, but these errors were encountered: