Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using the huge ManySStuBs4J code bugs dataset #1395

Open
caridorc-tergiliti opened this issue Feb 9, 2023 · 19 comments
Open

Using the huge ManySStuBs4J code bugs dataset #1395

caridorc-tergiliti opened this issue Feb 9, 2023 · 19 comments
Assignees
Labels

Comments

@caridorc-tergiliti
Copy link

The ManySStuBs4J dataset is a huge open source dataset of bugs based on GitHub.

It is easy to put it into dialogue form:

User: Find the bug in the following code:
{INITIAL_CODE}
Reply: The bugfix can be described as follows:
{COMMIT_MESSAGE}
The fixed code is:
{FIXED-CODE}

It would be a substantial boost to our dataset as far as code is concerned.

@huu4ontocord
Copy link
Collaborator

We need someone to help with this. Looks very straight forward.

@caridorc-tergiliti
Copy link
Author

caridorc-tergiliti commented Feb 9, 2023

@ontocord I got wrong to fixed code done. Getting commits is gonna be hard because they have to be found online on GitHub.

This generates prompts_{num}.txt files in the generated_bugfix_prompts that you can just download or write into another more specialized format: Colab Notebook

It is already very useful in this state in my opinion but if someone good at using the GitHub API or web scraping in general could improve it with the commit messages it would be even better.

Here is the same code as the notebook in plain text:

# -*- coding: utf-8 -*-
"""bugs_dataset.ipynb

Automatically generated by Colaboratory.

Original file is located at
    https://colab.research.google.com/drive/16Fr4iW0x1JdPC8x6GiBb6b4YtEr1CDGH
"""

#!wget https://zenodo.org/record/5845439/files/tssb_data_3M.zip?download=1

#!mv tssb_data_3M.zip?download=1 tssb_data_3M.zip

#!unzip tssb_data_3M.zip

import pandas

FILENUM = 32
table = pandas.read_json(f"tssb_data_3M/file-{FILENUM}.jsonl.gz", lines=True)
table

import re

TEMPLATE = \
"""User: Find the bug in the following code:
{}
Reply: The fixed code is:
```
{}
```


"""

def remove_starting_plus_minus(text):
  if text.startswith("+") or text.startswith("-"):
    return text[1:]
  else:
    return text

def remove_extraneous_diff_info(text):
  pattern = "@@.*@@"
  return re.sub(pattern, "", text)

def clean(text):
  return remove_extraneous_diff_info(remove_starting_plus_minus(text))

def write_prompts(num, table):
  with open(f"generated_bugfix_prompts/prompts_{num}.txt", "w+") as f:
    for index, row in table.iterrows():
      #print("diff")
      #print(row['diff'])
      correct = '\n'.join(clean(line) for line in row['diff'].split("\n") if not line.startswith("+"))
      wrong = '\n'.join(clean(line) for line in row['diff'].split("\n") if not line.startswith("-"))
      #print("prompt")
      #print(TEMPLATE.format(correct, wrong))
      f.write(TEMPLATE.format(correct, wrong))

#!mkdir generated_bugfix_prompts

for num in range(33):
  table = pandas.read_json(f"tssb_data_3M/file-{num}.jsonl.gz", lines=True)
  write_prompts(num, table)

#!zip -r generated_bugfix_prompts.zip generated_bugfix_prompts

@huu4ontocord
Copy link
Collaborator

Can you push the generated bugfix_prompts to hf please? As parquet? Please see the guide we have in the data directory.

@huu4ontocord
Copy link
Collaborator

And check in your notebook.
We will find someone to finish it :)

@caridorc-tergiliti
Copy link
Author

@ontocord This is the notebook Colab Notebook

If you run it you can download the generated prompts so far, it only takes around 20 minutes.

@caridorc-tergiliti
Copy link
Author

This is the pull request: #1410

Tell me if it is correct, it is the first pull request that I do.

@zirui
Copy link
Contributor

zirui commented Feb 10, 2023

I think I can help to get commit messages from GitHub, but it may not be done too quickly, as there is a rate limit imposed by the GitHub API.

@RiccardoRiglietti
Copy link
Contributor

@zirui @ontocord here is the pull asking to add the notebook: #1425

It says something weird about pre-commit, but I cannot run pre-commit locally because of version incompatibility

@huu4ontocord
Copy link
Collaborator

@andreaskoepf and others who are controlling PR will review. thank you!

@RiccardoRiglietti
Copy link
Contributor

@ontocord I managed to run precommit by installing it with conda rather than with snap

@zirui
Copy link
Contributor

zirui commented Feb 12, 2023

I'm working on getting commit messages from GitHub. Due to the limit of 5,000 requests imposed by the GitHub API, the process is slow, and I am trying to find a way to speed it up.

please assign this issue to me. @caridorc-tergiliti @ontocord

@RiccardoRiglietti
Copy link
Contributor

RiccardoRiglietti commented Feb 12, 2023

@zirui Thanks for your effort, if you are querying the GitHub API for commits, it might be worth it to also get more context for the code, i.e. more lines before and after the bug as the dataset only contains a few lines near the bug (but I am not sure if this is worth the extra effort, your call).

Another possibility is downloading the GitHub repositories and using the python github library to get the commit data from them. The repositories are only 8266 even if the bugfixes are over 3 million. So maybe it is possible to download the whole repository with the whole history and query it locally for the bugfixes.

Link to the PR: #1425

@zirui
Copy link
Contributor

zirui commented Feb 13, 2023

Thanks @RiccardoRiglietti
Your suggestion of downloading all GitHub repositories seems like a good idea, I will look into whether this method is more efficient(This method may be impacted by the network environment, and in my area, downloading the full Git code can sometimes take a long time... ).

@olliestanley
Copy link
Collaborator

@zirui I have assigned you to this. Soon we should be able to get #1425 merged and then you may be able to work on top of that existing notebook

@olliestanley
Copy link
Collaborator

Reopening to track the commit issue addition on top of the work in #1425

@olliestanley olliestanley reopened this Feb 21, 2023
@zirui
Copy link
Contributor

zirui commented Feb 22, 2023

@zirui I have assigned you to this. Soon we should be able to get #1425 merged and then you may be able to work on top of that existing notebook

Ok, after finishing the work of retrieving all the COMMIT_MESSAGE of codes, I will create a new HF dataset and update the existing notebook.

@huu4ontocord
Copy link
Collaborator

Looks like this issue is moving along nicely @zirui and @caridorc-tergiliti

@huu4ontocord
Copy link
Collaborator

Hi all @zirui and @caridorc-tergiliti

Are we good with this issue? Any results?

@zirui
Copy link
Contributor

zirui commented Apr 12, 2023

Hi all @zirui and @caridorc-tergiliti

Are we good with this issue? Any results?

Sorry for the late reply.
I have created two HF datasets follow this OA README:
The ManySStuBs4J dataset extended with commit info: TSSB-3M-ext
instruction dataset : TSSB-3M-instructions

and after further filtering and checking, I will create a pull request for the OA repository this week

sedthh pushed a commit that referenced this issue May 11, 2023
add the TSSM-3M code bugs dataset
issue: [#1395](#1395)

---------

Co-authored-by: 张子锐 <zirui.zhang@yiducloud.cn>
Co-authored-by: Oliver Stanley <olivergestanley@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants