Using the huge ManySStuBs4J code bugs dataset #1395

caridorc-tergiliti · 2023-02-09T16:52:07Z

The ManySStuBs4J dataset is a huge open source dataset of bugs based on GitHub.

It is easy to put it into dialogue form:

User: Find the bug in the following code:
{INITIAL_CODE}
Reply: The bugfix can be described as follows:
{COMMIT_MESSAGE}
The fixed code is:
{FIXED-CODE}

It would be a substantial boost to our dataset as far as code is concerned.

huu4ontocord · 2023-02-09T17:32:59Z

We need someone to help with this. Looks very straight forward.

caridorc-tergiliti · 2023-02-09T22:59:08Z

@ontocord I got wrong to fixed code done. Getting commits is gonna be hard because they have to be found online on GitHub.

This generates prompts_{num}.txt files in the generated_bugfix_prompts that you can just download or write into another more specialized format: Colab Notebook

It is already very useful in this state in my opinion but if someone good at using the GitHub API or web scraping in general could improve it with the commit messages it would be even better.

Here is the same code as the notebook in plain text:

# -*- coding: utf-8 -*-
"""bugs_dataset.ipynb

Automatically generated by Colaboratory.

Original file is located at
    https://colab.research.google.com/drive/16Fr4iW0x1JdPC8x6GiBb6b4YtEr1CDGH
"""

#!wget https://zenodo.org/record/5845439/files/tssb_data_3M.zip?download=1

#!mv tssb_data_3M.zip?download=1 tssb_data_3M.zip

#!unzip tssb_data_3M.zip

import pandas

FILENUM = 32
table = pandas.read_json(f"tssb_data_3M/file-{FILENUM}.jsonl.gz", lines=True)
table

import re

TEMPLATE = \
"""User: Find the bug in the following code:
{}
Reply: The fixed code is:
```
{}
```


"""

def remove_starting_plus_minus(text):
  if text.startswith("+") or text.startswith("-"):
    return text[1:]
  else:
    return text

def remove_extraneous_diff_info(text):
  pattern = "@@.*@@"
  return re.sub(pattern, "", text)

def clean(text):
  return remove_extraneous_diff_info(remove_starting_plus_minus(text))

def write_prompts(num, table):
  with open(f"generated_bugfix_prompts/prompts_{num}.txt", "w+") as f:
    for index, row in table.iterrows():
      #print("diff")
      #print(row['diff'])
      correct = '\n'.join(clean(line) for line in row['diff'].split("\n") if not line.startswith("+"))
      wrong = '\n'.join(clean(line) for line in row['diff'].split("\n") if not line.startswith("-"))
      #print("prompt")
      #print(TEMPLATE.format(correct, wrong))
      f.write(TEMPLATE.format(correct, wrong))

#!mkdir generated_bugfix_prompts

for num in range(33):
  table = pandas.read_json(f"tssb_data_3M/file-{num}.jsonl.gz", lines=True)
  write_prompts(num, table)

#!zip -r generated_bugfix_prompts.zip generated_bugfix_prompts

huu4ontocord · 2023-02-10T00:36:32Z

Can you push the generated bugfix_prompts to hf please? As parquet? Please see the guide we have in the data directory.

huu4ontocord · 2023-02-10T00:37:01Z

And check in your notebook.
We will find someone to finish it :)

caridorc-tergiliti · 2023-02-10T00:41:24Z

@ontocord This is the notebook Colab Notebook

If you run it you can download the generated prompts so far, it only takes around 20 minutes.

caridorc-tergiliti · 2023-02-10T00:53:56Z

This is the pull request: #1410

Tell me if it is correct, it is the first pull request that I do.

zirui · 2023-02-10T09:49:51Z

I think I can help to get commit messages from GitHub, but it may not be done too quickly, as there is a rate limit imposed by the GitHub API.

RiccardoRiglietti · 2023-02-10T12:02:51Z

@zirui @ontocord here is the pull asking to add the notebook: #1425

It says something weird about pre-commit, but I cannot run pre-commit locally because of version incompatibility

huu4ontocord · 2023-02-10T17:18:51Z

@andreaskoepf and others who are controlling PR will review. thank you!

RiccardoRiglietti · 2023-02-11T02:05:50Z

@ontocord I managed to run precommit by installing it with conda rather than with snap

zirui · 2023-02-12T13:40:34Z

I'm working on getting commit messages from GitHub. Due to the limit of 5,000 requests imposed by the GitHub API, the process is slow, and I am trying to find a way to speed it up.

please assign this issue to me. @caridorc-tergiliti @ontocord

RiccardoRiglietti · 2023-02-12T18:37:24Z

@zirui Thanks for your effort, if you are querying the GitHub API for commits, it might be worth it to also get more context for the code, i.e. more lines before and after the bug as the dataset only contains a few lines near the bug (but I am not sure if this is worth the extra effort, your call).

Another possibility is downloading the GitHub repositories and using the python github library to get the commit data from them. The repositories are only 8266 even if the bugfixes are over 3 million. So maybe it is possible to download the whole repository with the whole history and query it locally for the bugfixes.

Link to the PR: #1425

zirui · 2023-02-13T05:00:00Z

Thanks @RiccardoRiglietti
Your suggestion of downloading all GitHub repositories seems like a good idea, I will look into whether this method is more efficient(This method may be impacted by the network environment, and in my area, downloading the full Git code can sometimes take a long time... ).

olliestanley · 2023-02-20T22:04:05Z

@zirui I have assigned you to this. Soon we should be able to get #1425 merged and then you may be able to work on top of that existing notebook

olliestanley · 2023-02-21T16:37:01Z

Reopening to track the commit issue addition on top of the work in #1425

zirui · 2023-02-22T02:26:49Z

@zirui I have assigned you to this. Soon we should be able to get #1425 merged and then you may be able to work on top of that existing notebook

Ok, after finishing the work of retrieving all the COMMIT_MESSAGE of codes, I will create a new HF dataset and update the existing notebook.

huu4ontocord · 2023-02-24T06:31:11Z

Looks like this issue is moving along nicely @zirui and @caridorc-tergiliti

huu4ontocord · 2023-04-09T21:01:05Z

Hi all @zirui and @caridorc-tergiliti

Are we good with this issue? Any results?

zirui · 2023-04-12T15:23:28Z

Hi all @zirui and @caridorc-tergiliti

Are we good with this issue? Any results?

Sorry for the late reply.
I have created two HF datasets follow this OA README:
The ManySStuBs4J dataset extended with commit info: TSSB-3M-ext
instruction dataset : TSSB-3M-instructions

and after further filtering and checking, I will create a pull request for the OA repository this week

add the TSSM-3M code bugs dataset issue: [#1395](#1395) --------- Co-authored-by: 张子锐 <zirui.zhang@yiducloud.cn> Co-authored-by: Oliver Stanley <olivergestanley@gmail.com>

huu4ontocord assigned caridorc-tergiliti Feb 9, 2023

huu4ontocord added the data label Feb 9, 2023

huu4ontocord unassigned caridorc-tergiliti Feb 9, 2023

huu4ontocord assigned caridorc-tergiliti Feb 10, 2023

RiccardoRiglietti mentioned this issue Feb 10, 2023

Adding a notebook to make use of the TSSB-3M bugs dataset #1425

Merged

olliestanley assigned zirui and unassigned caridorc-tergiliti Feb 20, 2023

olliestanley closed this as completed in ac41901 Feb 21, 2023

olliestanley reopened this Feb 21, 2023

huu4ontocord assigned caridorc-tergiliti and unassigned caridorc-tergiliti Feb 24, 2023

zirui mentioned this issue Apr 18, 2023

add TSSB-3M dataset #2693

Merged

sedthh pushed a commit that referenced this issue May 11, 2023

add TSSB-3M dataset (#2693)

ea6af41

add the TSSM-3M code bugs dataset issue: [#1395](#1395) --------- Co-authored-by: 张子锐 <zirui.zhang@yiducloud.cn> Co-authored-by: Oliver Stanley <olivergestanley@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using the huge ManySStuBs4J code bugs dataset #1395

Using the huge ManySStuBs4J code bugs dataset #1395

caridorc-tergiliti commented Feb 9, 2023

huu4ontocord commented Feb 9, 2023

caridorc-tergiliti commented Feb 9, 2023 •

edited

huu4ontocord commented Feb 10, 2023

huu4ontocord commented Feb 10, 2023

caridorc-tergiliti commented Feb 10, 2023

caridorc-tergiliti commented Feb 10, 2023

zirui commented Feb 10, 2023

RiccardoRiglietti commented Feb 10, 2023

huu4ontocord commented Feb 10, 2023

RiccardoRiglietti commented Feb 11, 2023

zirui commented Feb 12, 2023

RiccardoRiglietti commented Feb 12, 2023 •

edited

zirui commented Feb 13, 2023

olliestanley commented Feb 20, 2023

olliestanley commented Feb 21, 2023

zirui commented Feb 22, 2023

huu4ontocord commented Feb 24, 2023

huu4ontocord commented Apr 9, 2023

zirui commented Apr 12, 2023 •

edited

Using the huge ManySStuBs4J code bugs dataset #1395

Using the huge ManySStuBs4J code bugs dataset #1395

Comments

caridorc-tergiliti commented Feb 9, 2023

huu4ontocord commented Feb 9, 2023

caridorc-tergiliti commented Feb 9, 2023 • edited

huu4ontocord commented Feb 10, 2023

huu4ontocord commented Feb 10, 2023

caridorc-tergiliti commented Feb 10, 2023

caridorc-tergiliti commented Feb 10, 2023

zirui commented Feb 10, 2023

RiccardoRiglietti commented Feb 10, 2023

huu4ontocord commented Feb 10, 2023

RiccardoRiglietti commented Feb 11, 2023

zirui commented Feb 12, 2023

RiccardoRiglietti commented Feb 12, 2023 • edited

zirui commented Feb 13, 2023

olliestanley commented Feb 20, 2023

olliestanley commented Feb 21, 2023

zirui commented Feb 22, 2023

huu4ontocord commented Feb 24, 2023

huu4ontocord commented Apr 9, 2023

zirui commented Apr 12, 2023 • edited

caridorc-tergiliti commented Feb 9, 2023 •

edited

RiccardoRiglietti commented Feb 12, 2023 •

edited

zirui commented Apr 12, 2023 •

edited