Skip to content

This is a compilation of papers using pre-trained models in SE tasks

Notifications You must be signed in to change notification settings

OpenSELab/PTM4SE

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

67 Commits
 
 
 
 
 
 
 
 

Repository files navigation

PTM4SE

Recent years, deep learning has achieved excellent performance in Software Engineering (SE) tasks. However, excellent performance relies on large-scale training sets, which prevents the application of deep learning techniques in practical tasks. With the release of pre-trained models (PTMs) in the field of deep learning, researchers in SE have begun to pay attention to PTMs, and introduced PTMs into SE tasks. PTMs has made a qualitative leap in SE tasks, which makes intelligent software engineering enter a new era. However, none of the studies have refined the success, failure, and opportunities of pre-trained models in SE. To clarify the work in this cross-field (PTM4SE: Pre-trained models for Software Engineering), we systematically review the current studies related to PTM4SE. Specifically, we firstly describe the framework of the intelligent software engineering methods based on pre-trained models. We then analyze and discuss the commonly used pre-trained models in SE. Meanwhile, we introduce the downstream tasks in SE with pre-trained models in detail, and compare and analyze the performance of pre-trained model techniques on these tasks. We then present the datasets used in SE for training and fine-tuning the PTMs. Finally, we discuss the challenges and opportunities for PTM4SE.

1. Research Framework

In order to solve the problem that intelligent SE methods based on DL methods require a large amount of labeled data, researchers in SE have proposed many methods with PTMs to solve SE-related tasks (i.e., pre-trained model-based intelligent software engineering methods). These methods apply a small amount of labeled data from SE downstream tasks to train an intelligent model based on the existing PTMs, which could achieve the final PTMs4SE intelligent method to solve SE downstream tasks (e.g., code generation, program repair, and issue report classification).

The specific construction process mainly includes four parts: SE downstream task data collection and processing, intelligent method construction based on pre-trained model, model training and model evaluation.


Fig.1 Research framework of intelligent software method based on pre-trained model

2. PTMs in SE

Since 2018, researchers in SE have begun to introduce different types of PTMs into the SE-related task. Thus, we collected the intelligent software engineer with the PTMs and divided them into four types:Off-the-shelf models, Domain-specific models, and source code models.


Fig.2 Distribution of pre-trained models used in the software engineering

2.1 General pre-trained Models

Off-the-shelf Models are the pre-trained models that are trained on general domain datasets in the DL doamin, e.g., BERT (or GPT or XLNet) pre-trained models which are trained on English Wikipedia and general news datasets in Natural language processing (NLP), and ResNet and VGG models which are pretrained on ImageNet datasets in computer vision (CV). Thus, we divide the off-the-shelf models into two categories: off-the-shelf models in the NLP and off-the-shelf models in the CV.

2.2 Domain-specific Models

Domain-specific models are the pre-trained models that are trained on the SE-specific datasets (e.g., GitHub, Stack Overflow, and JIRA). In recent years, researchers in SE have collected a large number of SE-specific datasets to re-train the DL models, such as SeBERT, Text-To-Text Transfer Transformer(T5) model, Word2Vec-SO, BERT-reviews, BERT-SO-1M, BERT-SO-1M-Large, and RoBERTa-SO in Fig. 2.

2.3 Source Code Models

Source code models are the pre-trained models that are trained on source codes to understand the syntax and semantic information included in the source data. For now, researchers in SE have collected different language of source code to retrain the DL models, such as Code2Vec,CodeT5, CodeBERT, GraphCodeBERT, C-BERT, CuBERT, CodeBERT, and CodeBERT. PLBART, OSCAR, InferCode, and DOBF in Fig. 2.

3. Common Available SE Datasets

Datasets as one of important components in PTMs affect the performance of PTMs for the SE-related tasks. To get the higher performance of intelligent software engineering methods, researchers in SE have collected different types of SE datasets to train or fine-tune the models. To present and understand the current SE datasets, we summarized and analyzed these datasets in SE, and divided them into PTMs datasets and SE-related downstream Task Dataset.

3.1 PTMs Datasets

PTMs datasets are datasets that are used for Trained a DL model from scratch. The PTMs datasets frequently used in SE are listed in the table, which are also collected in our datasets files in this repository.

Type

Dataset

Programming Language

Source

Scale

Open Time

PTMs

PL

CodeSearchNet+C/C# datasets

Ruby/ JavaScript/ GO/ Python/ Java/ PHP/ C/ C#

GitHub+BigQuery

8.35G

2021

CodeT5

GitHub C language repositories

C

GitHub

5.8 G

2020

C-BERT

Java and TypeScript datasets

Java/ TypeScript

GitHub

 

2020

CugLM

Java datases 

Java

GitHub

 

2021

SynFix

CLCDSA dataset

Java/ C/ C++

AtCoder+CodeJam

17.6M

2019

IR-BERT

Java datasets 

Java

GitHub

32G

2020

InferCode

Code2vec

ETH Py150 Open corpus

Python

GitHub

190M

2020

CodeTrek

unique Python files

Python

GitHub

159GB

2021

CodeX

JavaSmall and JavaMed datasets

Java

GitHub

4.7M

2020

Coder

Python and Java pre-training corpus

Java/ Python

GitHub

21.3M

2021

CuBERT/ TreeBERT

NL

SE textual data

English

Stack Overflow

GitHub

Jira

119.7G

2021

seBERT

CoNLL-2003

English

Stack Overflow

3.16M

2020

BERTOverflow

CosSensBERT

NL+

PL

Java datasets from CodeSearchNet+SO posts

Java/ English

GitHub+SO

52.5M

2022

T5

Java datasets from CodeSearchNet+SO posts

Java/ English

GitHub

1.5M

2021

T5

Java and Python from BigQuery+SO posts

Java/ Python/ English

BigQuery+GitHub

655G

2021

PLBART

CodeSearchNet

Ruby/ JavaScript/ GO/ Python/ Java/ PHP/ English

GitHub

3.5 G

2019

T-BERT

Graphcodebert

CodeBERT

Python corpus of CodeSearchNet dataset

Python

GitHub

1.6M

2019

CLAWSAT/ CODE-MVP

Java corpus of CodeSearchNet dataset

Java

GitHub

2.0M

2019

CLAWSAT

AnghaBench

C

GitHub

0.53M

2020

COMBO

3.2 SE-related Downstream Task Dataset

SE-related downstream task datasets are the datasets used to fine tune the intellignet DL models for the se-related downstream tasks. Common SE-related downstream datasets frequently used are listed in the followed table.

Type

Tasks

Dataset

Programming Language

Scale

Open Time

PL

Code Classification

Java patches

Java

102041

2020

POJ104

C/C++

30815

2021

CodeCloneBench

Java

901028

2014

SATD dataset

 

 

2016

SmartBugs Wild Dataset

 

47398

2020

SPI

C

298917

2021

QEMU

C/C++

13600

2005

FFmpeg

C/C++

4919

2006

Devign

C

27318

2021

Merge Conflcts Dataset

C#/JavaScript/TypeScript/Java

219934

2022

Multi-language Commit Message Dataset (MCMD)

Java/C#/C++/Python/JavaScript

2250000

2022

Vuldeepecker

C/C++

61638

2018

Draper

C/C++

1274366

2018

REVEAL

C/C++

18169

2020

muVuldeepecker (MVD)

C/C++

181641

2019

D2A

C/C++

1295623

2021

Program Repair

ManySStuBs4J

Java

63923

2021

Automatic Bug Fixing

Java

46680

2019

TFix-dataset

JavaScript

104804

2021

QuixBugs

Python/Java

 

2017

CoCoNut

Java/Python/C/JavaScript

9675342

2020

BugAID

JavaScript

 

2016

ManyBugs

C

10468

2015

Code Completion

Java and TypeScript datasets

Java/TypeScript

 

2020

ETH Py150 corpus

Python

74749

2020

API Recommendation

Req2Lib-dataset

Java

5625

2020

Code Translation

Coode-code (CodeTrans)

Java/C#

10300

2021

Python800 dataset

Python

240000

2021

NL

Text Classification

Herzig's issue report datasets

English

 

2012

Commit messages

English

1793

2021

issue report from GitHub

English

 

2021

SEntiMoji dataset

 

10096

2019

Review Responses Automatic Generation

review-response pairs datasets

English

570881

2020

Link Prediction

traceability dataset

English

1834

2021

PL+NL

Code Generation

Concode data

Java

100,000

2018

DJANGO

Python

18805

2015

JUICE-10K

Python

13946

2019

MBPP

Python

974

2021

Spider

SQL

5693

2018

APPS

Python

232421

2021

CodeContests

C++/Python/Java

13610

2022

HumanEval

Python

 

2021

Code Summarization

Code review comments (CR)

 

1600

2017

Code Summarization(CS)

Java

1953940

2020

CodeSearchNet

Ruby/JavaScript/GO/Python/Java/PHP

 

2019

Java projects from GitHub

Java

134239

 

PY150

Python

30 000

2016

Code Search

AdvTest dataset

Python

280634

2021

CoNaLa

Python/Java

79809

2018

SO-DS

Python

13250

2020

StaQC

Python

147546

2018

CoSQA

Python

20604

2021

Code Review

CodeReview data

Python/Java/Go/C++/JavaScript/C/C#/Php/Ruby

 

2022

Synthesis

CodeXGLUE

 

 

2021

CV

UML Diagram Classification

UML Diagram

 

14815

2016

4. SE-related tasks that used the PTMs and their performance

Researchers in SE have applied many PTMs into various SE-related tasks because of the powerful learning ability of PTMs. We summarized and analyzed these SE-related tasks with the PTMs. Meanwhile, based on the types of input datasets, we divided them into four types: programming language (PL) related tasks, natural language (NL) related tasks in the SE domain, the interaction task among PL and NL, and image related tasks in the SE domain.


Fig.3 Distribution of downstream tasks with pre-trained models in software engineering

4.1 PL-related tasks

PL-related tasks are the tasks to solve the problems through studying the syntactic and semantic feature representations of source code. The current main tasks and performance are listed in the followed table.

PL

 

Sub-Tasks

PTMs

Accuracy

Precision

Recall

F1

MAP

BLEU

EM

CodeBLEU

MCC

EditSIM

Number of fixed bugs

Code classification

Commit classifcation

BERT

0.80

0.84

0.75

0.79

 

 

 

 

 

 

 

seBERT

 

0.87

0.85

0.84

 

 

 

 

 

 

 

BERToverflow

 

0.84

0.81

0.81

 

 

 

 

 

 

 

BERT-BASE

 

0.77

0.73

0.75

 

 

 

 

 

 

 

Algorithm classification

CodeBERT

 

0.85

 

 

0.83

 

 

 

 

 

 

RoBERTa

 

0.83

 

 

0.80

 

 

 

 

 

 

COMBO

 

 

 

 

0.74

 

 

 

 

 

 

ResNet18

86.4

85.8

84.7

82.2

 

 

 

 

 

 

 

ResNet50

0.90

0.86

0.87

0.86

 

 

 

 

 

 

 

Technical Debt Detection

BERT

 

 

 

0.82

 

 

 

 

 

 

 

BERT-SO-1M

 

 

 

0.82

 

 

 

 

 

 

 

StackOBERTflow

 

 

 

0.81

 

 

 

 

 

 

 

BERT-comments

 

 

 

0.81

 

 

 

 

 

 

 

Vulnerability Detection

GraphCodeBERT

 

0.92

0.92

0.92

 

 

 

 

 

 

 

CuBERT

0.72

 

 

 

 

 

 

 

 

 

 

COMBO

 

 

 

0.67

 

 

 

 

 

 

 

C-BERT

0.62

 

 

 

 

 

 

 

 

 

 

CodeBERT

0.64

 

 

0.54

 

 

 

 

 

 

 

PLBART

0.63

 

 

 

 

 

 

 

 

 

 

ResNet18

 

0.89

0.89

0.89

 

 

 

 

 

 

 

ResNet50

 

0.91

0.91

0.91

 

 

 

 

 

 

 

Defect Detection

CodeT5-base

0.66

 

 

 

 

 

 

 

 

 

 

CodeT5-small

0.63

 

 

 

 

 

 

 

 

 

 

CodeT5

0.64

 

 

0.60

 

 

 

 

0.27

 

 

PLBART

0.63

 

 

 

 

 

 

 

 

 

 

CuBERT

0.95 

 

 

 

 

 

 

 

 

 

 

RoBERTa (code)

0.61

 

 

 

 

 

 

 

 

 

 

BERT

0.76

 

 

 

 

 

 

 

 

 

 

CodeBERT

0.68

 

 

0.54

 

 

 

 

0.27

 

 

CodeBERTa

0.70

 

 

0.59

 

 

 

 

0.27

 

 

GraphCodeBERT

0.71

 

 

 

 

 

 

 

 

 

 

CODE-MVP

0.89

 

 

 

 

 

 

 

 

 

 

SynCoBERT

0.65

 

 

 

 

 

 

 

 

 

 

Clone Detection

RoBERTa 

 

0.97

0.96

0.96

 

 

 

 

 

 

 

CodeBERT 

0.97

0.96

0.96

0.96

0.10

 

 

 

 

 

 

GraphCodeBERT

0.97

0.97

0.97

0.97

 

 

 

 

 

 

 

CodeT5-small

 

 

 

0.97

 

 

 

 

 

 

 

CodeT5-base

 

 

0.95

0.97

 

 

 

 

 

 

 

RoBERTa (code)

 

 

 

0.95

 

 

 

 

 

 

 

PLBART

 

 

0.95

0.97

 

 

 

 

 

 

 

code2vec

 

0.82

0.40

0.60

 

 

 

 

 

 

 

T5

 

 

 

 

0.70

 

 

 

 

 

 

OSCAR

 

 

 

 

0.49

 

 

 

 

 

 

COMBO

 

 

0.64

 

 

 

 

 

 

 

 

InferCode

 

0.90

0.56

0.75

 

 

 

 

 

 

 

SynCoBERT

 

0.97

0.98

0.97

0.88

 

 

 

 

 

 

SCodeR

 

0.95

0.96

0.95

0.92

 

 

 

 

 

 

UniXcoder

 

0.98

0.93

0.95

0.91

 

 

 

 

 

 

Program Repair

CodeT5-base

 

 

 

 

 

0.77

0.22

 

 

 

 

CodeT5-small

 

 

 

 

 

0.76

0.19

 

 

 

 

RoBERTa

0.75

 

 

 

 

 

 

 

 

 

 

RoBERTa (code)

 

 

 

 

 

0.77

0.16

 

 

 

 

CodeBERT

0.72

 

 

 

 

About

This is a compilation of papers using pre-trained models in SE tasks

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published