Skip to content

17385/TreeBERT

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TreeBERT

This is an implementation of the model described in: TreeBERT: A Tree-Based Pre-Trained Model for Programming Language.
In this paper, we propose TreeBERT, a tree-based pretrained model for programming language. TreeBERT follows the Transformer encoder-decoder architecture. To enable the Transformer to utilize the tree structure, we represent the AST corresponding to the code snippet as the set of the root node to terminal node paths as input and then introduce node position embedding to obtain the position of the node in the tree. We propose a hybrid objective applicable to AST to learn syntactic and semantic knowledge, i.e., tree masked language modeling (TMLM) and node order prediction (NOP). TreeBERT can be applied to a wide range of PL-oriented generation tasks by means of fine-tuning and without extensive modifications to the task-specific architecture.

Requirements

  • python3
  • numpy
  • tree-sitter
  • torch
  • tqdm

Pre-training Data Ready

The pre-training dataset we use is the Python and Java pre-training corpus published by CuBERT.
By running dataset\ParseTOASTPath.py you can transform the code snippet into an AST, extract the path from the root node to the terminal node in the AST, and standardize the format of the target code snippet.

Fine-tuning Data Ready

In method name prediction, we evaluate TreeBERT on two datasets, Python and Java, where the python dataset uses py150 , and the Java dataset uses java-small, java-med, and java-large. These data sets can be processed into the form required by the code summarization task by running dataset\Get_FunctionDesc.py.

{
    "function": Function, its function name is replaced by "__",
    "label": Function Name
}

In code summarization, we use the Java dataset provided by DeepCom to fine-tune our model.

Pre-training

1. Create vocab:

python dataset/vocab.py -c /home/pretrain_data_AST/ -o data/vocab.large -f 2 -m 32000

2.Training TCBERT using GPU:

python __main__.py -td /home/pretrain_data_AST_train/ -vd /home/pretrain_data_AST_test/ -v data/vocab.large -o output/treebert.model --with_cuda True

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages