This is an implementation of the model described in: TreeBERT: A Tree-Based Pre-Trained Model for Programming Language.
In this paper, we propose TreeBERT, a tree-based pretrained model for programming language. TreeBERT follows the Transformer encoder-decoder architecture. To enable the Transformer to utilize the tree structure, we represent the AST corresponding to the code snippet as the set of the root node to terminal node paths as input and then introduce node position embedding to obtain the position of the node in the tree. We propose a hybrid objective applicable to AST to learn syntactic and semantic knowledge, i.e., tree masked language modeling (TMLM) and node order prediction (NOP). TreeBERT can be applied to a wide range of PL-oriented generation tasks by means of fine-tuning and without extensive modifications to the task-specific architecture.
- python3
- numpy
- tree-sitter
- torch
- tqdm
The pre-training dataset we use is the Python and Java pre-training corpus published by CuBERT.
By running dataset\ParseTOASTPath.py
you can transform the code snippet into an AST, extract the path from the root node to the terminal node in the AST, and standardize the format of the target code snippet.
In method name prediction, we evaluate TreeBERT on two datasets, Python and Java, where the python dataset uses py150 , and the Java dataset uses java-small, java-med, and java-large.
These data sets can be processed into the form required by the code summarization task by running dataset\Get_FunctionDesc.py
.
{
"function": Function, its function name is replaced by "__",
"label": Function Name
}
In code summarization, we use the Java dataset provided by DeepCom to fine-tune our model.
python dataset/vocab.py -c /home/pretrain_data_AST/ -o data/vocab.large -f 2 -m 32000
python __main__.py -td /home/pretrain_data_AST_train/ -vd /home/pretrain_data_AST_test/ -v data/vocab.large -o output/treebert.model --with_cuda True