code for replication of results for paper "Semantic Similarity Loss for Neural Source Code Summarization"

Preparation

Please create a directory named outdir with 3 subdirectories named histories, models, and predictions.
Please download the model and config file from our Hugginface profile and put the files in the config directory to your local directory called histories and put the files in funcom-java-long/funcom-java/funcom-python directory to your local directory called models if you want to finetune models with SIMILE or BLEU.
Note that you need to put files in config directory to the same directory as the outdir argument in train.py
For setting up your environment, run the following command. We recommend you to use virtual environment.
```
pip install -r requirements.txt
```

Step 0 Dataset

We use three datasets for our experiments.
- funcom-java: Le Clair et al. [Arxiv]
- funcom-java-long: Bansal et al. [Data]. Please download q90 data and extract it to /nfs/projects/funcom/data/javastmt/q90 or the same directory as your --data argument in train.py.
- funcom-python: We provide this dataset in our Hugginface Profile. Please downlaod all files and put it to the same directory as your --data argument in train.py.

Step 1 Training

To train the use-seq model with the data and gpu options, run the following command. Note that transformer-base means the transformer model.

python3 train.py --model-type={transformer-base | ast-attendgru | codegnngru | setransformer}-use-seq  --gpu=0 --batch-size=50 --data={your data path}

For example, transformer-base model on ./mydata

python3 train.py --model-type=transformer-base-use-seq  --gpu=0 --batch-size=50 --data=./mydata

To finetune baseline model with SIMILE or BLEU for one epoch, run the following command.

python3 train.py --model-type={transformer-base | ast-attendgru | codegnngru | setransformer}-{simile | bleu-base}  --gpu=0 --batch-size=50 --epochs=1 --data={your data path} --load-model --model-file={your model path}

For example, fintuning transformer-base model with SIMILE on ./mydata with the model file called transformer.h5 on ./mymodel

python3 train.py --model-type=transformer-base-simile  --gpu=0 --batch-size=50 --epochs=1 --data=./mydata --load-model --model-file=./mymodel/transformer.h5

Step 2 Predictions

Once your training procedure is done, you can see the screen with the accuracy on validation set. Pick the one before the biggest drop on validation accuracy. After you decide the model, run the following command to generate the prediction files.
```
python3 predict.py {path to your model} --gpu=0 --data={your data path}
```
For example, if your model path is outdir/models/transformer-base.h5 and your data path is ./mydata, run the following command.
```
python3 predict.py outdir/models/transformer-base.h5 --gpu=0 --data=./mydata
```

Step 3 Metrics

We provide scripts for calculating the metrics that we report on the paper. The following commands are for BLEU score, METEOR, and USE score respectively.
```
python3 bleu.py {path to your preiction file} --data={path to reference file}
```
```
python3 meteor.py {path to your preiction file} --data={path to reference file}
```
```
python3 use_score_v.py {path to your preiction file} --data={path to reference file}
```
For example, if we want to compute the Bleu score and the path of the prediction file is outdir/predictions/transformer-base.txt and the reference file is on ./mydata directory, the command will be as follows.
```
python3 bleu.py outdir/predictions/transformer-base.txt --data=./mydata
```

Citation

If you use this work in an academic paper, please cite the following:

@misc{su2023semantic,
      title={Semantic Similarity Loss for Neural Source Code Summarization}, 
      author={Chia-Yi Su and Collin Mcmillan},
      year={2023},
      eprint={2308.07429},
      archivePrefix={arXiv},
      primaryClass={cs.SE}
}

PDF available here: https://arxiv.org/abs/2308.07429

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
codet5+		codet5+
custom		custom
gpt2		gpt2
llama		llama
models		models
.gitignore		.gitignore
README.md		README.md
bleu.py		bleu.py
bleuhist.py		bleuhist.py
meteor.py		meteor.py
model.py		model.py
myutils.py		myutils.py
predict.py		predict.py
qs_myutils.py		qs_myutils.py
requirements.txt		requirements.txt
tokenizer.py		tokenizer.py
train.py		train.py
use_score_v.py		use_score_v.py

apcl-research/funcom-useloss

Folders and files

Latest commit

History

Repository files navigation

code for replication of results for paper "Semantic Similarity Loss for Neural Source Code Summarization"

Preparation

Step 0 Dataset

Step 1 Training

Step 2 Predictions

Step 3 Metrics

Citation

About

Resources

Stars

Watchers

Forks