You can find the paper here: https://arxiv.org/abs/2103.11626
Note: If you are facing issues regarding the LFS bandwidth, you can download the dataset from Zenodo: https://zenodo.org/record/6802730.
data
folder contains multiple folders and files:
repetition
folder contains MSR datasets WITH <buggy code, fixed code> duplicate pairsunique
folder contains MSR datasets WITHOUT <buggy code, fixed code> duplicate pairssstubs(Large|Small).json
files contain dataset in JSON formatsstubs(Large|Small)-(train|test|val).json
files contain dataset split in JSON formatsplit/(large|small)
folders contain dataset in text format (what the CodeBERT works with)
- Clone the repository
git lfs install
git clone https://github.com/EhsanMashhadi/MSR2021-ProgramRepair.git
- Download the CodeBERT model
cd MSR2021-ProgramRepair
git clone https://huggingface.co/microsoft/codebert-base
- use the downloaded model's directory path as
pretrained_model
variable in script files
- Install dependencies
pip install torch==1.4.0
pip install transformers==2.5.0
- Train the model with MSR data
bash ./scripts/codebert/train.sh
- Evaluate the model
bash ./scripts/codebert/test.sh
- Install OpenNMT-py
pip install OpenNMT-py==2.2.0
- If you face conflicts between pytorch and CUDA version, you can follow this link
- Preprocess the MSR data
bash ./scripts/simple-lstm/build_vocab.sh
- Train the model
bash ./scripts/simple-lstm/train.sh
- Evaluate the model
bash ./scripts/simple-lstm/test.sh
(This is the original version used to run the simple LSTM experiments in the paper.)
- Install OpenNMT-py legacy
pip install OpenNMT-py==1.2.0
- Preprocess the MSR data
bash ./scripts/simple-lstm/legacy/preprocess.sh
- Train the model
bash ./scripts/simple-lstm/legacy/train.sh
- Evaluate the model
bash ./scripts/simple-lstm/legacy/test.sh
- You can change the
size
andtype
variables value in script files to run different experiments (large | small, unique | repetition).
- Check the
CUDA
andPyTorch
compatibility - Assign the correct values for
CUDA_VISIBLE_DEVICES
,gpu_rank
, andworld_size
based on your GPU numbers in all scripts. - Run on GPU by removing the
gpu_rank
, andworld_size
options in all scripts.