C4: Contrastive Cross-Language Code Clone Detection

Code and dataset for paper C4: Contrastive Cross-Language Code Clone Detection

pair_train.jsonl contains the training dataset.

pair_valid.jsonl contains the valid dataset.

pair_test.jsonl contains the test dataset.

The code is in run_con.py. To run the full pipeline, you can enter the following command.

cd code
output=test 
lr=5e-5
batch_size=36
source_length=512
data_dir=./
output_dir=model/$output
train_file=./pair_train.jsonl
dev_file=./pair_valid.jsonl
test_file=./pair_test.jsonl
epochs=10
pretrained_model=microsoft/codebert-base #Roberta: roberta-base

python run_con.py --do_train --do_eval --do_test --model_type roberta --model_name_or_path $pretrained_model --train_filename $train_file --dev_filename $dev_file --output_dir $output_dir --max_source_length $source_length --train_batch_size $batch_size --eval_batch_size $batch_size --learning_rate $lr --num_train_epochs $epochs --test_filename $test_file

To run the above code, we use 3 RTX 3090, average time for each epoch is about 60 min.

python environment

pytorch 1.9.0 transformers 4.4.0

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
dataset		dataset
.gitattributes		.gitattributes
README.md		README.md
bleu.py		bleu.py
model.py		model.py
run_con.py		run_con.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dataset

dataset

.gitattributes

.gitattributes

README.md

README.md

bleu.py

bleu.py

model.py

model.py

run_con.py

run_con.py

Repository files navigation

C4: Contrastive Cross-Language Code Clone Detection

python environment

About

Releases

Packages

Languages

Chenning-Tao/C4

Folders and files

Latest commit

History

Repository files navigation

C4: Contrastive Cross-Language Code Clone Detection

python environment

About

Resources

Stars

Watchers

Forks

Languages