Skip to content
Identifying Redundancies in Fork-based Development
Python HTML
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Type Name Latest commit message Commit time
Failed to load latest commit information.


IdeNTifying RedUndancies in Fork-based DEvelopment

Python library dependencies:

sklearn, numpy, SciPy, matplotlib, gensim, nltk, bs4, flask, GitHub-Flask

Configuration: LOCAL_DATA_PATH (for storing some data in local) access_token (for using GitHub API to fetch data) model_path (for storing the model in local)


[dupPR]: Reference paper: Yu, Yue, et al. "A dataset of duplicate pull-requests in github." Proceedings of the 15th International Conference on Mining Software Repositories. ACM, 2018. (link: <including: 2323 Duplicate PR pairs in 26 repos>

dupPR for training set

dupPR for testing set

Non-duplicate PRs for training set

Non-duplicate PRs for testing set

labeled results for RQ1 precision evaluation


  1. python data/random_sample_select_pr.txt 400

    (It will generate data/random_sample_select_pr.txt using random sampling)

  2. python

    (It will take data/random_sample_select_pr.txt & data/clf/second_msr_pairs.txt as input, and write the output into files: evaluation/random_sample_select_pr_result.txt & evaluation/msr_second_part_result.txt)

  3. manually label output file: evaluation/random_sample_select_pr_result.txt, add Y/N/Unknown at end (see evaluation/random_sample_select_pr_result_example.txt as example)

  4. python

    (It will print precision & recall at different threshold to stdout.)


  1. python data/clf/second_msr_pairs.txt

    python data/clf/second_nondup.txt

    (It will take data/clf/second_msr_pairs.txt & data/clf/second_nondup.txt as input, and write the output into files: evaluation/second_msr_pairs_history.txt & evaluation/second_nondup_history.txt.)

  2. python

    (It will print precision, FPR, saved commits at different threshold to stdout.)


  1. python new

    python old

    (It will take data/clf/second_msr_pairs.txt as input, and write the output into files: result_on_topk_new.txt & result_on_topk_old.txt)

  2. python new

    python old

    (It will print topK recall for our method and another method to stdout.)


  1. python data/small_sample_for_precision.txt 70

    (It will generate data/small_sample_for_precision.txt using random sampling)

    python data/clf/second_msr_pairs.txt data/small_sample_for_recall.txt 200

    (It will generate data/small_sample_for_recall.txt using random sampling)

  2. python

    (It will take data/small_sample_for_precision.txt & data/small_sample_for_recall.txt as input, and write the output into files: evaluation/small_sample_for_precision.txt_XXXX.out.txt & evaluation/small_sample_for_recall.txt_XXXX.out.txt)

  3. manually label all the output files: evaluation/small_sample_for_precision.txt_XXXX.out.txt, add Y/N/Unknown at end (see evaluation/small_sample_for_precision.txt_new_example.out as example)

  4. python

    (It will print precision for all the leave-one-out models under a fixed recall to stdout.)

Main API:

python repo # detect all the PRs of repo
python repo pr_num # detect one PR

python repo # detect all the open PRs of repo

python repo1 repo2 # detect the PRs between repo1 and repo2

python result_file # print html for the PR pairs Classification Model using Machine Learning.

# Set up the input dataset
c = classify()

init_model_with_repo(repo) # prepare for prediction Natural Language Processing model for calculating the text similarity.

m = Model(texts)
text_sim = query_sim_tfidf(tokens1, tokens2) Calculate the similarity for feature extraction.

# Set up the params of compare (different metrics).
# Check for init NLP model.
feature_vector = get_pr_sim_vector(pull1, pull2) Detection on (open) pull requests.

detect.detect_one(repo, pr_num) Detection on pull requests of cross-projects.

detect_on_cross_forks.detect_on_pr(repo_name) compare on granularity of commits. About GitHub API setting and fetching.

              'fork' / 'pull' / 'issue' / 'commit' / 'branch',

get_pull(repo, num, renew)
get_pull_commit(pull, renew)
fetch_file_list(pull, renew)
get_another_pull(pull, renew)
check_too_big(pull) Get data from API, parse the raw diff.

parse_diff(file_name, diff) # parse raw diff
fetch_raw_diff(url) # parse raw diff from GitHub API

You can’t perform that action at this time.