Skip to content

Latest commit

 

History

History
138 lines (102 loc) · 8.18 KB

README.md

File metadata and controls

138 lines (102 loc) · 8.18 KB

D2A Leaderboard

Vulnerability or Defect detection is a major problem in software engineering. Of late, many new ML models have been proposed to solve this problem. To help research in this area, we recently released the D2A dataset which is based on Infer Static Analyzer bug reports for real-world C programming language data. D2A goes one step further and performs a differential analysis by comparing the before (potential vulnerability) and after (fix of the vulnerability) versions to make an assessment and label the before version as vulnerable if it detects a change compared to the after version. Through this leaderboard we explore multiple ways to solve the vulnerability detection problem by identifying the real vulnerabilities amongst the many possibilities generated by static analysis.

Data

All the data is derived from the original D2A dataset which contains many kinds of data to help with program analysis. In order to understand the impact of different data on predictions, we use them in isolation and in combination. The D2A dataset contains labels derived from two sources, the auto-labeler and the after-fix extractor. The leaderboard makes use of both kinds of labels where each label indicates if a sample is a real (1) or false (0) bug report.

Details of different kinds of data and the corresponding labels are given below. The D2A dataset contains a lot more information which we could not use in the leaderboard tasks.

  1. Infer Bug Reports (Trace): This dataset consists of Infer bug reports, which are a combination of English language and C Programming language text.
  2. Bug function source code (Function)
  3. Bug function source code, trace functions source code and bug function file URL (Code)

Tasks:

  1. Trace: Bug trace or a bug report contains both natural language and code. The code is limited to code snippets from different functions and files. Models are expected to work with a combination of natural language and code snippets to make the prediction. The fields in the dataset are:
    1. id
    2. trace
    3. label (not present in test split)
  2. Code: Models can use source code from 3 fields (bug_function, functions and bug_url) to make the prediction. The file pointed to by bug_url must be downloaded in order to be used. The fields in the dataset are:
    1. id
    2. bug_url
    3. bug_function
    4. functions
    5. label (not present in test split)
  3. Trace + Code: Models can use all the fileds from the previous 2 tasks to make the prediction. The fields in the dataset are:
    1. id
    2. bug_url
    3. bug_function
    4. functions
    5. trace
    6. label (not present in test split)
  4. Function: Models can use only the source code from the bug function to make the prediction. The functions have been derived from a different subset of the full D2A dataset chosen to achieve a more balanced dataset.
    1. id
    2. code
    3. label (not present in test split)

All the data is in the d2a_leaderboard_dat.tar.gz tar file on the DAX data download page. Within the tar file the data for each task is in a separate folder. For example, the data for the Function task is in a folder named function. You can find more information on each column in the data section.

Task and Data Summary

TaskDataSummary

Metrics

The datasets for the Code + Trace, Trace and Code tasks are derived from the auto-labeler generated samples, and are quite unbalanced with a 0:1 ratio of about 40:1. The Function datasets contain functions marked '1' by the auto-labeler plus the matching after-fix '0' functions, so are well balanced with a 0:1 ratio of about 8:9. Because of these different distributions we use different metrics to measure the performance of models on the tasks.

  • Balanced Data: For the balanced dataset we use Accuracy to measure model performance.
  • Unbalanced Data: Because the dataset is so heavily unbalanced, we cannot use Accuracy since the model predicting only 0 would have a 98% accuracy. Instead we use the two metrics described below.
    • AUROC: Many open source project datasets are huge with hundreds of thousands of examples and thousands of 1 label examples. The cost associated with verifying every label is high. Which is why it is important to rank the models in the order of model confidence in labels. We use AUROC percentage for this purpose.
    • F1 - 5% FPR: The macro-average F1 score is generally considered a good metric for unbalanced datasets. We want the AUROC curve to peak as early as possible so we calculate the macro-average F1-score percentage at 5% FPR.
  • Overall: To get the overall model performance, we calculate the simple average percentage of all the scores across all the tasks.

Baselines

  • Augmented Static Analyzer: ML based ensemble of various models that uses hand crafted features.
    • Voting
    • Linear Regression based Stacking

  • C-BERT: This a BERT based model pre-trained on C source code.

C-BERT

Evaluation

It is available as a python script to compute the different metrics presented above. For tasks 1, 2, and 3 the script will output the AU-ROC (Area under the ROC curve) and the macro F1 score, accuracy for task 4. While test set labels are not provided, a developer may want to check the model's performance on the dev set using the evaluation script. Test set predictions for the submission should be presented as shown below.

For all the four tasks the predictions file must be in txt format, with one row per each dev/test set example. The format of each row is:

  • <ID> <PROBABILITY> for tasks 1, 2 and 3.

     	1   0.67891
     	2   0.98765
     	3   0.54321
     	4   0.12345
    

    Example

     cd evaluation
     # provide as arguments the CSV file with the labels and the prediction file 
     python evaluation.py  ../data/sample_trace_dev.csv sample_predictions_trace.txt
    

    {'AUC_SCORE': 75.0, 'F1_SCORE': 73.33333333333334}

  • <ID> <BINARY PREDICTION> for task 4.

     	1   0
     	2   1
     	3   0
     	4   1
    

    Example

     cd evaluation
     # provide as arguments the CSV file with the labels, the prediction file and a parameter to specify the task
     python evaluation.py  ../data/sample_function_dev.csv sample_predictions_function.txt --single_function
    

    {'ACCURACY': 75.0}

Naive examples of splits can be found inside the folder data, while an example of predictions file is available inside the folder evaluation.

Submission

The expected output is a probability score for each example of the test/dev set. The probability score is that probability that the example has label 1. Once your model is fully trained, you can check your model's performance on the dev set using the evaluation script. To get the performance on test set follow the below steps:

  1. Generate your prediction output for the dev set.
  2. Evaluate dev set predictions according to the evaluation script from the link above.
  3. Generate your prediction output for the test set.
  4. Submit the following information by emailing to saurabh.pujar@ibm.com

Your email must include:

  1. Prediction results on test set.
  2. Individual/Team Name: Name of the individual or the team to appear in the leaderboard.
  3. Model information: Name of the model/technique to appear in the leaderboard.

We recommend your email should also include:

  1. Prediction results on dev set.
  2. Individual/Team Institution: Name of the institution of the individual or the team to appear in the leaderboard.
  3. Model code: Training code for the model.
  4. Publication Information: Name, Citation, URL of the paper if model is from a published work to appear in the leaderboard.

Cite

Please cite the D2A paper.