diff --git a/README.md b/README.md index 4c562fd9f..7403f2b50 100644 --- a/README.md +++ b/README.md @@ -15,6 +15,7 @@ * [Benchmark](#benchmark) * [Main Requirements](#main-requirements) * [Installation](#installation) + * [How to extend the dataset](#how-to-extend-the-dataset) * [How to run](#how-to-run) * [Benchmark Result](#benchmark-result) * [Used Tools for Benchmarking](#used-tools-for-benchmarking) @@ -265,6 +266,19 @@ $ source venv/bin/activate $ pip install -qr requirements.txt ``` +### How to extend the dataset + +1. Find an interesting repo and commit +2. add to snapshot.json the data: + ``` json + "{commit_hash}{any_padding_hex_symbols_to_64}": "https://github.com/org/repo", + ``` +3. run download_data.py twice (first - a meta file will be created, second - all files from the commit will be downloaded) +4. run CredSweeper for the downloaded data to obtain a report (preferred with ``--ml_threshold 0`` argument) +5. run benchmark for the report with ``--fix`` option - all found values will be inserted into meta +6. review, correct markup if necessary, produce empty benchmark report for CI, commit the changes + + ### How to run ``` bash usage: python -m benchmark [-h] --scanner [SCANNER]