Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make the deduplication configurable per parser #1500

Closed
ptrovatelli opened this issue Aug 16, 2019 · 11 comments
Closed

Make the deduplication configurable per parser #1500

ptrovatelli opened this issue Aug 16, 2019 · 11 comments
Labels

Comments

@ptrovatelli
Copy link
Contributor

@ptrovatelli ptrovatelli commented Aug 16, 2019

As discussed a little bit in #1478, I believe it would be useful to have a configurable hash_code computation per parser, for the deduplication algorithm.
In the past month, the deduplication was changed several times, with a mix of changing the way of computing the hash_code and (more often) specific rules addded in the deduplication algorithm itlsef, often breaking the deduplication for some parsers.
I'm hoping we could put in place a basic deduplication algorithm, based on hash_code only:

  • If the hash_code is the same, the vulnerability is a duplicated
  • else, it's not

The complexity itself would reside in the hash_code computation. This would be done outside of the parser (as it is now).
In order to satisfy all use cases, it would be useful to make the set of fields used in the hash_code computation configurable, in settings.py for example.
I'm thinking of 2 levels of configuration:

  • a configuration per family of parsers (+ for each parser a configuration to state in which family they are, or hardcoded?)
  • an (optional) configuration for each parser, to allow specificities for each parser

The families of parsers that I know of at this point (please correct me / complete the list):

  • dynamic parsers (ex: ZAP)
    • hash_code should be computed based on title, endpoints, CWE
  • static parser (ex: sonar, checkmarx)
    • hash_code should be computed based on title, cwe, file_path, line number (or not), cwe
  • dependency analysis (ex dependency check)
    • hash_code should be computed based on title, cwe, file_path (or CVE and Library, to be determined)

Another use case that this logic could handle is a tunable / better precision for static parsers.
Including the line number in the dedup fields for static parsers make tracking a unique vuln very weak: say we have a vuln at line 100, before the next scan we have added some code before line 100 so now our vuln is at line 101 -> it will not be detected as a duplicate but as a new vulnerability.
Some people may want a more "aggressive" dedup system in order to attach information to a finding and making sure it will persist across scan (they will want to remove the line_number in the computation), other people may want a more exact count of findings. Thus the necessity of a configurable solution
Also some SAST tools have already advanced techniques to track unique vulnerability even after code change, using data flow graphs. Checkmarx for example can do this. I believe one of the fields pathId or SimilarityId can be used to track unique vulnerabilities. We could update the parser to extract those fields and put them in the dedup key in order to have a unique vuln tracking as clever as checkmarx has (when analysing checkmarx reports)
To accomodate the various fields that will exist in parsers without making changes in the model everytime, we could add fields:
- custom_1, custom_2, custom_3
- and another table to associate each value to a label
(on this point i'll need to check how the current custom fields work, I can maybe use this functionality)

Edit 26/08: another (better) solution I think to avoid adding custom fields :

  • add a new field in dojo_finding that would hold the technical identifier from the source tool, for example unique_id_from_tool (for checkmarx filled with "pathId", for sonar filled with "key" ...)
  • add a configuration parameter that states for each parser if the dedup should be done based on hash_code or based on the unique_id_from_tool
    That will allow keeping a hash_code filled with functional information in the database if at any point we want to go back to deduplication based on functional fields.

A difficulty to tackle is that, as @Maffooch pointed out, there is already a "kind of" deduplication made at (some) parser level when importing vulnerabilities, in order to not import duplicates from the same file. I think this specific deduplication might be removed altogether in order to have a unique deduplication system (if per the rule of the "global deduplication" two vulns in the same file are duplicates, one only is imported) but I'm not sure right now -> more info on that in a later post.

Edit 26/08: as discussed below (my post of august 18th), for SAST tools I think this problem should be alleviated by enforcing that 1 vulnerability in the source tool is imported into one and only one vulnerability in defectDojo (thus, no pseudo deduplication / aggregation done by the parser)

For debuging purposes, it will be useful to add also another field at finding level with the key values that generated the hash (value of the concatenated keys before applying the hash or whatever operation we apply before hashing). Otherwise it will be impossible to understand what's going on when the configuration changes and the hash value changes.

Edit 26/08: not too sure on this. keeping the whole value defeats the gain we have in having a hash_code (in term of space saving in the database).
However, I think we'll need a mass recalculation of hash_code: when changing the hash_code configuration, the user will run a specific command that re-computes all hash_codes

@DefectDojo/defectdojo-moderators what do you think?
I can work on it but don't expect to have this right away :)

Any feedback is welcome.

@ptrovatelli

This comment has been minimized.

Copy link
Contributor Author

@ptrovatelli ptrovatelli commented Aug 18, 2019

good news: sonarqube is also able to track same vulnerabilities across scans, at least with simple line number shift. The parameter we're gonna want to use for deduplication is key from the json extracted by /api/issues/search, which seems to identify uniquely a given vulnerability

@ptrovatelli

This comment has been minimized.

Copy link
Contributor Author

@ptrovatelli ptrovatelli commented Aug 18, 2019

Regarding checkmarx, this is the current situation: if you've imported some checkmarx reports, you must have noticed that the number of vulnerabilities in checkmarx is higher than in defectDojo (see for example unit test "multiple_findings.xml"). This is because 1 vuln in checkmarx, for each type of vulnerability is for a given:

  • source (start of attack vector)
    • filename
    • line number
    • object (variable)
  • sink (end of attack vector)
    • filename
    • line number
    • object (variable)

In DefectDojo, there is less information: we aggregate those multiple vulnerabilites from checkmarx into a single one, based on sink line number (into line number) and sink filename (into file_path) only.
So if you have a typical sql injection with a query constructed with 5 parameters, you'll have 5 vulns in checkmarx but a single in defectDojo.
It creates misunderstanding for defectDojo users and makes it almost impossible to verify that all vulnerabilities were imported correcly into defectDojo because the count always differs.
Also it would make it very complicated to track unique vulnerabilities using checkmarx pathId (or SimilarityId) because you'll have 5 distinct pathId in checkmarx but only one finding in defectDojo. Which pathId are you going to import? To have something robust, you would need to import all pathId, attach them to the finding, and when comparing 2 findings, they would be duplicate if at least one of their pathId is the same. Doesn't seem really optimised.

Conclusion: I think the pre-requisite to a proper deduplication system for SAST parsers is to make sure that each vulnerability from the source tool is imported into 1 distinct finding in defectDojo (and only one).
I'll see what I can do for checkmarx, with existing fields or adding new ones: for example we could concatenate all source and sink information to a "source" and a "sink" field in the finding (we need the details because having 5 findings in defectDojo that differ only by a technical id used for deduplication won't be helping).

@wurstbrot

This comment has been minimized.

Copy link
Contributor

@wurstbrot wurstbrot commented Aug 26, 2019

dependency analysis (ex dependency check)
hash_code should be computed based on title, cwe, file_path

Please not! What happens if you adjust the title or correct the cwe (right now the same, always). I recommend to use duplication based on CVE and Library (e.g. sha1)

@ptrovatelli

This comment has been minimized.

Copy link
Contributor Author

@ptrovatelli ptrovatelli commented Aug 26, 2019

@wurstbrot noted :) i have thrown those very quickly, the point is they differ for each kind of parsers. i'll check this more thoroughly based on the currenty behavior and the ideal behavior!

@ptrovatelli

This comment has been minimized.

Copy link
Contributor Author

@ptrovatelli ptrovatelli commented Aug 26, 2019

I have added a couple of adjustments in my initial post. see paragraphs starting with "Edit 26/08"

@Maffooch

This comment has been minimized.

Copy link
Collaborator

@Maffooch Maffooch commented Aug 26, 2019

Currently, hash code is computed after a finding is initially saved the db. It would be pretty easy to rig it up to compute on every save/modification. This way, we could debug by just looking at the database instead of decrypting the hash via the saved key idea. I don't believe it will be too labor intensive to compute hash codes on every save either.

@ptrovatelli

This comment has been minimized.

Copy link
Contributor Author

@ptrovatelli ptrovatelli commented Aug 27, 2019

@Maffooch re-computing the hashcode upon save wouldn't be enough: when changing the configuration, you'll want the deduplication to work right away. Most of the findings, that you have imported the past week, month, year won't ever be modified so they'll keep their wrong hash_code. You need to re-compute the whole database; maybe we could add a filter on product or/and dates or/and scanner in order to be a bit faster when the user knows that for other products/dates/scanners it doesn't matter.

@ptrovatelli

This comment has been minimized.

Copy link
Contributor Author

@ptrovatelli ptrovatelli commented Aug 27, 2019

FYI I'm working in priority on having a reliant tracking of unique vulns by SAST scanners:

  • removing the aggregation at checkmarx scanner level (i'm keeping a "checkmarx legacy scan" and adding a new scanner without the aggregation)
  • adding that unique_id_from_tool and other missing fields in dojo_finding (i'm thinking: SAST_SOURCE and SAST_SINK, both holding - for checkmarx- a concatenation of filename / line number / object for either source or sink)
  • checking the same for sonar parser
  • adding the configuration per parser that says if we want to deduplicate on hash_code or unique_id_from_tool

i'll look at the hash_code configuration per parser at a later time (dedup based on hashcode was kind of working not a long time ago, we should be able to put it back into a working state...)

@ptrovatelli

This comment has been minimized.

Copy link
Contributor Author

@ptrovatelli ptrovatelli commented Aug 28, 2019

ptrovatelli pushed a commit to ptrovatelli/django-DefectDojo that referenced this issue Sep 5, 2019
…ort technical id for checkmarx and sonar. default checkmarx scanner doesn't aggregate anymore
ptrovatelli pushed a commit to ptrovatelli/django-DefectDojo that referenced this issue Nov 25, 2019
  Create a new detailed scaner for checkmarx.
  Import technical id for checkmarx and sonar.
  Add nb_occurences to count aggregates
ptrovatelli added a commit to ptrovatelli/django-DefectDojo that referenced this issue Nov 25, 2019
ptrovatelli pushed a commit to ptrovatelli/django-DefectDojo that referenced this issue Nov 25, 2019
…ow the default one)

  Tune the configuration to allow cross-parser dedupe sonar vs checkmarx
  Configure dedupe for: dependency check, npm audit scan (DefectDojo#1640), whitesource scan(DefectDojo#1654)
  fix full database hash_code recompute and deduplication
ptrovatelli pushed a commit to ptrovatelli/django-DefectDojo that referenced this issue Nov 26, 2019
  Create a new detailed scaner for checkmarx.
  Import technical id for checkmarx and sonar.
  Add nb_occurences to count aggregates
ptrovatelli added a commit to ptrovatelli/django-DefectDojo that referenced this issue Nov 26, 2019
ptrovatelli pushed a commit to ptrovatelli/django-DefectDojo that referenced this issue Nov 26, 2019
…ow the default one)

  Tune the configuration to allow cross-parser dedupe sonar vs checkmarx
  Configure dedupe for: dependency check, npm audit scan (DefectDojo#1640), whitesource scan(DefectDojo#1654)
  fix full database hash_code recompute and deduplication
@stale

This comment has been minimized.

Copy link

@stale stale bot commented Dec 4, 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Dec 4, 2019
@madchap madchap removed the stale label Dec 4, 2019
madchap added a commit to madchap/django-DefectDojo that referenced this issue Dec 23, 2019
  Create a new detailed scaner for checkmarx.
  Import technical id for checkmarx and sonar.
  Add nb_occurences to count aggregates
madchap added a commit to madchap/django-DefectDojo that referenced this issue Dec 23, 2019
…te unit tests
madchap added a commit to madchap/django-DefectDojo that referenced this issue Dec 23, 2019
…ow the default one)

  Tune the configuration to allow cross-parser dedupe sonar vs checkmarx
  Configure dedupe for: dependency check, npm audit scan (DefectDojo#1640), whitesource scan(DefectDojo#1654)
  fix full database hash_code recompute and deduplication
ptrovatelli pushed a commit to ptrovatelli/django-DefectDojo that referenced this issue Jan 10, 2020
  Create a new detailed scaner for checkmarx.
  Import technical id for checkmarx and sonar.
  Add nb_occurences to count aggregates
ptrovatelli added a commit to ptrovatelli/django-DefectDojo that referenced this issue Jan 10, 2020
ptrovatelli pushed a commit to ptrovatelli/django-DefectDojo that referenced this issue Jan 10, 2020
…ow the default one)

  Tune the configuration to allow cross-parser dedupe sonar vs checkmarx
  Configure dedupe for: dependency check, npm audit scan (DefectDojo#1640), whitesource scan(DefectDojo#1654)
  fix full database hash_code recompute and deduplication
@ptrovatelli

This comment has been minimized.

Copy link
Contributor Author

@ptrovatelli ptrovatelli commented Jan 25, 2020

fixed by #1665

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
4 participants
You can’t perform that action at this time.