Reimport feature: use the configurable deduplication for matching new findings to existing findings #3753

ptrovatelli · 2021-01-30T23:22:15Z

… findings to existing findings

This PR includes:

use the deduplication algorithm configured for each parser to identify findings during re-import (instead of title + severity)
remove specific code that was added for Veracode Scan and Arachni Scan.
- It's no longer needed no that we can correcly identify findings through the dedupe configuration (it's kept only when using legacy dedupe algorithm just in case someone has an old configuration without the dedupe configuration for those parsers)
removed the endpoints from the dedupe configuration for the dynamic parsers that had one (Nessus, ZAP, Qualys):
- we agreed on slack that it's better to identify findings without endpoints (and then handle endpoint status during re-import if using the re-import feature).
- Note that this will have an impact for those not using re-import. I believe it's better to identify findings without endpoints even without re-import because slight variations in endpoints will always create "new" findings while it's really the same finding and for people trying to track what happens with a findings it will cause problems. However the drawback is that one need to go look at endpoints to know if there was a change or if the finding is really the same.
removed the endpoints from the legacy hash_code computation algorithm
change the way we compute hash_code based on endpoints:
- (In most cases, we advise not to include endpoints in the hash_code and they are not included with the default shipped configuration anymore but it is still possible to include them by configuration)
- before we used host+port which was not sufficient to identify an endpoint. For example endpoints https://mainsite.com and https://mainsite.com/dashboard were hashed to https://mainsite.com:443https://mainsite.com:443 (ie the concatenated value of https://mainsite.com:443 twice) which didn't make sense
- Note the the legacy deduplicatoin algorithm (the default algorithm) still uses endpoints for findings identification (even if the endpoints are not in the hash_code, the endpoints are examined specifically)
- normalize endpoints urls when computing hash_code : with the new endpoint value used, we had diffs where we shouldn't have due to lack of standardization
handled the endpoint status in order to mitigate them when they are no longer present while re-importing
- adding of new endpoints was already working, but the mitigation was missing
improve dedupe.py stability by:
- allowing to dedupe only a subset of parsers
- allowing to only compute hash_code or only deduplicate
fix logic for unique_id_from_tool_or_hash_code algorithm in the existing dedupe: we need to be sure that either unique_id_from_tool or hash_code is not None when we use it. Mathing on None value may match loads of findings (although it's really a borderline scenario because they shouldn't be null for the parsers that use them)
use a separate volume for integration tests (to avoid login error when switching from unit tests to integration tests)

Important notice to include in the release note:
Due to:

the change in hash_code formula for some parsers in settings.dist.py
the removal of the endpoints from the hash_code in the default hash_code computation
the change of hashing formula for hashing endpoints

If you're using the deduplication or the re-import feature, the hash_code needs recomputing for
Due to 1):

Nessus
ZAP
Qualys

Due to 2) :

Any dynamic parser without a hash_code configuration in settings.dist.py that you have been using

Due to 3) :

Any dynamic parser for which you have specifically configured the hash_code to use endpoints

With the new syntax:

./manage.py dedupe --parser "Nessus Scan" --parser "ZAP Scan" --parser "Qualys Scan"

To recompute hash_code without deduplicating (enough if you're using the re-import but not the deduplication)
./manage.py dedupe --parser "Nessus Scan" --parser "ZAP Scan" --parser "Qualys Scan" --hash_code_only

Note that if the deduplication is not enabled in the system settings, it will not occur no matter what the arguments to manage.py are

valentijnscholten · 2021-01-31T09:09:17Z

dojo/reimport_utils.py

+    elif deduplication_algorithm == 'unique_id_from_tool' or deduplication_algorithm == 'unique_id_from_tool_or_hash_code':
+        # processing 'unique_id_from_tool_or_hash_code' as 'unique_id_from_tool' because when using 'unique_id_from_tool_or_hash_code'
+        # we usually want to use 'hash_code' for cross-parser matching, 
+        # while it makes more sense to use the 'unique_id_from_tool' for same-parser matching
+        return Finding.objects.filter(
+            test=test,
+            unique_id_from_tool=item.unique_id_from_tool).exclude(
+                        unique_id_from_tool=None).all()


should you not include the findings with the same hash code if there is no match on unique_id_from_tool?

i think i'd go for either:

only unique_id_from_tool (because as explained in the comments, for the use cases I had in mind, when this configuration is on, it doesn't make much sense to match same parser findings on hash_code)

both unique_id_from_tool and hash_code (more coherent with the actual configuration)

2# is probably more logical

But matching on hash_code only if the match on unique_id_from_tool doesn't find anything seems a bit unpredictible. Although I don't have a strong opinion on this right now, need to throw the code at more tests.

@valentijnscholten i'll go for 2: both unique_id_from_tool and hash_code because this is how it's done when deduplicating outside of re-import, so better to keep the same logic.

will push that asap

ptrovatelli · 2021-02-22T21:39:18Z

just pushed:

normalize endpoints urls when computing hash_code (with the new endpoint value used, we had diffs where we shouldn't have due to lack of standardization)
fix logic for unique_id_from_tool_or_hash_code algorithm (existing dedupe + re-import after Valentijn remark)

ptrovatelli · 2021-02-22T21:46:54Z

OK I'm almost done but I have a dilemma. It's probably already like that in the current code. When we import a report which includes itself duplicates, when re-importing the same report, the duplicate gets mitigated.

first import:
- 1 active, verified
- 1 inactive, duplicate
after reimport:
- 1 active, verified
- 1 inactive, mitigated, duplicate

It doesn't seem correct to me. I'd expect to have :

1 active, verified
1 inactive, duplicate (same status as before the reimport as we have just re-imported the same report)

The problem is that when matching findings, we only take the first one, the duplicates are assumed not to be in the report, when they still are.

What do you think? maybe keep that for another PR? this one is big enough as it is...

Sample data: with checkmarx parser attached
checkmarx_duplicate_in_same_report.zip

Brightside56 · 2021-02-23T00:25:58Z

Can anyone tell me what to do if I want to have separate findings with same fields for different endpoints in project? Right now I'm importing scan results via API v2 (nmap scan results), but when I have same ports open for different hosts, these opened ports should be completely different findings, but they're considered as duplicates

ptrovatelli · 2021-02-23T13:47:53Z

@Brightside56 it seems that you have already posted the question in slack, which is the proper channel for that discussion. let's move on to to slack for the rest of the discussion :)

Brightside56 · 2021-02-24T21:35:39Z

@Brightside56 it seems that you have already posted the question in slack, which is the proper channel for that discussion. let's move on to to slack for the rest of the discussion :)

I've updated thread. To be honest it's more convenient to track issue progress here than in slack thread =/

valentijnscholten · 2021-02-27T22:20:00Z

Added breaking_changes label as there is the risk that it might break some usecases and people need to adjust to the new reimport logic, as well as the fact that endpoints are no longer taken into account in hashcode calculation.

valentijnscholten · 2021-02-27T22:23:36Z

Should there be some instructions in the "Uprade notes" to tell people to remove endpoints also from any custom hash_code configuration in a local_settings.py or some (bind)mounted settings.py or similar?

I think you mentioned there were still some known issues / cornercases. Maybe we should think about documenting them in a GitHub issue, or maybe transparantly in the docs? (reimport of duplicates, endpoints still used by legacy, ...)

ptrovatelli · 2021-02-28T21:02:07Z

@valentijnscholten it's always possible to use endpoints in the hash_code. I just removed it from the default zap, nessus and qualys configuration. for legacy (default hash_code algo) they are still there, the hash_code is just computed a bit differently than before so that we really extract what's specific to each endpoint, and not just host + port.

OK there is a catch: the re-import will not work for dynamic findings if the hash_code algo includes endpoints. So I need to remove the endpoints from the default legacy hash_code compute algo and I need to document somewhere that when hash_code includes endpoints, the re-import will not work properly for dynamic parsers

I can create an issue regarding the specific use case I mentionned about re-importing a report that includes several duplicates: issue created #3958

valentijnscholten · 2021-03-01T19:58:07Z

I don't want to generate extra work, but I think a small change to the dedupe.py command is needed. The default ordering for Finding is ordering = ('numerical_severity', '-date', 'title'). This means findings are not processed in chronological order, which means that older findings can be marked as duplicates of newer findings. Which might be confusing and is something we try to avoid in the deduplication usually. Usually the oldest finding is the "first" one (original) and might contain important notes, impact, mitigation info. So we probably should order the findings by id.

I also think that the current dedupe.py and do_dedupe_finding will also perform deduplication even for findings that are already marked as duplicate.

Maybe the dedupe.py command should have a parameter to indicate that only the hash_code should be recalculated, but to deduplication should be performed. If you have 1 or 2 years worth of findings maybe you don't want the duplicates clusters that are currently to be changed and only update the hash_code to have good deduplication for future findings?

ptrovatelli · 2021-03-01T22:11:42Z

@valentijnscholten I believe you have two valid points with the sorting and the possibility to recompute hash_code without deduplicating. However I think the ordering should be by id decrementing: we need to examine the highest id first, so that they are marked as duplicate of (older) smaller id findings. Shouldn't be too hard to change, i'll have a look after I fix the unit tests.

Regarding deduplicating already duplicated findings, it may be required in order to "unduplicate" a finding that is no longer a duplicate after the algorithm change. I haven't tested this use case though.

valentijnscholten · 2021-03-01T22:23:25Z

Ordering by id descending won't work that way as the older findings will not have their hash code recalculated yet. We could take a two step approach. First recalculate all hashcodes, then (optionally) deduplicate.

ptrovatelli · 2021-03-01T22:28:42Z

ok it needs more thinking. Maybe we should use two separate scripts.

turns out the endpoints are all over the place in the legacy dedupe algorithm, not only in the hash_code. Actually it's a good thing, it won't prevent the re import to correctly match the findings and the changes won't have that much impact on legacy dedupe when not using re import

ptrovatelli · 2021-03-02T07:47:54Z

for dedupe.py if we want to be really safe :

first pass: update the hash_code without deduplicating. That won't be easy though because the dedupe is automatic upon saving. we need a special flag on the save function to inhibit the dedupe event sent to celery (and if we don't save I think the hash_code recompute will be lost)
second pass: by id decrementing, trigger the deduplication synchronously. Currently it's asynchronous which will always generate randomness in execution.

github-actions · 2021-03-02T16:02:50Z

Conflicts have been resolved. A maintainer will review the pull request shortly.

github-actions · 2021-03-04T22:23:03Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

… findings to existing findings

…ue_id_from_tool_or_hash_code algorithm (existing dedupe + re-import); add unit tests

… then deduplicate

github-actions · 2021-03-05T13:39:25Z

Conflicts have been resolved. A maintainer will review the pull request shortly.

valentijnscholten · 2021-03-07T08:18:30Z

Thanks @ptrovatelli , big PR. Let's hope nothing breaks ;-). I approved it because I would like to see it merged. There are two things to consider:

docs: upgrade notes for 1.14.0 to instruct people to run dedupe command
docs: explain dedupe/matching for import/reimport in general
performance: in add last_status_update field and make status updates consistent #3947 I have seen that iterating over all findings like this is very slow if you have over 400k findings. It will take hours, it becomes slower and slower towards the end of the resultset. Order by id descending might make it worse because it's traversing the disk in reverse order. Because we are ordering by a unique id, we do a efficient implementation like I did in add last_status_update field and make status updates consistent #3947, see

django-DefectDojo/dojo/db_migrations/0082_last_status_update_populate.py

Line 28 in 541108a

for p in range(1, (total_count // page_size) + 1):

.

These can all be future PRs, but the first item (docs for 1.14.0), might be a "must have" for the release end of march.

dojo/unittests/dojo_test_case.py

damiencarol · 2021-03-07T17:52:17Z

@ptrovatelli you did a awesome work! Many modifications in this PR are good and fix design pb in the current core features.
I have one concern/question regarding Hyperlink lib new dependency. We have few lib that are able to parse/modify URLs.
It could be good to double check adding a new one.
By the way, I analyzed hyperlink and it's a well done lib and an active project.
Do get me wrong, I don't want to block your PR but we have pb currently with abandoned third party libs and we are in a move to remove dependencies so we should analyze further if we want to add another external lib.

ptrovatelli · 2021-03-08T22:30:48Z

@damiencarol no problem you're right. I don't like to add dependencies either but re-writing the url normalization code seemed silly.
Amongst what we have:

urllib3 seems to only normalize upper/lower case (see https://pypi.org/project/urllib3/)
packageurl-python maybe normalizes but I can't find information on what it does exactly and how to do it

firsthand i came across "url_normalize" which seemed to do the job, but hyperlink seemed to be really the mainstream project at the moment for this kind of task.

ptrovatelli · 2021-03-08T22:41:07Z

@valentijnscholten OK good to know. that is a syntax for bulk updating the same value though. I don't think it can be used for the hash_code that is different and need a specific computation for each row.

Regarding the performance of the reverse order, I don't think it should change much. in any case we'll be doing full table scans, getting all the disk blocks; shouldn't change much if we do it reversly or not.

i'll have a look at the doc

valentijnscholten · 2021-03-09T09:40:26Z

It's not so much about the bulk update, but more about how to iterate over all findings.

ptrovatelli · 2021-03-10T21:53:51Z

@valentijnscholten OK. I don't know if this syntax without bulk update will be faster. It'll need to be tested, maybe in a separate issue/PR?

valentijnscholten · 2021-03-10T21:55:36Z

Yes, the PR is good to go now, already approved it.

ptrovatelli mentioned this pull request Jan 30, 2021

Sonarqube Tests finding mitigations after running Reimport-scan #3675

Open

4 tasks

valentijnscholten reviewed Jan 31, 2021

View reviewed changes

damiencarol marked this pull request as draft February 7, 2021 21:49

damiencarol changed the title ~~WIP - reimport feature: use the configurable deduplication for matching new…~~ Reimport feature: use the configurable deduplication for matching new findings to existing findings Feb 7, 2021

ptrovatelli force-pushed the reimport-configurable-dedupe branch from 0b1be84 to b0c29ec Compare February 10, 2021 08:52

valentijnscholten mentioned this pull request Feb 15, 2021

Reimport: keep false positive, out of scope and risk_accepted history #3848 #3858

Merged

macedogm mentioned this pull request Feb 17, 2021

Improve GitLab Dependency Scanning hash code configuration #3873

Merged

ptrovatelli force-pushed the reimport-configurable-dedupe branch from b0c29ec to cc8836d Compare February 20, 2021 20:47

ptrovatelli force-pushed the reimport-configurable-dedupe branch from 91b88c7 to 4f789bc Compare February 26, 2021 16:15

ptrovatelli marked this pull request as ready for review February 26, 2021 17:12

valentijnscholten added enhancement Breaking Changes labels Feb 27, 2021

valentijnscholten added the settings_changes Needs changes to settings.py based on changes in settings.dist.py included in this PR label Feb 27, 2021

ptrovatelli mentioned this pull request Feb 28, 2021

Re-importing the same report leaves the duplicates in status mitigated #3958

Open

2 tasks

ptrovatelli force-pushed the reimport-configurable-dedupe branch 3 times, most recently from eeafc4e to 671c737 Compare March 2, 2021 08:35

github-actions bot removed the conflicts-detected label Mar 2, 2021

ptrovatelli force-pushed the reimport-configurable-dedupe branch 3 times, most recently from d206174 to f6c0880 Compare March 2, 2021 16:25

github-actions bot added the conflicts-detected label Mar 4, 2021

ptrovatelli added 7 commits March 5, 2021 14:37

reimport feature: use the configurable deduplication for matching new…

66a2d3c

… findings to existing findings

fix log that triggers an exception

6ec4d50

Re-import: mitigating endpoints that are no longer vulnerable

bf3c4b4

makes it possible to recompute hash_codes for a list of parsers

98882b4

normalize endpoints urls when computing hash_code; fix logic for uniq…

fe35063

…ue_id_from_tool_or_hash_code algorithm (existing dedupe + re-import); add unit tests

remove endpoints from legacy hash_code algorithm

9f30005

stability improvments of dedupe.py utility: first compute hash_codes,…

4085435

… then deduplicate

ptrovatelli force-pushed the reimport-configurable-dedupe branch from f6c0880 to 4085435 Compare March 5, 2021 13:39

github-actions bot removed the conflicts-detected label Mar 5, 2021

valentijnscholten approved these changes Mar 6, 2021

View reviewed changes

valentijnscholten mentioned this pull request Mar 7, 2021

Anchore Grype parser: add deduplication #4013

Merged

3 tasks

damiencarol reviewed Mar 7, 2021

View reviewed changes

dojo/unittests/dojo_test_case.py Show resolved Hide resolved

damiencarol approved these changes Mar 11, 2021

View reviewed changes

damiencarol merged commit df825e9 into DefectDojo:dev Mar 11, 2021

StefanFl mentioned this pull request Mar 14, 2021

Hash code deduplication for GitLab SAST and Checkov #4055

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reimport feature: use the configurable deduplication for matching new findings to existing findings #3753

Reimport feature: use the configurable deduplication for matching new findings to existing findings #3753

ptrovatelli commented Jan 30, 2021 •

edited

valentijnscholten Jan 31, 2021

ptrovatelli Feb 1, 2021

ptrovatelli Feb 22, 2021

ptrovatelli commented Feb 22, 2021

ptrovatelli commented Feb 22, 2021 •

edited

Brightside56 commented Feb 23, 2021

ptrovatelli commented Feb 23, 2021

Brightside56 commented Feb 24, 2021

valentijnscholten commented Feb 27, 2021

valentijnscholten commented Feb 27, 2021 •

edited

ptrovatelli commented Feb 28, 2021 •

edited

valentijnscholten commented Mar 1, 2021

ptrovatelli commented Mar 1, 2021

valentijnscholten commented Mar 1, 2021

ptrovatelli commented Mar 1, 2021 •

edited

ptrovatelli commented Mar 2, 2021 •

edited

github-actions bot commented Mar 2, 2021

github-actions bot commented Mar 4, 2021

github-actions bot commented Mar 5, 2021

valentijnscholten commented Mar 7, 2021

damiencarol commented Mar 7, 2021

ptrovatelli commented Mar 8, 2021

ptrovatelli commented Mar 8, 2021

valentijnscholten commented Mar 9, 2021

ptrovatelli commented Mar 10, 2021

valentijnscholten commented Mar 10, 2021

Reimport feature: use the configurable deduplication for matching new findings to existing findings #3753

Reimport feature: use the configurable deduplication for matching new findings to existing findings #3753

Conversation

ptrovatelli commented Jan 30, 2021 • edited

valentijnscholten Jan 31, 2021

Choose a reason for hiding this comment

ptrovatelli Feb 1, 2021

Choose a reason for hiding this comment

ptrovatelli Feb 22, 2021

Choose a reason for hiding this comment

ptrovatelli commented Feb 22, 2021

ptrovatelli commented Feb 22, 2021 • edited

Brightside56 commented Feb 23, 2021

ptrovatelli commented Feb 23, 2021

Brightside56 commented Feb 24, 2021

valentijnscholten commented Feb 27, 2021

valentijnscholten commented Feb 27, 2021 • edited

ptrovatelli commented Feb 28, 2021 • edited

valentijnscholten commented Mar 1, 2021

ptrovatelli commented Mar 1, 2021

valentijnscholten commented Mar 1, 2021

ptrovatelli commented Mar 1, 2021 • edited

ptrovatelli commented Mar 2, 2021 • edited

github-actions bot commented Mar 2, 2021

github-actions bot commented Mar 4, 2021

github-actions bot commented Mar 5, 2021

valentijnscholten commented Mar 7, 2021

damiencarol commented Mar 7, 2021

ptrovatelli commented Mar 8, 2021

ptrovatelli commented Mar 8, 2021

valentijnscholten commented Mar 9, 2021

ptrovatelli commented Mar 10, 2021

valentijnscholten commented Mar 10, 2021

ptrovatelli commented Jan 30, 2021 •

edited

ptrovatelli commented Feb 22, 2021 •

edited

valentijnscholten commented Feb 27, 2021 •

edited

ptrovatelli commented Feb 28, 2021 •

edited

ptrovatelli commented Mar 1, 2021 •

edited

ptrovatelli commented Mar 2, 2021 •

edited