-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reimport feature: use the configurable deduplication for matching new findings to existing findings #3753
Reimport feature: use the configurable deduplication for matching new findings to existing findings #3753
Conversation
dojo/reimport_utils.py
Outdated
elif deduplication_algorithm == 'unique_id_from_tool' or deduplication_algorithm == 'unique_id_from_tool_or_hash_code': | ||
# processing 'unique_id_from_tool_or_hash_code' as 'unique_id_from_tool' because when using 'unique_id_from_tool_or_hash_code' | ||
# we usually want to use 'hash_code' for cross-parser matching, | ||
# while it makes more sense to use the 'unique_id_from_tool' for same-parser matching | ||
return Finding.objects.filter( | ||
test=test, | ||
unique_id_from_tool=item.unique_id_from_tool).exclude( | ||
unique_id_from_tool=None).all() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should you not include the findings with the same hash code if there is no match on unique_id_from_tool?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think i'd go for either:
- only unique_id_from_tool (because as explained in the comments, for the use cases I had in mind, when this configuration is on, it doesn't make much sense to match same parser findings on hash_code)
- both unique_id_from_tool and hash_code (more coherent with the actual configuration)
2# is probably more logical
But matching on hash_code only if the match on unique_id_from_tool doesn't find anything seems a bit unpredictible. Although I don't have a strong opinion on this right now, need to throw the code at more tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@valentijnscholten i'll go for 2: both unique_id_from_tool and hash_code because this is how it's done when deduplicating outside of re-import, so better to keep the same logic.
will push that asap
0b1be84
to
b0c29ec
Compare
b0c29ec
to
cc8836d
Compare
just pushed:
|
OK I'm almost done but I have a dilemma. It's probably already like that in the current code. When we import a report which includes itself duplicates, when re-importing the same report, the duplicate gets mitigated.
It doesn't seem correct to me. I'd expect to have :
The problem is that when matching findings, we only take the first one, the duplicates are assumed not to be in the report, when they still are. What do you think? maybe keep that for another PR? this one is big enough as it is... Sample data: with checkmarx parser attached |
Can anyone tell me what to do if I want to have separate findings with same fields for different endpoints in project? Right now I'm importing scan results via API v2 (nmap scan results), but when I have same ports open for different hosts, these opened ports should be completely different findings, but they're considered as duplicates |
@Brightside56 it seems that you have already posted the question in slack, which is the proper channel for that discussion. let's move on to to slack for the rest of the discussion :) |
I've updated thread. To be honest it's more convenient to track issue progress here than in slack thread =/ |
91b88c7
to
4f789bc
Compare
Added |
Should there be some instructions in the "Uprade notes" to tell people to remove endpoints also from any custom hash_code configuration in a I think you mentioned there were still some known issues / cornercases. Maybe we should think about documenting them in a GitHub issue, or maybe transparantly in the docs? (reimport of duplicates, endpoints still used by legacy, ...) |
@valentijnscholten it's always possible to use endpoints in the hash_code. I just removed it from the default zap, nessus and qualys configuration. for legacy (default hash_code algo) they are still there, the hash_code is just computed a bit differently than before so that we really extract what's specific to each endpoint, and not just host + port. OK there is a catch: the re-import will not work for dynamic findings if the hash_code algo includes endpoints. So I need to remove the endpoints from the default legacy hash_code compute algo and I need to document somewhere that when hash_code includes endpoints, the re-import will not work properly for dynamic parsers I can create an issue regarding the specific use case I mentionned about re-importing a report that includes several duplicates: issue created #3958 |
I don't want to generate extra work, but I think a small change to the I also think that the current dedupe.py and do_dedupe_finding will also perform deduplication even for findings that are already marked as duplicate. Maybe the |
@valentijnscholten I believe you have two valid points with the sorting and the possibility to recompute hash_code without deduplicating. However I think the ordering should be by id decrementing: we need to examine the highest id first, so that they are marked as duplicate of (older) smaller id findings. Shouldn't be too hard to change, i'll have a look after I fix the unit tests. Regarding deduplicating already duplicated findings, it may be required in order to "unduplicate" a finding that is no longer a duplicate after the algorithm change. I haven't tested this use case though. |
Ordering by id descending won't work that way as the older findings will not have their hash code recalculated yet. We could take a two step approach. First recalculate all hashcodes, then (optionally) deduplicate. |
ok it needs more thinking. Maybe we should use two separate scripts. turns out the endpoints are all over the place in the legacy dedupe algorithm, not only in the hash_code. Actually it's a good thing, it won't prevent the re import to correctly match the findings and the changes won't have that much impact on legacy dedupe when not using re import |
for dedupe.py if we want to be really safe :
|
eeafc4e
to
671c737
Compare
Conflicts have been resolved. A maintainer will review the pull request shortly. |
d206174
to
f6c0880
Compare
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
… findings to existing findings
…ue_id_from_tool_or_hash_code algorithm (existing dedupe + re-import); add unit tests
… then deduplicate
f6c0880
to
4085435
Compare
Conflicts have been resolved. A maintainer will review the pull request shortly. |
Thanks @ptrovatelli , big PR. Let's hope nothing breaks ;-). I approved it because I would like to see it merged. There are two things to consider:
These can all be future PRs, but the first item (docs for 1.14.0), might be a "must have" for the release end of march. |
@ptrovatelli you did a awesome work! Many modifications in this PR are good and fix design pb in the current core features. |
@damiencarol no problem you're right. I don't like to add dependencies either but re-writing the url normalization code seemed silly.
firsthand i came across "url_normalize" which seemed to do the job, but hyperlink seemed to be really the mainstream project at the moment for this kind of task. |
@valentijnscholten OK good to know. that is a syntax for bulk updating the same value though. I don't think it can be used for the hash_code that is different and need a specific computation for each row. Regarding the performance of the reverse order, I don't think it should change much. in any case we'll be doing full table scans, getting all the disk blocks; shouldn't change much if we do it reversly or not. i'll have a look at the doc |
It's not so much about the bulk update, but more about how to iterate over all findings. |
@valentijnscholten OK. I don't know if this syntax without bulk update will be faster. It'll need to be tested, maybe in a separate issue/PR? |
Yes, the PR is good to go now, already approved it. |
… findings to existing findings
This PR includes:
https://mainsite.com
andhttps://mainsite.com/dashboard
were hashed tohttps://mainsite.com:443https://mainsite.com:443
(ie the concatenated value ofhttps://mainsite.com:443
twice) which didn't make senseImportant notice to include in the release note:
Due to:
If you're using the deduplication or the re-import feature, the hash_code needs recomputing for
Due to 1):
Due to 2) :
Due to 3) :
With the new syntax:
./manage.py dedupe --parser "Nessus Scan" --parser "ZAP Scan" --parser "Qualys Scan"
To recompute hash_code without deduplicating (enough if you're using the re-import but not the deduplication)
./manage.py dedupe --parser "Nessus Scan" --parser "ZAP Scan" --parser "Qualys Scan" --hash_code_only
Note that if the deduplication is not enabled in the system settings, it will not occur no matter what the arguments to manage.py are