Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reimport feature: use the configurable deduplication for matching new findings to existing findings #3753

Merged
merged 7 commits into from
Mar 11, 2021

Conversation

ptrovatelli
Copy link
Collaborator

@ptrovatelli ptrovatelli commented Jan 30, 2021

… findings to existing findings

This PR includes:

  • use the deduplication algorithm configured for each parser to identify findings during re-import (instead of title + severity)
  • remove specific code that was added for Veracode Scan and Arachni Scan.
    • It's no longer needed no that we can correcly identify findings through the dedupe configuration (it's kept only when using legacy dedupe algorithm just in case someone has an old configuration without the dedupe configuration for those parsers)
  • removed the endpoints from the dedupe configuration for the dynamic parsers that had one (Nessus, ZAP, Qualys):
    • we agreed on slack that it's better to identify findings without endpoints (and then handle endpoint status during re-import if using the re-import feature).
    • Note that this will have an impact for those not using re-import. I believe it's better to identify findings without endpoints even without re-import because slight variations in endpoints will always create "new" findings while it's really the same finding and for people trying to track what happens with a findings it will cause problems. However the drawback is that one need to go look at endpoints to know if there was a change or if the finding is really the same.
  • removed the endpoints from the legacy hash_code computation algorithm
  • change the way we compute hash_code based on endpoints:
    • (In most cases, we advise not to include endpoints in the hash_code and they are not included with the default shipped configuration anymore but it is still possible to include them by configuration)
    • before we used host+port which was not sufficient to identify an endpoint. For example endpoints https://mainsite.com and https://mainsite.com/dashboard were hashed to https://mainsite.com:443https://mainsite.com:443 (ie the concatenated value of https://mainsite.com:443 twice) which didn't make sense
    • Note the the legacy deduplicatoin algorithm (the default algorithm) still uses endpoints for findings identification (even if the endpoints are not in the hash_code, the endpoints are examined specifically)
    • normalize endpoints urls when computing hash_code : with the new endpoint value used, we had diffs where we shouldn't have due to lack of standardization
  • handled the endpoint status in order to mitigate them when they are no longer present while re-importing
    • adding of new endpoints was already working, but the mitigation was missing
  • improve dedupe.py stability by:
    • allowing to dedupe only a subset of parsers
    • allowing to only compute hash_code or only deduplicate
  • fix logic for unique_id_from_tool_or_hash_code algorithm in the existing dedupe: we need to be sure that either unique_id_from_tool or hash_code is not None when we use it. Mathing on None value may match loads of findings (although it's really a borderline scenario because they shouldn't be null for the parsers that use them)
  • use a separate volume for integration tests (to avoid login error when switching from unit tests to integration tests)

Important notice to include in the release note:
Due to:

  1. the change in hash_code formula for some parsers in settings.dist.py
  2. the removal of the endpoints from the hash_code in the default hash_code computation
  3. the change of hashing formula for hashing endpoints

If you're using the deduplication or the re-import feature, the hash_code needs recomputing for
Due to 1):

  • Nessus
  • ZAP
  • Qualys

Due to 2) :

  • Any dynamic parser without a hash_code configuration in settings.dist.py that you have been using

Due to 3) :

  • Any dynamic parser for which you have specifically configured the hash_code to use endpoints

With the new syntax:

./manage.py dedupe --parser "Nessus Scan" --parser "ZAP Scan" --parser "Qualys Scan"

To recompute hash_code without deduplicating (enough if you're using the re-import but not the deduplication)
./manage.py dedupe --parser "Nessus Scan" --parser "ZAP Scan" --parser "Qualys Scan" --hash_code_only

Note that if the deduplication is not enabled in the system settings, it will not occur no matter what the arguments to manage.py are

Comment on lines 29 to 36
elif deduplication_algorithm == 'unique_id_from_tool' or deduplication_algorithm == 'unique_id_from_tool_or_hash_code':
# processing 'unique_id_from_tool_or_hash_code' as 'unique_id_from_tool' because when using 'unique_id_from_tool_or_hash_code'
# we usually want to use 'hash_code' for cross-parser matching,
# while it makes more sense to use the 'unique_id_from_tool' for same-parser matching
return Finding.objects.filter(
test=test,
unique_id_from_tool=item.unique_id_from_tool).exclude(
unique_id_from_tool=None).all()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should you not include the findings with the same hash code if there is no match on unique_id_from_tool?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think i'd go for either:

  1. only unique_id_from_tool (because as explained in the comments, for the use cases I had in mind, when this configuration is on, it doesn't make much sense to match same parser findings on hash_code)
  2. both unique_id_from_tool and hash_code (more coherent with the actual configuration)

2# is probably more logical

But matching on hash_code only if the match on unique_id_from_tool doesn't find anything seems a bit unpredictible. Although I don't have a strong opinion on this right now, need to throw the code at more tests.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@valentijnscholten i'll go for 2: both unique_id_from_tool and hash_code because this is how it's done when deduplicating outside of re-import, so better to keep the same logic.

will push that asap

@damiencarol damiencarol marked this pull request as draft February 7, 2021 21:49
@damiencarol damiencarol changed the title WIP - reimport feature: use the configurable deduplication for matching new… Reimport feature: use the configurable deduplication for matching new findings to existing findings Feb 7, 2021
@ptrovatelli
Copy link
Collaborator Author

just pushed:

  • normalize endpoints urls when computing hash_code (with the new endpoint value used, we had diffs where we shouldn't have due to lack of standardization)
  • fix logic for unique_id_from_tool_or_hash_code algorithm (existing dedupe + re-import after Valentijn remark)

@ptrovatelli
Copy link
Collaborator Author

ptrovatelli commented Feb 22, 2021

OK I'm almost done but I have a dilemma. It's probably already like that in the current code. When we import a report which includes itself duplicates, when re-importing the same report, the duplicate gets mitigated.

  • first import:
    • 1 active, verified
    • 1 inactive, duplicate
  • after reimport:
    • 1 active, verified
    • 1 inactive, mitigated, duplicate

It doesn't seem correct to me. I'd expect to have :

  • 1 active, verified
  • 1 inactive, duplicate (same status as before the reimport as we have just re-imported the same report)

The problem is that when matching findings, we only take the first one, the duplicates are assumed not to be in the report, when they still are.

What do you think? maybe keep that for another PR? this one is big enough as it is...

Sample data: with checkmarx parser attached
checkmarx_duplicate_in_same_report.zip

@Brightside56
Copy link

Can anyone tell me what to do if I want to have separate findings with same fields for different endpoints in project? Right now I'm importing scan results via API v2 (nmap scan results), but when I have same ports open for different hosts, these opened ports should be completely different findings, but they're considered as duplicates

@ptrovatelli
Copy link
Collaborator Author

@Brightside56 it seems that you have already posted the question in slack, which is the proper channel for that discussion. let's move on to to slack for the rest of the discussion :)

@Brightside56
Copy link

@Brightside56 it seems that you have already posted the question in slack, which is the proper channel for that discussion. let's move on to to slack for the rest of the discussion :)

I've updated thread. To be honest it's more convenient to track issue progress here than in slack thread =/

@valentijnscholten
Copy link
Member

Added breaking_changes label as there is the risk that it might break some usecases and people need to adjust to the new reimport logic, as well as the fact that endpoints are no longer taken into account in hashcode calculation.

@valentijnscholten valentijnscholten added the settings_changes Needs changes to settings.py based on changes in settings.dist.py included in this PR label Feb 27, 2021
@valentijnscholten
Copy link
Member

valentijnscholten commented Feb 27, 2021

Should there be some instructions in the "Uprade notes" to tell people to remove endpoints also from any custom hash_code configuration in a local_settings.py or some (bind)mounted settings.py or similar?

I think you mentioned there were still some known issues / cornercases. Maybe we should think about documenting them in a GitHub issue, or maybe transparantly in the docs? (reimport of duplicates, endpoints still used by legacy, ...)

@ptrovatelli
Copy link
Collaborator Author

ptrovatelli commented Feb 28, 2021

@valentijnscholten it's always possible to use endpoints in the hash_code. I just removed it from the default zap, nessus and qualys configuration. for legacy (default hash_code algo) they are still there, the hash_code is just computed a bit differently than before so that we really extract what's specific to each endpoint, and not just host + port.

OK there is a catch: the re-import will not work for dynamic findings if the hash_code algo includes endpoints. So I need to remove the endpoints from the default legacy hash_code compute algo and I need to document somewhere that when hash_code includes endpoints, the re-import will not work properly for dynamic parsers

I can create an issue regarding the specific use case I mentionned about re-importing a report that includes several duplicates: issue created #3958

@valentijnscholten
Copy link
Member

I don't want to generate extra work, but I think a small change to the dedupe.py command is needed. The default ordering for Finding is ordering = ('numerical_severity', '-date', 'title'). This means findings are not processed in chronological order, which means that older findings can be marked as duplicates of newer findings. Which might be confusing and is something we try to avoid in the deduplication usually. Usually the oldest finding is the "first" one (original) and might contain important notes, impact, mitigation info. So we probably should order the findings by id.

I also think that the current dedupe.py and do_dedupe_finding will also perform deduplication even for findings that are already marked as duplicate.

Maybe the dedupe.py command should have a parameter to indicate that only the hash_code should be recalculated, but to deduplication should be performed. If you have 1 or 2 years worth of findings maybe you don't want the duplicates clusters that are currently to be changed and only update the hash_code to have good deduplication for future findings?

@ptrovatelli
Copy link
Collaborator Author

@valentijnscholten I believe you have two valid points with the sorting and the possibility to recompute hash_code without deduplicating. However I think the ordering should be by id decrementing: we need to examine the highest id first, so that they are marked as duplicate of (older) smaller id findings. Shouldn't be too hard to change, i'll have a look after I fix the unit tests.

Regarding deduplicating already duplicated findings, it may be required in order to "unduplicate" a finding that is no longer a duplicate after the algorithm change. I haven't tested this use case though.

@valentijnscholten
Copy link
Member

Ordering by id descending won't work that way as the older findings will not have their hash code recalculated yet. We could take a two step approach. First recalculate all hashcodes, then (optionally) deduplicate.

@ptrovatelli
Copy link
Collaborator Author

ptrovatelli commented Mar 1, 2021

ok it needs more thinking. Maybe we should use two separate scripts.

turns out the endpoints are all over the place in the legacy dedupe algorithm, not only in the hash_code. Actually it's a good thing, it won't prevent the re import to correctly match the findings and the changes won't have that much impact on legacy dedupe when not using re import

@ptrovatelli
Copy link
Collaborator Author

ptrovatelli commented Mar 2, 2021

for dedupe.py if we want to be really safe :

  • first pass: update the hash_code without deduplicating. That won't be easy though because the dedupe is automatic upon saving. we need a special flag on the save function to inhibit the dedupe event sent to celery (and if we don't save I think the hash_code recompute will be lost)
  • second pass: by id decrementing, trigger the deduplication synchronously. Currently it's asynchronous which will always generate randomness in execution.

@ptrovatelli ptrovatelli force-pushed the reimport-configurable-dedupe branch 3 times, most recently from eeafc4e to 671c737 Compare March 2, 2021 08:35
@github-actions
Copy link
Contributor

github-actions bot commented Mar 2, 2021

Conflicts have been resolved. A maintainer will review the pull request shortly.

@ptrovatelli ptrovatelli force-pushed the reimport-configurable-dedupe branch 3 times, most recently from d206174 to f6c0880 Compare March 2, 2021 16:25
@github-actions
Copy link
Contributor

github-actions bot commented Mar 4, 2021

This pull request has conflicts, please resolve those before we can evaluate the pull request.

@github-actions
Copy link
Contributor

github-actions bot commented Mar 5, 2021

Conflicts have been resolved. A maintainer will review the pull request shortly.

@valentijnscholten
Copy link
Member

Thanks @ptrovatelli , big PR. Let's hope nothing breaks ;-). I approved it because I would like to see it merged. There are two things to consider:

These can all be future PRs, but the first item (docs for 1.14.0), might be a "must have" for the release end of march.

@damiencarol
Copy link
Contributor

@ptrovatelli you did a awesome work! Many modifications in this PR are good and fix design pb in the current core features.
I have one concern/question regarding Hyperlink lib new dependency. We have few lib that are able to parse/modify URLs.
It could be good to double check adding a new one.
By the way, I analyzed hyperlink and it's a well done lib and an active project.
Do get me wrong, I don't want to block your PR but we have pb currently with abandoned third party libs and we are in a move to remove dependencies so we should analyze further if we want to add another external lib.

@ptrovatelli
Copy link
Collaborator Author

@damiencarol no problem you're right. I don't like to add dependencies either but re-writing the url normalization code seemed silly.
Amongst what we have:

  • urllib3 seems to only normalize upper/lower case (see https://pypi.org/project/urllib3/)
  • packageurl-python maybe normalizes but I can't find information on what it does exactly and how to do it

firsthand i came across "url_normalize" which seemed to do the job, but hyperlink seemed to be really the mainstream project at the moment for this kind of task.

@ptrovatelli
Copy link
Collaborator Author

@valentijnscholten OK good to know. that is a syntax for bulk updating the same value though. I don't think it can be used for the hash_code that is different and need a specific computation for each row.

Regarding the performance of the reverse order, I don't think it should change much. in any case we'll be doing full table scans, getting all the disk blocks; shouldn't change much if we do it reversly or not.

i'll have a look at the doc

@valentijnscholten
Copy link
Member

It's not so much about the bulk update, but more about how to iterate over all findings.

@ptrovatelli
Copy link
Collaborator Author

@valentijnscholten OK. I don't know if this syntax without bulk update will be faster. It'll need to be tested, maybe in a separate issue/PR?

@valentijnscholten
Copy link
Member

Yes, the PR is good to go now, already approved it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Breaking Changes enhancement settings_changes Needs changes to settings.py based on changes in settings.dist.py included in this PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants