Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deduplication does not work for Whitesource Scan #1654

Closed
1 of 3 tasks
bilalk88 opened this issue Nov 15, 2019 · 8 comments
Closed
1 of 3 tasks

Deduplication does not work for Whitesource Scan #1654

bilalk88 opened this issue Nov 15, 2019 · 8 comments

Comments

@bilalk88
Copy link

bilalk88 commented Nov 15, 2019

Bug description
Deduplication does not work for Whitesource Scan

Steps to reproduce
Steps to reproduce the behavior:

  1. Run DefectDojo with Docker Compose
  2. Switch on deduplication:
    Go to System Settings
    Switch on Deduplicate findings
    Click Submit
  3. Add a product:
    Go to Add Product
    Fill the required fields.
    Click Submit
  4. Add a new engagement:
    Import Scan Results for Whitesource tool multiple times

Actual behavior
All findings have status just "Active".

Expected behavior
All findings have status "Duplicate".

Deployment method (select with an X)

  • Kubernetes
  • Docker
  • setup.bash / legacy-setup.bash

Environment information

  • Operating System: CentOS7
  • DefectDojo Commit Message: 42f1d4e: Update issue templates [2019-10-02 16:01:25 -0500]

Sample scan files (optional)

production-vulnerability-report.json.txt

@bilalk88 bilalk88 added the bug label Nov 15, 2019
@ptrovatelli
Copy link
Collaborator

Hello @bilalk88 ,
I'm glad to see that people take interest in the deduplication system :)
I've been working on a system for make it configurable for each parser because the current system that tries to work for all the parsers has flaws.
Your problem is similar to #1640
I have just pushed the working configuration for whitesource in #1540 : deduplicate on title + severity + description
severity is just added for safety, in case we have two same titles with different severerities although it really doesn't seem probable
we do need the description, because without it there is sometimes the same title for different libraries
example in your file:

  • CVE-2017-0249 | Edgeimport-Insights
    on both Microsoft.CodeDom.Providers.DotNetCompilerPlatform and Microsoft.Net.Compilers

Again, we could do a little better if the library name was put into "file_path" by the parser: we could deduplicate on cve + file_path + severity which would allow cross-parsers deduplication and would be more resilient to title and description change due to whitesource evolutions.
Still it'll work in simple cases when uploading reports generated by the same version of whitesource

ptrovatelli pushed a commit to ptrovatelli/django-DefectDojo that referenced this issue Nov 25, 2019
…ow the default one)

  Tune the configuration to allow cross-parser dedupe sonar vs checkmarx
  Configure dedupe for: dependency check, npm audit scan (DefectDojo#1640), whitesource scan(DefectDojo#1654)
  fix full database hash_code recompute and deduplication
@ptrovatelli
Copy link
Collaborator

Hello @bilalk88 ,
Can you checkout the latest dev and see it that fixes your issue? #1665 was merged.

Thanks!

madchap pushed a commit to madchap/django-DefectDojo that referenced this issue Dec 23, 2019
…ow the default one)

  Tune the configuration to allow cross-parser dedupe sonar vs checkmarx
  Configure dedupe for: dependency check, npm audit scan (DefectDojo#1640), whitesource scan(DefectDojo#1654)
  fix full database hash_code recompute and deduplication
@dmeeetreee
Copy link

dmeeetreee commented Mar 4, 2020

@ptrovatelli just adding in here, as we are in process of uploading multiple scan results (fortify, burp, sonar, .. ) into Dojo from which we'll be integrating with JIRA as part of full incident management pipeline. Deduplication of issues with multiple tools is a bit of a challenge.

Question I have regarding deduplication: with the new mechanism that allows to specify specific fields used in the hash calculation, why do parsers (i.e. burp) still run some deduplication ( dupe_key = str(item.url) + item.severity + item.title) inside their logic, which will conflict when specifying different fields in HASHCODE_FIELDS_PER_SCANNER?

Or do I have that wrong?

@madchap
Copy link
Collaborator

madchap commented Mar 4, 2020

Again, we could do a little better if the library name was put into "file_path" by the parser: we could deduplicate on cve + file_path + severity which would allow cross-parsers deduplication and would be more resilient to title and description change due to whitesource evolutions.

#1854 could now help. The whitesource parser just needs to take advantage of that change.

Fyi @alles-klar

@ptrovatelli
Copy link
Collaborator

@dmeeetreee that "dupe_keys" thing in parsers should be understood as an aggregation of results. it is done by the parser before the deduplication is called. the aggregation is not configurable.

the deduplication configuration should be done having in mind what the aggregation is, indeed: if you aggregate on url+severy+title, you cannot deduplicate on other fields than those because all the other fields will be either not present, or not correct( some parsers just pick one random value amongst the aggregated records).

I've fixed this behavior in checkmarx parser by building a concatanation of all the found values rather than picking a random one (it was the last value amongst the aggregated records if I recall which was quite misleading. that same logic is probably still present is some parsers)

I think we need to add this in the doc as it's really not obvious.

@ptrovatelli
Copy link
Collaborator

ptrovatelli commented Mar 4, 2020

although, in some case we could still use other fields in dedupe if they are present in one of the aggregate keys.
for example if title is built based on cve + other data, it's valid to dedup on cve.

still it would seem safer to add cve in the aggregate keys to make sure we don't loose data based on incorrect assumptions

@dmeeetreee
Copy link

dmeeetreee commented Mar 5, 2020

Thanks @ptrovatelli for explaining. As I understand, 2-stage process with aggregation in parser followed by de-duplication.

Will have to talk to our business about integrating with JIRA as part of full incident management pipeline: the idea was to have one JIRA ticket / Dojo Finding for each issue/location/line, so if a file had 2 occurrences of same issue (i.e. CWE-798) we'd have separate incidents. (The Finding object/class inside Dojo model has a line number field so this could be supported in theory.)

In practice that would require to make line number part of the deduplication key, and outside of aggregator. (Flawed in itself, as active development on a file would then see the line number of incident change place and we'd create duplicates ourselves.)

Are there any best practices that can be suggested regarding aggregation/deduplication for our requirement, that is, we are not only interested in reporting the fact a specific security risk is found, but we also want to create a developer workflow/process around it?

@stale
Copy link

stale bot commented Jun 3, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Jun 3, 2020
@stale stale bot closed this as completed Jun 10, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants