Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option to exclude outputs with cli and config #207

Merged
merged 8 commits into from Sep 5, 2022

Conversation

meanrin
Copy link
Contributor

@meanrin meanrin commented Aug 31, 2022

Description

Resolves #195

  • add blacklist field to the default config. Old configs without the fields will still be processed normally
  • add separate value and line blacklists. Both ignores space-like characters at left and right
  • add blacklist CLI arg that extends the list from config
  • at the moment CLI blacklist adds to both line and value blacklist for simplicity and user convenience. However, interface allows to add line and value lists separately
  • blacklist stored as Sets rather than Lists. Sets on average faster for x in y checks, and have average search time of O(1) https://wiki.python.org/moin/TimeComplexity
  • update the guide with blacklisting details

image

How has this been tested?

Please describe the tests that you ran to verify your changes.

  • Add new unit tests for both CLI arg and passing arguments directly into CredSweeper
  • Verified proper changes in the guide

@meanrin meanrin added the enhancement New feature or request label Aug 31, 2022
@meanrin meanrin requested a review from a team as a code owner August 31, 2022 12:55
@babenek babenek closed this Aug 31, 2022
@babenek babenek reopened this Aug 31, 2022
Copy link
Contributor

@babenek babenek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

User may use windows files with UTF-16 encoding.

if args.blacklist_path is not None:
with open(args.blacklist_path) as f:
blacklist_text = f.read()
blacklist = [line for line in blacklist_text.split("\n") if line]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need test UTF-16 and windows line ending.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modified with 0a7f3d8

@@ -198,6 +203,13 @@ def scan(args: Namespace, content_provider: FilesProvider, json_filename: Option

"""
try:
if args.blacklist_path is not None:
with open(args.blacklist_path) as f:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently user might use custom config. So, probably file reading is extra, even the config is applied.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is is not overwriting config but works with it together
So person may use both custom config and custom plain text blacklist

And reading is conditional on args.blacklist_path being specified anyway

@meanrin meanrin requested a review from babenek August 31, 2022 15:36
Copy link
Contributor

@csh519 csh519 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding those excluding list import code is good!
However how about change the name of blacklist to denylist.
As you know these days, some words that using for developing is changing.. for "PC" 🙄

@babenek
Copy link
Contributor

babenek commented Sep 1, 2022

Adding those excluding list import code is good! However how about change the name of blacklist to denylist. As you know these days, some words that using for developing is changing.. for "PC" roll_eyes

"ignore_list" ???

@meanrin
Copy link
Contributor Author

meanrin commented Sep 1, 2022

"ignore_list" ???

https://english.stackexchange.com/questions/51088/alternative-terms-to-blacklist-and-whitelist

Deny list is the top comment on stackexchange, so here we go

image

@meanrin meanrin changed the title Add blacklist cli and config Add option to exclude outputs with cli and config Sep 1, 2022
@meanrin meanrin requested a review from csh519 September 1, 2022 09:24
Comment on lines +222 to +223
exclude_lines=denylist,
exclude_values=denylist)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

deny_lines
deny_values

  • maybe use the 'deny'' ? It helps to keep semantic in code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But we have exclude in config T-T
So it looks very nice when assigning to the config

image

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with single 'denylist' we will have double work.
So, i propose use only single denylist in config

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with single 'denylist' we will have double work. So, i propose use only single denylist in config

@babenek We are taking about O(1) average difficultly at the search operation
https://wiki.python.org/moin/TimeComplexity

Copy link
Contributor

@babenek babenek Sep 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

x N lines

Copy link
Contributor

@babenek babenek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO 'denylist' is duplicated to exclude.lines and exlude.values and it breaks semantic in code (even it described in doc).
In config it use separated values (OK), but when user applies denylist - double work appears.
User may export default config, modify it with new feature and use like custom config - it will works without extra argument.

Comment on lines +95 to +98
if exclude_lines is not None:
config_dict["exclude"]["lines"] = config_dict["exclude"].get("lines", []) + exclude_lines
if exclude_values is not None:
config_dict["exclude"]["values"] = config_dict["exclude"].get("values", []) + exclude_values
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably the feature is overhead

Comment on lines +48 to +49
self.exclude_lines = set(line.strip() for line in self.exclude_lines)
self.exclude_values = set(line.strip() for line in self.exclude_values)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

denylist is the same

Comment on lines +222 to +223
exclude_lines=denylist,
exclude_values=denylist)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with single 'denylist' we will have double work.
So, i propose use only single denylist in config

line_data = cls.get_line_data(config, target.line, target.line_num, target.file_path, rule.patterns[0],
rule.filters)

if line_data is None:
return None
if line_data.value.strip() in config.exclude_values:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

double work in case of user 'denylist'

Copy link
Contributor Author

@meanrin meanrin Sep 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

double work in case of user 'denylist'

Not the case
If we stripped values in list, we still need to strip actual lines, otherwise match cannot occur

@meanrin
Copy link
Contributor Author

meanrin commented Sep 1, 2022

User may export default config, modify it with new feature and use like custom config - it will works without extra argument.

@babenek It would be just a bad UX
I would remind you to try place yourself on the users place, and try to update json file with a new blacklist

@@ -154,11 +154,16 @@ def _get_candidate(cls, config: Config, rule: Rule, target: AnalysisTarget) -> O
remove current line. None otherwise

"""
if target.line.strip() in config.exclude_lines:
Copy link
Contributor

@babenek babenek Sep 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then, here may be extra work when user suggests to skip values only

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, agree, will update

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@babenek fixed with 3670e58

@babenek
Copy link
Contributor

babenek commented Sep 1, 2022

User may export default config, modify it with new feature and use like custom config - it will works without extra argument.

@babenek It would be just a bad UX I would remind you to try place yourself on the users place, and try to update json file with a new blacklist

then maybe there two new arguments?
--denylines
--denyvalues

@meanrin
Copy link
Contributor Author

meanrin commented Sep 1, 2022

User may export default config, modify it with new feature and use like custom config - it will works without extra argument.

@babenek It would be just a bad UX I would remind you to try place yourself on the users place, and try to update json file with a new blacklist

then maybe there two new arguments? --denylines --denyvalues

@babenek Yeah, I thought about it, but i see no case from the user side to be honest

From the dev site i see for the debug reasons OR while training ML

On the user side, collisions between full line and value should be implausible. Therefore, there are should be no random FN due to line be removed because of value or other way around

Copy link
Contributor

@babenek babenek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, let user decide what way should be used to skip credentials. config or denylist

Comment on lines +56 to +57
exclude_lines: Optional[List[str]] = None,
exclude_values: Optional[List[str]] = None) -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the lines inserted before e.g. find_by_ext - then 55line is kept untouched :)

Copy link
Contributor Author

@meanrin meanrin Sep 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, but it have a chance to brake code that passes arguments by order rather than by name

@@ -198,6 +203,11 @@ def scan(args: Namespace, content_provider: FilesProvider, json_filename: Option

"""
try:
if args.denylist_path is not None:
denylist = [line for line in Util.read_file(args.denylist_path) if line]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
denylist = [line for line in Util.read_file(args.denylist_path) if line]
denylist = [line for line in Util.read_file(args.denylist_path) if line.strip()]

skip user lines with whitespaces only?

Copy link
Contributor Author

@meanrin meanrin Sep 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huuuuuh

@babenek
Good, point!

However, IMO it's irrelevant because:

  1. We attach denylist values to the config Before instantiating the config class
  2. Config do strip all values anyway (lines 48-49)
  3. Aaaand after that we call the set function that removes all duplicates and creates hash based data-structure
        self.exclude_lines = set(line.strip() for line in self.exclude_lines)
        self.exclude_values = set(line.strip() for line in self.exclude_values)

So after the steps 2 all space-only lines would be converted to ""
And step 3 would guarantee absence of duplicates

So not doing if line.strip() would increase search in cache by only 1 data point

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, only 1 empty check

Copy link
Contributor Author

@meanrin meanrin Sep 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, only 1 empty check

It would be 1 more check IF we would perform linear search

However, x in y for sets doing hash map based search that is O(1) in average difficulty - meaning it does not depend on the set size
So adding "" to set would increase set size by 1, but not the average number of operations during checking x in y

In worst case x in y would be O(n), so it would increase operations by 1. But it's really an insanely rare worst case

Copy link
Contributor

@csh519 csh519 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--denylist option is added.

LGTM 👍

@meanrin meanrin merged commit 710f8d8 into Samsung:main Sep 5, 2022
@meanrin meanrin deleted the add-blacklist-cli-and-config branch September 5, 2022 08:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Filter out false positives
3 participants