New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add option to exclude outputs with cli and config #207
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
User may use windows files with UTF-16 encoding.
credsweeper/__main__.py
Outdated
if args.blacklist_path is not None: | ||
with open(args.blacklist_path) as f: | ||
blacklist_text = f.read() | ||
blacklist = [line for line in blacklist_text.split("\n") if line] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need test UTF-16 and windows line ending.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Modified with 0a7f3d8
credsweeper/__main__.py
Outdated
@@ -198,6 +203,13 @@ def scan(args: Namespace, content_provider: FilesProvider, json_filename: Option | |||
|
|||
""" | |||
try: | |||
if args.blacklist_path is not None: | |||
with open(args.blacklist_path) as f: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently user might use custom config. So, probably file reading is extra, even the config is applied.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is is not overwriting config but works with it together
So person may use both custom config and custom plain text blacklist
And reading is conditional on args.blacklist_path
being specified anyway
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding those excluding list import code is good!
However how about change the name of blacklist
to denylist
.
As you know these days, some words that using for developing is changing.. for "PC" 🙄
"ignore_list" ??? |
https://english.stackexchange.com/questions/51088/alternative-terms-to-blacklist-and-whitelist Deny list is the top comment on stackexchange, so here we go |
exclude_lines=denylist, | ||
exclude_values=denylist) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
deny_lines
deny_values
- maybe use the 'deny'' ? It helps to keep semantic in code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with single 'denylist' we will have double work.
So, i propose use only single denylist in config
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with single 'denylist' we will have double work. So, i propose use only single denylist in config
@babenek We are taking about O(1) average difficultly at the search operation
https://wiki.python.org/moin/TimeComplexity
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
x N lines
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO 'denylist' is duplicated to exclude.lines and exlude.values and it breaks semantic in code (even it described in doc).
In config it use separated values (OK), but when user applies denylist - double work appears.
User may export default config, modify it with new feature and use like custom config - it will works without extra argument.
if exclude_lines is not None: | ||
config_dict["exclude"]["lines"] = config_dict["exclude"].get("lines", []) + exclude_lines | ||
if exclude_values is not None: | ||
config_dict["exclude"]["values"] = config_dict["exclude"].get("values", []) + exclude_values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably the feature is overhead
self.exclude_lines = set(line.strip() for line in self.exclude_lines) | ||
self.exclude_values = set(line.strip() for line in self.exclude_values) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
denylist is the same
exclude_lines=denylist, | ||
exclude_values=denylist) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with single 'denylist' we will have double work.
So, i propose use only single denylist in config
line_data = cls.get_line_data(config, target.line, target.line_num, target.file_path, rule.patterns[0], | ||
rule.filters) | ||
|
||
if line_data is None: | ||
return None | ||
if line_data.value.strip() in config.exclude_values: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
double work in case of user 'denylist'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
double work in case of user 'denylist'
Not the case
If we stripped values in list, we still need to strip actual lines, otherwise match cannot occur
@babenek It would be just a bad UX |
@@ -154,11 +154,16 @@ def _get_candidate(cls, config: Config, rule: Rule, target: AnalysisTarget) -> O | |||
remove current line. None otherwise | |||
|
|||
""" | |||
if target.line.strip() in config.exclude_lines: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
then, here may be extra work when user suggests to skip values only
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, agree, will update
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
then maybe there two new arguments? |
@babenek Yeah, I thought about it, but i see no case from the user side to be honest From the dev site i see for the debug reasons OR while training ML On the user side, collisions between full line and value should be implausible. Therefore, there are should be no random FN due to line be removed because of value or other way around |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, let user decide what way should be used to skip credentials. config or denylist
exclude_lines: Optional[List[str]] = None, | ||
exclude_values: Optional[List[str]] = None) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if the lines inserted before e.g. find_by_ext - then 55line is kept untouched :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, but it have a chance to brake code that passes arguments by order rather than by name
@@ -198,6 +203,11 @@ def scan(args: Namespace, content_provider: FilesProvider, json_filename: Option | |||
|
|||
""" | |||
try: | |||
if args.denylist_path is not None: | |||
denylist = [line for line in Util.read_file(args.denylist_path) if line] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
denylist = [line for line in Util.read_file(args.denylist_path) if line] | |
denylist = [line for line in Util.read_file(args.denylist_path) if line.strip()] |
skip user lines with whitespaces only?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Huuuuuh
@babenek
Good, point!
However, IMO it's irrelevant because:
- We attach denylist values to the config Before instantiating the config class
- Config do strip all values anyway (lines 48-49)
- Aaaand after that we call the
set
function that removes all duplicates and creates hash based data-structure
self.exclude_lines = set(line.strip() for line in self.exclude_lines)
self.exclude_values = set(line.strip() for line in self.exclude_values)
So after the steps 2 all space-only lines would be converted to ""
And step 3 would guarantee absence of duplicates
So not doing if line.strip()
would increase search in cache by only 1 data point
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, only 1 empty check
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, only 1 empty check
It would be 1 more check IF we would perform linear search
However, x in y
for sets doing hash map based search that is O(1) in average difficulty - meaning it does not depend on the set size
So adding ""
to set would increase set size by 1, but not the average number of operations during checking x in y
In worst case x in y
would be O(n), so it would increase operations by 1. But it's really an insanely rare worst case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
--denylist
option is added.
LGTM 👍
Description
Resolves #195
x in y
checks, and have average search time of O(1) https://wiki.python.org/moin/TimeComplexityHow has this been tested?
Please describe the tests that you ran to verify your changes.
CredSweeper