Add option to exclude outputs with cli and config #207

meanrin · 2022-08-31T12:55:22Z

Description

Resolves #195

add blacklist field to the default config. Old configs without the fields will still be processed normally
add separate value and line blacklists. Both ignores space-like characters at left and right
add blacklist CLI arg that extends the list from config
at the moment CLI blacklist adds to both line and value blacklist for simplicity and user convenience. However, interface allows to add line and value lists separately
blacklist stored as Sets rather than Lists. Sets on average faster for x in y checks, and have average search time of O(1) https://wiki.python.org/moin/TimeComplexity
update the guide with blacklisting details

How has this been tested?

Please describe the tests that you ran to verify your changes.

Add new unit tests for both CLI arg and passing arguments directly into CredSweeper
Verified proper changes in the guide

babenek

User may use windows files with UTF-16 encoding.

babenek · 2022-08-31T13:51:40Z

credsweeper/__main__.py

+        if args.blacklist_path is not None:
+            with open(args.blacklist_path) as f:
+                blacklist_text = f.read()
+            blacklist = [line for line in blacklist_text.split("\n") if line]


Need test UTF-16 and windows line ending.

Modified with 0a7f3d8

babenek · 2022-08-31T13:55:55Z

credsweeper/__main__.py

@@ -198,6 +203,13 @@ def scan(args: Namespace, content_provider: FilesProvider, json_filename: Option

    """
    try:
+        if args.blacklist_path is not None:
+            with open(args.blacklist_path) as f:


Currently user might use custom config. So, probably file reading is extra, even the config is applied.

Is is not overwriting config but works with it together
So person may use both custom config and custom plain text blacklist

And reading is conditional on args.blacklist_path being specified anyway

csh519

Adding those excluding list import code is good!
However how about change the name of blacklist to denylist.
As you know these days, some words that using for developing is changing.. for "PC" 🙄

babenek · 2022-09-01T08:24:34Z

Adding those excluding list import code is good! However how about change the name of blacklist to denylist. As you know these days, some words that using for developing is changing.. for "PC" roll_eyes

"ignore_list" ???

meanrin · 2022-09-01T09:03:51Z

"ignore_list" ???

https://english.stackexchange.com/questions/51088/alternative-terms-to-blacklist-and-whitelist

Deny list is the top comment on stackexchange, so here we go

babenek · 2022-09-01T09:25:36Z

credsweeper/__main__.py

+                                  exclude_lines=denylist,
+                                  exclude_values=denylist)


deny_lines
deny_values

maybe use the 'deny'' ? It helps to keep semantic in code.

But we have exclude in config T-T
So it looks very nice when assigning to the config

with single 'denylist' we will have double work.
So, i propose use only single denylist in config

with single 'denylist' we will have double work. So, i propose use only single denylist in config

@babenek We are taking about O(1) average difficultly at the search operation
https://wiki.python.org/moin/TimeComplexity

babenek

IMO 'denylist' is duplicated to exclude.lines and exlude.values and it breaks semantic in code (even it described in doc).
In config it use separated values (OK), but when user applies denylist - double work appears.
User may export default config, modify it with new feature and use like custom config - it will works without extra argument.

babenek · 2022-09-01T09:32:54Z

credsweeper/app.py

+        if exclude_lines is not None:
+            config_dict["exclude"]["lines"] = config_dict["exclude"].get("lines", []) + exclude_lines
+        if exclude_values is not None:
+            config_dict["exclude"]["values"] = config_dict["exclude"].get("values", []) + exclude_values


Probably the feature is overhead

babenek · 2022-09-01T09:33:18Z

credsweeper/config/config.py

+        self.exclude_lines = set(line.strip() for line in self.exclude_lines)
+        self.exclude_values = set(line.strip() for line in self.exclude_values)


denylist is the same

babenek · 2022-09-01T09:35:18Z

credsweeper/__main__.py

+                                  exclude_lines=denylist,
+                                  exclude_values=denylist)


with single 'denylist' we will have double work.
So, i propose use only single denylist in config

babenek · 2022-09-01T09:37:29Z

credsweeper/scanner/scan_type/scan_type.py

        line_data = cls.get_line_data(config, target.line, target.line_num, target.file_path, rule.patterns[0],
                                      rule.filters)

        if line_data is None:
            return None
+        if line_data.value.strip() in config.exclude_values:


double work in case of user 'denylist'

double work in case of user 'denylist'

Not the case
If we stripped values in list, we still need to strip actual lines, otherwise match cannot occur

meanrin · 2022-09-01T09:47:19Z

User may export default config, modify it with new feature and use like custom config - it will works without extra argument.

@babenek It would be just a bad UX
I would remind you to try place yourself on the users place, and try to update json file with a new blacklist

babenek · 2022-09-01T09:47:35Z

credsweeper/scanner/scan_type/scan_type.py

@@ -154,11 +154,16 @@ def _get_candidate(cls, config: Config, rule: Rule, target: AnalysisTarget) -> O
            remove current line. None otherwise

        """
+        if target.line.strip() in config.exclude_lines:


then, here may be extra work when user suggests to skip values only

Ok, agree, will update

@babenek fixed with 3670e58

babenek · 2022-09-01T09:50:44Z

User may export default config, modify it with new feature and use like custom config - it will works without extra argument.

@babenek It would be just a bad UX I would remind you to try place yourself on the users place, and try to update json file with a new blacklist

then maybe there two new arguments?
--denylines
--denyvalues

meanrin · 2022-09-01T09:58:32Z

User may export default config, modify it with new feature and use like custom config - it will works without extra argument.

@babenek It would be just a bad UX I would remind you to try place yourself on the users place, and try to update json file with a new blacklist

then maybe there two new arguments? --denylines --denyvalues

@babenek Yeah, I thought about it, but i see no case from the user side to be honest

From the dev site i see for the debug reasons OR while training ML

On the user side, collisions between full line and value should be implausible. Therefore, there are should be no random FN due to line be removed because of value or other way around

babenek

Well, let user decide what way should be used to skip credentials. config or denylist

babenek · 2022-09-01T11:06:28Z

credsweeper/app.py

+                 exclude_lines: Optional[List[str]] = None,
+                 exclude_values: Optional[List[str]] = None) -> None:


if the lines inserted before e.g. find_by_ext - then 55line is kept untouched :)

Agree, but it have a chance to brake code that passes arguments by order rather than by name

babenek · 2022-09-01T11:07:34Z

credsweeper/__main__.py

@@ -198,6 +203,11 @@ def scan(args: Namespace, content_provider: FilesProvider, json_filename: Option

    """
    try:
+        if args.denylist_path is not None:
+            denylist = [line for line in Util.read_file(args.denylist_path) if line]


Suggested change

denylist = [line for line in Util.read_file(args.denylist_path) if line]

denylist = [line for line in Util.read_file(args.denylist_path) if line.strip()]

skip user lines with whitespaces only?

Huuuuuh

@babenek
Good, point!

However, IMO it's irrelevant because:

We attach denylist values to the config Before instantiating the config class

Config do strip all values anyway (lines 48-49)

Aaaand after that we call the set function that removes all duplicates and creates hash based data-structure

self.exclude_lines = set(line.strip() for line in self.exclude_lines) self.exclude_values = set(line.strip() for line in self.exclude_values)

So after the steps 2 all space-only lines would be converted to ""
And step 3 would guarantee absence of duplicates

So not doing if line.strip() would increase search in cache by only 1 data point

yes, only 1 empty check

yes, only 1 empty check

It would be 1 more check IF we would perform linear search

However, x in y for sets doing hash map based search that is O(1) in average difficulty - meaning it does not depend on the set size
So adding "" to set would increase set size by 1, but not the average number of operations during checking x in y

In worst case x in y would be O(n), so it would increase operations by 1. But it's really an insanely rare worst case

csh519

--denylist option is added.

LGTM 👍

Alex Sokol added 5 commits August 31, 2022 15:16

Update guide to include blacklist info

cffdc77

Update code to work with blacklist

23285f0

Update default cong with exclude lines, values

bf5d3ea

Update tests in related to help output and CLI

33e4a77

Add new tests related to blacklist

66a6c40

meanrin added the enhancement New feature or request label Aug 31, 2022

meanrin requested a review from a team as a code owner August 31, 2022 12:55

babenek closed this Aug 31, 2022

babenek reopened this Aug 31, 2022

babenek reviewed Aug 31, 2022

View reviewed changes

Change blacklist reading to read_file function

0a7f3d8

meanrin requested a review from babenek August 31, 2022 15:36

csh519 reviewed Sep 1, 2022

View reviewed changes

meanrin changed the title ~~Add blacklist cli and config~~ Add option to exclude outputs with cli and config Sep 1, 2022

Change wording

41b2f82

meanrin requested a review from csh519 September 1, 2022 09:24

babenek reviewed Sep 1, 2022

View reviewed changes

Add additional if related to exclude lists

3670e58

babenek approved these changes Sep 1, 2022

View reviewed changes

csh519 approved these changes Sep 5, 2022

View reviewed changes

meanrin merged commit 710f8d8 into Samsung:main Sep 5, 2022

meanrin deleted the add-blacklist-cli-and-config branch September 5, 2022 08:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add option to exclude outputs with cli and config #207

Add option to exclude outputs with cli and config #207

meanrin commented Aug 31, 2022 •

edited

babenek left a comment

babenek Aug 31, 2022

meanrin Aug 31, 2022

babenek Aug 31, 2022

meanrin Aug 31, 2022

csh519 left a comment

babenek commented Sep 1, 2022 •

edited

meanrin commented Sep 1, 2022

babenek Sep 1, 2022

meanrin Sep 1, 2022

babenek Sep 1, 2022

meanrin Sep 1, 2022

babenek Sep 1, 2022 •

edited

babenek left a comment

babenek Sep 1, 2022

babenek Sep 1, 2022

babenek Sep 1, 2022

babenek Sep 1, 2022

meanrin Sep 1, 2022 •

edited

meanrin commented Sep 1, 2022

babenek Sep 1, 2022 •

edited

meanrin Sep 1, 2022

meanrin Sep 1, 2022

babenek commented Sep 1, 2022

meanrin commented Sep 1, 2022 •

edited

babenek left a comment

babenek Sep 1, 2022

meanrin Sep 1, 2022 •

edited

babenek Sep 1, 2022

meanrin Sep 1, 2022 •

edited

babenek Sep 1, 2022

meanrin Sep 1, 2022 •

edited

csh519 left a comment •

edited

		self.exclude_lines = set(line.strip() for line in self.exclude_lines)
		self.exclude_values = set(line.strip() for line in self.exclude_values)

		exclude_lines: Optional[List[str]] = None,
		exclude_values: Optional[List[str]] = None) -> None:

	denylist = [line for line in Util.read_file(args.denylist_path) if line]
	denylist = [line for line in Util.read_file(args.denylist_path) if line.strip()]

Add option to exclude outputs with cli and config #207

Add option to exclude outputs with cli and config #207

Conversation

meanrin commented Aug 31, 2022 • edited

Description

How has this been tested?

babenek left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

csh519 left a comment

Choose a reason for hiding this comment

babenek commented Sep 1, 2022 • edited

meanrin commented Sep 1, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

babenek Sep 1, 2022 • edited

Choose a reason for hiding this comment

babenek left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

meanrin Sep 1, 2022 • edited

Choose a reason for hiding this comment

meanrin commented Sep 1, 2022

babenek Sep 1, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

babenek commented Sep 1, 2022

meanrin commented Sep 1, 2022 • edited

babenek left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

meanrin Sep 1, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

meanrin Sep 1, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

meanrin Sep 1, 2022 • edited

Choose a reason for hiding this comment

csh519 left a comment • edited

Choose a reason for hiding this comment

meanrin commented Aug 31, 2022 •

edited

babenek commented Sep 1, 2022 •

edited

babenek Sep 1, 2022 •

edited

meanrin Sep 1, 2022 •

edited

babenek Sep 1, 2022 •

edited

meanrin commented Sep 1, 2022 •

edited

meanrin Sep 1, 2022 •

edited

meanrin Sep 1, 2022 •

edited

meanrin Sep 1, 2022 •

edited

csh519 left a comment •

edited