Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option to exclude outputs with cli and config #207

Merged
merged 8 commits into from Sep 5, 2022
Merged
Show file tree
Hide file tree
Changes from 7 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
14 changes: 13 additions & 1 deletion credsweeper/__main__.py
Expand Up @@ -90,6 +90,11 @@ def get_arguments() -> Namespace:
default=None,
dest="config_path",
metavar="PATH")
parser.add_argument("--denylist",
help="path to a plain text file with lines or secrets to ignore",
default=None,
dest="denylist_path",
metavar="PATH")
parser.add_argument("--find-by-ext",
help="find files by predefined extension.",
dest="find_by_ext",
Expand Down Expand Up @@ -198,6 +203,11 @@ def scan(args: Namespace, content_provider: FilesProvider, json_filename: Option

"""
try:
if args.denylist_path is not None:
denylist = [line for line in Util.read_file(args.denylist_path) if line]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
denylist = [line for line in Util.read_file(args.denylist_path) if line]
denylist = [line for line in Util.read_file(args.denylist_path) if line.strip()]

skip user lines with whitespaces only?

Copy link
Contributor Author

@meanrin meanrin Sep 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huuuuuh

@babenek
Good, point!

However, IMO it's irrelevant because:

  1. We attach denylist values to the config Before instantiating the config class
  2. Config do strip all values anyway (lines 48-49)
  3. Aaaand after that we call the set function that removes all duplicates and creates hash based data-structure
        self.exclude_lines = set(line.strip() for line in self.exclude_lines)
        self.exclude_values = set(line.strip() for line in self.exclude_values)

So after the steps 2 all space-only lines would be converted to ""
And step 3 would guarantee absence of duplicates

So not doing if line.strip() would increase search in cache by only 1 data point

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, only 1 empty check

Copy link
Contributor Author

@meanrin meanrin Sep 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, only 1 empty check

It would be 1 more check IF we would perform linear search

However, x in y for sets doing hash map based search that is O(1) in average difficulty - meaning it does not depend on the set size
So adding "" to set would increase set size by 1, but not the average number of operations during checking x in y

In worst case x in y would be O(n), so it would increase operations by 1. But it's really an insanely rare worst case

else:
denylist = []

credsweeper = CredSweeper(rule_path=args.rule_path,
config_path=args.config_path,
api_validation=args.api_validation,
Expand All @@ -208,7 +218,9 @@ def scan(args: Namespace, content_provider: FilesProvider, json_filename: Option
ml_threshold=args.ml_threshold,
find_by_ext=args.find_by_ext,
depth=args.depth,
size_limit=args.size_limit)
size_limit=args.size_limit,
exclude_lines=denylist,
exclude_values=denylist)
Comment on lines +222 to +223
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

deny_lines
deny_values

  • maybe use the 'deny'' ? It helps to keep semantic in code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But we have exclude in config T-T
So it looks very nice when assigning to the config

image

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with single 'denylist' we will have double work.
So, i propose use only single denylist in config

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with single 'denylist' we will have double work. So, i propose use only single denylist in config

@babenek We are taking about O(1) average difficultly at the search operation
https://wiki.python.org/moin/TimeComplexity

Copy link
Contributor

@babenek babenek Sep 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

x N lines

return credsweeper.run(content_provider=content_provider)
except Exception as exc:
logger.critical(exc, exc_info=True)
Expand Down
10 changes: 9 additions & 1 deletion credsweeper/app.py
Expand Up @@ -52,7 +52,9 @@ def __init__(self,
ml_threshold: Union[float, ThresholdPreset] = ThresholdPreset.medium,
find_by_ext: bool = False,
depth: int = 0,
size_limit: Optional[str] = None) -> None:
size_limit: Optional[str] = None,
exclude_lines: Optional[List[str]] = None,
exclude_values: Optional[List[str]] = None) -> None:
Comment on lines +56 to +57
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the lines inserted before e.g. find_by_ext - then 55line is kept untouched :)

Copy link
Contributor Author

@meanrin meanrin Sep 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, but it have a chance to brake code that passes arguments by order rather than by name

"""Initialize Advanced credential scanner.

Args:
Expand All @@ -73,6 +75,8 @@ def __init__(self,
find_by_ext: boolean - files will be reported by extension
depth: int - how deep container files will be scanned
size_limit: optional string integer or human-readable format to skip oversize files
exclude_lines: lines to omit in scan. Will be added to the lines already in config
exclude_values: values to omit in scan. Will be added to the values already in config

"""
self.pool_count: int = int(pool_count) if int(pool_count) > 1 else 1
Expand All @@ -88,6 +92,10 @@ def __init__(self,
config_dict["find_by_ext"] = find_by_ext
config_dict["size_limit"] = size_limit
config_dict["depth"] = depth
if exclude_lines is not None:
config_dict["exclude"]["lines"] = config_dict["exclude"].get("lines", []) + exclude_lines
if exclude_values is not None:
config_dict["exclude"]["values"] = config_dict["exclude"].get("values", []) + exclude_values
Comment on lines +95 to +98
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably the feature is overhead


self.config = Config(config_dict)
self.credential_manager = CredentialManager()
Expand Down
8 changes: 7 additions & 1 deletion credsweeper/config/config.py
@@ -1,4 +1,4 @@
from typing import Dict, List, Optional
from typing import Dict, List, Optional, Set

from humanfriendly import parse_size
from regex import regex
Expand All @@ -20,6 +20,8 @@ def __init__(self, config: Dict) -> None:
]
self.exclude_paths: List[str] = config["exclude"]["path"]
self.exclude_extensions: List[str] = config["exclude"]["extension"]
self.exclude_lines: Set[str] = set(config["exclude"].get("lines", []))
self.exclude_values: Set[str] = set(config["exclude"].get("values", []))
self.source_extensions: List[str] = config["source_ext"]
self.source_quote_ext: List[str] = config["source_quote_ext"]
self.find_by_ext_list: List[str] = config["find_by_ext_list"]
Expand All @@ -41,3 +43,7 @@ def __init__(self, config: Dict) -> None:
self.exclude_extensions.remove(".zip")
if ".gz" in self.exclude_extensions:
self.exclude_extensions.remove(".gz")

# Trim exclude patterns from space like characters
self.exclude_lines = set(line.strip() for line in self.exclude_lines)
self.exclude_values = set(line.strip() for line in self.exclude_values)
Comment on lines +48 to +49
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

denylist is the same

5 changes: 5 additions & 0 deletions credsweeper/scanner/scan_type/scan_type.py
Expand Up @@ -154,11 +154,16 @@ def _get_candidate(cls, config: Config, rule: Rule, target: AnalysisTarget) -> O
remove current line. None otherwise

"""
if target.line.strip() in config.exclude_lines:
Copy link
Contributor

@babenek babenek Sep 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then, here may be extra work when user suggests to skip values only

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, agree, will update

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@babenek fixed with 3670e58

return None

line_data = cls.get_line_data(config, target.line, target.line_num, target.file_path, rule.patterns[0],
rule.filters)

if line_data is None:
return None
if line_data.value.strip() in config.exclude_values:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

double work in case of user 'denylist'

Copy link
Contributor Author

@meanrin meanrin Sep 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

double work in case of user 'denylist'

Not the case
If we stripped values in list, we still need to strip actual lines, otherwise match cannot occur

return None

return Candidate([line_data], rule.patterns, rule.rule_name, rule.severity, config, rule.validations,
rule.use_ml)
4 changes: 3 additions & 1 deletion credsweeper/secret/config.json
Expand Up @@ -69,7 +69,9 @@
"/node_modules/",
"/target/",
"/venv/"
]
],
"lines": [],
"values": []
},
"source_ext": [
".aspx",
Expand Down
47 changes: 44 additions & 3 deletions docs/source/guide.rst
Expand Up @@ -13,9 +13,9 @@ Get all argument list:

.. code-block:: text

usage: python -m credsweeper [-h] (--path PATH [PATH ...] | --diff_path PATH [PATH ...] | --export_config [PATH]) [--rules [PATH]] [--config [PATH]] [--find-by-ext] [--depth POSITIVE_INT]
[--ml_threshold FLOAT_OR_STR] [--ml_batch_size POSITIVE_INT] [--api_validation] [--jobs POSITIVE_INT] [--skip_ignored] [--save-json [PATH]] [--save-xlsx [PATH]]
[--log LOG_LEVEL] [--size_limit SIZE_LIMIT] [--version]
usage: python -m credsweeper [-h] (--path PATH [PATH ...] | --diff_path PATH [PATH ...] | --export_config [PATH]) [--rules [PATH]] [--config [PATH]] [--denylist PATH] [--find-by-ext]
[--depth POSITIVE_INT] [--ml_threshold FLOAT_OR_STR] [--ml_batch_size POSITIVE_INT] [--api_validation] [--jobs POSITIVE_INT] [--skip_ignored]
[--save-json [PATH]] [--save-xlsx [PATH]] [--log LOG_LEVEL] [--size_limit SIZE_LIMIT] [--version]
optional arguments:
-h, --help show this help message and exit
--path PATH [PATH ...]
Expand All @@ -26,6 +26,7 @@ Get all argument list:
exporting default config to file (default: config.json)
--rules [PATH] path of rule config file (default: credsweeper/rules/config.yaml)
--config [PATH] use custom config (default: built-in)
--denylist PATH path to a plain text file with lines or secrets to ignore
--find-by-ext find files by predefined extension.
--depth POSITIVE_INT recursive search in files which are zip archives.
--ml_threshold FLOAT_OR_STR
Expand Down Expand Up @@ -104,6 +105,46 @@ Get CLI output only:

rule: Password / severity: medium / line_data_list: [line : 'password = "cackle!"' / line_num : 1 / path : tests/samples/password / entropy_validation: False] / api_validation: NOT_AVAILABLE / ml_validation: VALIDATED_KEY


Exclude outputs using CLI:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

If you want to remove some values from report (e.g. known public secrets):
create text files with lines or values you want to remove and add it using `--denylist` argument.
Space-like characters at left and right will be ignored.

.. code-block:: bash

$ python -m credsweeper --path tests/samples/password --denylist list.txt
Detected Credentials: 0
Time Elapsed: 0.07523202896118164s
$ cat list.txt
cackle!
password = "cackle!"

Exclude outputs using config:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Edit ``exclude`` part of the config file.
Default config can be generated using ``python -m credsweeper --export_config place_to_save.json``
or can be found in ``credsweeper/secret/config.json``.
Space-like characters at left and right will be ignored.

.. code-block:: json

"exclude": {
"lines": [" password = \"cackle!\" "],
"values": ["cackle!"]
}

Then specify your config in CLI:

.. code-block:: bash

$ python -m credsweeper --path tests/samples/password --config my_cfg.json
Detected Credentials: 0
Time Elapsed: 0.07152628898620605s

Use as a python library
-----------------------

Expand Down
63 changes: 63 additions & 0 deletions tests/test_app.py
Expand Up @@ -187,6 +187,7 @@ def test_it_works_n(self) -> None:
")" \
" [--rules [PATH]]" \
" [--config [PATH]]" \
" [--denylist PATH]" \
" [--find-by-ext]" \
" [--depth POSITIVE_INT]" \
" [--ml_threshold FLOAT_OR_STR]" \
Expand Down Expand Up @@ -447,3 +448,65 @@ def test_zip_p(self) -> None:
assert len(report) == SAMPLES_POST_CRED_COUNT + SAMPLES_IN_DEEP_1 - SAMPLES_FILTERED_BY_POST_COUNT

# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #

def test_denylist_value_p(self) -> None:
target_path = str(SAMPLES_DIR / "password")
with tempfile.TemporaryDirectory() as tmp_dir:
json_filename = os.path.join(tmp_dir, f"{__name__}.json")
denylist_filename = os.path.join(tmp_dir, f"list.txt")
with open(denylist_filename, "w") as f:
f.write("cackle!")
_stdout, _stderr = self._m_credsweeper([
"--path", target_path, "--denylist", denylist_filename, "--save-json", json_filename, "--log", "silence"
])
with open(json_filename, "r") as json_file:
report = json.load(json_file)
assert len(report) == 0

# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #

def test_denylist_value_n(self) -> None:
target_path = str(SAMPLES_DIR / "password")
with tempfile.TemporaryDirectory() as tmp_dir:
json_filename = os.path.join(tmp_dir, f"{__name__}.json")
denylist_filename = os.path.join(tmp_dir, f"list.txt")
with open(denylist_filename, "w") as f:
f.write("abc")
_stdout, _stderr = self._m_credsweeper([
"--path", target_path, "--denylist", denylist_filename, "--save-json", json_filename, "--log", "silence"
])
with open(json_filename, "r") as json_file:
report = json.load(json_file)
assert len(report) == 1

# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #

def test_denylist_line_p(self) -> None:
target_path = str(SAMPLES_DIR / "password")
with tempfile.TemporaryDirectory() as tmp_dir:
json_filename = os.path.join(tmp_dir, f"{__name__}.json")
denylist_filename = os.path.join(tmp_dir, f"list.txt")
with open(denylist_filename, "w") as f:
f.write(' password = "cackle!" ')
_stdout, _stderr = self._m_credsweeper([
"--path", target_path, "--denylist", denylist_filename, "--save-json", json_filename, "--log", "silence"
])
with open(json_filename, "r") as json_file:
report = json.load(json_file)
assert len(report) == 0

# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #

def test_denylist_line_n(self) -> None:
target_path = str(SAMPLES_DIR / "password")
with tempfile.TemporaryDirectory() as tmp_dir:
json_filename = os.path.join(tmp_dir, f"{__name__}.json")
denylist_filename = os.path.join(tmp_dir, f"list.txt")
with open(denylist_filename, "w") as f:
f.write("abc")
_stdout, _stderr = self._m_credsweeper([
"--path", target_path, "--denylist", denylist_filename, "--save-json", json_filename, "--log", "silence"
])
with open(json_filename, "r") as json_file:
report = json.load(json_file)
assert len(report) == 1
43 changes: 41 additions & 2 deletions tests/test_main.py
Expand Up @@ -164,7 +164,8 @@ def test_binary_patch_n(self, mock_get_arguments: Mock()) -> None:
ml_threshold=0.0,
depth=1,
size_limit="1G",
api_validation=False)
api_validation=False,
denylist_path=None)
mock_get_arguments.return_value = args_mock
with patch('logging.Logger.warning') as mocked_logger:
app_main.main()
Expand Down Expand Up @@ -192,7 +193,8 @@ def test_report_p(self, mock_get_arguments: Mock()) -> None:
depth=0,
size_limit="1G",
find_by_ext=False,
api_validation=False)
api_validation=False,
denylist_path=None)
mock_get_arguments.return_value = args_mock
app_main.main()
assert os.path.exists(xlsx_filename)
Expand Down Expand Up @@ -346,3 +348,40 @@ def test_zip_p(self) -> None:
cred_sweeper.config.depth = 0
cred_sweeper.run(content_provider=content_provider)
assert len(cred_sweeper.credential_manager.get_credentials()) == SAMPLES_POST_CRED_COUNT

# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #

def test_exclude_value_p(self) -> None:
cred_sweeper = CredSweeper(use_filters=True, exclude_values=["cackle!"])
files = [SAMPLES_DIR / "password"]
files_provider = [TextContentProvider(file_path) for file_path in files]
cred_sweeper.scan(files_provider)
assert len(cred_sweeper.credential_manager.get_credentials()) == 0

# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #

def test_exclude_value_n(self) -> None:
cred_sweeper = CredSweeper(use_filters=True, exclude_values=["abc"])
files = [SAMPLES_DIR / "password"]
files_provider = [TextContentProvider(file_path) for file_path in files]
cred_sweeper.scan(files_provider)
assert len(cred_sweeper.credential_manager.get_credentials()) == 1

# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #

@pytest.mark.parametrize("line", [' password = "cackle!" ', 'password = "cackle!"'])
def test_exclude_line_p(self, line: str) -> None:
cred_sweeper = CredSweeper(use_filters=True, exclude_lines=[line])
files = [SAMPLES_DIR / "password"]
files_provider = [TextContentProvider(file_path) for file_path in files]
cred_sweeper.scan(files_provider)
assert len(cred_sweeper.credential_manager.get_credentials()) == 0

# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #

def test_exclude_line_n(self) -> None:
cred_sweeper = CredSweeper(use_filters=True, exclude_lines=["abc"])
files = [SAMPLES_DIR / "password"]
files_provider = [TextContentProvider(file_path) for file_path in files]
cred_sweeper.scan(files_provider)
assert len(cred_sweeper.credential_manager.get_credentials()) == 1