Version 2.0.0 #84
Version 2.0.0 #84
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@adeptex awesome changes. A lot to process in here so I've left an initial review and may give it another pass later on. Are there any particular areas you would like reviewing in more detail?
I will also try running locally and seeing if there are any bugs I can find and then link back to the code. Let me know if you need any clarifications on my comments! 💯
args_parser.add_argument("-i", "--info", action="store_true", help="show extended help and exit") | ||
args_parser.add_argument("-c", "--config", help="config file") | ||
args_parser.add_argument("-o", "--output", help="output file") | ||
args_parser.add_argument("-e", "--exitcode", default=0, type=int, help="exit code on success") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the exitcode
need to be configurable? Should it not always be 0
on success?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feature was requested in #61
It appears that when chaining/piping scripts in Unix-type shells, it is possible that a wrong exit code gets propagated down the pipeline if one of the scripts fails. So having this configurable helps ensure that the step completed successfully with the expected exit code. For instance: https://unix.stackexchange.com/questions/584990/why-is-exit-code-0-even-though-the-command-is-wrong
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Huh, interesting.
|
||
logging.debug(f"Loaded plugin '{plugin}' for file '{file}'") | ||
|
||
pairs = plugin().pairs(file) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All the plugins could inherit from a common base class, which might help with some of the Python typing and make extensibility even easier in the future.
For example, creating a BasePlugin
class which has an abstract method pairs
, alongside other common properties/functions. 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Initially that was the idea. However, in practice not all plugins have a generic way of parsing files. The pairs
method is the most generic common thing that unites all plugins - calling that method gives you a list of KeyValuePair
objects.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, maybe one for a future PR / contribution 😉
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To summarize current cases:
- Plugin parses file line by line with
enumerate
, sopair.line
can be directly set by trackinglineno
. For example - Dockerfile, .ini, .npmrc
whispers/whispers/plugins/npmrc.py
Lines 9 to 16 in 9f5f6da
for lineno, line in enumerate(filepath.open(), 1): | |
if ":_authToken=" not in line: | |
continue | |
value = line.split(":_authToken=")[-1].strip() | |
if value: | |
key = "npm authToken" | |
yield KeyValuePair(key, value, keypath=[key], line=lineno) |
- Plugin parses file as a dictionary, so
pair.line
has to be found later by reading the file as plaintext, if the pair is identified as a secret. For example - JSON, XML, YML
whispers/whispers/core/utils.py
Lines 192 to 216 in 9f5f6da
def find_line_number(pair: KeyValuePair) -> int: | |
"""Finds line number using pair keypath and value""" | |
if pair.line: | |
return pair.line # Already set | |
valuepath = pair.value.split("\n")[0][:16] | |
findpath = [*pair.keypath, valuepath] | |
foundline = 0 | |
for lineno, line in enumerate(Path(pair.file).open(), 1): | |
founditems = 0 | |
for item in findpath: | |
if item not in line: | |
break | |
founditems += 1 | |
foundline = lineno | |
findpath = findpath[founditems:] | |
if not findpath: | |
return foundline | |
return 0 |
- Plugin parses file as AST, so
pair.line
can be directly set from available AST node properties. For example - Python
whispers/whispers/plugins/python.py
Lines 122 to 126 in 9f5f6da
elif isinstance(node, astroid.nodes.Keyword): | |
key = self.node_to_str(node) | |
value = self.node_to_str(node.value) | |
if key and value and isinstance(value, (str, int)): | |
yield KeyValuePair(key, value, keypath=[key], line=node.lineno) |
if pair.line: | ||
return pair.line # Already set | ||
|
||
valuepath = pair.value.split("\n")[0][:16] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why 16
here? 😄 I'm sure there's good reason, but maybe an inline comment will help
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is for specific cases like the following
whispers/tests/fixtures/privatekeys.yml
Lines 6 to 10 in 9f5f6da
key: "\ | |
-----BEGIN RSA KEY----- \ | |
QyNTUxOQAAACCtrF27B/zd9DEpd38IbVBy93wSeYXKU0AGXMyO8ePu2QAAAKBSzpYEUs6W \ | |
-----END RSA KEY-----\ | |
" |
When this is loaded by whispers, key key
has the value of:
-----BEGIN RSA KEY----- \nQyNTUxOQAAACCtrF27B/zd9DEpd38IbVBy93wSeYXKU0AGXMyO8ePu2QAAAKBSzpYEUs6W \n-----END RSA KEY-----
However, if you were to look for that string in that file, you would not find it, because in plaintext its multi-line is formatted differently.
In order to find the line number where this value begins correctly, the operation is to first split the value to get -----BEGIN RSA KEY-----
, and then only take up to the first 16 chars for finding the string in a line of plaintext.
Co-authored-by: Oliver Crawford <16978487+ocrawford555@users.noreply.github.com>
from setuptools import find_packages, setup | ||
|
||
|
||
def get_version(): | ||
return import_module("whispers.__version__").__version__ | ||
|
||
|
||
install_requires = ["luhn>=0.2.0", "lxml>=4.6.2", "pyyaml>=5.3.1", "astroid>=2.4.2", "jproperties>=2.1.0", "python-levenshtein>=0.12.0", "beautifulsoup4>=4.9.3"] | ||
install_requires = ["dataclasses", "luhn", "lxml", "pyyaml", "astroid", "jproperties", "python-levenshtein", "beautifulsoup4"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why versions are not restricted like dev dependencies? 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This decision was based on the following: https://caremad.io/posts/2013/07/setup-vs-requirement/
In addition, dependabot only bumps requirements.txt, while setup.py remains outdated. Updating versions in setup.py requires manual review. Unless there is an alternative way of keeping dependencies updated that I am missing, this seems to be the most reasonable solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue is the way we use setup.py. dependabot/dependabot-core#1475
So we don't use dependabot or we change the way we manage dependencies.
However, I would push for having pinned dependencies on setup.py
"""Ensure minimal expected config structure""" | ||
try: | ||
config["include"] = config.get("include", {"files": ["**/*"]}) | ||
config["include"]["files"] = config["include"].get("files", ["**/*"]) | ||
|
||
config["exclude"] = config.get("exclude", {"files": None, "keys": None, "values": None}) | ||
config["exclude"]["files"] = config["exclude"].get("files", None) | ||
config["exclude"]["keys"] = config["exclude"].get("keys", None) | ||
config["exclude"]["values"] = config["exclude"].get("values", None) | ||
|
||
for idx in ["files", "keys", "values"]: | ||
if not config["exclude"][idx]: | ||
continue | ||
|
||
# Create a single regex statement and compile it for efficient matching | ||
unified = "|".join(config["exclude"][idx]) | ||
config["exclude"][idx] = re.compile(unified) | ||
|
||
config["severity"] = config.get("severity", DEFAULT_SEVERITY) | ||
config["rules"] = config.get("rules", list_rule_ids(default_rules())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably for a future release, this could be changed to a dataclass or pydantic model where schema and defaults are defined in the model.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good one :)
"""Checks if given data is printable text""" | ||
if isinstance(data, bytes): | ||
try: | ||
data = data.decode("utf-8") | ||
except Exception: | ||
return False | ||
|
||
if not isinstance(data, (str, int)): | ||
return False | ||
|
||
for ch in str(data): | ||
if ch not in string.printable: | ||
return False | ||
|
||
return True | ||
|
||
|
||
def is_base64(data: str) -> bool: | ||
"""Checks if given data is base64-decodable to text""" | ||
if not isinstance(data, str): | ||
return False | ||
|
||
try: | ||
b64decode(data).decode("utf-8") | ||
return True | ||
|
||
except Exception: | ||
return False | ||
|
||
|
||
def is_base64_bytes(data: str) -> bool: | ||
"""Checks if given data is base64-decodable to bytes""" | ||
try: | ||
return b64decode(data) != b"" | ||
|
||
except Exception: | ||
return False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Capturing bare exceptions might be dangerous, we might be hiding errors. Can we just capture the ones we expect on these cases ex binascii.Error
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are too many exceptions that could happen here. It could be wrongly guessing something as base64 and trying to decode it, or decode something that is usually base64 but not that time.
All errors are automatically logged into whispers.log
for later analysis.
Release notes
❌ Breaking changes ❌
❌ File exclusion globs are now regex ❌
In version 1 the configuration file expected file exclusion specification to be a list of globs. Whispers would then resolve included globs, resolve excluded globs, and finally subtract the two lists to get applicable scope. The entire target directory tree would be traversed twice to compute applicable files (highly resource-intensive operation!)
In version 2 file exclusions are specified as regex. Instead of resolving globs, Whispers now uses the generator directly. Every file path received from the glob generator is now checked against the file exclusion regex to determine whether the file should be excluded on-the-fly.
This highly improves performance for cases where the target directory contains a large number of files. In version 2 the tree is traversed file by file, individually checking if the file path matches a pre-compiled exclusion regex. This decreases CPU, RAM and time needed to scan directories of potentially unlimited trees and depths.
❌ Rule specification format changes ❌
In version 1 the rules were defined as a dictionary with rule ID as the key and rule config as the value. This created awkward parsing practices and unintuitive code. For example:
In version 2 the rules are defined as a list of dictionaries. The rule ID now has its own
id
key inside the rule config definition. For example:If you have any custom rule definitions, you will have to adjust them for migrating to version 2.
❌ Removed support for dynamic languages ❌
In version 1 the following language files were parsed as text and checked for common variable declaration and assignment patterns:
It is not possible to parse these languages as Abstract Syntax Trees (ASTs) in Python. The initial attempt was to detect "low hanging fruit" by parsing the files as text instead. This lead to poor functional coverage, as well as a potentially false sense of security.
In version 2 the support for these dynamic languages is dropped. This allowed bringing unit test coverage up to 100%, and in this way ensuring result reliability and true security coverage. It is recommended to rely on AST-based parsing for dynamic languages for getting reliable results. Check out Semgrep!
Python3 remains fully supported in Whispers 2.
🛠️ Non-breaking changes 🛠️
🛠️ Rule specification in config.yml 🛠️
You can now specify rules that you want to use directly in config.yml as a list. In addition, custom rules can be added directly to that list using the following format:
If you don't specify any, all built-in rules will be used be default. If you do, only those that you specify will be applicable.
🛠️ Severity specification in config.yml 🛠️
You can now specify severity levels that you want to use directly in config.yml as a list:
If you don't specify any, all built-in severity levels will be used be default - BLOCKER, CRITICAL, MAJOR, MINOR, INFO.
If you do, only those that you specify will be applicable.
✅ New features ✅
No new features were introduced in this release. The primary objective of the present release was to optimize currently implemented logic in order to make it easier to read, understand, and work with in general. This refactoring, along with the aforementioned breaking changes, have shown to increase scanning speed of up to 7-10 times (depending on conditions) in comparison with version 1. In addition, it allowed achieving 100% unit test coverage.
Complete list of arguments can be found in
whispers -h
along with documentation in README.md