Skip to content
This repository has been archived by the owner on Oct 25, 2023. It is now read-only.

Version 2.0.0 #84

Closed
wants to merge 49 commits into from
Closed

Version 2.0.0 #84

wants to merge 49 commits into from

Conversation

adeptex
Copy link
Contributor

@adeptex adeptex commented Sep 20, 2021

Release notes

❌ Breaking changes ❌

❌ File exclusion globs are now regex ❌

In version 1 the configuration file expected file exclusion specification to be a list of globs. Whispers would then resolve included globs, resolve excluded globs, and finally subtract the two lists to get applicable scope. The entire target directory tree would be traversed twice to compute applicable files (highly resource-intensive operation!)

In version 2 file exclusions are specified as regex. Instead of resolving globs, Whispers now uses the generator directly. Every file path received from the glob generator is now checked against the file exclusion regex to determine whether the file should be excluded on-the-fly.

This highly improves performance for cases where the target directory contains a large number of files. In version 2 the tree is traversed file by file, individually checking if the file path matches a pre-compiled exclusion regex. This decreases CPU, RAM and time needed to scan directories of potentially unlimited trees and depths.

❌ Rule specification format changes ❌

In version 1 the rules were defined as a dictionary with rule ID as the key and rule config as the value. This created awkward parsing practices and unintuitive code. For example:

npmrc: 
  description: Hardcoded .npmrc authToken
  message: .npmrc authToken
  severity: CRITICAL
  key:
    regex: ^npm authToken$
    ignorecase: False

In version 2 the rules are defined as a list of dictionaries. The rule ID now has its own id key inside the rule config definition. For example:

- id: npmrc
  description: Hardcoded .npmrc authToken
  message: .npmrc authToken
  severity: CRITICAL
  key:
    regex: ^npm authToken$
    ignorecase: False

If you have any custom rule definitions, you will have to adjust them for migrating to version 2.

❌ Removed support for dynamic languages ❌

In version 1 the following language files were parsed as text and checked for common variable declaration and assignment patterns:

  • JavaScript
  • Java
  • Go
  • PHP

It is not possible to parse these languages as Abstract Syntax Trees (ASTs) in Python. The initial attempt was to detect "low hanging fruit" by parsing the files as text instead. This lead to poor functional coverage, as well as a potentially false sense of security.

In version 2 the support for these dynamic languages is dropped. This allowed bringing unit test coverage up to 100%, and in this way ensuring result reliability and true security coverage. It is recommended to rely on AST-based parsing for dynamic languages for getting reliable results. Check out Semgrep!

Python3 remains fully supported in Whispers 2.

🛠️ Non-breaking changes 🛠️

🛠️ Rule specification in config.yml 🛠️

You can now specify rules that you want to use directly in config.yml as a list. In addition, custom rules can be added directly to that list using the following format:

exclude:
  files:
    - \.npmrc
    - .*coded.*
    - \.git/.*
  keys:
    - SECRET_VALUE_KEY
  values:
    - SECRET_VALUE_PLACEHOLDER

rules:
  - password
  - privatekey
  - id: starks
    message: Whispers from the North
    severity: CRITICAL
    value:
      regex: (Aria|Ned) Stark
      ignorecase: True

If you don't specify any, all built-in rules will be used be default. If you do, only those that you specify will be applicable.

🛠️ Severity specification in config.yml 🛠️

You can now specify severity levels that you want to use directly in config.yml as a list:

exclude:
  files:
    - \.npmrc
    - .*coded.*
    - \.git/.*
  keys:
    - SECRET_VALUE_KEY
  values:
    - SECRET_VALUE_PLACEHOLDER

severity:
  - BLOCKER
  - CRITICAL

If you don't specify any, all built-in severity levels will be used be default - BLOCKER, CRITICAL, MAJOR, MINOR, INFO.
If you do, only those that you specify will be applicable.

✅ New features ✅

No new features were introduced in this release. The primary objective of the present release was to optimize currently implemented logic in order to make it easier to read, understand, and work with in general. This refactoring, along with the aforementioned breaking changes, have shown to increase scanning speed of up to 7-10 times (depending on conditions) in comparison with version 1. In addition, it allowed achieving 100% unit test coverage.

Complete list of arguments can be found in whispers -h along with documentation in README.md

Copy link

@ocrawford555 ocrawford555 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@adeptex awesome changes. A lot to process in here so I've left an initial review and may give it another pass later on. Are there any particular areas you would like reviewing in more detail?

I will also try running locally and seeing if there are any bugs I can find and then link back to the code. Let me know if you need any clarifications on my comments! 💯

args_parser.add_argument("-i", "--info", action="store_true", help="show extended help and exit")
args_parser.add_argument("-c", "--config", help="config file")
args_parser.add_argument("-o", "--output", help="output file")
args_parser.add_argument("-e", "--exitcode", default=0, type=int, help="exit code on success")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the exitcode need to be configurable? Should it not always be 0 on success?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feature was requested in #61

It appears that when chaining/piping scripts in Unix-type shells, it is possible that a wrong exit code gets propagated down the pipeline if one of the scripts fails. So having this configurable helps ensure that the step completed successfully with the expected exit code. For instance: https://unix.stackexchange.com/questions/584990/why-is-exit-code-0-even-though-the-command-is-wrong

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh, interesting.

whispers/core/args.py Outdated Show resolved Hide resolved
whispers/core/config.py Show resolved Hide resolved

logging.debug(f"Loaded plugin '{plugin}' for file '{file}'")

pairs = plugin().pairs(file)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the plugins could inherit from a common base class, which might help with some of the Python typing and make extensibility even easier in the future.

For example, creating a BasePlugin class which has an abstract method pairs, alongside other common properties/functions. 😄

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initially that was the idea. However, in practice not all plugins have a generic way of parsing files. The pairs method is the most generic common thing that unites all plugins - calling that method gives you a list of KeyValuePair objects.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, maybe one for a future PR / contribution 😉

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To summarize current cases:

  1. Plugin parses file line by line with enumerate, so pair.line can be directly set by tracking lineno. For example - Dockerfile, .ini, .npmrc

for lineno, line in enumerate(filepath.open(), 1):
if ":_authToken=" not in line:
continue
value = line.split(":_authToken=")[-1].strip()
if value:
key = "npm authToken"
yield KeyValuePair(key, value, keypath=[key], line=lineno)

  1. Plugin parses file as a dictionary, so pair.line has to be found later by reading the file as plaintext, if the pair is identified as a secret. For example - JSON, XML, YML

def find_line_number(pair: KeyValuePair) -> int:
"""Finds line number using pair keypath and value"""
if pair.line:
return pair.line # Already set
valuepath = pair.value.split("\n")[0][:16]
findpath = [*pair.keypath, valuepath]
foundline = 0
for lineno, line in enumerate(Path(pair.file).open(), 1):
founditems = 0
for item in findpath:
if item not in line:
break
founditems += 1
foundline = lineno
findpath = findpath[founditems:]
if not findpath:
return foundline
return 0

  1. Plugin parses file as AST, so pair.line can be directly set from available AST node properties. For example - Python

elif isinstance(node, astroid.nodes.Keyword):
key = self.node_to_str(node)
value = self.node_to_str(node.value)
if key and value and isinstance(value, (str, int)):
yield KeyValuePair(key, value, keypath=[key], line=node.lineno)

whispers/core/scope.py Outdated Show resolved Hide resolved
whispers/plugins/python.py Outdated Show resolved Hide resolved
whispers/plugins/shell.py Outdated Show resolved Hide resolved
tests/unit/test_pairs.py Outdated Show resolved Hide resolved
whispers/core/rules.py Outdated Show resolved Hide resolved
if pair.line:
return pair.line # Already set

valuepath = pair.value.split("\n")[0][:16]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why 16 here? 😄 I'm sure there's good reason, but maybe an inline comment will help

Copy link
Contributor Author

@adeptex adeptex Sep 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for specific cases like the following

key: "\
-----BEGIN RSA KEY----- \
QyNTUxOQAAACCtrF27B/zd9DEpd38IbVBy93wSeYXKU0AGXMyO8ePu2QAAAKBSzpYEUs6W \
-----END RSA KEY-----\
"

When this is loaded by whispers, key key has the value of:

-----BEGIN RSA KEY----- \nQyNTUxOQAAACCtrF27B/zd9DEpd38IbVBy93wSeYXKU0AGXMyO8ePu2QAAAKBSzpYEUs6W \n-----END RSA KEY-----

However, if you were to look for that string in that file, you would not find it, because in plaintext its multi-line is formatted differently.

In order to find the line number where this value begins correctly, the operation is to first split the value to get -----BEGIN RSA KEY----- , and then only take up to the first 16 chars for finding the string in a line of plaintext.

@adeptex adeptex marked this pull request as draft September 22, 2021 15:10
Makefile Outdated Show resolved Hide resolved
Makefile Show resolved Hide resolved
from setuptools import find_packages, setup


def get_version():
return import_module("whispers.__version__").__version__


install_requires = ["luhn>=0.2.0", "lxml>=4.6.2", "pyyaml>=5.3.1", "astroid>=2.4.2", "jproperties>=2.1.0", "python-levenshtein>=0.12.0", "beautifulsoup4>=4.9.3"]
install_requires = ["dataclasses", "luhn", "lxml", "pyyaml", "astroid", "jproperties", "python-levenshtein", "beautifulsoup4"]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why versions are not restricted like dev dependencies? 😅

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This decision was based on the following: https://caremad.io/posts/2013/07/setup-vs-requirement/

In addition, dependabot only bumps requirements.txt, while setup.py remains outdated. Updating versions in setup.py requires manual review. Unless there is an alternative way of keeping dependencies updated that I am missing, this seems to be the most reasonable solution.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue is the way we use setup.py. dependabot/dependabot-core#1475
So we don't use dependabot or we change the way we manage dependencies.

However, I would push for having pinned dependencies on setup.py

@adeptex adeptex closed this Sep 23, 2021
@adeptex adeptex deleted the version-2 branch September 23, 2021 15:31
@adeptex adeptex restored the version-2 branch October 14, 2021 16:42
@adeptex adeptex reopened this Oct 14, 2021
@adeptex adeptex marked this pull request as ready for review October 18, 2021 10:42
Makefile Outdated Show resolved Hide resolved
Comment on lines +26 to +45
"""Ensure minimal expected config structure"""
try:
config["include"] = config.get("include", {"files": ["**/*"]})
config["include"]["files"] = config["include"].get("files", ["**/*"])

config["exclude"] = config.get("exclude", {"files": None, "keys": None, "values": None})
config["exclude"]["files"] = config["exclude"].get("files", None)
config["exclude"]["keys"] = config["exclude"].get("keys", None)
config["exclude"]["values"] = config["exclude"].get("values", None)

for idx in ["files", "keys", "values"]:
if not config["exclude"][idx]:
continue

# Create a single regex statement and compile it for efficient matching
unified = "|".join(config["exclude"][idx])
config["exclude"][idx] = re.compile(unified)

config["severity"] = config.get("severity", DEFAULT_SEVERITY)
config["rules"] = config.get("rules", list_rule_ids(default_rules()))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably for a future release, this could be changed to a dataclass or pydantic model where schema and defaults are defined in the model.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good one :)

whispers/core/pairs.py Outdated Show resolved Hide resolved
whispers/core/pairs.py Outdated Show resolved Hide resolved
whispers/core/pairs.py Show resolved Hide resolved
whispers/core/rules.py Show resolved Hide resolved
Comment on lines +83 to +119
"""Checks if given data is printable text"""
if isinstance(data, bytes):
try:
data = data.decode("utf-8")
except Exception:
return False

if not isinstance(data, (str, int)):
return False

for ch in str(data):
if ch not in string.printable:
return False

return True


def is_base64(data: str) -> bool:
"""Checks if given data is base64-decodable to text"""
if not isinstance(data, str):
return False

try:
b64decode(data).decode("utf-8")
return True

except Exception:
return False


def is_base64_bytes(data: str) -> bool:
"""Checks if given data is base64-decodable to bytes"""
try:
return b64decode(data) != b""

except Exception:
return False

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Capturing bare exceptions might be dangerous, we might be hiding errors. Can we just capture the ones we expect on these cases ex binascii.Error?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are too many exceptions that could happen here. It could be wrongly guessing something as base64 and trying to decode it, or decode something that is usually base64 but not that time.

All errors are automatically logged into whispers.log for later analysis.

whispers/core/utils.py Outdated Show resolved Hide resolved
@adeptex adeptex closed this Oct 20, 2021
@adeptex adeptex reopened this Oct 20, 2021
@adeptex adeptex closed this Apr 20, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants