New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(settings): fuzzy env variable matching #4590
Conversation
89e0ec7
to
ab2e4bd
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is cool, let some initial minor thoughts/comments
@@ -0,0 +1,123 @@ | |||
import os |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will want this to be a private module, maybe something in ddtrace.internal
, or just renaming as _matching.py
?
def __init__(self): | ||
self.matcher = Trie() | ||
for env_name in os.environ.keys(): | ||
if env_name.startswith(self.SCANNED_PREFIX): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so we'll only be able to fuzzy match if they have properly started with DD_
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking that DD_
was probably something people wouldn't mistype (or the mistake is quite obvious) so it's not really worth taking into account variables without the prefix.
Plus if a user forgets to put DD_ (like TRACE_ENABLED=true
) the matching method doesn't work that well since the distance to the actual env is at least 3.
ddtrace/settings/matching.py
Outdated
max_dist = min(max((len(key) - len(self.SCANNED_PREFIX)) / 3, 1), 2) | ||
match = self.matcher.match_damreau_levhenstein(key, max_dist) | ||
if match is not None: | ||
log.warning("Env variable %s not recognized, did you mean %s", key, match.match) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
log.warning("Env variable %s not recognized, did you mean %s", key, match.match) | |
log.warning("Env variable %s not recognized, did you mean %s", key, match.match) |
Then we'll get quotes around them.
@@ -2,7 +2,7 @@ | |||
universal=1 | |||
|
|||
[codespell] | |||
skip = *.json,*.cpp,*.c,.riot,.tox,.mypy_cache,.git,*ddtrace/vendor | |||
skip = *.json,*.cpp,*.c,.riot,.tox,.mypy_cache,.git,*ddtrace/vendor,tests/settings/test_levhenstein_distance.py |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need to skip the whole file here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The codespell has an errror on of of the test input, and from a quick look it looks like codespell doesn't support inline ignore codespell-project/codespell#1212
@paullegranddc this pull request is now in conflict 😩 |
return a | ||
|
||
|
||
class Trie(object): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we leverage the difflib
module here somehow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've taken a look at the difflib
module, and it's based on another algorithm (substring matching).
I think the one currently implemented works better to correct typos for two reasons:
- The algorithm difflib uses returns a score that depends on the length of the strings matched, whereas we mostly care about the absolute number of differences, and it doesn't consider character swap as a single change.
- The current implementation should be more efficient on long strings because we can prune results early by giving a max distance of 1 or 2. That makes the matching mostly O(len(text)) whereas difflib is O(len(text) * len(pattern) * nb(pattern))
This is a cool POC, but closing the PR for now until we are able to come back to it! |
Description
This PR replaces
os.environ.get
andos.getenv
calls with a proxy object that checks if the env variable is not set wether another variable with a close name is set.Checklist
feat
andfix
pull requests.Motivation
Design
Testing strategy
Relevant issue(s)
Testing strategy
Reviewer Checklist
changelog/no-changelog
label added.