Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(settings): fuzzy env variable matching #4590

Closed
wants to merge 5 commits into from

Conversation

paullegranddc
Copy link

@paullegranddc paullegranddc commented Nov 18, 2022

Description

This PR replaces os.environ.get and os.getenv calls with a proxy object that checks if the env variable is not set wether another variable with a close name is set.

Checklist

Motivation

Design

Testing strategy

Relevant issue(s)

Testing strategy

Reviewer Checklist

  • Title is accurate.
  • Description motivates each change.
  • No unnecessary changes were introduced in this PR.
  • Avoid breaking API changes unless absolutely necessary.
  • Tests provided or description of manual testing performed is included in the code or PR.
  • Release note has been added for fixes and features, or else changelog/no-changelog label added.
  • All relevant GitHub issues are correctly linked.
  • Backports are identified and tagged with Mergifyio.

@mergify mergify bot mentioned this pull request Nov 18, 2022
10 tasks
@paullegranddc paullegranddc changed the title Fuzzy env variable matching feat (settings): Fuzzy env variable matching Nov 18, 2022
@paullegranddc paullegranddc changed the title feat (settings): Fuzzy env variable matching feat(settings): Fuzzy env variable matching Nov 18, 2022
@paullegranddc paullegranddc changed the title feat(settings): Fuzzy env variable matching feat(settings): fuzzy env variable matching Nov 18, 2022
Copy link
Member

@brettlangdon brettlangdon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is cool, let some initial minor thoughts/comments

@@ -0,0 +1,123 @@
import os
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will want this to be a private module, maybe something in ddtrace.internal, or just renaming as _matching.py ?

def __init__(self):
self.matcher = Trie()
for env_name in os.environ.keys():
if env_name.startswith(self.SCANNED_PREFIX):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so we'll only be able to fuzzy match if they have properly started with DD_?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking that DD_ was probably something people wouldn't mistype (or the mistake is quite obvious) so it's not really worth taking into account variables without the prefix.

Plus if a user forgets to put DD_ (like TRACE_ENABLED=true) the matching method doesn't work that well since the distance to the actual env is at least 3.

max_dist = min(max((len(key) - len(self.SCANNED_PREFIX)) / 3, 1), 2)
match = self.matcher.match_damreau_levhenstein(key, max_dist)
if match is not None:
log.warning("Env variable %s not recognized, did you mean %s", key, match.match)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
log.warning("Env variable %s not recognized, did you mean %s", key, match.match)
log.warning("Env variable %s not recognized, did you mean %s", key, match.match)

Then we'll get quotes around them.

@@ -2,7 +2,7 @@
universal=1

[codespell]
skip = *.json,*.cpp,*.c,.riot,.tox,.mypy_cache,.git,*ddtrace/vendor
skip = *.json,*.cpp,*.c,.riot,.tox,.mypy_cache,.git,*ddtrace/vendor,tests/settings/test_levhenstein_distance.py
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need to skip the whole file here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The codespell has an errror on of of the test input, and from a quick look it looks like codespell doesn't support inline ignore codespell-project/codespell#1212

@mergify
Copy link
Contributor

mergify bot commented Nov 23, 2022

@paullegranddc this pull request is now in conflict 😩

@mergify mergify bot added the conflicts label Nov 23, 2022
return a


class Trie(object):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we leverage the difflib module here somehow?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've taken a look at the difflib module, and it's based on another algorithm (substring matching).
I think the one currently implemented works better to correct typos for two reasons:

  • The algorithm difflib uses returns a score that depends on the length of the strings matched, whereas we mostly care about the absolute number of differences, and it doesn't consider character swap as a single change.
  • The current implementation should be more efficient on long strings because we can prune results early by giving a max distance of 1 or 2. That makes the matching mostly O(len(text)) whereas difflib is O(len(text) * len(pattern) * nb(pattern))

@github-actions github-actions bot added the stale label Dec 31, 2022
@brettlangdon
Copy link
Member

This is a cool POC, but closing the PR for now until we are able to come back to it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants