-
Notifications
You must be signed in to change notification settings - Fork 401
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
feat(tracer): add
TracerFlareSubscriber
to enable the tracer flare (#…
…9150) ## Overview This PR adds a remote config subscriber for the tracer flare. The tracer flare logic was already implemented in a previous PR, but was not wired up to actually react to the RC products yet. This PR is the last piece to enable the tracer flare for dd-trace-py. A tracer flare is a tar file containing tracer logs from (approximately) the last 5 minutes, as well as a JSON file of the current tracer configurations. This pair of files will be generated per tracer instance that is connected to the agent, and the tar containing all of these files will be sent to a Zendesk ticket. Users can trigger the tracer flare the same way they trigger agent flares. See Agent Flare documentation [here](https://docs.datadoghq.com/agent/troubleshooting/send_a_flare/?tab=agentv6v7). For details on the flare implementation, see #8961 and #8969. ## Risks ### `AGENT_CONFIG` doesn't get cleared very frequently Something I noticed when doing some E2E testing is that if you try to do consecutive tracer flare requests, this won't work because the `AGENT_CONFIG` is still retaining the state from the previous request. This means that there isn't an update/publish event that can get picked up on our end, so we can't trigger another flare for some amount of time (not sure what this duration is exactly). The current implementation depends on the publish event, so trying to trigger consecutive tracer flare requests will not work until the state gets cleared. ### `AGENT_CONFIG` and `AGENT_TASK` are not exclusive to tracer flare use Currently, the tracer is listening for changes to the `AGENT_CONFIG` and `AGENT_TASK` remote config products. This is originally intended for the **agent** flare, not the tracer flare, but at this time we are piggy-backing on this signal. For this reason, it's been flagged by other tracer teams that the format/contents of the products may not be guaranteed. In the case that we start to notice flares not being triggered/generated as expected, this may be a code fix to check for. The current expectation for the products is: `AGENT_CONFIG` ```json { "metadata":[ { "id":"flare-log-level.<log-level>", "product_name":"AGENT_CONFIG", "sha256_hash":"xxx", "length":63, "tuf_version":3, "apply_state":2, "apply_error":"None" } ], "config":[ { "config":{ "log_level":"<log-level>" }, "name":"flare-log-level.<log-level>" } ], "shared_data_counter":2 } ``` `AGENT_TASK` ```json { "metadata":[ { "id":"id1", "product_name":"AGENT_TASK", "sha256_hash":"xxx", "length":139, "tuf_version":4, "apply_state":2, "apply_error":"None" } ], "config":[ false, { "args":{ "case_id":"111", "hostname":"myhostname", "user_handle":"user.name@datadoghq.com" }, "task_type":"tracer_flare", "uuid":"yyyyyy" } ], "shared_data_counter":5 } ``` ## Checklist - [x] Change(s) are motivated and described in the PR description - [x] Testing strategy is described if automated tests are not included in the PR - [x] Risks are described (performance impact, potential for breakage, maintainability) - [x] Change is maintainable (easy to change, telemetry, documentation) - [x] [Library release note guidelines](https://ddtrace.readthedocs.io/en/stable/releasenotes.html) are followed or label `changelog/no-changelog` is set - [x] Documentation is included (in-code, generated user docs, [public corp docs](https://github.com/DataDog/documentation/)) - [x] Backport labels are set (if [applicable](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting)) - [x] If this PR changes the public interface, I've notified `@DataDog/apm-tees`. ## Reviewer Checklist - [x] Title is accurate - [x] All changes are related to the pull request's stated goal - [x] Description motivates each change - [x] Avoids breaking [API](https://ddtrace.readthedocs.io/en/stable/versioning.html#interfaces) changes - [x] Testing strategy adequately addresses listed risks - [x] Change is maintainable (easy to change, telemetry, documentation) - [x] Release note makes sense to a user of the library - [x] Author has acknowledged and discussed the performance implications of this PR as reported in the benchmarks PR comment - [x] Backport labels are set in a manner that is consistent with the [release branch maintenance policy](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting) --------- Co-authored-by: Brett Langdon <brett.langdon@datadoghq.com>
- Loading branch information
1 parent
5897cab
commit b73738e
Showing
9 changed files
with
347 additions
and
33 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Empty file.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,78 @@ | ||
from datetime import datetime | ||
import os | ||
from typing import Callable # noqa:F401 | ||
from typing import Optional # noqa:F401 | ||
|
||
from ddtrace.internal.flare.flare import Flare | ||
from ddtrace.internal.logger import get_logger | ||
from ddtrace.internal.remoteconfig._connectors import PublisherSubscriberConnector # noqa:F401 | ||
from ddtrace.internal.remoteconfig._subscribers import RemoteConfigSubscriber | ||
|
||
|
||
log = get_logger(__name__) | ||
|
||
DEFAULT_STALE_FLARE_DURATION_MINS = 20 | ||
|
||
|
||
class TracerFlareSubscriber(RemoteConfigSubscriber): | ||
def __init__( | ||
self, | ||
data_connector: PublisherSubscriberConnector, | ||
callback: Callable, | ||
flare: Flare, | ||
stale_flare_age: int = DEFAULT_STALE_FLARE_DURATION_MINS, | ||
): | ||
super().__init__(data_connector, callback, "TracerFlareConfig") | ||
self.current_request_start: Optional[datetime] = None | ||
self.stale_tracer_flare_num_mins = stale_flare_age | ||
self.flare = flare | ||
|
||
def has_stale_flare(self) -> bool: | ||
if self.current_request_start: | ||
curr = datetime.now() | ||
flare_age = (curr - self.current_request_start).total_seconds() | ||
stale_age = self.stale_tracer_flare_num_mins * 60 | ||
return flare_age >= stale_age | ||
return False | ||
|
||
def _get_data_from_connector_and_exec(self): | ||
if self.has_stale_flare(): | ||
log.info( | ||
"Tracer flare request started at %s is stale, reverting " | ||
"logger configurations and cleaning up resources now", | ||
self.current_request_start, | ||
) | ||
self.current_request_start = None | ||
self._callback(self.flare, {}, True) | ||
return | ||
|
||
data = self._data_connector.read() | ||
metadata = data.get("metadata") | ||
if not metadata: | ||
log.debug("No metadata received from data connector") | ||
return | ||
|
||
for md in metadata: | ||
product_type = md.get("product_name") | ||
if product_type == "AGENT_CONFIG": | ||
# We will only process one tracer flare request at a time | ||
if self.current_request_start is not None: | ||
log.warning( | ||
"There is already a tracer flare job started at %s. Skipping new request.", | ||
str(self.current_request_start), | ||
) | ||
return | ||
self.current_request_start = datetime.now() | ||
elif product_type == "AGENT_TASK": | ||
# Possible edge case where we don't have an existing flare request | ||
# In this case we won't have anything to send, so we log and do nothing | ||
if self.current_request_start is None: | ||
log.warning("There is no tracer flare job to complete. Skipping new request.") | ||
return | ||
self.current_request_start = None | ||
else: | ||
log.debug("Received unexpected product type for tracer flare: {}", product_type) | ||
return | ||
log.debug("[PID %d] %s _exec_callback: %s", os.getpid(), self, str(data)[:50]) | ||
self._callback(self.flare, data) | ||
return |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,90 @@ | ||
from typing import Any | ||
from typing import Callable | ||
from typing import List | ||
|
||
from ddtrace.internal.flare.flare import Flare | ||
from ddtrace.internal.flare.flare import FlareSendRequest | ||
from ddtrace.internal.logger import get_logger | ||
|
||
|
||
log = get_logger(__name__) | ||
|
||
|
||
def _tracerFlarePubSub(): | ||
from ddtrace.internal.flare._subscribers import TracerFlareSubscriber | ||
from ddtrace.internal.remoteconfig._connectors import PublisherSubscriberConnector | ||
from ddtrace.internal.remoteconfig._publishers import RemoteConfigPublisher | ||
from ddtrace.internal.remoteconfig._pubsub import PubSub | ||
|
||
class _TracerFlarePubSub(PubSub): | ||
__publisher_class__ = RemoteConfigPublisher | ||
__subscriber_class__ = TracerFlareSubscriber | ||
__shared_data__ = PublisherSubscriberConnector() | ||
|
||
def __init__(self, callback: Callable, flare: Flare): | ||
self._publisher = self.__publisher_class__(self.__shared_data__, None) | ||
self._subscriber = self.__subscriber_class__(self.__shared_data__, callback, flare) | ||
|
||
return _TracerFlarePubSub | ||
|
||
|
||
def _handle_tracer_flare(flare: Flare, data: dict, cleanup: bool = False): | ||
if cleanup: | ||
flare.revert_configs() | ||
flare.clean_up_files() | ||
return | ||
|
||
if "config" not in data: | ||
log.warning("Unexpected tracer flare RC payload %r", data) | ||
return | ||
if len(data["config"]) == 0: | ||
log.warning("Unexpected number of tracer flare RC payloads %r", data) | ||
return | ||
|
||
product_type = data.get("metadata", [{}])[0].get("product_name") | ||
configs = data.get("config", [{}]) | ||
if product_type == "AGENT_CONFIG": | ||
_prepare_tracer_flare(flare, configs) | ||
elif product_type == "AGENT_TASK": | ||
_generate_tracer_flare(flare, configs) | ||
else: | ||
log.warning("Received unexpected tracer flare product type: %s", product_type) | ||
|
||
|
||
def _prepare_tracer_flare(flare: Flare, configs: List[dict]): | ||
""" | ||
Update configurations to start sending tracer logs to a file | ||
to be sent in a flare later. | ||
""" | ||
for c in configs: | ||
# AGENT_CONFIG is currently being used for multiple purposes | ||
# We only want to prepare for a tracer flare if the config name | ||
# starts with 'flare-log-level' | ||
if not c.get("name", "").startswith("flare-log-level"): | ||
continue | ||
|
||
flare_log_level = c.get("config", {}).get("log_level").upper() | ||
flare.prepare(c, flare_log_level) | ||
return | ||
|
||
|
||
def _generate_tracer_flare(flare: Flare, configs: List[Any]): | ||
""" | ||
Revert tracer flare configurations back to original state | ||
before sending the flare. | ||
""" | ||
for c in configs: | ||
# AGENT_TASK is currently being used for multiple purposes | ||
# We only want to generate the tracer flare if the task_type is | ||
# 'tracer_flare' | ||
if type(c) != dict or c.get("task_type") != "tracer_flare": | ||
continue | ||
args = c.get("args", {}) | ||
flare_request = FlareSendRequest( | ||
case_id=args.get("case_id"), hostname=args.get("hostname"), email=args.get("user_handle") | ||
) | ||
|
||
flare.revert_configs() | ||
|
||
flare.send(flare_request) | ||
return |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
--- | ||
features: | ||
- | | ||
tracer: This introduces the tracer flare functionality. Currently the tracer flare includes the | ||
tracer logs and tracer configurations. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.