Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core] Collect run-time metrics #819

Merged
merged 75 commits into from
Apr 11, 2019
Merged
Show file tree
Hide file tree
Changes from 60 commits
Commits
Show all changes
75 commits
Select commit Hold shift + click to select a range
c29f31c
[metrics] initial implementation
Kyle-Verhoog Feb 2, 2019
8c2679f
[metrics] add gc generation metrics
Kyle-Verhoog Feb 2, 2019
d9b61b6
[metrics] clean-up
Kyle-Verhoog Feb 8, 2019
5f14909
[metrics] add thread worker, additional metrics
Kyle-Verhoog Feb 9, 2019
f0174de
[metrics] linting
Kyle-Verhoog Feb 9, 2019
4a4c9f7
[metrics] code organization
Kyle-Verhoog Mar 15, 2019
75c05c1
[metrics] add runtime_id to tracer
Kyle-Verhoog Mar 15, 2019
1c4a3b8
[metrics] resolve rebase conflicts
Kyle-Verhoog Mar 15, 2019
066df1a
[metrics] linting
Kyle-Verhoog Mar 15, 2019
b96ba74
[metrics] add runtime-id tag
Kyle-Verhoog Mar 15, 2019
e84e450
[metrics] linting
Kyle-Verhoog Mar 15, 2019
1e10667
[metrics] linting
Kyle-Verhoog Mar 15, 2019
910b83b
Add environment variable for enabling runtime metrics
majorgreys Mar 21, 2019
0abcafb
Environment configuration for dogstatsd
majorgreys Mar 21, 2019
1612214
apply brettlinter
brettlangdon Mar 22, 2019
eed154c
[metrics] remove unnecessary LazyValues
Kyle-Verhoog Mar 22, 2019
009b94f
[metrics] in-line psutil method calls
Kyle-Verhoog Mar 22, 2019
8aea85a
[metrics] use internal logger
Kyle-Verhoog Mar 22, 2019
02f2b00
[metrics] add reset method, gather services
Kyle-Verhoog Mar 22, 2019
5ba8dc1
[metrics] support multiple services properly
Kyle-Verhoog Mar 22, 2019
810ec4e
[metrics] use base test case
Kyle-Verhoog Mar 22, 2019
7f63ec9
[metrics] handle process forking
Kyle-Verhoog Mar 23, 2019
d23995c
[metrics] add runtime metrics tags to spans
Kyle-Verhoog Mar 25, 2019
d886bff
Remove LazyValue
majorgreys Mar 25, 2019
fcda216
Add dependencies for runtime metrics to library
majorgreys Mar 26, 2019
a333f70
Refactor metrics collectors and add tests
majorgreys Mar 28, 2019
321474d
Begin major refactoring of api
majorgreys Mar 28, 2019
1cc7895
Decouple dogstatsd from runtime metrics
majorgreys Mar 29, 2019
f851205
Fix constant
majorgreys Mar 29, 2019
c900dd2
Fix flake8
majorgreys Mar 29, 2019
2e807ec
Separate host/port for trace agent and dogstatsd
majorgreys Mar 29, 2019
a9999c8
Update ddtrace_run tests
majorgreys Mar 29, 2019
0308fd7
Fix integration test
majorgreys Mar 29, 2019
992c9ce
Merge branch '0.24-dev' into kyle-verhoog/metrics
majorgreys Apr 1, 2019
c78c5a0
Merge branch '0.24-dev' into kyle-verhoog/metrics
majorgreys Apr 1, 2019
a198c5f
Vendor datadogpy to fix issues with gevent+requests
majorgreys Apr 1, 2019
4e8e40e
Revert change to on import
majorgreys Apr 1, 2019
868891e
Add license for dogstatsd
majorgreys Apr 1, 2019
df7a07f
Move runtime metrics into internal
majorgreys Apr 1, 2019
c58e796
Fixes for ddtrace.internal.runtime
majorgreys Apr 1, 2019
effd59a
Wrap worker flush in try-except to log errors
majorgreys Apr 1, 2019
1ffdcb9
Flush calls gauge which is a UDP so no need to catch errors
majorgreys Apr 2, 2019
71439ac
Remove unused datadog and metrics tests
majorgreys Apr 2, 2019
86f70c8
Rename class in repr
majorgreys Apr 2, 2019
15953d0
Remove collect_fn argument from ValueCollector
majorgreys Apr 2, 2019
b1ff051
Fix flake8
majorgreys Apr 2, 2019
fbbbddf
Remove tags not called for in RFC
majorgreys Apr 2, 2019
b592566
Merge branch '0.24-dev' into kyle-verhoog/metrics
majorgreys Apr 2, 2019
3940813
Better metric names for cpu
majorgreys Apr 2, 2019
50a6ecf
Merge branch 'kyle-verhoog/metrics' of github.com:DataDog/dd-trace-py…
majorgreys Apr 2, 2019
641f9b6
Use 0-1-2 for gc collections
majorgreys Apr 5, 2019
da771e1
Merge branch '0.24-dev' into kyle-verhoog/metrics
majorgreys Apr 5, 2019
38e7f60
Comments
majorgreys Apr 5, 2019
9a8b6c7
Merge branch 'kyle-verhoog/metrics' of github.com:DataDog/dd-trace-py…
majorgreys Apr 5, 2019
156b6b4
Fix daemon for threading
majorgreys Apr 8, 2019
589a89b
Add test on metrics received by dogstatsd
majorgreys Apr 8, 2019
48d9bf2
Remove datadog dependency since we have it vendored
majorgreys Apr 8, 2019
34d5c0c
Fix cpu metrics
majorgreys Apr 8, 2019
e344085
Fix cumulative metrics
majorgreys Apr 10, 2019
a234743
Fix reset
majorgreys Apr 10, 2019
657061b
Flag check unnecessary
majorgreys Apr 10, 2019
a76e1ee
Fix runtime tag names
brettlangdon Apr 10, 2019
a9fb5c0
Merge branch 'kyle-verhoog/metrics' of github.com:DataDog/dd-trace-py…
majorgreys Apr 10, 2019
52acbb8
Only tag root span with runtime info
majorgreys Apr 10, 2019
610e8ce
Use common namespace for gc metric names
majorgreys Apr 10, 2019
94f58ad
Remove unnecessary set check
majorgreys Apr 10, 2019
5d34662
Wait for tests of metrics received
majorgreys Apr 10, 2019
af39200
Fix for constant tags and services
majorgreys Apr 10, 2019
75fb9de
Fix broken config
majorgreys Apr 10, 2019
bc560ed
Fix flake8
majorgreys Apr 11, 2019
7e26b3f
Merge branch '0.24-dev' into kyle-verhoog/metrics
majorgreys Apr 11, 2019
c467106
Fix ddtrace-run test for runtime metrics enabled
majorgreys Apr 11, 2019
667feea
Merge branch 'kyle-verhoog/metrics' of github.com:DataDog/dd-trace-py…
majorgreys Apr 11, 2019
077cad9
Update ddtrace/bootstrap/sitecustomize.py
brettlangdon Apr 11, 2019
ab0c594
Merge branch '0.24-dev' into kyle-verhoog/metrics
majorgreys Apr 11, 2019
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
3 changes: 3 additions & 0 deletions ddtrace/bootstrap/sitecustomize.py
Original file line number Diff line number Diff line change
Expand Up @@ -85,6 +85,7 @@ def add_global_tags(tracer):
hostname = os.environ.get('DD_AGENT_HOST', os.environ.get('DATADOG_TRACE_AGENT_HOSTNAME'))
port = os.environ.get("DATADOG_TRACE_AGENT_PORT")
priority_sampling = os.environ.get("DATADOG_PRIORITY_SAMPLING")
runtime_metrics_enabled = get_env('runtime_metrics', 'enabled')

opts = {}

Expand All @@ -97,6 +98,8 @@ def add_global_tags(tracer):
opts["port"] = int(port)
if priority_sampling:
opts["priority_sampling"] = asbool(priority_sampling)
if runtime_metrics_enabled:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we set a default, then we'll always have this, unless they do DD_RUNTIME_METRICS= and empty string is falsey.

Also we don't really use "enabled" as a value for other things do we? we should just use True as the default.

We should be able to change to:

opts['collect_metrics'] = asbool(get_env('runtime_metrics', True))

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to verify if we want this to be True or False by default.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nvm, I completely forgot how get_env works, it is get_env(<integration>, <name>) so this is DD_RUNTIME_METRICS_ENABLED and the default is None.

So this is totally fine to keep as-is! sorry about any confusion!

opts["collect_metrics"] = asbool(runtime_metrics_enabled)

if opts:
tracer.configure(**opts)
Expand Down
12 changes: 12 additions & 0 deletions ddtrace/internal/runtime/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
from .runtime_metrics import (
RuntimeTags,
RuntimeMetrics,
RuntimeWorker,
)


__all__ = [
'RuntimeTags',
'RuntimeMetrics',
'RuntimeWorker',
]
85 changes: 85 additions & 0 deletions ddtrace/internal/runtime/collector.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
import importlib

from ..logger import get_logger

log = get_logger(__name__)


class ValueCollector(object):
"""A basic state machine useful for collecting, caching and updating data
obtained from different Python modules.

The two primary use-cases are
1) data loaded once (like tagging information)
2) periodically updating data sources (like thread count)

Functionality is provided for requiring and importing modules which may or
may not be installed.
"""
enabled = True
periodic = False
required_modules = []
value = None
value_loaded = False

def __init__(self, enabled=None, periodic=None, required_modules=None):
self.enabled = self.enabled if enabled is None else enabled
self.periodic = self.periodic if periodic is None else periodic
self.required_modules = self.required_modules if required_modules is None else required_modules

self._modules_successfully_loaded = False
self.modules = self._load_modules()
if self._modules_successfully_loaded:
self._on_modules_load()

def _on_modules_load(self):
"""Hook triggered after all required_modules have been successfully loaded.
"""

def _load_modules(self):
modules = {}
try:
for module in self.required_modules:
modules[module] = importlib.import_module(module)
self._modules_successfully_loaded = True
except ImportError:
# DEV: disable collector if we cannot load any of the required modules
self.enabled = False
log.warn('Could not import module "{}" for {}. Disabling collector.'.format(module, self))
return None
return modules

def collect(self, keys=None):
"""Returns metrics as collected by `collect_fn`.

:param keys: The keys of the metrics to collect.
"""
if not self.enabled:
return self.value

keys = keys or set()

if not self.periodic and self.value_loaded:
return self.value

# call underlying collect function and filter out keys not requested
self.value = self.collect_fn(keys)

# filter values for keys
if len(keys) > 0 and isinstance(self.value, list):
self.value = [
(k, v)
for (k, v) in self.value
if k in keys
]

self.value_loaded = True
return self.value

def __repr__(self):
return '<{}(enabled={},periodic={},required_modules={})>'.format(
self.__class__.__name__,
self.enabled,
self.periodic,
self.required_modules,
)
46 changes: 46 additions & 0 deletions ddtrace/internal/runtime/constants.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
GC_GEN0_COUNT = 'runtime.python.gc.gen0_count'
GC_GEN1_COUNT = 'runtime.python.gc.gen1_count'
GC_GEN2_COUNT = 'runtime.python.gc.gen2_count'

THREAD_COUNT = 'runtime.python.thread_count'
MEM_RSS = 'runtime.python.mem.rss'
CPU_TIME_SYS = 'runtime.python.cpu.time.sys'
CPU_TIME_USER = 'runtime.python.cpu.time.user'
CPU_PERCENT = 'runtime.python.cpu.percent'
CTX_SWITCH_VOLUNTARY = 'runtime.python.cpu.ctx_switch.voluntary'
CTX_SWITCH_INVOLUNTARY = 'runtime.python.cpu.ctx_switch.involuntary'

GC_RUNTIME_METRICS = set([
GC_GEN0_COUNT,
GC_GEN1_COUNT,
GC_GEN2_COUNT,
])

PSUTIL_RUNTIME_METRICS = set([
THREAD_COUNT,
MEM_RSS,
CTX_SWITCH_VOLUNTARY,
CTX_SWITCH_INVOLUNTARY,
CPU_TIME_SYS,
CPU_TIME_USER,
CPU_PERCENT,
])

DEFAULT_RUNTIME_METRICS = GC_RUNTIME_METRICS | PSUTIL_RUNTIME_METRICS

RUNTIME_ID = 'runtime.python.runtime-id'
majorgreys marked this conversation as resolved.
Show resolved Hide resolved
SERVICE = 'runtime.python.service'
majorgreys marked this conversation as resolved.
Show resolved Hide resolved
LANG_INTERPRETER = 'runtime.python.lang_interpreter'
majorgreys marked this conversation as resolved.
Show resolved Hide resolved
LANG_VERSION = 'runtime.python.lang_version'
majorgreys marked this conversation as resolved.
Show resolved Hide resolved

TRACER_TAGS = set([
RUNTIME_ID,
SERVICE,
])

PLATFORM_TAGS = set([
LANG_INTERPRETER,
LANG_VERSION
])

DEFAULT_RUNTIME_TAGS = TRACER_TAGS
92 changes: 92 additions & 0 deletions ddtrace/internal/runtime/metric_collectors.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,92 @@
import os

from .collector import ValueCollector
from .constants import (
GC_GEN0_COUNT,
GC_GEN1_COUNT,
GC_GEN2_COUNT,
THREAD_COUNT,
MEM_RSS,
CTX_SWITCH_VOLUNTARY,
CTX_SWITCH_INVOLUNTARY,
CPU_TIME_SYS,
CPU_TIME_USER,
CPU_PERCENT,
)


class RuntimeMetricCollector(ValueCollector):
value = []
periodic = True


class GCRuntimeMetricCollector(RuntimeMetricCollector):
""" Collector for garbage collection generational counts

More information at https://docs.python.org/3/library/gc.html
"""
required_modules = ['gc']

def collect_fn(self, keys):
gc = self.modules.get('gc')

counts = gc.get_count()
metrics = [
(GC_GEN0_COUNT, counts[0]),
(GC_GEN1_COUNT, counts[1]),
(GC_GEN2_COUNT, counts[2]),
]

return metrics


class PSUtilRuntimeMetricCollector(RuntimeMetricCollector):
"""Collector for psutil metrics.

Performs batched operations via proc.oneshot() to optimize the calls.
See https://psutil.readthedocs.io/en/latest/#psutil.Process.oneshot
for more information.
"""
required_modules = ['psutil']
stored_value = dict(
CPU_TIME_SYS_TOTAL=0,
CPU_TIME_USER_TOTAL=0,
CTX_SWITCH_VOLUNTARY_TOTAL=0,
CTX_SWITCH_INVOLUNTARY_TOTAL=0,
)

def _on_modules_load(self):
self.proc = self.modules['psutil'].Process(os.getpid())

def collect_fn(self, keys):
with self.proc.oneshot():
# only return time deltas
# TODO[tahir]: better abstraction for metrics based on last value
cpu_time_sys_total = self.proc.cpu_times().system
cpu_time_user_total = self.proc.cpu_times().user
cpu_time_sys = cpu_time_sys_total - self.stored_value['CPU_TIME_SYS_TOTAL']
cpu_time_user = cpu_time_user_total - self.stored_value['CPU_TIME_USER_TOTAL']

ctx_switch_voluntary_total = self.proc.num_ctx_switches().voluntary
ctx_switch_involuntary_total = self.proc.num_ctx_switches().involuntary
ctx_switch_voluntary = ctx_switch_voluntary_total - self.stored_value['CTX_SWITCH_VOLUNTARY_TOTAL']
ctx_switch_involuntary = ctx_switch_involuntary_total - self.stored_value['CTX_SWITCH_INVOLUNTARY_TOTAL']

self.stored_value = dict(
CPU_TIME_SYS_TOTAL=cpu_time_sys_total,
CPU_TIME_USER_TOTAL=cpu_time_user_total,
CTX_SWITCH_VOLUNTARY_TOTAL=ctx_switch_voluntary_total,
CTX_SWITCH_INVOLUNTARY_TOTAL=ctx_switch_involuntary_total,
)

metrics = [
(THREAD_COUNT, self.proc.num_threads()),
(MEM_RSS, self.proc.memory_info().rss),
(CTX_SWITCH_VOLUNTARY, ctx_switch_voluntary),
(CTX_SWITCH_INVOLUNTARY, ctx_switch_involuntary),
(CPU_TIME_SYS, cpu_time_sys),
(CPU_TIME_USER, cpu_time_user),
(CPU_PERCENT, self.proc.cpu_percent()),
]

return metrics
107 changes: 107 additions & 0 deletions ddtrace/internal/runtime/runtime_metrics.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,107 @@
import threading
import time
import itertools

from ..logger import get_logger
from .constants import (
DEFAULT_RUNTIME_METRICS,
DEFAULT_RUNTIME_TAGS,
)
from .metric_collectors import (
GCRuntimeMetricCollector,
PSUtilRuntimeMetricCollector,
)
from .tag_collectors import (
TracerTagCollector,
)

log = get_logger(__name__)


class RuntimeCollectorsIterable(object):
def __init__(self, enabled=None):
self._enabled = enabled or self.ENABLED
# Initialize the collectors.
self._collectors = [c() for c in self.COLLECTORS]

def __iter__(self):
collected = (
collector.collect(self._enabled)
for collector in self._collectors
)
return itertools.chain.from_iterable(collected)

def __repr__(self):
return '{}(enabled={})'.format(
self.__class__.__name__,
self._enabled,
)


class RuntimeTags(RuntimeCollectorsIterable):
ENABLED = DEFAULT_RUNTIME_TAGS
COLLECTORS = [
TracerTagCollector,
]


class RuntimeMetrics(RuntimeCollectorsIterable):
ENABLED = DEFAULT_RUNTIME_METRICS
COLLECTORS = [
GCRuntimeMetricCollector,
PSUtilRuntimeMetricCollector,
]


class RuntimeWorker(object):
""" Worker thread for collecting and writing runtime metrics to a DogStatsd
client.
"""

FLUSH_INTERVAL = 10

def __init__(self, statsd_client, flush_interval=None):
self._stay_alive = None
self._thread = None
self._flush_interval = flush_interval or self.FLUSH_INTERVAL
self._statsd_client = statsd_client
self._runtime_metrics = RuntimeMetrics()

def _target(self):
while self._stay_alive:
self.flush()
time.sleep(self._flush_interval)

def start(self):
if not self._thread:
log.debug("Starting {}".format(self))
self._stay_alive = True
self._thread = threading.Thread(target=self._target)
self._thread.setDaemon(True)
self._thread.start()

def stop(self):
if self._thread and self._stay_alive:
log.debug("Stopping {}".format(self))
self._stay_alive = False

def _write_metric(self, key, value):
log.debug('Writing metric {}:{}'.format(key, value))
self._statsd_client.gauge(key, value)

def flush(self):
if not self._statsd_client:
log.warn('Attempted flush with uninitialized or failed statsd client')
return

for key, value in self._runtime_metrics:
self._write_metric(key, value)

def reset(self):
self._runtime_metrics = RuntimeMetrics()

def __repr__(self):
return '{}(runtime_metrics={})'.format(
self.__class__.__name__,
self._runtime_metrics,
)