Enabling RuntimeMetrics often occurs in 'Socket send would block' error #7728

Alex-Wauters · 2023-11-22T11:50:40Z

Summary of problem

When RuntimeMetrics is enabled, we often get the following error on our kubernetes pods:

Socket send would block: [Errno 11] Resource temporarily unavailable, dropping the packet

We can see the runtime metrics in our dashboard (dogstatsd is enabled on our agent and used for other metrics), but the frequent occurence of these errors made us roll back the changes just in case it is contributing to resource exhaustion. We're aware the feature is in public beta, hence our report.
https://docs.datadoghq.com/tracing/metrics/runtime_metrics/python/

Which version of dd-trace-py are you using?

2.2.0

Runtime metrics enabled via

from ddtrace.runtime import RuntimeMetrics RuntimeMetrics.enable()

How can we reproduce your problem?

It would require running several pods with runtime metrics enabled and monitor over time

What is the result that you get?

Socket send would block: [Errno 11] Resource temporarily unavailable, dropping the packet

What is the result that you expected?

Preferably no errors

The text was updated successfully, but these errors were encountered:

P403n1x87 · 2023-11-22T16:24:12Z

@Alex-Wauters is gevent one of the dependencies in use by any chance?

Alex-Wauters · 2023-11-22T19:44:17Z

Doesn't appear to be included directly, could be a transitive dependency. It occurs on all 3 services that we tried. Requirements.txt's:"

svc 1 - flask

gunicorn==20.1.0
numpy==1.24.3
future
ddtrace==2.2.0
Flask==3.0.0
Flask-Cors==3.0.10
joblib==1.2.0
opencv-python==4.8.1.78
pandas~=1.2.4
pillow~=10.0.1
python-json-logger==2.0.7
rapidfuzz~=1.4.1
regex~=2022.6.2
requests==2.28.2
scikit-learn==0.24.2
scikit-learn-extra==0.2.0
tldextract~=3.1.0
APScheduler==3.9.1
toolz==0.12.0
pyahocorasick==1.4.4
pytest==7.2.0
pytest-mock==3.10.0
pytest-xdist==3.0.2
deepdiff==6.2.1
approvaltests==8.2.0
dateparser~=1.1.8
-e SHARED_MODULE

shared module from all svc's

[
    "docutils",
    "setuptools",
    "setuptools-scm>=7.1.0",
    "Pillow",
    "azure-appconfiguration>=1.4.0",
    "PyPDF2",
    "requests>=2.28.2",
    "aiohttp>=3.8.4",
    "rapidfuzz",
    "regex",
    "shapely>=2.0.0",
    "toolz>=0.12.0",
    "cachetools>=5.0.0",
    "pydantic>=2.0.3",
    "deprecation>=2.1.0",
    "tenacity>=8.2.3"
]

svc 2 - fastapi based

requests~=2.28.2
regex==2020.11.13
ddtrace~=2.2.0
gunicorn==20.1.0
unidecode==1.2.0
psutil~=5.9.5
itsdangerous==2.0.1
Jinja2==3.0.3
pytest==7.2.0
rapidfuzz==2.13.3
python-json-logger==2.0.7
pydantic~=2.4.2
setuptools~=65.5.1
asyncio~=3.4.3
PyYAML~=6.0.1
fastapi~=0.103.2
uvicorn~=0.23.2
tenacity~=8.2.3
aiohttp~=3.8.6

ZStriker19 · 2023-11-27T21:38:21Z

Hi @Alex-Wauters do you know when you started to see this issue? Were you using an earlier tracer version before and didn't see those logs? We updated the Dogstatsd code that we vendor to the latest (0.47) two weeks ago. Which correlates with the 2.2.0 release that you're on. PR here. I'll look into this further, but that info would definitely be helpful.

Alex-Wauters · 2023-11-30T08:26:22Z

We only just enabled Python runtime metrics with that version, we haven't used it before so I don't have any stats for earlier versions.

ZStriker19 · 2023-12-01T17:49:06Z

Hi @Alex-Wauters could you tell me with what frequency you're seeing that message? If it's less than multiple times per second, it shouldn't be an issue. This happens when we try to send a payload, but the agent is too busy to accept it. Non-blocking socket is a least common denominator way of sending metrics that works in all scenarios without negatively impacting the user application in some way.

That message should be a debug log, are you running in debug mode? If so, and you want to avoid seeing this log, you can grab the logger and change its level e.g.:

import logging

logger = logging.getLogger("datadog.dogstatsd")
logger.setLevel(logging.DEBUG)

Let me know if you have any questions about this!

ZStriker19 · 2023-12-04T20:55:41Z

Closing this, but please re-open if the above does not help.

mabdinur added the runtime metrics label Nov 22, 2023

mabdinur self-assigned this Nov 22, 2023

ZStriker19 self-assigned this Nov 27, 2023

ZStriker19 closed this as completed Dec 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enabling RuntimeMetrics often occurs in 'Socket send would block' error #7728

Enabling RuntimeMetrics often occurs in 'Socket send would block' error #7728

Alex-Wauters commented Nov 22, 2023 •

edited

Loading

P403n1x87 commented Nov 22, 2023

Alex-Wauters commented Nov 22, 2023 •

edited

Loading

ZStriker19 commented Nov 27, 2023 •

edited

Loading

Alex-Wauters commented Nov 30, 2023

ZStriker19 commented Dec 1, 2023

ZStriker19 commented Dec 4, 2023

Enabling RuntimeMetrics often occurs in 'Socket send would block' error #7728

Enabling RuntimeMetrics often occurs in 'Socket send would block' error #7728

Comments

Alex-Wauters commented Nov 22, 2023 • edited Loading

Summary of problem

Which version of dd-trace-py are you using?

How can we reproduce your problem?

What is the result that you get?

What is the result that you expected?

P403n1x87 commented Nov 22, 2023

Alex-Wauters commented Nov 22, 2023 • edited Loading

ZStriker19 commented Nov 27, 2023 • edited Loading

Alex-Wauters commented Nov 30, 2023

ZStriker19 commented Dec 1, 2023

ZStriker19 commented Dec 4, 2023

Alex-Wauters commented Nov 22, 2023 •

edited

Loading

Alex-Wauters commented Nov 22, 2023 •

edited

Loading

ZStriker19 commented Nov 27, 2023 •

edited

Loading