Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enabling RuntimeMetrics often occurs in 'Socket send would block' error #7728

Closed
Alex-Wauters opened this issue Nov 22, 2023 · 6 comments
Closed
Assignees

Comments

@Alex-Wauters
Copy link

Alex-Wauters commented Nov 22, 2023

Summary of problem

When RuntimeMetrics is enabled, we often get the following error on our kubernetes pods:

Socket send would block: [Errno 11] Resource temporarily unavailable, dropping the packet

We can see the runtime metrics in our dashboard (dogstatsd is enabled on our agent and used for other metrics), but the frequent occurence of these errors made us roll back the changes just in case it is contributing to resource exhaustion. We're aware the feature is in public beta, hence our report.
https://docs.datadoghq.com/tracing/metrics/runtime_metrics/python/

Which version of dd-trace-py are you using?

2.2.0

Runtime metrics enabled via

from ddtrace.runtime import RuntimeMetrics RuntimeMetrics.enable()

How can we reproduce your problem?

It would require running several pods with runtime metrics enabled and monitor over time

What is the result that you get?

Socket send would block: [Errno 11] Resource temporarily unavailable, dropping the packet

What is the result that you expected?

Preferably no errors

@P403n1x87
Copy link
Contributor

@Alex-Wauters is gevent one of the dependencies in use by any chance?

@mabdinur mabdinur self-assigned this Nov 22, 2023
@Alex-Wauters
Copy link
Author

Alex-Wauters commented Nov 22, 2023

Doesn't appear to be included directly, could be a transitive dependency. It occurs on all 3 services that we tried. Requirements.txt's:"

svc 1 - flask

gunicorn==20.1.0
numpy==1.24.3
future
ddtrace==2.2.0
Flask==3.0.0
Flask-Cors==3.0.10
joblib==1.2.0
opencv-python==4.8.1.78
pandas~=1.2.4
pillow~=10.0.1
python-json-logger==2.0.7
rapidfuzz~=1.4.1
regex~=2022.6.2
requests==2.28.2
scikit-learn==0.24.2
scikit-learn-extra==0.2.0
tldextract~=3.1.0
APScheduler==3.9.1
toolz==0.12.0
pyahocorasick==1.4.4
pytest==7.2.0
pytest-mock==3.10.0
pytest-xdist==3.0.2
deepdiff==6.2.1
approvaltests==8.2.0
dateparser~=1.1.8
-e SHARED_MODULE

shared module from all svc's

[
    "docutils",
    "setuptools",
    "setuptools-scm>=7.1.0",
    "Pillow",
    "azure-appconfiguration>=1.4.0",
    "PyPDF2",
    "requests>=2.28.2",
    "aiohttp>=3.8.4",
    "rapidfuzz",
    "regex",
    "shapely>=2.0.0",
    "toolz>=0.12.0",
    "cachetools>=5.0.0",
    "pydantic>=2.0.3",
    "deprecation>=2.1.0",
    "tenacity>=8.2.3"
]

svc 2 - fastapi based

requests~=2.28.2
regex==2020.11.13
ddtrace~=2.2.0
gunicorn==20.1.0
unidecode==1.2.0
psutil~=5.9.5
itsdangerous==2.0.1
Jinja2==3.0.3
pytest==7.2.0
rapidfuzz==2.13.3
python-json-logger==2.0.7
pydantic~=2.4.2
setuptools~=65.5.1
asyncio~=3.4.3
PyYAML~=6.0.1
fastapi~=0.103.2
uvicorn~=0.23.2
tenacity~=8.2.3
aiohttp~=3.8.6

@ZStriker19
Copy link
Contributor

ZStriker19 commented Nov 27, 2023

Hi @Alex-Wauters do you know when you started to see this issue? Were you using an earlier tracer version before and didn't see those logs? We updated the Dogstatsd code that we vendor to the latest (0.47) two weeks ago. Which correlates with the 2.2.0 release that you're on. PR here. I'll look into this further, but that info would definitely be helpful.

@ZStriker19 ZStriker19 self-assigned this Nov 27, 2023
@Alex-Wauters
Copy link
Author

We only just enabled Python runtime metrics with that version, we haven't used it before so I don't have any stats for earlier versions.

@ZStriker19
Copy link
Contributor

Hi @Alex-Wauters could you tell me with what frequency you're seeing that message? If it's less than multiple times per second, it shouldn't be an issue. This happens when we try to send a payload, but the agent is too busy to accept it. Non-blocking socket is a least common denominator way of sending metrics that works in all scenarios without negatively impacting the user application in some way.

That message should be a debug log, are you running in debug mode? If so, and you want to avoid seeing this log, you can grab the logger and change its level e.g.:

import logging

logger = logging.getLogger("datadog.dogstatsd")
logger.setLevel(logging.DEBUG)

Let me know if you have any questions about this!

@ZStriker19
Copy link
Contributor

Closing this, but please re-open if the above does not help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants