-
Notifications
You must be signed in to change notification settings - Fork 219
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"Closing the transport connection timed out." caused by race condition #1975
Comments
Thanks for the report! This isn't very high priority since it only happens on shutdown and emits an error. But we should definitely get this fixed. |
Hello, I encountered the same problem when trying to monitor a Python script in APM Elastic.
In ES_APM_CONFIGURATION, I have: SERVICE_NAME, SECRET_TOKEN, SERVER_URL, SERVICE_VERSION, ENABLED, ENVIRONMENT I tried to add function get_import_string as suggested by @robin-mader-bis but I got an error I just received the message Start job do_something and nothing else. I don’t know how to resolve this problem. Thanks, |
I had the same issue at first, but make sure the Transport class you are inheriting from is elasticapm.transport.http.Transport (my first attempt was from elasticapm.transport.base.Transport and got the same error as you) |
Modified the original poster message to include the import |
I'm having this same issue as well. I've implemented the workaround listed in the description, but now I'm seeing some warnings coming from urllib
|
+1 |
@robin-mader-bis It has been suggested to me to set the pools to non-blocking only at shutdown. That could be something like the following untested patch.
|
Ok, spent some time on this today, the previous snippet will not work because the urllib3 connection pool weakref.finalize callback will be called before our atexit callback and so the pool would be already gone. Instead we can force the creation of a new one as below:
@pquentin what do you think? |
@robin-mader-bis would be great if you can take a look at #2085 |
Describe the bug: Occasionally, when using the
elasticapm.Client
(without a framework), during process shutdown (in theatexit
handler), the transport thread will block forever while trying to send data to the APM server and subsequently be killed by the thread manager, after the configured timeout is reached. This causes "Closing the transport connection timed out." to be printed to the command line and the messages remaining in the buffer to be lost.This seems to be caused by a race condition involving the
atexit
handler of theelasticapm.Client
and theweakref.finalize
ofurllib3.connectionpool.HTTPConnectionPool
(which uses anatexit
handler under the hood) which calls_close_pool_connections
. A timeline causing this bugs looks like this:atexit
handlers are called._close_pool_connections
is called while all connections are in the pool. All existing connections are disposed.elasticapm.Client
atexit
handler is called, sending the "close" event to the transport thread.urlopen
will block the transport thread forever while waiting to get a connection from the connection pool (since the poolmanager usesblock=True
and no pool timeout is configured, this will block forever, because the only way to get a connection from the pool this way, is for someone else to put a connection into the pool).The reason why this does occur consistently, is because
_close_pool_connections
will not clean up connections which are currently in use (e.g. connections being used in another thread). If a request is in progress when_close_pool_connections
is called, the associated connection "survives" the cleanup and will be added back to the pool afterwards and can be reused by the transport thread (which may be a bug/unintended behavior of urllib3 since it claimsHTTPConnectionPool
is thread safe).To Reproduce
The following minimal example reproduces the issue:
As is the case with race conditions, you might have to fiddle with the sleep timing a little bit. 10 seconds work quite reliable in my environment, but you may need a few more/less seconds, depending on your environment.
Environment
urllib3==2.2.1
elastic-apm==6.20.0
Additional context
A workaround for my use case is to use a custom Transport class which uses a non-blocking pool. I don't know enough about the
elastic-apm
code base to know whether or not this causes issues in other parts of the package, but it seems to resolve the issue for me without causing any other major issues.The text was updated successfully, but these errors were encountered: