In containerized environments (Kubeflow TrainJob, KubeRay), the non-zero exit causes the pod to fail/restart even though training completed successfully — adapter checkpoints, metrics, and results are all persisted to disk before the error occurs.
Traceback (most recent call last):
File "/usr/lib64/python3.12/asyncio/tasks.py", line 314, in __step_run_and_handle_result
result = coro.send(None)
^^^^^^^^^^^^^^^
File "/opt/app-root/lib64/python3.12/site-packages/art/local/backend.py", line 436, in _monitor_openai_server
async with session.get(
^^^^^^^^^^^^
File "/opt/app-root/lib64/python3.12/site-packages/aiohttp/client.py", line 1521, in __aenter__
self._resp: _RetType = await self._coro
^^^^^^^^^^^^^^^^
File "/opt/app-root/lib64/python3.12/site-packages/aiohttp/client.py", line 788, in _request
resp = await handler(req)
^^^^^^^^^^^^^^^^^^
File "/opt/app-root/lib64/python3.12/site-packages/aiohttp/client.py", line 742, in _connect_and_send_request
conn = await self._connector.connect(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/app-root/lib64/python3.12/site-packages/aiohttp/connector.py", line 672, in connect
proto = await self._create_connection(req, traces, timeout)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/app-root/lib64/python3.12/site-packages/aiohttp/connector.py", line 1251, in _create_connection
_, proto = await self._create_direct_connection(req, traces, timeout)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/app-root/lib64/python3.12/site-packages/aiohttp/connector.py", line 1623, in _create_direct_connection
raise last_exc
File "/opt/app-root/lib64/python3.12/site-packages/aiohttp/connector.py", line 1592, in _create_direct_connection
transp, proto = await self._wrap_create_connection(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/app-root/lib64/python3.12/site-packages/aiohttp/connector.py", line 1333, in _wrap_create_connection
raise client_error(req.connection_key, exc) from exc
aiohttp.client_exceptions.ClientConnectorError: Cannot connect to host 0.0.0.0:54993 ssl:default [Connect call failed ('0.0.0.0', 54993)]
Description
Two fire-and-forget asyncio tasks are never cancelled during backend shutdown:
Impact
In containerized environments (Kubeflow TrainJob, KubeRay), the non-zero exit causes the pod to fail/restart even though training completed successfully — adapter checkpoints, metrics, and results are all persisted to disk before the error occurs.
Steps to reproduce
Error output
Environment