Skip to content

Monitor and server tasks are never cancelled on shutdown, causing non-zero exit after training completes #668

@Fiona-Waters

Description

@Fiona-Waters

Description

Two fire-and-forget asyncio tasks are never cancelled during backend shutdown:

  1. _monitor_openai_server in LocalBackend (src/art/local/backend.py, line ~491) — created via asyncio.create_task() but the reference is never stored. The monitor polls vLLM health every 30s. After vLLM shuts down, it hits 3 consecutive ConnectionRefusedError failures and re-raises, producing a "Task exception was never retrieved" error and a non-zero process exit.
  2. openai_server_task in UnslothService (src/art/unsloth/service.py, line ~516) — in shared mode, the returned asyncio task (the uvicorn server) is discarded. It continues running as an orphaned task after shutdown.

Impact

In containerized environments (Kubeflow TrainJob, KubeRay), the non-zero exit causes the pod to fail/restart even though training completed successfully — adapter checkpoints, metrics, and results are all persisted to disk before the error occurs.

Steps to reproduce

  1. Run a LocalBackend training loop in shared mode (in_process=True) inside a container
  2. Let training complete normally
  3. Observe ConnectionRefusedError in the asyncio task exception output
  4. Process exits non-zero

Error output

  Traceback (most recent call last):
    File "/usr/lib64/python3.12/asyncio/tasks.py", line 314, in __step_run_and_handle_result
      result = coro.send(None)
               ^^^^^^^^^^^^^^^
    File "/opt/app-root/lib64/python3.12/site-packages/art/local/backend.py", line 436, in _monitor_openai_server
      async with session.get(
                 ^^^^^^^^^^^^
    File "/opt/app-root/lib64/python3.12/site-packages/aiohttp/client.py", line 1521, in __aenter__
      self._resp: _RetType = await self._coro
                             ^^^^^^^^^^^^^^^^
    File "/opt/app-root/lib64/python3.12/site-packages/aiohttp/client.py", line 788, in _request
      resp = await handler(req)
             ^^^^^^^^^^^^^^^^^^
    File "/opt/app-root/lib64/python3.12/site-packages/aiohttp/client.py", line 742, in _connect_and_send_request
      conn = await self._connector.connect(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/opt/app-root/lib64/python3.12/site-packages/aiohttp/connector.py", line 672, in connect
      proto = await self._create_connection(req, traces, timeout)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/opt/app-root/lib64/python3.12/site-packages/aiohttp/connector.py", line 1251, in _create_connection
      _, proto = await self._create_direct_connection(req, traces, timeout)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/opt/app-root/lib64/python3.12/site-packages/aiohttp/connector.py", line 1623, in _create_direct_connection
      raise last_exc
    File "/opt/app-root/lib64/python3.12/site-packages/aiohttp/connector.py", line 1592, in _create_direct_connection
      transp, proto = await self._wrap_create_connection(
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/opt/app-root/lib64/python3.12/site-packages/aiohttp/connector.py", line 1333, in _wrap_create_connection
      raise client_error(req.connection_key, exc) from exc
  aiohttp.client_exceptions.ClientConnectorError: Cannot connect to host 0.0.0.0:54993 ssl:default [Connect call failed ('0.0.0.0', 54993)]

Environment

  • openpipe-art 0.5.17
  • vLLM 0.18.x (V1 engine only)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions