Skip to content

Director-v2 fails to parse dask-job (IndexError: list index out of range) #3000

Closed
@mrnicegyu11

Description

@mrnicegyu11

Long story short

Today on aws-prod, the director-v2 was considerably slowed down by repeatedly throwing an IndexError. The task.job_id could not be parsed:

  File "/home/scu/.venv/lib/python3.9/site-packages/simcore_service_director_v2/modules/comp_scheduler/background_task.py", line 21, in scheduler_task
    await scheduler.schedule_all_pipelines()
  File "/home/scu/.venv/lib/python3.9/site-packages/simcore_service_director_v2/modules/comp_scheduler/base_scheduler.py", line 119, in schedule_all_pipelines
    await asyncio.gather(
  File "/home/scu/.venv/lib/python3.9/site-packages/simcore_service_director_v2/modules/comp_scheduler/base_scheduler.py", line 313, in _schedule_pipeline
    await self._update_states_from_comp_backend(
  File "/home/scu/.venv/lib/python3.9/site-packages/simcore_service_director_v2/modules/comp_scheduler/base_scheduler.py", line 265, in _update_states_from_comp_backend
    await self._process_completed_tasks(user_id, cluster_id, tasks_completed)
  File "/home/scu/.venv/lib/python3.9/site-packages/simcore_service_director_v2/modules/comp_scheduler/dask_scheduler.py", line 130, in _process_completed_tasks
    await asyncio.gather(
  File "/home/scu/.venv/lib/python3.9/site-packages/simcore_service_director_v2/modules/comp_scheduler/dask_scheduler.py", line 155, in _process_task_result
    ) = parse_dask_job_id(task.job_id)
  File "/home/scu/.venv/lib/python3.9/site-packages/simcore_service_director_v2/utils/dask.py", line 77, in parse_dask_job_id
    parts[1],
IndexError: list index out of range

Manually setting the comp_runs result to FAILED in the pg-db, and then restarting the director-v2 and dask-schedular fixed the problem.

Expected behaviour

This error should be handled more gracefully without blocking the director-v2, for example by changing the state of the task to FAILED.

Actual behaviour

Director-v2 is slowed and overwhelmed

Steps to reproduce

UNCLEAR

Your environment

aws-prod

Metadata

Metadata

Assignees

Labels

bugbuggy, it does not work as expected

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions