Closed
Description
Long story short
Today on aws-prod, the director-v2 was considerably slowed down by repeatedly throwing an IndexError. The task.job_id
could not be parsed:
File "/home/scu/.venv/lib/python3.9/site-packages/simcore_service_director_v2/modules/comp_scheduler/background_task.py", line 21, in scheduler_task
await scheduler.schedule_all_pipelines()
File "/home/scu/.venv/lib/python3.9/site-packages/simcore_service_director_v2/modules/comp_scheduler/base_scheduler.py", line 119, in schedule_all_pipelines
await asyncio.gather(
File "/home/scu/.venv/lib/python3.9/site-packages/simcore_service_director_v2/modules/comp_scheduler/base_scheduler.py", line 313, in _schedule_pipeline
await self._update_states_from_comp_backend(
File "/home/scu/.venv/lib/python3.9/site-packages/simcore_service_director_v2/modules/comp_scheduler/base_scheduler.py", line 265, in _update_states_from_comp_backend
await self._process_completed_tasks(user_id, cluster_id, tasks_completed)
File "/home/scu/.venv/lib/python3.9/site-packages/simcore_service_director_v2/modules/comp_scheduler/dask_scheduler.py", line 130, in _process_completed_tasks
await asyncio.gather(
File "/home/scu/.venv/lib/python3.9/site-packages/simcore_service_director_v2/modules/comp_scheduler/dask_scheduler.py", line 155, in _process_task_result
) = parse_dask_job_id(task.job_id)
File "/home/scu/.venv/lib/python3.9/site-packages/simcore_service_director_v2/utils/dask.py", line 77, in parse_dask_job_id
parts[1],
IndexError: list index out of range
Manually setting the comp_runs result to FAILED
in the pg-db, and then restarting the director-v2 and dask-schedular fixed the problem.
Expected behaviour
This error should be handled more gracefully without blocking the director-v2, for example by changing the state of the task to FAILED.
Actual behaviour
Director-v2 is slowed and overwhelmed
Steps to reproduce
UNCLEAR
Your environment
aws-prod