You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Today on aws-prod, the director-v2 was considerably slowed down by repeatedly throwing an IndexError. The task.job_id could not be parsed:
File "/home/scu/.venv/lib/python3.9/site-packages/simcore_service_director_v2/modules/comp_scheduler/background_task.py", line 21, in scheduler_task
await scheduler.schedule_all_pipelines()
File "/home/scu/.venv/lib/python3.9/site-packages/simcore_service_director_v2/modules/comp_scheduler/base_scheduler.py", line 119, in schedule_all_pipelines
await asyncio.gather(
File "/home/scu/.venv/lib/python3.9/site-packages/simcore_service_director_v2/modules/comp_scheduler/base_scheduler.py", line 313, in _schedule_pipeline
await self._update_states_from_comp_backend(
File "/home/scu/.venv/lib/python3.9/site-packages/simcore_service_director_v2/modules/comp_scheduler/base_scheduler.py", line 265, in _update_states_from_comp_backend
await self._process_completed_tasks(user_id, cluster_id, tasks_completed)
File "/home/scu/.venv/lib/python3.9/site-packages/simcore_service_director_v2/modules/comp_scheduler/dask_scheduler.py", line 130, in _process_completed_tasks
await asyncio.gather(
File "/home/scu/.venv/lib/python3.9/site-packages/simcore_service_director_v2/modules/comp_scheduler/dask_scheduler.py", line 155, in _process_task_result
) = parse_dask_job_id(task.job_id)
File "/home/scu/.venv/lib/python3.9/site-packages/simcore_service_director_v2/utils/dask.py", line 77, in parse_dask_job_id
parts[1],
IndexError: list index out of range
Manually setting the comp_runs result to FAILED in the pg-db, and then restarting the director-v2 and dask-schedular fixed the problem.
Expected behaviour
This error should be handled more gracefully without blocking the director-v2, for example by changing the state of the task to FAILED.
Actual behaviour
Director-v2 is slowed and overwhelmed
Steps to reproduce
UNCLEAR
Your environment
aws-prod
The text was updated successfully, but these errors were encountered:
Long story short
Today on aws-prod, the director-v2 was considerably slowed down by repeatedly throwing an IndexError. The
task.job_id
could not be parsed:Manually setting the comp_runs result to
FAILED
in the pg-db, and then restarting the director-v2 and dask-schedular fixed the problem.Expected behaviour
This error should be handled more gracefully without blocking the director-v2, for example by changing the state of the task to FAILED.
Actual behaviour
Director-v2 is slowed and overwhelmed
Steps to reproduce
UNCLEAR
Your environment
aws-prod
The text was updated successfully, but these errors were encountered: