Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Director-v2 fails to parse dask-job (IndexError: list index out of range) #3000

Closed
mrnicegyu11 opened this issue Apr 25, 2022 · 2 comments
Closed
Assignees
Labels
bug buggy, it does not work as expected

Comments

@mrnicegyu11
Copy link
Member

Long story short

Today on aws-prod, the director-v2 was considerably slowed down by repeatedly throwing an IndexError. The task.job_id could not be parsed:

  File "/home/scu/.venv/lib/python3.9/site-packages/simcore_service_director_v2/modules/comp_scheduler/background_task.py", line 21, in scheduler_task
    await scheduler.schedule_all_pipelines()
  File "/home/scu/.venv/lib/python3.9/site-packages/simcore_service_director_v2/modules/comp_scheduler/base_scheduler.py", line 119, in schedule_all_pipelines
    await asyncio.gather(
  File "/home/scu/.venv/lib/python3.9/site-packages/simcore_service_director_v2/modules/comp_scheduler/base_scheduler.py", line 313, in _schedule_pipeline
    await self._update_states_from_comp_backend(
  File "/home/scu/.venv/lib/python3.9/site-packages/simcore_service_director_v2/modules/comp_scheduler/base_scheduler.py", line 265, in _update_states_from_comp_backend
    await self._process_completed_tasks(user_id, cluster_id, tasks_completed)
  File "/home/scu/.venv/lib/python3.9/site-packages/simcore_service_director_v2/modules/comp_scheduler/dask_scheduler.py", line 130, in _process_completed_tasks
    await asyncio.gather(
  File "/home/scu/.venv/lib/python3.9/site-packages/simcore_service_director_v2/modules/comp_scheduler/dask_scheduler.py", line 155, in _process_task_result
    ) = parse_dask_job_id(task.job_id)
  File "/home/scu/.venv/lib/python3.9/site-packages/simcore_service_director_v2/utils/dask.py", line 77, in parse_dask_job_id
    parts[1],
IndexError: list index out of range

Manually setting the comp_runs result to FAILED in the pg-db, and then restarting the director-v2 and dask-schedular fixed the problem.

Expected behaviour

This error should be handled more gracefully without blocking the director-v2, for example by changing the state of the task to FAILED.

Actual behaviour

Director-v2 is slowed and overwhelmed

Steps to reproduce

UNCLEAR

Your environment

aws-prod

@sanderegg
Copy link
Member

no more occurences for >3 months

@sanderegg
Copy link
Member

no more occurences for >3 months

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug buggy, it does not work as expected
Projects
None yet
Development

No branches or pull requests

2 participants