Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛♻️ Bugfix/handle ever running tasks - refactor director-v2 workflow scheduler #2798

Merged

Conversation

sanderegg
Copy link
Member

@sanderegg sanderegg commented Feb 4, 2022

What do these changes do?

Context

The original issue where a computational pipeline would be forever locked in Running state (as stated in #2786) stems from the following issues:

  • the director-v2 workflow scheduler and the way computational task results were retrieved are following 2 different code paths
  • in case of an issue with the dask-client (living in the director-v2) that trigger a reset of the dask client (either crash of director-v2, or proper reset of the client) the task result is lost and the state of the task is never updated --> the pipeline is in ever running mode

Also during the analysis of the problems, I found out that dask-backend shows an issue with how we retrieve logs/progress/state of the tasks using dask pubsub mechanism reference.

Details

  • re-design of the workflow scheduler by following the concept 1 writer/several readers:
    • the previous design was using the dask.Future callback to react when a task was done
    • the new design checks the task status as part of the workflow scheduler task in the backend and update the tasks, the callback is not used anymore (NOTE: getting the task status from the dask-scheduler might be resource intensive on the dask-scheduler and will have to be monitored)
    • the computational tasks are persisted onto the dask-scheduler, therefore if the director-v2 or its dask-client is restarted the task are still available and can continue being scheduled
  • implemented workaround to solve the dask-backend issue with disconnecting/reconnecting dask Pub/Sub communication systems (e.g. connection to dask-scheduler is closed without removing the distributed.Sub 1 by 1, this seems to be better handled)
  • dask tasks must return a specific python exception (a.k.a TaskCancelledError) when cancelled. NOTE: returning asyncio.CancelledError is not supported and actually breaks internal of the dask-worker application
  • properly show debug logs in the dask-scheduler when LOG_LEVEL is set to DEBUG
  • re-implemented in-house task cancellation by using a distributed.Event instead of PubSub mechanism
  • additional tests for use-case where the director-v2 starts with tasks already running/aborted/failed/lost

Journey of a task

  1. Pipeline of tasks is created in the GUI or PublicAPI
  2. tasks are transferred to the webserver and saved in the Projects table
  3. tasks are transferred to the director-v2 and saved in respectively comp_tasks and comp_pipeline tables
  4. When the user presses run button then an entry is created in comp_runs, the comp_tasks and comp_pipelines tables are now in partial read-only mode (only outputs/state are changeable in comp_tasks)
  5. The director-v2 workflow-scheduler runs every 5 seconds and takes care of:
    a. checking the state of PENDING/STARTED tasks in the selected dask backend, and retrieving their results/issues and updating comp_tasks table
    b. finding which tasks are candidate to run next by analyzing the DAG (directed acyclic graph)
    c1. starting the relevant tasks,
    c2. or stopping the pipeline if the user pressed the stop button
  6. When the pipeline is completed (e.g. success/failure/aborted), then the scheduler stops monitoring this pipeline
  7. comp_tasks and comp_pipeline are released
  8. Everytime comp_tasks table outputs/state fields are updated a signal is sent to the webserver, that updates the Projects table and emit a websocket signal to the webclient

Notes on Persistency

  1. When a task is scheduled to run on the computaional backend it is submitted to the dask-scheduler
  2. The dask-scheduler keeps a future of the task in its local memory (thus if the director-v2 restarts, the task is not lost)
  3. Also if the dask-sidecar working on the task dies, it will be moved to another worker
  4. When the work is done, the results are kept in the dask-sidecar local memory (if that dask-sidear dies before the result is fetched, the task will be rerun on another dask-sidecar if possible)
  5. If the dask-scheduler dies for some reason and some task results were still not recovered, then we are doomed --> the pipeline is FAILED. In all other cases, either the results should still be available or the task will automatically be rerun.

Related issue/s

fixes #2786

How to test

IMPORTANT: for testing, production mode is important because the dask backend appears to be more resiliant this way

# set-up prod build
make build up-prod

Once running tasks:

# run some tasks and randomly do:
docker service update --force master-simcore_dask-sidecar
# or
docker service update --force master-simcore_dask-scheduler
# or 
docker service update --force master-simcore_director-v2

@sanderegg sanderegg self-assigned this Feb 4, 2022
@codecov
Copy link

codecov bot commented Feb 4, 2022

Codecov Report

Merging #2798 (ef9a5f2) into master (8fd4e45) will increase coverage by 1.8%.
The diff coverage is 88.2%.

Impacted file tree graph

@@           Coverage Diff            @@
##           master   #2798     +/-   ##
========================================
+ Coverage    77.2%   79.1%   +1.8%     
========================================
  Files         666     672      +6     
  Lines       27327   27468    +141     
  Branches     3162    3205     +43     
========================================
+ Hits        21112   21733    +621     
+ Misses       5494    4988    -506     
- Partials      721     747     +26     
Flag Coverage Δ
integrationtests 65.6% <70.0%> (-0.1%) ⬇️
unittests 74.6% <87.5%> (+2.1%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...rc/simcore_service_director_v2/core/application.py 87.5% <ø> (-0.3%) ⬇️
...e_service_director_v2/modules/dask_clients_pool.py 92.7% <ø> (ø)
...ector_v2/modules/db/repositories/comp_pipelines.py 94.2% <ø> (+8.0%) ⬆️
...e_director_v2/modules/db/repositories/comp_runs.py 97.6% <ø> (+4.6%) ⬆️
..._director_v2/modules/db/repositories/comp_tasks.py 95.7% <ø> (-0.4%) ⬇️
...ore_service_director_v2/utils/dask_client_utils.py 74.6% <74.6%> (ø)
...r-v2/src/simcore_service_director_v2/utils/dask.py 97.1% <83.3%> (+6.2%) ⬆️
...car/src/simcore_service_dask_sidecar/dask_utils.py 91.2% <91.3%> (+1.5%) ⬆️
...simcore_service_director_v2/modules/dask_client.py 91.7% <91.3%> (+9.7%) ⬆️
...rector_v2/modules/comp_scheduler/base_scheduler.py 86.7% <91.6%> (+3.4%) ⬆️
... and 58 more

@pcrespov pcrespov modified the milestone: Rudolph+1 Feb 4, 2022
@sanderegg sanderegg force-pushed the bugfix/handle_ever_running_tasks branch 2 times, most recently from d0ab9a2 to 7ff8031 Compare February 17, 2022 14:52
@sanderegg sanderegg force-pushed the bugfix/handle_ever_running_tasks branch 2 times, most recently from e9ccfd4 to e84e4fb Compare February 22, 2022 20:30
@sanderegg sanderegg force-pushed the bugfix/handle_ever_running_tasks branch from 6bdae8f to 2825dbb Compare February 23, 2022 22:05
@sanderegg sanderegg changed the title WIP: Bugfix/handle ever running tasks WIP: Bugfix/handle ever running tasks - refactor director-v2 workflow scheduler Feb 24, 2022
Copy link
Member Author

@sanderegg sanderegg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please fix my own errors

@sanderegg sanderegg force-pushed the bugfix/handle_ever_running_tasks branch from 0eab8b9 to 08a4dab Compare February 24, 2022 20:24
@sanderegg sanderegg changed the title WIP: Bugfix/handle ever running tasks - refactor director-v2 workflow scheduler Bugfix/handle ever running tasks - refactor director-v2 workflow scheduler Feb 24, 2022
@sanderegg sanderegg changed the title Bugfix/handle ever running tasks - refactor director-v2 workflow scheduler 🐛♻️ Bugfix/handle ever running tasks - refactor director-v2 workflow scheduler Feb 24, 2022
@sanderegg sanderegg marked this pull request as ready for review February 24, 2022 22:05
@sanderegg sanderegg force-pushed the bugfix/handle_ever_running_tasks branch from 8562527 to 985fd62 Compare February 25, 2022 15:03
@sanderegg sanderegg force-pushed the bugfix/handle_ever_running_tasks branch from 5a3c1b2 to ef9a5f2 Compare February 28, 2022 16:10
@sanderegg sanderegg merged commit 99d70df into ITISFoundation:master Mar 1, 2022
@sanderegg sanderegg deleted the bugfix/handle_ever_running_tasks branch March 1, 2022 08:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

computational service stays in "Running" state
3 participants