Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Race condition when suspending tasks #1024

Closed
romw opened this issue Feb 4, 2015 · 5 comments · Fixed by #5178
Closed

Race condition when suspending tasks #1024

romw opened this issue Feb 4, 2015 · 5 comments · Fixed by #5178

Comments

@romw
Copy link
Member

romw commented Feb 4, 2015

Reported by Martin Suchan on 21 Sep 41192538 07:06 UTC
I've just noticed this issue when suspending tasks manually in BOINC Manager - situation:

Win7 x86, BM 6.12.15, only WCG project, Core2Duo - 2 cores
I got about 10 downloaded tasks, one is completed and reported, other two are running, the other tasks are not started yet, but allowed to be started once other task is finished.

'''I selected all not-started tasks PLUS one running task and clicked the Suspend button''' in the left command bar.

I expected, that all task will be marked at once as Suspended and the running will stop as well.

What actually happened? '''One not-yet-started task was started for about 1 second and after then it was suspended'''. I guess the task of '''"changing status to suspended" is not done in transactional way'''. What actually happened, my guess, - some function got list of tasks to suspend, it started suspending one task each time. First it suspended the one running task. In this moment some other thread noticed there is one free slot for running, it found ready task and started it (typical race condition ), in the meantime the first thread finished suspending the other tasks,
including the one started by the other thread.

This should be fixed in my opinion. It could lead to bigger problems when running on 8+ core systems with lot of projects.

Event log:

task faah19421_ZINC17130909_xmdEq_1TW7_02_0 is running
task HFCC_L4_01202033_L4_0001_0 is in group for suspending, but it is started for 1 second

9.3.2011 9:50:46 |  | Suspending computation - user request
9.3.2011 9:50:50 |  | Resuming computation
9.3.2011 9:50:54 | World Community Grid | task faah19421_ZINC17130909_xmdEq_1TW7_02_0 suspended by user
9.3.2011 9:50:55 | World Community Grid | task oe781_00061_9 suspended by user
9.3.2011 9:50:55 | World Community Grid | task X0000065610008200603171636_1 suspended by user
9.3.2011 9:50:55 | World Community Grid | task X0000065621388200603241639_0 suspended by user
9.3.2011 9:50:55 | World Community Grid | Starting HFCC_L4_01202033_L4_0001_0
9.3.2011 9:50:55 | World Community Grid | Starting task HFCC_L4_01202033_L4_0001_0 using hfcc version 640
9.3.2011 9:50:56 | World Community Grid | task HFCC_L4_01202033_L4_0001_0 suspended by user
9.3.2011 9:50:56 | World Community Grid | task X0000065671034200603171856_1 suspended by user

Migrated-From: http://boinc.berkeley.edu/trac/ticket/1048

@Ageless93
Copy link
Contributor

The scheduler was completely changed between BOINC 6 and 7. Is this still an issue when running the same scenario on a BOINC 7.6?

@Ageless93 Ageless93 added this to Backlog in Client and Manager via automation Nov 10, 2017
@RichardHaselgrove
Copy link
Contributor

I assume this happens because the GUI RPC protocol can only suspend one task per RPC. Suspending a block of tasks (multi-select) is something I do regularly, but I don't usually include a running task in the mix. Presumably it's implemented by sending a stream of single-task RPCs in quick succession.

The behaviour will be determined by the relative speeds of the host running the Manager, the host running the client, the communication link between them, and the reaction time of the running task to a suspend message. On my network (fast modern machines with gigabit LAN), the 'suspend' RPCs are processed fast enough not to allow a new task to start until the block suspend is complete.

I can see the possibility that 'suspend task, start next task' completes before 'suspend next task' is ready to be acted upon, leading to a number of tasks being left suspended, waiting to run, with 1 second elapsed time showing. That creates a large number of unnecessary slot directories, and possibly occupies extra memory, but isn't fatal. Eliminating the race condition would involve re-writing the RPC mechanism to allow batching: I think that's too much work to resolve what is in reality a minor problem which can easily be worked round on the rare systems where is it a problem (suspend 'ready-to-start' tasks only in the batch, and then suspend the running task singly once all suspends have been processed).

@davidpanderson
Copy link
Contributor

I agree that this is not high priority.
However, it wouldn't actually be that hard to extend the current RPCs
to handle batches (of jobs or projects).

@ghost
Copy link

ghost commented Oct 18, 2018

10-18-18 , BOINC 7.14.2: On a Windows 7 Pro box, Intel Core Duo E8500 @ 3.16GHz, 4GB DRAM with nVIDIA GF 8400 GS as display adapter (GPU suspended in BOINC) , I was able to suspend multiple combinations of running and waiting (SETI and Einstein) tasks, resume them, and re-suspend them multiple times with instantaneous change to all highlighted tasks as viewed in Task window.

When an already suspended task was included in the list, the resume/suspend button was grayed out.

On a SuperMicro 2 x 6 Xeon X5650 `(12 physical cores) 16 GB ECC RAM, no GPU, running WIN10 Pro: I was able to suspend and resume running and waiting (Einstein, MW 12-core WU and WCG) tasks multiple times with instantaneous change as viewed in Task window.
When an already suspended task was included in the list, the resume/suspend button was grayed out.

No evidence of the race condition was observed.

The above tests were done at the computer's console keyboard and mouse.

@Vulpine05
Copy link
Contributor

I just tested this on my Core 2 Duo T7250 laptop a few times running Ubuntu 18.04.6 and I couldn't reproduce this, for what its worth. Regardless, a solution may be to suspend the tasks that are not active first, then the active tasks?

@AenBleidd AenBleidd removed this from Backlog in Client and Manager Apr 4, 2023
@AenBleidd AenBleidd added this to Backlog in BOINC Client/Manager via automation Apr 4, 2023
@AenBleidd AenBleidd removed this from To do in BOINC Client/Manager Apr 4, 2023
@AenBleidd AenBleidd moved this from Backlog to In Testing in BOINC Client/Manager Apr 4, 2023
BOINC Client/Manager automation moved this from In Testing to Done Apr 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Development

Successfully merging a pull request may close this issue.

7 participants