Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Client: loses track of new work requirements #4117

Closed
RichardHaselgrove opened this issue Dec 1, 2020 · 21 comments · Fixed by #4592
Closed

Client: loses track of new work requirements #4117

RichardHaselgrove opened this issue Dec 1, 2020 · 21 comments · Fixed by #4592

Comments

@RichardHaselgrove
Copy link
Contributor

Describe the bug
The client sometimes gets into an internal state, where it makes repeated requests for extra work which bear little resemblance to real needs.

Steps To Reproduce
This bug cannot be reproduced on demand, so I've gathered as much data as I can for this occurrence.

System Information

  • OS: Windows 7/64
  • Hardware: Quad-core CPU (i5), 1x Intel_gpu, 2x NVidia GPU
  • BOINC Version: v7.16.11 release - but has been observed over many years with previous versions

Additional context
The particular episode under investigation started on 24-Nov-2020, with the machine making 108 work requests between

24-Nov-2020 09:59:40 [Rosetta@home] [sched_op] CPU work request: 65.56 seconds; 0.00 devices
24-Nov-2020 09:59:42 [Rosetta@home] [sched_op] estimated total CPU task duration: 28803 seconds

and

24-Nov-2020 11:07:56 [Rosetta@home] [sched_op] CPU work request: 4697.21 seconds; 0.00 devices
24-Nov-2020 11:07:58 [Rosetta@home] [sched_op] estimated total CPU task duration: 28803 seconds

Full list

108 tasks, estimated at 8 hours each, would have kept all four CPU cores occupied for 9 full days. Rosetta tasks have a three day deadline: the server should never have allocated so many tasks, and the client should never have requested them. My work fetch settings are 0.25 days + 0.05 days (21600.00 + 4320.00 sec)

I completed as many as possible, but the client aborted a substantial number on the third day for "not started by deadline". At the same time, the machine had also over-fetched task for a second CPU project, NumberFields@Home. Work requests for GPU tasks, from GPUGrid and Einstein, remained normal. All tasks have been displayed normally on the tasks tab in BOINC Manager.

In previous years, these sporadic overfetch events have stopped whenever I've attempted to gather logs or other evidence. This time, it appears to be stable, so from a test this evening I've gathered

Work request summary
Complete work fetch log
Sample scheduler request
Matching scheduler reply
Client state before (from boinccmd)
Client state after

I also loaded all relevant files into the client emulator on 28 Nov 2020, as simulation 184 (before a work fetch event) and simulation 185 (after). The simulation run didn't show the same effects. Since the derangement seems to be persistent this time, please suggest other investigations I could carry out.

@davidpanderson
Copy link
Contributor

I don't know what the problem is, but it's related to the "max concurrent" mechanism,
specified in the projects/X/app_config.xml files.
It looks like you're using this for both Rosetta and NumberFields with a max of 2.
In simulation 184 it looks like you tried to include both app_config.xml files in the simulation.
The one for NumberFields made it OK, but not the one for Rosetta; maybe there was a URL typo.
In simulation 185 neither is present.

Note: in the web page that shows a scenario added code to display the app_config.xml files; see e.g.
https://boinc.berkeley.edu/dev/sim_web.php?action=show_scenario&name=184

So please try the simulations again; it's possible that there's a bug in uploading 2 app config files,
but I see other scenarios with 2 files.

In any case: whatever the basic problem is, it's in the "round-robin simulation" done by the client.
Please set the rr_sim_debug log flag and post the output for one of the bad work-fetch decisions.

@RichardHaselgrove
Copy link
Contributor Author

RichardHaselgrove commented Dec 2, 2020 via email

@RichardHaselgrove
Copy link
Contributor Author

Looking at the timestamps on the two app_config.xml files, I changed them both shortly after the first event on 24 November (at 11:18 and 11:27). The Rosetta change will have been from max 1 to 2, to try to burn off more tasks before deadline. I can only run two CPU tasks on this machine, because each NVidia task requires a full code core for sync support.

@RichardHaselgrove
Copy link
Contributor Author

I set up an identical machine with the same app_config.xml files (max concurrent in app section), and it started to misbehave some 30 minutes later, after two normal work requests. Summary log attached:

2nd_machine_stdoutdae.txt

I'll do the rr_sim etc. from the original machine a little later.

@RichardHaselgrove
Copy link
Contributor Author

Scenario 186 created from the original machine. I think this one has uploaded properly.

Log with rr_simulation attached. Only fetched from NumberFields this time. Would you like rr_sim_detail as well?

rr_sim_stdoutdae.txt

On the second machine, I tried it with project_max_concurrent instead, with the same result - overfetched again. So I've taken both app_config.xml files offline, and we can use that machine for further testing if needed.

@cwallbaum
Copy link

cwallbaum commented Dec 2, 2020

Hi,

iirc, there were two significant changes to the code related to max_concurrent:

We (= me and some guys from my team) were able to (re)produce the infinity-bug as follows:

  • set project to resource share 0
  • set both buffers ("store at least # days of work" + same for additional work) to 0
  • make use of "max_concurrent" (inside an app-tag. This problem is/was only related to "max_concurrent", not "<project_max_concurrent>") with "max_concurrent" < #ncpus (=cores/threads).

Let's assume your cpu has 4 cores and you have max-concurrent set to 3 for some app, you will always have 1 free core, and the client will always ask for work for at least this core (and/or at least 1 second). Not sure about the case when you have another project with any active task(s) so there is no idle core/thread (= instance) at all...

The reason for this problem is found in work_fetch.cpp:

    if (p->resource_share == 0) {
        req_instances = nidle_now;
        req_secs = 1;
        return;
    }

I don't remember if we tested this on any MT-capable/-aware (multi-threaded) app and I don't remember if setting the buffers to 0 was really necessary in this case, because we had another issue that caused an infinity-fetch-situation under some circumstances, but we haven't figured out the root cause of this - that's also one reason why we haven't filed a bug report yet.

I wrote this down from my memory and my notes on this issue, so please forgive me if that's not very accurate or may have changed due to any bugfixes lately... At least I hope it can serve as a little hint for further investigations :).

walli

@davidpanderson
Copy link
Contributor

I couldn't repro walli's case; I have 1 project (Rosetta) with zero resource share and zero work buffer params, and max_concurrent=2 for the "rosetta" app, and I'm not seeing unbounded work fetch.
Can anyone else repro it?

Richard, did you see the bad behavior in your latest simulation?

@RichardHaselgrove
Copy link
Contributor Author

Richard, did you see the bad behavior in your latest simulation?

I couldn't see what caused it in the rr_sim log, but I found the timeline illuminating.

03:34:00 RPC to Rosetta@home: [work_fetch] request: CPU (12382.46 sec)
04:21:00 RPC to Rosetta@home: [work_fetch] request: CPU (17912.28 sec)
04:36:00 RPC to Rosetta@home: [work_fetch] request: CPU (19681.15 sec)
04:44:00 RPC to Rosetta@home: [work_fetch] request: CPU (20624.38 sec)
04:56:00 RPC to Rosetta@home: [work_fetch] request: CPU (22038.04 sec)
05:42:00 RPC to Rosetta@home: [work_fetch] request: CPU (27452.65 sec)
06:45:00 RPC to Rosetta@home: [work_fetch] request: CPU (34868.94 sec), got two tasks
07:21:00 RPC to Rosetta@home: [work_fetch] request: CPU (36987.02 sec), got two tasks
08:39:00 RPC to Rosetta@home: [work_fetch] request: CPU (41574.80 sec), got two tasks
10:09:00 Report, RPC to Rosetta@home: [work_fetch] request: CPU (20381.78 sec)
10:26:00 RPC to Rosetta@home: [work_fetch] request: CPU (21176.82 sec)
11:19:00 RPC to Rosetta@home: [work_fetch] request: CPU (23576.91 sec)
12:03:00 RPC to Rosetta@home: [work_fetch] request: CPU (25601.99 sec)
13:12:00 Report only
15:46:00 RPC to Rosetta@home: [work_fetch] request: CPU (7094.27 sec)
16:40:00 RPC to Rosetta@home: [work_fetch] request: CPU (10052.38 sec)
17:46:00 RPC to Rosetta@home: [work_fetch] request: CPU (13495.32 sec)
19:31:00 RPC to Rosetta@home: [work_fetch] request: CPU (18265.50 sec)
20:54:00 RPC to Rosetta@home: [work_fetch] request: CPU (22045.65 sec)
21:30:00 RPC to Rosetta@home: [work_fetch] request: CPU (23680.71 sec)
22:10:00 RPC to Rosetta@home: [work_fetch] request: CPU (25510.78 sec)
22:17:00 RPC to Rosetta@home: [work_fetch] request: CPU (25840.80 sec)
23:20:00 Report only

In 24 simulated hours, Rosetta started with 2 tasks, downloaded 24 tasks, reported 3 tasks, and finished with 23 tasks. NumberFields started with 51 tasks, downloaded 8, reported 23, and finished with 36. That's more reasonable.

At an estimated 8 hours per Rosetta task, and only two cores available after GPU reservations, that's 3.8 days of wall-time committed, for a project with 3 day deadlines and a host with a 7.2 hour cache setting. That's what I'm seeing in real life over an interval of seconds, rather than hours.

@RichardHaselgrove
Copy link
Contributor Author

Looking at the real-life rr_sim_stdoutdae.txt output from #4117 (comment), I see the simulated runtimes at 02-Dec-2020 15:53:28 as:

NumberFields: 26690.25 seconds, 6 tasks
Rosetta: 61610.66 seconds, 2 tasks
GPUGrid: 20878.41 seconds, 4 tasks
Einstein: 28904.16 seconds, 2 tasks

making 138083.48 seconds in all. Plus,

NumberFields: 31 tasks 'at app max concurrent'
Rosetta: 6 tasks 'at app max concurrent'

Then there's
02-Dec-2020 15:53:29 [NumberFields@home] [sched_op] CPU work request: 34807.28 seconds
I haven't been able to reconcile that number yet.

@RichardHaselgrove
Copy link
Contributor Author

Here's a log which combines both rr_sim and work_fetch in the same file

rr_sim_and_work_fetch_stdoutdae.txt

I allowed new work at close to the point where I judge that work would have been fetched normally - with just below 21,600 seconds of CPU work stored. I see no trace of max_concurrent mentioned in this file until 18:05:25, after the "18:05:20 [NumberFields@home] [work_fetch] request: CPU (8862.86 sec)" took the work cache into the 'additional' days.

@davidpanderson
Copy link
Contributor

What am I supposed to look at? At what point does an error occur?

@RichardHaselgrove
Copy link
Contributor Author

I think you were right to draw attention to max_concurrent: I think the problem arises in rr_sim when a project already has enough tasks to reach the work_fetch "target work buffer", but can't run all tasks immediately because of max_concurrent limits.

02-Dec-2020 15:53:28 [---] [rr_sim] doing sim: work fetch
02-Dec-2020 15:53:28 [---] [rr_sim] start: work_buf min 21600 additional 4320 total 25920 on_frac 1.000 active_frac 1.000
02-Dec-2020 15:53:28 [GPUGRID] [rr_sim] 1025.97: 4nurA02_450_4-TONI_MDADex2sn-36-50-RND3389_0 finishes (1.00 CPU + 1.00 NVIDIA GPU) (466134.24G/454.34G)
02-Dec-2020 15:53:28 [NumberFields@home] [rr_sim] 1935.25: wu_sf7_DS-15x10_Grp583685of1250000_0 finishes (1.00 CPU) (8339.81G/4.31G)
02-Dec-2020 15:53:28 [NumberFields@home] [rr_sim] 2320.50: wu_sf7_DS-15x10_Grp583689of1250000_0 finishes (1.00 CPU) (10000.00G/4.31G)
02-Dec-2020 15:53:28 [NumberFields@home] [rr_sim] 4255.75: wu_sf7_DS-15x10_Grp583694of1250000_0 finishes (1.00 CPU) (10000.00G/4.31G)
02-Dec-2020 15:53:28 [GPUGRID] [rr_sim] 4283.41: e33s40_e6s6p0f340-ADRIA_PBBandit100ns-0-1-RND3833_1 finishes (1.00 CPU + 1.00 NVIDIA GPU) (1946110.47G/454.34G)
02-Dec-2020 15:53:28 [NumberFields@home] [rr_sim] 4641.00: wu_sf7_DS-15x10_Grp583682of1250000_0 finishes (1.00 CPU) (10000.00G/4.31G)
02-Dec-2020 15:53:28 [Einstein@Home] [rr_sim] 5492.90: LATeah1066L35_500.0_0_0.0_10872246_2 finishes (0.04 CPU + 1.00 Intel GPU) (738379.81G/134.42G)
02-Dec-2020 15:53:28 [GPUGRID] [rr_sim] 6155.79: 2j7aA02_320_1-TONI_MDADex2sj-34-50-RND5811_1 finishes (1.00 CPU + 1.00 NVIDIA GPU) (2330671.22G/454.34G)
02-Dec-2020 15:53:28 [NumberFields@home] [rr_sim] at app max concurrent for GetDecics
02-Dec-2020 15:53:28 [NumberFields@home] [rr_sim] at app max concurrent for GetDecics
02-Dec-2020 15:53:28 [NumberFields@home] [rr_sim] at app max concurrent for GetDecics
02-Dec-2020 15:53:28 [NumberFields@home] [rr_sim] at app max concurrent for GetDecics
02-Dec-2020 15:53:28 [NumberFields@home] [rr_sim] at app max concurrent for GetDecics
02-Dec-2020 15:53:28 [NumberFields@home] [rr_sim] at app max concurrent for GetDecics
02-Dec-2020 15:53:28 [NumberFields@home] [rr_sim] at app max concurrent for GetDecics
02-Dec-2020 15:53:28 [NumberFields@home] [rr_sim] at app max concurrent for GetDecics
02-Dec-2020 15:53:28 [NumberFields@home] [rr_sim] at app max concurrent for GetDecics
02-Dec-2020 15:53:28 [NumberFields@home] [rr_sim] at app max concurrent for GetDecics
02-Dec-2020 15:53:28 [NumberFields@home] [rr_sim] at app max concurrent for GetDecics
02-Dec-2020 15:53:28 [NumberFields@home] [rr_sim] at app max concurrent for GetDecics
02-Dec-2020 15:53:28 [NumberFields@home] [rr_sim] at app max concurrent for GetDecics
02-Dec-2020 15:53:28 [NumberFields@home] [rr_sim] at app max concurrent for GetDecics
02-Dec-2020 15:53:28 [NumberFields@home] [rr_sim] at app max concurrent for GetDecics
02-Dec-2020 15:53:28 [NumberFields@home] [rr_sim] at app max concurrent for GetDecics
02-Dec-2020 15:53:28 [NumberFields@home] [rr_sim] at app max concurrent for GetDecics
02-Dec-2020 15:53:28 [NumberFields@home] [rr_sim] at app max concurrent for GetDecics
02-Dec-2020 15:53:28 [NumberFields@home] [rr_sim] at app max concurrent for GetDecics
02-Dec-2020 15:53:28 [NumberFields@home] [rr_sim] at app max concurrent for GetDecics
02-Dec-2020 15:53:28 [NumberFields@home] [rr_sim] at app max concurrent for GetDecics
02-Dec-2020 15:53:28 [NumberFields@home] [rr_sim] at app max concurrent for GetDecics
02-Dec-2020 15:53:28 [NumberFields@home] [rr_sim] at app max concurrent for GetDecics
02-Dec-2020 15:53:28 [NumberFields@home] [rr_sim] at app max concurrent for GetDecics
02-Dec-2020 15:53:28 [NumberFields@home] [rr_sim] at app max concurrent for GetDecics
02-Dec-2020 15:53:28 [NumberFields@home] [rr_sim] at app max concurrent for GetDecics
02-Dec-2020 15:53:28 [NumberFields@home] [rr_sim] at app max concurrent for GetDecics
02-Dec-2020 15:53:28 [NumberFields@home] [rr_sim] at app max concurrent for GetDecics
02-Dec-2020 15:53:28 [NumberFields@home] [rr_sim] at app max concurrent for GetDecics
02-Dec-2020 15:53:28 [NumberFields@home] [rr_sim] at app max concurrent for GetDecics
02-Dec-2020 15:53:28 [NumberFields@home] [rr_sim] at app max concurrent for GetDecics
02-Dec-2020 15:53:28 [NumberFields@home] [rr_sim] 6576.25: wu_sf7_DS-15x10_Grp583687of1250000_0 finishes (1.00 CPU) (10000.00G/4.31G)
02-Dec-2020 15:53:28 [NumberFields@home] [rr_sim] 6961.50: wu_sf7_DS-15x10_Grp583690of1250000_0 finishes (1.00 CPU) (10000.00G/4.31G)

Numberfields reached a total runtime of 6961.50 during the simulation, which is below 'work_buf min 21600'. A work fetch followed.

But Numberfields already had many more tasks queued: their runtime was excluded from the simulation by 'at app max concurrent for GetDecics'.

When a project has more tasks available than allowed by max_concurrent, we don't abort the surplus: we defer them, and run them later. Their runtime needs to be included in the simulation.

The exclusion is made in https://github.com/BOINC/boinc/blob/master/client/rr_sim.cpp#L340 ff. That code change was made in 40f0cb4, when the comment was

  • work fetch: if a project is at a max concurrent limit,
    don't fetch work from it.

That was part of #2918, the first attempt at fixing the job scheduling bug. The later #3076, 'Re-enable work buffering ...', didn't re-visit that part of rr_sim.

@RichardHaselgrove
Copy link
Contributor Author

Been doing some thinking about this one. Problem still remains, so I've been managing it by allowing work fetch for limited times, and setting 'No new work' at other times. I've also been working on a patched client.

RR_sim and WF logs from stock client Windows v7.16.11:
Sim-WF_first.txt
Sim-WF_second.txt

My patched version separates the rr_sim for CPU sched from the rr_sim for work fetch, and omits the max_concurrent check for the work_fetch cycle - I've had to cut out the efficiency cycle that re-uses the previous sim if it's fresh.
Sim-WF_second_patched.txt

After saving that log, I allowed work fetch from both active CPU projects, and observed

13/12/2020 17:05:52 | NumberFields@home | [sched_op] CPU work request: 10672.98 seconds; 0.00 devices
13/12/2020 17:05:54 | NumberFields@home | Scheduler request completed: got 5 new tasks
13/12/2020 17:05:54 | NumberFields@home | [sched_op] estimated total CPU task duration: 12254 seconds

which is to be expected, but since then no further work has been requested. That's how it should be.

I'll let the patched version run overnight. If no problems are visible, I'll apply it to the other machine tomorrow.

Side comment: this machine has 4 cores. Two are reserved for GPU support, one runs NumberFields, and one runs Rosetta. Max_concurrent is set to two for both those projects: at no point is it reached in normal (not simulated) running.

@davidpanderson
Copy link
Contributor

The RR sim for work fetch needs to model max concurrent as well.

If there is a simulation that exhibits the problem, or if I can repro it on my computer, then I can fix it.

@RichardHaselgrove
Copy link
Contributor Author

I accept it needs to model it, but not by reducing the measurement of work cache below the requested level. That's when it feels the need to fetch more.

My patched client has maintained a normal level of cached work overnight, but my stock client continues to over-fetch whenever given the chance:

14/12/2020 10:15:21 | NumberFields@home | [sched_op] CPU work request: 33736.82 seconds; 0.00 devices
14/12/2020 10:15:23 | NumberFields@home | Scheduler request completed: got 14 new tasks
14/12/2020 10:15:23 | NumberFields@home | [sched_op] estimated total CPU task duration: 35291 seconds
14/12/2020 10:15:59 | NumberFields@home | [sched_op] CPU work request: 33535.15 seconds; 0.00 devices
14/12/2020 10:16:01 | NumberFields@home | Scheduler request completed: got 14 new tasks
14/12/2020 10:16:01 | NumberFields@home | [sched_op] estimated total CPU task duration: 35291 seconds

The key feature is that the second request doesn't take any notice of the substantial amount of work received just 40 seconds before. Both sets of tasks have been received and display normally in BOINC Manager.

image

@Sandman192
Copy link

BOINC gets work but can't finish on time? What?
Plus, won't get ANY work units from project unless I stop another project. Exp: If TN-Grid is scheduled to get work then only TN-Grid will only get work. If WCG is scheduled to get work then only WCG will get work and will be more in charge and takes out TN-Grid.

This is happening on 2 of my computers. One Linux and one Windows 10.

"Days overdue; you may not get credit for it. Consider aborting it". ???
"Tasks won't finish in time: BOINC runs 99.6% of the time; computation is enabled 100.0% of that". ??? Unless I stop it from getting other projects.

3/9/2021 3:57:05 AM | TN-Grid Platform | Task 181047_Hs_T154140-OXCT1_wu-277_1614546821865_1 is 3.11 days overdue; you may not get credit for it. Consider aborting it.
3/9/2021 3:57:05 AM | TN-Grid Platform | Task 181053_Hs_T002498-OPTN_wu-183_1614555068070_1 is 3.03 days overdue; you may not get credit for it. Consider aborting it.
3/9/2021 3:57:05 AM | TN-Grid Platform | Task 181055_Hs_T002502-OPTN_wu-232_1614558714182_0 is 2.99 days overdue; you may not get credit for it. Consider aborting it.
3/9/2021 3:57:05 AM | TN-Grid Platform | Task 181074_Hs_T194898-OGN_wu-176_1614588563451_1 is 2.66 days overdue; you may not get credit for it. Consider aborting it.
3/9/2021 3:57:05 AM | TN-Grid Platform | Task 181084_Hs_T116698-OBFC2A_wu-83_1614600576731_1 is 2.50 days overdue; you may not get credit for it. Consider aborting it.
3/9/2021 3:57:05 AM | TN-Grid Platform | Task 181162_Hs_T191705-NELF_wu-18_1614716635947_0 is 1.18 days overdue; you may not get credit for it. Consider aborting it.
3/9/2021 3:57:05 AM | TN-Grid Platform | Task 181166_Hs_T191709-NELF_wu-65_1614721341325_1 is 1.11 days overdue; you may not get credit for it. Consider aborting it.
3/9/2021 3:57:05 AM | TN-Grid Platform | Task 181235_Hs_T175748-MYO1G_wu-194_1614812880971_1 is 0.25 days overdue; you may not get credit for it. Consider aborting it.
3/9/2021 3:57:05 AM | TN-Grid Platform | Task 181242_Hs_T045070-MYO1E_wu-217_1614821392673_1 is 0.18 days overdue; you may not get credit for it. Consider aborting it.
3/9/2021 3:57:45 AM | SiDock@home | Tasks won't finish in time: BOINC runs 99.6% of the time; computation is enabled 100.0% of that
3/9/2021 3:57:45 AM | SiDock@home | Project requested delay of 7 seconds
3/9/2021 3:57:52 AM | Rosetta@home | Sending scheduler request: To fetch work.
3/9/2021 3:57:52 AM | Rosetta@home | Requesting new tasks for CPU
3/9/2021 3:57:53 AM | Rosetta@home | Scheduler request completed: got 0 new tasks
3/9/2021 3:57:53 AM | Rosetta@home | No tasks sent
3/9/2021 3:57:53 AM | Rosetta@home | Tasks won't finish in time: BOINC runs 99.6% of the time; computation is enabled 100.0% of that
3/9/2021 3:57:53 AM | Rosetta@home | Project requested delay of 31 seconds
3/9/2021 3:57:59 AM | GPUGRID | Sending scheduler request: To fetch work.
3/9/2021 4:44:11 AM | TN-Grid Platform | Aborting task 181258_Hs_T100042-MYH7B_wu-211_1614840590806_1; not started and deadline has passed
3/9/2021 4:44:11 AM | TN-Grid Platform | Aborting task 181259_Hs_T100043-MYH7B_wu-90_1614841440934_1; not started and deadline has passed
3/9/2021 9:20:21 AM | SiDock@home | Tasks won't finish in time: BOINC runs 99.6% of the time; computation is enabled 100.0% of that
3/9/2021 9:20:21 AM | SiDock@home | Project requested delay of 7 seconds
3/9/2021 10:43:19 AM | Rosetta@home | Tasks won't finish in time: BOINC runs 99.6% of the time; computation is enabled 100.0% of that
3/9/2021 10:43:19 AM | Rosetta@home | Project requested delay of 31 seconds
3/9/2021 2:23:04 PM | SiDock@home | Tasks won't finish in time: BOINC runs 99.6% of the time; computation is enabled 100.0% of that
3/9/2021 2:23:04 PM | SiDock@home | Project requested delay of 7 seconds

@MaxMaximus
Copy link

MaxMaximus commented Jun 6, 2021

I also see this bug from time to time.
Usually with rosetta@home too, but it is not a project/server side bug, but client bug. Because i see from logs very long loops of client keeps requesting work from a project - usually until it reach daily quota or until it realize that many WUs will miss deadline. Client request (and gets) only one task per request/server contact but keeps repeating such requests until it finish in situation with a work queue size much more than what can be calculated before the deadline and regardless of client cache setting.

It linked to max <max_concurrent> setting in app_config (if i remove it bug does not trigger). But some additional unknown condition needed as I can not reproduce it intentionally.

Here my settings from one of hosts where i see it most often:
it run few project in parallel including R@H (other major is WCG and Einstein@Home)
R@H project share = 250 (not zero)
max_concurrent = 8-10 tasks on app level (like rosetta<max_concurrent>10</max_concurrent>)
client cache settings = 1 day + 0 days
ncpus = 15 (8 SMT cores = 16 threads, but settings set to use up to 95% of CPUs )

If client "go crazy" it download up to few hundreds tasks with average runtimes about 8 hours each and 3 days deadline.

So it is not simple ignoring of max_concurrent setting in work fetch calculations as even without this setting(running R@H on all 16 threads instead of 8 or 10 restricted by max_concurrent) client still can not process all of download work before deadline.
And of course it much exceeds the specified cache size setting = 1 day, so even if working on all 16 threads, it should be a maximum24/8*16 = 48 tasks downloaded, but client downloads up to few hundreds.

Additional logs with work_fetch debug on shows that client underestimate amount of work it have (saturated).
This taken with 85 R@H WUs in queue (were about 200 WUs after previous "work fetch frenzy" but i have manually aborted ~100 already as they will miss deadline anyway)

06/06/2021 04:50:00 |  | [work_fetch] ------- start work fetch state -------
06/06/2021 04:50:00 |  | [work_fetch] target work buffer: 86400.00 + 0.00 sec
06/06/2021 04:50:00 |  | [work_fetch] --- project states ---
06/06/2021 04:50:00 | Einstein@Home | [work_fetch] REC 7569.179 prio -1000.000 can't request work: "no new tasks" requested via Manager
06/06/2021 04:50:00 | Milkyway@Home | [work_fetch] REC 0.000 prio -0.000 can't request work: "no new tasks" requested via Manager
06/06/2021 04:50:00 | Rosetta@home | [work_fetch] REC 8121.046 prio -1.131 can request work
06/06/2021 04:50:00 | World Community Grid | [work_fetch] REC 64037.701 prio -18.384 can't request work: scheduler RPC backoff (26.87 sec)
06/06/2021 04:50:00 | WUProp@Home | [work_fetch] REC 0.011 prio -0.000 can't request work: non CPU intensive
06/06/2021 04:50:00 |  | [work_fetch] --- state for CPU ---
06/06/2021 04:50:00 |  | [work_fetch] shortfall 0.00 nidle 0.00 saturated 88003.55 busy 0.00
06/06/2021 04:50:00 | Einstein@Home | [work_fetch] share 0.000 zero resource share
06/06/2021 04:50:00 | Milkyway@Home | [work_fetch] share 0.000 blocked by project preferences
06/06/2021 04:50:00 | Rosetta@home | [work_fetch] share 1.000
06/06/2021 04:50:00 | World Community Grid | [work_fetch] share 0.000
06/06/2021 04:50:00 |  | [work_fetch] --- state for AMD/ATI GPU ---
06/06/2021 04:50:00 |  | [work_fetch] shortfall 84055.11 nidle 0.00 saturated 2216.09 busy 0.00
06/06/2021 04:50:00 | Einstein@Home | [work_fetch] share 0.000 zero resource share
06/06/2021 04:50:00 | Milkyway@Home | [work_fetch] share 0.000
06/06/2021 04:50:00 | Rosetta@home | [work_fetch] share 0.000 no applications
06/06/2021 04:50:00 | World Community Grid | [work_fetch] share 0.000
06/06/2021 04:50:00 |  | [work_fetch] ------- end work fetch state -------

If i understand it correctly - client thinks it has only 88003 seconds worth of work for CPU in queue.

While actual queue include 85 R@H WUs(not counting other projects) , ~8 hours each (client calculate this correctly - estimated runtimes floats around 8 hr)
So = 85 * 8 / 10 = 68 hours = 244800 sec if running 10 task (as restricted by max_concurrent setting)
or 85 * 8 / 16 * 3600 = 153000 sec if running on all 16 threads (ignoring max_concurrent and ignoring all other projects shares)

@MaxMaximus
Copy link

I aborted some more WUs from queue (still leaving >> 1 day worth of work) to test and client already thinks there is a shortage of work in queue (but choose to grab work from WCG this time):

06/06/2021 05:36:18 |  | [work_fetch] ------- start work fetch state -------
06/06/2021 05:36:18 |  | [work_fetch] target work buffer: 86400.00 + 0.00 sec
06/06/2021 05:36:18 |  | [work_fetch] --- project states ---
06/06/2021 05:36:18 | Einstein@Home | [work_fetch] REC 7552.069 prio -1000.000 can't request work: "no new tasks" requested via Manager
06/06/2021 05:36:18 | Milkyway@Home | [work_fetch] REC 0.000 prio -0.000 can't request work: "no new tasks" requested via Manager
06/06/2021 05:36:18 | Rosetta@home | [work_fetch] REC 8121.915 prio -0.253 can request work
06/06/2021 05:36:18 | World Community Grid | [work_fetch] REC 64063.509 prio -3.257 can request work
06/06/2021 05:36:18 | WUProp@Home | [work_fetch] REC 0.011 prio -0.000 can't request work: non CPU intensive
06/06/2021 05:36:18 |  | [work_fetch] --- state for CPU ---
06/06/2021 05:36:18 |  | [work_fetch] shortfall 6000.30 nidle 0.00 saturated 82374.18 busy 0.00
06/06/2021 05:36:18 | Einstein@Home | [work_fetch] share 0.000 zero resource share
06/06/2021 05:36:18 | Milkyway@Home | [work_fetch] share 0.000 blocked by project preferences
06/06/2021 05:36:18 | Rosetta@home | [work_fetch] share 0.714
06/06/2021 05:36:18 | World Community Grid | [work_fetch] share 0.286
06/06/2021 05:36:18 |  | [work_fetch] --- state for AMD/ATI GPU ---
06/06/2021 05:36:18 |  | [work_fetch] shortfall 82729.73 nidle 0.00 saturated 3450.10 busy 0.00
06/06/2021 05:36:18 | Einstein@Home | [work_fetch] share 0.000 zero resource share
06/06/2021 05:36:18 | Milkyway@Home | [work_fetch] share 0.000
06/06/2021 05:36:18 | Rosetta@home | [work_fetch] share 0.000 no applications
06/06/2021 05:36:18 | World Community Grid | [work_fetch] share 1.000
06/06/2021 05:36:18 |  | [work_fetch] ------- end work fetch state -------
06/06/2021 05:36:18 | Milkyway@Home | choose_project: scanning
06/06/2021 05:36:18 | Milkyway@Home | skip: "no new tasks" requested via Manager
06/06/2021 05:36:18 | WUProp@Home | choose_project: scanning
06/06/2021 05:36:18 | WUProp@Home | skip: non CPU intensive
06/06/2021 05:36:18 | Rosetta@home | choose_project: scanning
06/06/2021 05:36:18 | Rosetta@home | can fetch CPU
06/06/2021 05:36:18 | Rosetta@home | can't fetch AMD/ATI GPU: no applications
06/06/2021 05:36:18 | World Community Grid | choose_project: scanning
06/06/2021 05:36:18 | World Community Grid | can fetch CPU
06/06/2021 05:36:18 | World Community Grid | CPU needs work - buffer low
06/06/2021 05:36:18 | World Community Grid | checking CPU
06/06/2021 05:36:18 | World Community Grid | [work_fetch] using MC shortfall 6000.295547 instead of shortfall 6000.295547
06/06/2021 05:36:18 | World Community Grid | [work_fetch] set_request() for CPU: ninst 15 nused_total 134.00 nidle_now 0.00 fetch share 0.29 req_inst 0.00 req_secs 6000.30
06/06/2021 05:36:18 | World Community Grid | CPU set_request: 6000.295547
06/06/2021 05:36:18 | World Community Grid | checking AMD/ATI GPU
06/06/2021 05:36:18 | World Community Grid | [work_fetch] using MC shortfall 82729.732807 instead of shortfall 82729.732807
06/06/2021 05:36:18 | World Community Grid | [work_fetch] set_request() for AMD/ATI GPU: ninst 1 nused_total 4.00 nidle_now 0.00 fetch share 1.00 req_inst 0.00 req_secs 82729.73
06/06/2021 05:36:18 | World Community Grid | AMD/ATI GPU set_request: 82729.732807
06/06/2021 05:36:18 | World Community Grid | [work_fetch] request: CPU (6000.30 sec, 0.00 inst) AMD/ATI GPU (82729.73 sec, 0.00 inst)
06/06/2021 05:36:19 | World Community Grid | Sending scheduler request: To fetch work.
06/06/2021 05:36:19 | World Community Grid | Requesting new tasks for CPU and AMD/ATI GPU
06/06/2021 05:36:20 | World Community Grid | Scheduler request completed: got 1 new tasks
06/06/2021 05:36:20 | World Community Grid | Project requested delay of 121 seconds
06/06/2021 05:36:20 | World Community Grid | [work_fetch] backing off AMD/ATI GPU 615 sec
06/06/2021 05:36:20 |  | [work_fetch] Request work fetch: RPC complete

Also - did client thinks computer has 134 cpus/thread? This line:
[work_fetch] set_request() for CPU: ninst 15 nused_total 134.00 nidle_now 0.00 fetch share 0.29 req_inst 0.00 req_secs 6000.30
Is so it can be a reason for miscalculating work cache size and hoarding excessive amounts of WUs.
This value jumps up and down with each request:

06-Jun-2021 05:23:47 [World Community Grid] [work_fetch] set_request() for CPU: ninst 15 nused_total 128.00 nidle_now 0.00 fetch share 0.29 req_inst 0.00 req_secs 38886.85
...................
06-Jun-2021 05:28:56 [Rosetta@home] [work_fetch] set_request() for CPU: ninst 15 nused_total 45.00 nidle_now 0.00 fetch share 1.00 req_inst 0.00 req_secs 0.00
....................
06-Jun-2021 05:27:58 [World Community Grid] [work_fetch] set_request() for CPU: ninst 15 nused_total 131.00 nidle_now 0.00 fetch share 0.29 req_inst 0.00 req_secs 17047.13
..................
06/06/2021 05:40:29 | World Community Grid | [work_fetch] set_request() for CPU: ninst 15 nused_total 135.00 nidle_now 0.00 fetch share 0.29 req_inst 0.00 req_secs 2171.49
....................
06-Jun-2021 05:41:16 [Rosetta@home] [work_fetch] set_request() for CPU: ninst 15 nused_total 36.00 nidle_now 0.00 fetch share 1.00 req_inst 0.00 req_secs 0.00

@Sandman192
Copy link

This is all I get if I have WCG in the mix is just WCG. I left it running for a week and I only have WCG work only. I've notice in earlier versions of BOINC I always had a little for 100% resource and a lot for 1000% of work from other projects. Even though WCG is at a 1000% I always have others at 100% projects "just not as many WU" for years on when I started BOINC.

Didn't change any setting when updating BOINC. Just not working like it did before.
2021-06-11 (5)
2021-06-16

@MaxMaximus
Copy link

I have found why it sometimes hard to reproduce / random bug behavior.
It is just depends on other projects (with no <max_concurrent> setting applied).
If BOINC manage to fill work buffer with WUs from other project(s) - <max_concurrent> bug will not be triggered.
But if supply of WUs from other project is limited for any reason for some time (server outages, not enough work generated or just user set OTHER project to "no new tasks") boinc tries to fill work queue with work from a project with <max_concurrent> setting ONLY and bug triggered (infinite work downloading loop begins).

P.S.
<max_concurrent> bug occur on BOTH app and project levels.
Its slightly different, but leads to the same error: infinite loops of work downloading for this project, as internal work queue calculations stops after reaching <max_concurrent> number of WUs and ignores all other WUs client already has in queue. And does not matter how many it downloads - it still count only <max_concurrent> number of WUs regardless of how many already downloaded = infinite loop as calculated work queue size does not increase after downloading new WUa.

@AenBleidd AenBleidd removed this from To do in BOINC Client/Manager Aug 11, 2021
@AenBleidd AenBleidd added this to Backlog in Client Release 7.20.0 via automation Aug 11, 2021
@AenBleidd
Copy link
Member

Fixed via #4592

Client Release 7.20.0 automation moved this from Backlog to Done Dec 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Development

Successfully merging a pull request may close this issue.

6 participants