-
Notifications
You must be signed in to change notification settings - Fork 439
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Client: loses track of new work requirements #4117
Comments
I don't know what the problem is, but it's related to the "max concurrent" mechanism, Note: in the web page that shows a scenario added code to display the app_config.xml files; see e.g. So please try the simulations again; it's possible that there's a bug in uploading 2 app config files, In any case: whatever the basic problem is, it's in the "round-robin simulation" done by the client. |
Thanks. You're right - I have those two app_config.xml files. I also
have two other identical machines, not currently showing this problem -
I can apply the same two files to one of them, to see if it replicates.
I think I know what caused the upload problem to the emulator - I'll try
again.
I'll burn off a few more of the overfetch overnight (your time), and
upload the rr_sim log ready for your morning.
…------ Original Message ------
From: "David Anderson" <notifications@github.com>
To: "BOINC/boinc" <boinc@noreply.github.com>
Cc: "RichardHaselgrove" <r.haselgrove@btinternet.com>; "Author"
<author@noreply.github.com>
Sent: Wednesday, 2 Dec, 20 At 07:10
Subject: Re: [BOINC/boinc] Client: loses track of new work requirements
(#4117)
I don't know what the problem is, but it's related to the "max
concurrent" mechanism,
specified in the projects/X/app_config.xml files.
It looks like you're using this for both Rosetta and NumberFields with
a max of 2.
In simulation 184 it looks like you tried to include both
app_config.xml files in the simulation.
The one for NumberFields made it OK, but not the one for Rosetta;
maybe there was a URL typo.
In simulation 185 neither is present.
Note: in the web page that shows a scenario added code to display the
app_config.xml files; see e.g.
https://boinc.berkeley.edu/dev/sim_web.php?action=show_scenario&name=184
<https://boinc.berkeley.edu/dev/sim_web.php?action=show_scenario&name=184>
So please try the simulations again; it's possible that there's a bug in
uploading 2 app config files,
but I see other scenarios with 2 files.
In any case: whatever the basic problem is, it's in the "round-robin
simulation" done by the client.
Please set the rr_sim_debug log flag and post the output for one of
the bad work-fetch decisions.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#4117 (comment)> , or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ADRSMJDL6HV5EOKUVPM3A23SSXR4VANCNFSM4UJOACQA>
.
|
Looking at the timestamps on the two app_config.xml files, I changed them both shortly after the first event on 24 November (at 11:18 and 11:27). The Rosetta change will have been from max 1 to 2, to try to burn off more tasks before deadline. I can only run two CPU tasks on this machine, because each NVidia task requires a full code core for sync support. |
I set up an identical machine with the same app_config.xml files (max concurrent in app section), and it started to misbehave some 30 minutes later, after two normal work requests. Summary log attached: I'll do the rr_sim etc. from the original machine a little later. |
Scenario 186 created from the original machine. I think this one has uploaded properly. Log with rr_simulation attached. Only fetched from NumberFields this time. Would you like rr_sim_detail as well? On the second machine, I tried it with project_max_concurrent instead, with the same result - overfetched again. So I've taken both app_config.xml files offline, and we can use that machine for further testing if needed. |
Hi, iirc, there were two significant changes to the code related to max_concurrent:
We (= me and some guys from my team) were able to (re)produce the infinity-bug as follows:
Let's assume your cpu has 4 cores and you have max-concurrent set to 3 for some app, you will always have 1 free core, and the client will always ask for work for at least this core (and/or at least 1 second). Not sure about the case when you have another project with any active task(s) so there is no idle core/thread (= instance) at all... The reason for this problem is found in work_fetch.cpp:
I don't remember if we tested this on any MT-capable/-aware (multi-threaded) app and I don't remember if setting the buffers to 0 was really necessary in this case, because we had another issue that caused an infinity-fetch-situation under some circumstances, but we haven't figured out the root cause of this - that's also one reason why we haven't filed a bug report yet. I wrote this down from my memory and my notes on this issue, so please forgive me if that's not very accurate or may have changed due to any bugfixes lately... At least I hope it can serve as a little hint for further investigations :). walli |
I couldn't repro walli's case; I have 1 project (Rosetta) with zero resource share and zero work buffer params, and max_concurrent=2 for the "rosetta" app, and I'm not seeing unbounded work fetch. Richard, did you see the bad behavior in your latest simulation? |
I couldn't see what caused it in the rr_sim log, but I found the timeline illuminating. 03:34:00 RPC to Rosetta@home: [work_fetch] request: CPU (12382.46 sec) In 24 simulated hours, Rosetta started with 2 tasks, downloaded 24 tasks, reported 3 tasks, and finished with 23 tasks. NumberFields started with 51 tasks, downloaded 8, reported 23, and finished with 36. That's more reasonable. At an estimated 8 hours per Rosetta task, and only two cores available after GPU reservations, that's 3.8 days of wall-time committed, for a project with 3 day deadlines and a host with a 7.2 hour cache setting. That's what I'm seeing in real life over an interval of seconds, rather than hours. |
Looking at the real-life rr_sim_stdoutdae.txt output from #4117 (comment), I see the simulated runtimes at 02-Dec-2020 15:53:28 as: NumberFields: 26690.25 seconds, 6 tasks making 138083.48 seconds in all. Plus, NumberFields: 31 tasks 'at app max concurrent' Then there's |
Here's a log which combines both rr_sim and work_fetch in the same file rr_sim_and_work_fetch_stdoutdae.txt I allowed new work at close to the point where I judge that work would have been fetched normally - with just below 21,600 seconds of CPU work stored. I see no trace of max_concurrent mentioned in this file until 18:05:25, after the "18:05:20 [NumberFields@home] [work_fetch] request: CPU (8862.86 sec)" took the work cache into the 'additional' days. |
What am I supposed to look at? At what point does an error occur? |
I think you were right to draw attention to max_concurrent: I think the problem arises in rr_sim when a project already has enough tasks to reach the work_fetch "target work buffer", but can't run all tasks immediately because of max_concurrent limits.
Numberfields reached a total runtime of 6961.50 during the simulation, which is below 'work_buf min 21600'. A work fetch followed. But Numberfields already had many more tasks queued: their runtime was excluded from the simulation by 'at app max concurrent for GetDecics'. When a project has more tasks available than allowed by max_concurrent, we don't abort the surplus: we defer them, and run them later. Their runtime needs to be included in the simulation. The exclusion is made in https://github.com/BOINC/boinc/blob/master/client/rr_sim.cpp#L340 ff. That code change was made in 40f0cb4, when the comment was
That was part of #2918, the first attempt at fixing the job scheduling bug. The later #3076, 'Re-enable work buffering ...', didn't re-visit that part of rr_sim. |
Been doing some thinking about this one. Problem still remains, so I've been managing it by allowing work fetch for limited times, and setting 'No new work' at other times. I've also been working on a patched client. RR_sim and WF logs from stock client Windows v7.16.11: My patched version separates the rr_sim for CPU sched from the rr_sim for work fetch, and omits the max_concurrent check for the work_fetch cycle - I've had to cut out the efficiency cycle that re-uses the previous sim if it's fresh. After saving that log, I allowed work fetch from both active CPU projects, and observed 13/12/2020 17:05:52 | NumberFields@home | [sched_op] CPU work request: 10672.98 seconds; 0.00 devices which is to be expected, but since then no further work has been requested. That's how it should be. I'll let the patched version run overnight. If no problems are visible, I'll apply it to the other machine tomorrow. Side comment: this machine has 4 cores. Two are reserved for GPU support, one runs NumberFields, and one runs Rosetta. Max_concurrent is set to two for both those projects: at no point is it reached in normal (not simulated) running. |
The RR sim for work fetch needs to model max concurrent as well. If there is a simulation that exhibits the problem, or if I can repro it on my computer, then I can fix it. |
I accept it needs to model it, but not by reducing the measurement of work cache below the requested level. That's when it feels the need to fetch more. My patched client has maintained a normal level of cached work overnight, but my stock client continues to over-fetch whenever given the chance:
The key feature is that the second request doesn't take any notice of the substantial amount of work received just 40 seconds before. Both sets of tasks have been received and display normally in BOINC Manager. |
BOINC gets work but can't finish on time? What? This is happening on 2 of my computers. One Linux and one Windows 10. "Days overdue; you may not get credit for it. Consider aborting it". ???
|
I also see this bug from time to time. It linked to max <max_concurrent> setting in app_config (if i remove it bug does not trigger). But some additional unknown condition needed as I can not reproduce it intentionally. Here my settings from one of hosts where i see it most often: If client "go crazy" it download up to few hundreds tasks with average runtimes about 8 hours each and 3 days deadline. So it is not simple ignoring of max_concurrent setting in work fetch calculations as even without this setting(running R@H on all 16 threads instead of 8 or 10 restricted by max_concurrent) client still can not process all of download work before deadline. Additional logs with work_fetch debug on shows that client underestimate amount of work it have (saturated).
If i understand it correctly - client thinks it has only 88003 seconds worth of work for CPU in queue. While actual queue include 85 R@H WUs(not counting other projects) , ~8 hours each (client calculate this correctly - estimated runtimes floats around 8 hr) |
I aborted some more WUs from queue (still leaving >> 1 day worth of work) to test and client already thinks there is a shortage of work in queue (but choose to grab work from WCG this time):
Also - did client thinks computer has 134 cpus/thread? This line:
|
I have found why it sometimes hard to reproduce / random bug behavior. P.S. |
Fixed via #4592 |
Describe the bug
The client sometimes gets into an internal state, where it makes repeated requests for extra work which bear little resemblance to real needs.
Steps To Reproduce
This bug cannot be reproduced on demand, so I've gathered as much data as I can for this occurrence.
System Information
Additional context
The particular episode under investigation started on 24-Nov-2020, with the machine making 108 work requests between
24-Nov-2020 09:59:40 [Rosetta@home] [sched_op] CPU work request: 65.56 seconds; 0.00 devices
24-Nov-2020 09:59:42 [Rosetta@home] [sched_op] estimated total CPU task duration: 28803 seconds
and
24-Nov-2020 11:07:56 [Rosetta@home] [sched_op] CPU work request: 4697.21 seconds; 0.00 devices
24-Nov-2020 11:07:58 [Rosetta@home] [sched_op] estimated total CPU task duration: 28803 seconds
Full list
108 tasks, estimated at 8 hours each, would have kept all four CPU cores occupied for 9 full days. Rosetta tasks have a three day deadline: the server should never have allocated so many tasks, and the client should never have requested them. My work fetch settings are 0.25 days + 0.05 days (21600.00 + 4320.00 sec)
I completed as many as possible, but the client aborted a substantial number on the third day for "not started by deadline". At the same time, the machine had also over-fetched task for a second CPU project, NumberFields@Home. Work requests for GPU tasks, from GPUGrid and Einstein, remained normal. All tasks have been displayed normally on the tasks tab in BOINC Manager.
In previous years, these sporadic overfetch events have stopped whenever I've attempted to gather logs or other evidence. This time, it appears to be stable, so from a test this evening I've gathered
Work request summary
Complete work fetch log
Sample scheduler request
Matching scheduler reply
Client state before (from boinccmd)
Client state after
I also loaded all relevant files into the client emulator on 28 Nov 2020, as simulation 184 (before a work fetch event) and simulation 185 (after). The simulation run didn't show the same effects. Since the derangement seems to be persistent this time, please suggest other investigations I could carry out.
The text was updated successfully, but these errors were encountered: