Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BOINC may not use all CPUs in some cases #1775

Open
sirzooro opened this issue Jan 29, 2017 · 12 comments
Open

BOINC may not use all CPUs in some cases #1775

sirzooro opened this issue Jan 29, 2017 · 12 comments

Comments

@sirzooro
Copy link
Contributor

sirzooro commented Jan 29, 2017

I am cleaning up my work queue before next PrimeGrid challenge, and found case when BOINC Client does not run tasks on all available cores. Now I have 3 rosetta@home tasks running on 3 out of 8 available CPUs. There are also some ATLAS@Home and Cosmology@Home tasks waiting, but they require 7 or 8 CPUs per WU. Most projects are now set to not download new tasks, except for one with zero resource usage set. It looks that BOINC only checks if there are some other tasks available in the queue and do not try to download new ones from project with zero resource usage set when there are some downloaded tasks waiting. This is wrong, it should also check required CPU count for them and compare it with current free CPU count to eliminate cases like this.

I suspect that other similar cases may also exists, e.g. when some tasks are waiting but there is not enough memory to run them, please take a look on them too.

Windows 10 64bit, BOINC 7.6.33

Edit: there is one more case. I suspended rosetta project and BOINC started crunching one Cosmology WU. It finished it and started ATLAS WU. It required more memory so it stopped working (status is Waiting for memory). Now BOINC does not use any CPU (except for small fraction reserved for GPU and NCI tasks), even if there are other Cosmology tasks ready to start.

@sirzooro
Copy link
Contributor Author

sirzooro commented Jan 31, 2017

There are also two cases when GPU also may not have work. I am not sure if they should go to this issue, but they looks related:

  • on systems with multiple GPUs some of them may not get work if all CPUs are busy. Details and logs from someone with 4 Titans are here. I also had similar problem with my 2 GPUs, and fixed it in the same way - created app_config for GPU apps to reduce requires CPU to small value like 0.01;-
  • similar problem also exists with GPU tasks which needs multiple GPUs. Moo! Wrapper projects sends such WUs, it sent me ones which needed both of my 2 GPU. For some reason presence of such tasks in work queue also was a problem for scheduler, sometimes it also assigned work for only 1 GPU. All other GPU apps were configured to use small fractional CPU part, so it looks like something related to these Moo! Wrapper tasks. When I finished crunching all downloaded WUs, BOINC started working as expected again.

This was observed on previous Windows BOINC version (do not remember exactly - 7.6.23?). I did not try to reproduce it on current version.

@davidpanderson
Copy link
Contributor

davidpanderson commented Jan 31, 2017 via email

@sirzooro
Copy link
Contributor Author

sirzooro commented Feb 1, 2017

Thanks for link. I will try to play with it a bit.

@sorcrosc
Copy link

sorcrosc commented Feb 3, 2017

This also happens when for one project is used <max_concurrent> option in app_config.xml to limit the number of tasks to run simultaneously . If BOINC has plenty of workunits for such project, it doesn't request more work from others and some cores remain dry

@sirzooro
Copy link
Contributor Author

sirzooro commented Mar 20, 2017

One more issue, just reported on WUProp forum:

Just in case anyone encounters the same issue. A couple of inactive NCI projects prevented me from getting any work for any hardware on one system today. BM (7.6.33 [x64]) event log:
Not requesting tasks: don't need (CPU: not highest priority project; Miner ASIC: not highest priority project; NVIDIA GPU: not highest priority project)

Had run out of Asic & GPU work and was about to run out of CPU work (only 2/7 logical cores being used). BM just kept asking for nci work (PoD style) & ignored the other projects/devices completely.
Serious scheduler bug IMO + stupid error message (CPU isn't a project, even if the code deludes itself into thinking otherwise).

@ChristianBeer ChristianBeer added this to the Client/Manager 8.0 milestone Apr 12, 2017
@Toby-Broom
Copy link

I see the same as sorosc on LHC, as they have a job limit of 24. If I set on this project to unlimited for the Sixtrack app then it will queue based on the cache settings of BOINC, if I set to 24 then it queues no task it just runs upto that limit and when one task is finished it gets another.

@sirzooro
Copy link
Contributor Author

sirzooro commented Jun 23, 2017

One more case (maybe duplicate of some already mentioned one): DENIS performs some maintenance work now and it sends WUs, but input files cannot be downloaded so WUs ends with "download error". This somehow prevents downloading WUs from Asteroids - my backup project. I saw this in log when I tried to manually update project to download new WUs:

300324 Asteroids@home 2017-06-23 07:24:28 Sending scheduler request: Requested by user.
300325 Asteroids@home 2017-06-23 07:24:28 Not requesting tasks: don't need (not highest priority project)

Looks that these faulty DENIS WUs prevented downloads of other ones from backup project. I had 16 of them in the queue. Remaining 16 CPUs were getting WUs from Asteroids as expected. This was on BOINC 7.6.22 for Linux.

@sirzooro
Copy link
Contributor Author

sirzooro commented Jul 27, 2017

One one case, this one is interesting. I am crunching "GFN-13 Prime Search" from "PRIVATE GFN SERVER" (run by stream, https://www.primegrid.com/forum_thread.php?id=6511). One of results for completed WU could not be uploaded, and somehow it prevented downloading of new WUs from this project - BOINC client switched to backup project. This is what I found in log:

225112	PRIVATE GFN SERVER	2017-07-27 17:32:53	Requesting new tasks for CPU	
225113	PRIVATE GFN SERVER	2017-07-27 17:32:59	Scheduler request completed: got 0 new tasks	
225114	PRIVATE GFN SERVER	2017-07-27 17:32:59	Result gfn13_72132256_1499672386_1 is no longer usable	
225115	PRIVATE GFN SERVER	2017-07-27 17:32:59	No tasks sent	

I have aborted this upload and requested project update. After doing this new WUs were downloaded without problem:

226443	PRIVATE GFN SERVER	2017-07-27 20:04:54	update requested by user	
226444	PRIVATE GFN SERVER	2017-07-27 20:04:56	Sending scheduler request: Requested by user.	
226445	PRIVATE GFN SERVER	2017-07-27 20:04:56	Reporting 1 completed tasks	
226446	PRIVATE GFN SERVER	2017-07-27 20:04:56	Requesting new tasks for CPU	
226447	PRIVATE GFN SERVER	2017-07-27 20:05:01	Scheduler request completed: got 15 new tasks	

I am not sure if this is problem with client or server, it may be on either side.

@Toby-Broom
Copy link

Another example here is if a task goes to the state VM unmanagble it depletes the queues tasks and just sits there with 1 bad task till you abort the it reloads n tasks

@sirzooro
Copy link
Contributor Author

sirzooro commented Sep 15, 2017

And next one: I configured one project via app_config.xml to use 22 out of 32 cores. Remaining 10 were left for another project with very short tasks. That 2nd project also has very limited WU supply, so BOINC was not able to build buffer for it. As a result BOINC kept downloading tasks from 1st project until it filled work queue. At this point it stopped trying to download tasks from 2nd project because queue was full, so 10 cores reserved for it were idle.

@davidpanderson
Copy link
Contributor

Can you reproduce this on the client emulator?
https://boinc.berkeley.edu/dev/sim_web.php
That makes it easier for me to fix the problem.

@Toby-Broom
Copy link

Toby-Broom commented Sep 18, 2017

My PC became VM unmanagable, here is sim with the required files, I didn't look to see if the SIM was blocked? https://boinc.berkeley.edu/dev/sim_web.php?action=show_simulation&scen=154&sim=0

Here is one with 24 job limit
https://boinc.berkeley.edu/dev/sim_web.php?action=simulation_form&scen=155

@Ageless93 Ageless93 added this to Backlog in Client and Manager via automation Nov 11, 2017
@AenBleidd AenBleidd removed this from Backlog in Client and Manager Aug 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Backlog
Development

No branches or pull requests

5 participants