Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

2nd GPU WU overallocated CPUs #106

Closed
alexatkinuk opened this issue Jan 31, 2023 · 5 comments
Closed

2nd GPU WU overallocated CPUs #106

alexatkinuk opened this issue Jan 31, 2023 · 5 comments

Comments

@alexatkinuk
Copy link

If left on default settings with 1 GPU and 4x CPU (its only a 4 core CPU), I had a GPU job start with 1xCPU and in the meantime a CPU job was processing using 3xCPU.

When the GPU job finished, it was assigned a new GPU WU with 4xCPU causing the existing CPU WU to pause with "Resources not available".

Surely the a WU should be chosen based on available resources?

@jcoffland
Copy link
Member

Can you post logs for this event?

@jcoffland jcoffland changed the title GPU WU being assigned more than 1 CPU core when existing CPU job is already using all spare cores. 2nd GPU WU overallocated CPUs Jan 31, 2023
@alexatkinuk
Copy link
Author

alexatkinuk commented Jan 31, 2023

Strange, I can't tell what is going on here other than for some reason its shutting down the other WU:

02:11:31:I1::Added new work unit: cpus:1 gpus:gpu:01:00:00
02:11:31:I1::WU20:Uploading WU results
02:11:31:I1::WU19:Caught signal SIGINT(2) on PID 41573
02:11:31:I1::WU19:Exiting, please wait. . .
02:11:31:I1:OUT31828:> POST https://vav19.fah.temple.edu/api/results HTTP/1.1
02:11:31:I3:Connecting to vav19.fah.temple.edu:443
02:11:31:I1::WU21:Requesting WU assignment
02:11:31:I1:OUT31829:> POST https://assign4.foldingathome.org/api/assign HTTP/1.1
02:11:31:I3:Connecting to assign4.foldingathome.org:443
02:11:31:I1::WU19:Folding@home Core Shutdown: INTERRUPTED
02:11:32:I1:OUT31829:< assign4.foldingathome.org:443 HTTP/1.1 200 HTTP_OK
02:11:32:I1::WU21:Received WU assignment 3Gp4m3GG7M4LDXvEm21MjA4nrDxcF-NY6NjLh3_UCaA
02:11:32:I1::WU21:Downloading WU
02:11:32:I1:OUT31830:> POST https://fah1.innovatr.ca/api/assign HTTP/1.1
02:11:32:I3:Connecting to fah1.innovatr.ca:443
02:11:32:I1::WU19:Core returned INTERRUPTED (102)
02:11:34:I1::WU20:UPLOAD 25%
02:11:35:I1::WU20:UPLOAD 44%
02:11:36:I1::WU20:UPLOAD 65%
02:11:37:I1::WU20:UPLOAD 87%
02:11:41:I1:OUT31828:< vav19.fah.temple.edu:443 HTTP/1.1 200 HTTP_OK
02:11:41:I1::WU20:Credited
02:11:45:I1::WU21:DOWNLOAD 12%
02:11:47:I1::WU21:DOWNLOAD 26%
02:11:48:I1::WU21:DOWNLOAD 34%
02:11:49:I1::WU21:DOWNLOAD 41%
02:11:50:I1::WU21:DOWNLOAD 49%
02:11:51:I1::WU21:DOWNLOAD 57%
02:11:52:I1::WU21:DOWNLOAD 65%
02:11:53:I1::WU21:DOWNLOAD 72%
02:11:54:I1::WU21:DOWNLOAD 80%
02:11:55:I1::WU21:DOWNLOAD 89%
02:11:56:I1::WU21:DOWNLOAD 97%
02:11:56:I1:OUT31830:< fah1.innovatr.ca:443 HTTP/1.1 200 HTTP_OK
02:11:56:I1::WU21:Received WU
02:11:57:I1::WU21:CORE 100%
02:11:57:I3::WU21:Running FahCore: /home/fah/fah-client_8.1.11-64bit-release/cores/openmm-core-22/fahcore-22-linux-64bit-release-0.0.20/FahCore_22 -dir 3Gp4m3GG7M4LDXvEm21MjA4nrDxcF-NY6NjLh3_UCaA -suffix 01 -version 8.1.11 -lifeline 38487 -gpu-vendor nvidia -cuda-platform 0 -cuda-device 0 -gpu -1
02:11:57:I3::WU21:Started FahCore on PID 41726
02:11:57:I1::WU21:*********************** Log Started 2023-01-31T02:11:57Z ***********************
02:11:57:I1::WU21:*************************** Core22 Folding@home Core ***************************
02:11:57:I1::WU21: Core: Core22
02:11:57:I1::WU21: Type: 0x22
02:11:57:I1::WU21: Version: 0.0.20
02:11:57:I1::WU21: Author: Joseph Coffland joseph@cauldrondevelopment.com
02:11:57:I1::WU21: Copyright: 2020 foldingathome.org
02:11:57:I1::WU21: Homepage: https://foldingathome.org/
02:11:57:I1::WU21: Date: Jan 20 2022
02:11:57:I1::WU21: Time: 00:57:52
02:11:57:I1::WU21: Revision: 3f211b8a4346514edbff34e3cb1c0e0ec951373c
02:11:57:I1::WU21: Branch: HEAD
02:11:57:I1::WU21: Compiler: GNU 9.4.0
02:11:57:I1::WU21: Options: -faligned-new -std=c++11 -fsigned-char -ffunction-sections
02:11:57:I1::WU21: -fdata-sections -O3 -funroll-loops -fno-pie
02:11:57:I1::WU21: -DOPENMM_VERSION=""7.7.0""
02:11:57:I1::WU21: Platform: linux 5.11.0-1025-azure
02:11:57:I1::WU21: Bits: 64
02:11:57:I1::WU21: Mode: Release
02:11:57:I1::WU21:Maintainers: John Chodera john.chodera@choderalab.org and Peter Eastman
02:11:57:I1::WU21: peastman@stanford.edu
02:11:57:I1::WU21: Args: -dir 3Gp4m3GG7M4LDXvEm21MjA4nrDxcF-NY6NjLh3_UCaA -suffix 01
02:11:57:I1::WU21: -version 8.1.11 -lifeline 38487 -gpu-vendor nvidia -cuda-platform 0
02:11:57:I1::WU21: -cuda-device 0 -gpu -1
02:11:57:I1::WU21:************************************ libFAH ************************************
02:11:57:I1::WU21: Date: Jan 20 2022
02:11:57:I1::WU21: Time: 00:57:22
02:11:57:I1::WU21: Revision: 9f4ad694e75c2350d4bb6b8b5b769ba27e483a2f
02:11:57:I1::WU21: Branch: HEAD
02:11:57:I1::WU21: Compiler: GNU 9.4.0
02:11:57:I1::WU21: Options: -faligned-new -std=c++11 -fsigned-char -ffunction-sections
02:11:57:I1::WU21: -fdata-sections -O3 -funroll-loops -fno-pie
02:11:57:I1::WU21: Platform: linux 5.11.0-1025-azure
02:11:57:I1::WU21: Bits: 64
02:11:57:I1::WU21: Mode: Release
02:11:57:I1::WU21:************************************ CBang *************************************
02:11:57:I1::WU21: Date: Jan 20 2022
02:11:57:I1::WU21: Time: 00:57:00
02:11:57:I1::WU21: Revision: ab023d155b446906d55b0f6c9a1eedeea04f7a1a
02:11:57:I1::WU21: Branch: HEAD
02:11:57:I1::WU21: Compiler: GNU 9.4.0
02:11:57:I1::WU21: Options: -faligned-new -std=c++11 -fsigned-char -ffunction-sections
02:11:57:I1::WU21: -fdata-sections -O3 -funroll-loops -fno-pie -fPIC
02:11:57:I1::WU21: Platform: linux 5.11.0-1025-azure
02:11:57:I1::WU21: Bits: 64
02:11:57:I1::WU21: Mode: Release
02:11:57:I1::WU21:************************************ System ************************************
02:11:57:I1::WU21: CPU: Intel(R) Core(TM) i5-4690 CPU @ 3.50GHz
02:11:57:I1::WU21: CPU ID: GenuineIntel Family 6 Model 60 Stepping 3
02:11:57:I1::WU21: CPUs: 4
02:11:57:I1::WU21: Memory: 15.57GiB
02:11:57:I1::WU21:Free Memory: 11.65GiB
02:11:57:I1::WU21: Threads: POSIX_THREADS
02:11:57:I1::WU21: OS Version: 6.0
02:11:57:I1::WU21:Has Battery: false
02:11:57:I1::WU21: On Battery: false
02:11:57:I1::WU21: UTC Offset: 0
02:11:57:I1::WU21: PID: 41726
02:11:57:I1::WU21: CWD: /home/fah/fah-client_8.1.11-64bit-release/work
02:11:57:I1::WU21:************************************ OpenMM ************************************
02:11:57:I1::WU21: Version: 7.7.0
02:11:57:I1::WU21:********************************************************************************
02:11:57:I1::WU21:Project: 18213 (Run 16498, Clone 1, Gen 0)
02:11:57:I1::WU21:Reading tar file core.xml
02:11:57:I1::WU21:Reading tar file integrator.xml
02:11:57:I1::WU21:Reading tar file state.xml
02:11:57:I1::WU21:Reading tar file system.xml
02:11:58:I1::WU21:Digital signatures verified
02:11:58:I1::WU21:Folding@home GPU Core22 Folding@home Core
02:11:58:I1::WU21:Version 0.0.20
02:11:58:I1::WU21: Checkpoint write interval: 25000 steps (2%) [50 total]
02:11:58:I1::WU21: JSON viewer frame write interval: 12500 steps (1%) [100 total]
02:11:58:I1::WU21: XTC frame write interval: 20000 steps (1.6%) [62 total]
02:11:58:I1::WU21: Global context and integrator variables write interval: disabled
02:11:58:I1::WU21:No -opencl-device specified; using deprecated -gpu argument as an alias for -opencl-device.
02:11:58:I1::WU21:Please consider upgrading your client version.
02:11:58:I1::WU21:There are 4 platforms available.
02:11:58:I1::WU21:Platform 0: Reference
02:11:58:I1::WU21:Platform 1: CPU
02:11:58:I1::WU21:Platform 2: OpenCL
02:11:58:I1::WU21: opencl-device -1 specified
02:11:58:I1::WU21:Platform 3: CUDA
02:11:58:I1::WU21: cuda-device 0 specified
02:12:04:I1::WU21:Attempting to create CUDA context:
02:12:04:I1::WU21: Configuring platform CUDA
02:12:09:I1::WU21: Using CUDA and gpu 0

According to the log the GPU WU is only assigned 1 CPU, but the UI says 4 and why did it shut down the CPU WU in the first place? This seems to happen every time a new WU is requested, it shuts down existing running ones, except they usually re-start again.

@jcoffland
Copy link
Member

I found the problem. It will be fixed in the v8.1.12 release. Thanks for the report.

@alexatkinuk
Copy link
Author

Glad I could help with something.

@jcoffland
Copy link
Member

I believe this is fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants