Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update on hzdr-hemera requires update to hemera profile #2860

Closed
steindev opened this issue Jan 22, 2019 · 8 comments
Closed

Update on hzdr-hemera requires update to hemera profile #2860

steindev opened this issue Jan 22, 2019 · 8 comments
Assignees
Labels
documentation regarding documentation or wiki discussions machine/system machine & HPC system specific issues

Comments

@steindev
Copy link
Member

After the update on hemera I resubmitted a simulation which ran on hemera before the update. Now I get the error

sbatch: error: Batch job submission failed: Requested node configuration is not available

My job cfg file looks like:

 TBG_gpu_x=1
TBG_gpu_y=8
TBG_gpu_z=4

TBG_gridSize="768 2304 768"


TBG_steps="150000"
TBG_gridDist=" --gridDist '768{1}' '288{8}' '288{1}, 96{2}, 288{1}' "

Is this related to the deactivation of Hyperthreading and how we request cores/assign tasks?

Btw, the same happens for a 4 gpu job I try to run.

@ax3l ax3l added this to To do in 0.4.3 Backports via automation Jan 22, 2019
@ax3l ax3l added documentation regarding documentation or wiki discussions machine/system machine & HPC system specific issues labels Jan 22, 2019
@ax3l
Copy link
Member

ax3l commented Jan 22, 2019

Likely, also we have to pass the project with -A now as the mail announced.

@steindev
Copy link
Member Author

Likely, also we have to pass the project with -A now as the mail announced.

The email did not announce a special account/contingent for us. So I would not think so.

@sbastrakov
Copy link
Member

sbastrakov commented Jan 23, 2019

I've just ran into the same problem, also on the GPU partition on hemera. Could not figure it out. It seems to be unrelated to the -A option (it was also indeed mentioned only for some other institutes). As a temporary hack setting .TBG_coresPerGPU=2 in the gpu.tpl file seems to work for me with 4-GPU setup. Also requesting just1 or 2 GPUs with the original .tpl file works. So it implies there is probably some configuration issue (likely on the cluster side) with the number of cores per node.

@sbastrakov
Copy link
Member

sbastrakov commented Jan 23, 2019

Btw setting .TBG_coresPerGPU=3 also works with 4 GPUs, but 4 cores per GPU already does not. So it seems the GPU nodes are somehow exposed to slurm as if they have only 12 cores.

@ax3l
Copy link
Member

ax3l commented Jan 23, 2019

As they turned off hyperthreading, we basically have to reduce the number of CPU "cores" we request per node and GPU by a factor two.

@sbastrakov
Copy link
Member

sbastrakov commented Jan 23, 2019

This of course makes sense. However, (upon a rather brief look) I did not see where are we accounting for the hyperthreading: the original gpu.tpl for Hemera has 6 cores per GPU, which is 24 per node with 4 GPUs, exactly same as the amount of physical cores.

@ax3l
Copy link
Member

ax3l commented Jan 23, 2019

Just half the TBG_coresPerGPU=6 to 3.

The exact details are how slurm handles ntasks-per-node, cpus-per-task and ntasks (manual). For our queue update, just update as described. Also, that update is a brutal work-around for some scheduling issue in Slurm, since disabling HT on a machine easily wastes 30-40% of potential CPU performance.

You can also verify if it does the right placement, could be that we now accidentally just stay on one package, etc. On Davide we also request with the combination ntasks-per-socket+ntasks-per-node in an interactive case. Please feel free to try changing the Hemera .tpl in case the 6->3 update does not place the GPU-controlling processes well.

@ax3l
Copy link
Member

ax3l commented Feb 13, 2019

@psychocoderHPC @steindev can this be closed? (since #2862 was able to be closed)

@ax3l ax3l closed this as completed Feb 13, 2019
@ax3l ax3l removed this from To do in 0.4.3 Backports Feb 13, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation regarding documentation or wiki discussions machine/system machine & HPC system specific issues
Projects
None yet
Development

No branches or pull requests

5 participants