Update on hzdr-hemera requires update to hemera profile #2860

steindev · 2019-01-22T12:52:13Z

After the update on hemera I resubmitted a simulation which ran on hemera before the update. Now I get the error

sbatch: error: Batch job submission failed: Requested node configuration is not available

My job cfg file looks like:

 TBG_gpu_x=1
TBG_gpu_y=8
TBG_gpu_z=4

TBG_gridSize="768 2304 768"


TBG_steps="150000"
TBG_gridDist=" --gridDist '768{1}' '288{8}' '288{1}, 96{2}, 288{1}' "

Is this related to the deactivation of Hyperthreading and how we request cores/assign tasks?

Btw, the same happens for a 4 gpu job I try to run.

The text was updated successfully, but these errors were encountered:

ax3l · 2019-01-22T13:23:20Z

Likely, also we have to pass the project with -A now as the mail announced.

steindev · 2019-01-22T14:09:30Z

Likely, also we have to pass the project with -A now as the mail announced.

The email did not announce a special account/contingent for us. So I would not think so.

sbastrakov · 2019-01-23T09:28:39Z

I've just ran into the same problem, also on the GPU partition on hemera. Could not figure it out. It seems to be unrelated to the -A option (it was also indeed mentioned only for some other institutes). As a temporary hack setting .TBG_coresPerGPU=2 in the gpu.tpl file seems to work for me with 4-GPU setup. Also requesting just1 or 2 GPUs with the original .tpl file works. So it implies there is probably some configuration issue (likely on the cluster side) with the number of cores per node.

sbastrakov · 2019-01-23T09:34:09Z

Btw setting .TBG_coresPerGPU=3 also works with 4 GPUs, but 4 cores per GPU already does not. So it seems the GPU nodes are somehow exposed to slurm as if they have only 12 cores.

ax3l · 2019-01-23T10:21:29Z

As they turned off hyperthreading, we basically have to reduce the number of CPU "cores" we request per node and GPU by a factor two.

sbastrakov · 2019-01-23T10:29:28Z

This of course makes sense. However, (upon a rather brief look) I did not see where are we accounting for the hyperthreading: the original gpu.tpl for Hemera has 6 cores per GPU, which is 24 per node with 4 GPUs, exactly same as the amount of physical cores.

ax3l · 2019-01-23T13:01:46Z

Just half the TBG_coresPerGPU=6 to 3.

The exact details are how slurm handles ntasks-per-node, cpus-per-task and ntasks (manual). For our queue update, just update as described. Also, that update is a brutal work-around for some scheduling issue in Slurm, since disabling HT on a machine easily wastes 30-40% of potential CPU performance.

You can also verify if it does the right placement, could be that we now accidentally just stay on one package, etc. On Davide we also request with the combination ntasks-per-socket+ntasks-per-node in an interactive case. Please feel free to try changing the Hemera .tpl in case the 6->3 update does not place the GPU-controlling processes well.

ax3l · 2019-02-13T08:11:49Z

@psychocoderHPC @steindev can this be closed? (since #2862 was able to be closed)

ax3l added this to To do in 0.4.3 Backports via automation Jan 22, 2019

ax3l added documentation regarding documentation or wiki discussions machine/system machine & HPC system specific issues labels Jan 22, 2019

ax3l assigned psychocoderHPC, PrometheusPi and steindev Jan 22, 2019

ax3l closed this as completed Feb 13, 2019

ax3l removed this from To do in 0.4.3 Backports Feb 13, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update on hzdr-hemera requires update to hemera profile #2860

Update on hzdr-hemera requires update to hemera profile #2860

steindev commented Jan 22, 2019

ax3l commented Jan 22, 2019

steindev commented Jan 22, 2019

sbastrakov commented Jan 23, 2019 •

edited

sbastrakov commented Jan 23, 2019 •

edited

ax3l commented Jan 23, 2019 •

edited

sbastrakov commented Jan 23, 2019 •

edited

ax3l commented Jan 23, 2019 •

edited

ax3l commented Feb 13, 2019 •

edited

Update on hzdr-hemera requires update to hemera profile #2860

Update on hzdr-hemera requires update to hemera profile #2860

Comments

steindev commented Jan 22, 2019

ax3l commented Jan 22, 2019

steindev commented Jan 22, 2019

sbastrakov commented Jan 23, 2019 • edited

sbastrakov commented Jan 23, 2019 • edited

ax3l commented Jan 23, 2019 • edited

sbastrakov commented Jan 23, 2019 • edited

ax3l commented Jan 23, 2019 • edited

ax3l commented Feb 13, 2019 • edited

sbastrakov commented Jan 23, 2019 •

edited

sbastrakov commented Jan 23, 2019 •

edited

ax3l commented Jan 23, 2019 •

edited

sbastrakov commented Jan 23, 2019 •

edited

ax3l commented Jan 23, 2019 •

edited

ax3l commented Feb 13, 2019 •

edited