-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Residuals are very large for certain number of nodes #23
Comments
Hello Martin,
Can you post your executable arguments here, especially the qphix specific ones, I.e. by, bz,sy, minCt, etc? I have seen these issues when the block layout does not make sense.
Best
Thorsten
Am 28. Jan. 2017, 10:06 -0800 schrieb Martin Ueding <notifications@github.com>:
…
I run Chroma with the QPhiX clover solvers on an Intel Xeon Haswell (AVX2) architecture. Each node has two Xeons with 12 physical cores, 24 virtual cores. I do not use SMT and a single MPI rank, so that is 24 threads per node.
The 16³×32 lattice works just find on 1, 2, 4, 8, and 32 nodes. A 32³×96 lattice works fine on 8, 64, or 128 nodes. The 24³×96 lattice however, fails on 64 nodes:
QPHIX_RESIDUUM_REPORT: Red_Final[0]=2.35420099376685e-310 Red_Final[1]=2.35420099376685e-310 Red_Final[2]=0 Red_Final[3]=0 Red_Final[4]=2.21341409336878e-321 Red_Final[5]=6.32404026676796e-322 Red_Final[6]=0 Red_Final[7]=1.48219693752374e-323 Red_Final[8]=1.06527781423771e-316 Red_Final[9]=0 Red_Final[10]=0 Red_Final[11]=3.44237511497523e-316 Red_Final[12]=0 Red_Final[13]=0 Red_Final[14]=3.44235930487457e-316 Red_Final[15]=7.83598624667517e-12 QPHIX_CLOVER_MULTI_SHIFT_CG_MDAGM_SOLVER: Residua Check: shift[0] Actual || r || / || b || = 72301.7437396402 QMP ***@***.*** error: abort: 1 SOLVE FAILED: rel_resid=72301.7437396402 target=1e-09 tolerance_factor=10 max tolerated=1e-08
On 8 nodes, it works. On 128 it fails, I think. I now have it running on 81. It is probably caused by all the factors of 3 that are in the volume.
It would be nice if there was some error message when the number of nodes does not make sense for QPhiX. At least it fails early on with the residuals, still it took me a while to figure out a working number of nodes, especially since the queueing time for jobs with more than 8 nodes can be several days for my account.
What is this condition? What has do be divisible by what? Then I would attempt to implement this warning and suggest a number of nodes that the user should try instead.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub (#23), or mute the thread (https://github.com/notifications/unsubscribe-auth/ABAQ5kiUdXYmc4gWAJS-8m5ozP_2uQLwks5rW4OEgaJpZM4LwkmI).
|
The QPhiX arguments are the following:
Below is the beginning of the standard output from the job, the lines starting with
|
Hi Martin,
For blocking to work, we need local (i.e. per rank) Ny and Nz to be divisible by By and Bz respectively. By and Bz of 8 are good for large volumes but in multinode case, you may try value of 4 or 6 that divides Ny and Nz. In your case, can you please try with –by 6 –bz 6 (or with –by 6 –bz 4). These two blocking may work.
I thought we had these checks somewhere but maybe we lost those. We certainly need to add these sanity checks at the time of lattice setup in QPhiX.
Thanks,
Dhiraj
From: Martin Ueding [mailto:notifications@github.com]
Sent: Saturday, January 28, 2017 11:55 PM
To: JeffersonLab/qphix <qphix@noreply.github.com>
Cc: Subscribed <subscribed@noreply.github.com>
Subject: Re: [JeffersonLab/qphix] Residuals are very large for certain number of nodes (#23)
The QPhiX arguments are the following:
…-by 8 -bz 8 -c 24 -sy 1 -sz 1 -pxy 1 -pxyz 0 -minct 2
Below is the beginning of the standard output from the job, the lines starting with + are the shell commands executed (I ran bash -x).
+ export OMP_NUM_THREADS=24
+ OMP_NUM_THREADS=24
+ export KMP_AFFINITY=compact,0
+ KMP_AFFINITY=compact,0
+ mkdir -p cfg
+ mkdir -p hmc-out
+ srun ./hmc -i hmc.ini.xml -o hmc-out/hmc.slurm-2831834.out.xml -l hmc-out/hmc.slurm-2831834.log.xml -by 8 -bz 8 -c 24 -sy 1 -sz 1 -pxy 1 -pxyz 0 -minct 2
QDP use OpenMP threading. We have 24 threads
Affinity reporting not implemented for this architecture
Initialize done
Initializing QPhiX CLI Args
QPhiX CLI Args Initialized
QPhiX: By=8
QPhiX: Bz=8
QPhiX: Pxy=1
QPhiX: Pxyz=0
QPhiX: NCores=24
QPhiX: Sy=1
QPhiX: Sz=1
QPhiX: MinCt=2
---%<---
Lattice initialized:
problem size = 24 24 24 96
layout size = 12 24 24 96
logical machine size = 1 4 2 8
subgrid size = 24 6 12 12
total number of nodes = 64
total volume = 1327104
subgrid volume = 20736
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub<#23 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AIYlT0nnF7q1I1VYHbvHKnc6boRkLc-uks5rW4f7gaJpZM4LwkmI>.
|
Hi Martin,
Your local volume is 24x6x12x12.
After CB this is 12x6x12x12
I am presuming this is AVX/AVX2 on a 2 socket x 12 core system?
Your block sizes of 8x8 do not divide this well which may be the source of trouble.
I would
a)
Run 2 MPI per Node ( bind each to a socket and run with minct=1.
With Intel MPI you can use I_MPI_PIN=1 I_MPI_PIN_DOMAIN=socket(
With MinCt=2 face buffers may get communicated via the intrasocket (QPI?) and this can be a drag
b) Assuming that after you divide another factor of 2 (because you go to 1 MPI per socket) your volume become 12x6x6x12 after checkerboarding.
You should be able to run -by 6 -bz 6 Xeon has a huge L3 cache so this should be OK, hopefully. This will give you 1 block per thread/core
since you are not using SMT. NB: Probably you will be affected by strong scaling issues at this point rather than node level issue,,, We will see.
c) For these kinds of dimensions you don’t need padding, set -pxy 0 -pxyz 0
Let me know if this helps,
Best,
B
… On Jan 28, 2017, at 1:24 PM, Martin Ueding ***@***.***> wrote:
The QPhiX arguments are the following:
-by 8 -bz 8 -c 24 -sy 1 -sz 1 -pxy 1 -pxyz 0 -minct 2
Below is the beginning of the standard output from the job, the lines starting with + are the shell commands executed (I ran bash -x).
+ export OMP_NUM_THREADS=24
+ OMP_NUM_THREADS=24
+ export KMP_AFFINITY=compact,0
+ KMP_AFFINITY=compact,0
+ mkdir -p cfg
+ mkdir -p hmc-out
+ srun ./hmc -i hmc.ini.xml -o hmc-out/hmc.slurm-2831834.out.xml -l hmc-out/hmc.slurm-2831834.log.xml -by 8 -bz 8 -c 24 -sy 1 -sz 1 -pxy 1 -pxyz 0 -minct 2
QDP use OpenMP threading. We have 24 threads
Affinity reporting not implemented for this architecture
Initialize done
Initializing QPhiX CLI Args
QPhiX CLI Args Initialized
QPhiX: By=8
QPhiX: Bz=8
QPhiX: Pxy=1
QPhiX: Pxyz=0
QPhiX: NCores=24
QPhiX: Sy=1
QPhiX: Sz=1
QPhiX: MinCt=2
---%<---
Lattice initialized:
problem size = 24 24 24 96
layout size = 12 24 24 96
logical machine size = 1 4 2 8
subgrid size = 24 6 12 12
total number of nodes = 64
total volume = 1327104
subgrid volume = 20736
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub[github.com], or mute the thread[github.com].
--------------------------------------------------------------------------------------
Dr Balint Joo High Performance Computational Scientist
Jefferson Lab
12000 Jefferson Ave, Suite 3, MS 12B2, Room F217,
Newport News, VA 23606, USA
Tel: +1-757-269-5339, Fax: +1-757-269-5427
email: bjoo@jlab.org
-------------------------------------------------------------------------------------
|
The system I run on is JURECA in Jülich, which is a dual socket Xeon machine:
DivisibilitySo the X and Y components of the subgrid size have to be divisable by
But it must be before checkerboarding, right? Otherwise it would not have worked on 81 nodes, since then 4/8 would have become a problem. Two MPI processes per node
I had tried two MPI processes but went from some minutes per trajectory to 4 hours. But perhaps I did something wrong with the OpenMP thread binding or SMT. I'll have to look into that again since a performance test on 8 nodes showed that although the solver performance is a bit less, the time to solution is improved. So far I have not tuned the performance extensively, most of my time has been spend on getting the |
Hi Martin,
So the X and Y components of the subgrid size have to be divisable by Bx and By` after checkerboarding? This would certainly explain why it works with 81 but not with 64.
There is no Bx currently only By and Bz. So
- The X subgrid dimension has to be divisible by the SOALEN after checkerboarding.
- They Y and Z dimensions need to be divisible by By and Bz respectively (and are not affected by checkerboarding)
- By needs to be divisible by VECLEN/SOALEN.
Here VECLEN= hardware vector lenght and SOALEN is something you choose at compile time.
E.g. AVX and AVX2 allows SOALEN=4 and SOALEN=8 in single prec, for a VECLEN=8
AVX and AVX2 allow SOALEN=2 and SOALEN=4 in double prec. for a VECLEN=4.
Suppose you are in double prec, (Vector Length=4) and choose SOALEN=4.
Then each SOALEN load is a full vector load using one Y-coordinate. So By has to be By > 1, and must divide Y
Suppose you are in single prec (Vector legnth=8) and choose SOALEN=4.
Then each vector load will be 2 half vector loads, of length 4. These will come from two Y coordinates, y and y+1
so you will want By divisible by 2 AND By must divide Y.
On KNL where in single prec the vector length is 16. If you have SOALEN=4, this wil load from 4 successive Y coordinates.
In that case By needs to be divisible by 4 as well as dividing the Y dimension.
Bz is not affected by VECLEN and SOALEN as it is not involved in the vectorization.
Having an X-dimension of only SOALEN (ie one SOALEN length block) after checkerboarding will hurt
your strong scaling, since the face reconstructs from +/- X neighbors will hit the same vector and
will need to be serialized to avoid conflict. If more than one SOALE block is in X the forward and backward may
be able to update simultaneosly the forward and backward blcoks. For best results it may be worth having 2 SOALENs at least
in X. With SOALEN=4 this would mean local checkerboarded X-lengths of 8,12, etc
so Local uncheckerboarded lengths of 16,24, etc (this last part would apply if you use the nesap_hacklatt_strongscale
branch — I can’t remember whether I merged that into devel yet).
Best,
B
… • 64 nodes: subgrid size = 24 6 12 12, 24/8 works, 6/8 is not integer.
• 81 nodes: subgrid size = 8 8 8 32, 8/8 is integer.
But it must be before checkerboarding, right? Otherwise it would not have worked on 81 nodes, since then 4/8 would have become a problem.
Two MPI processes per node
I_MPI_PIN=1 I_MPI_PIN_DOMAIN=socket are environment variables to set?
I had tried two MPI processes but went from some minutes per trajectory to 4 hours. But perhaps I did something wrong with the OpenMP thread binding or SMT. I'll have to look into that again since a performance test on 8 nodes showed that although the solver performance is a bit less, the time to solution is improved.
So far I have not tuned the performance extensively, most of my time has been spend on getting the Delta H under control. I will have to look into performance again and make a couple test with the larger lattices.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub[github.com], or mute the thread[github.com].
--------------------------------------------------------------------------------------
Dr Balint Joo High Performance Computational Scientist
Jefferson Lab
12000 Jefferson Ave, Suite 3, MS 12B2, Room F217,
Newport News, VA 23606, USA
Tel: +1-757-269-5339, Fax: +1-757-269-5427
email: bjoo@jlab.org
-------------------------------------------------------------------------------------
|
Every QPhiX operation has to construct a `Geometry` object at some time. Therefore the check seems to be best placed there. At startup, there will be an uncaught exception that causes the calling program (Chroma, tmLQCD) to crash right away. Currently Chroma will crash because the residuals become too high. For the user it is probably nicer to have a hard error message that points to the cause of the problem.
Ensure that blocking sizes can work out (fixes GH-23)
That should be fixed now. |
Every QPhiX operation has to construct a `Geometry` object at some time. Therefore the check seems to be best placed there. At startup, there will be an uncaught exception that causes the calling program (Chroma, tmLQCD) to crash right away. Currently Chroma will crash because the residuals become too high. For the user it is probably nicer to have a hard error message that points to the cause of the problem.
Every QPhiX operation has to construct a `Geometry` object at some time. Therefore the check seems to be best placed there. At startup, there will be an uncaught exception that causes the calling program (Chroma, tmLQCD) to crash right away. Currently Chroma will crash because the residuals become too high. For the user it is probably nicer to have a hard error message that points to the cause of the problem.
I run Chroma with the QPhiX clover solvers on an Intel Xeon Haswell (AVX2) architecture. Each node has two Xeons with 12 physical cores, 24 virtual cores. I do not use SMT and a single MPI rank, so that is 24 threads per node.
The 16³×32 lattice works just find on 1, 2, 4, 8, and 32 nodes. A 32³×96 lattice works fine on 8, 64, or 128 nodes. The 24³×96 lattice however, fails on 64 nodes:
On 8 nodes, it works. On 128 it fails, I think. I now have it running on 81. It is probably caused by all the factors of 3 that are in the volume.
It would be nice if there was some error message when the number of nodes does not make sense for QPhiX. At least it fails early on with the residuals, still it took me a while to figure out a working number of nodes, especially since the queueing time for jobs with more than 8 nodes can be several days for my account.
What is this condition? What has do be divisible by what? Then I would attempt to implement this warning and suggest a number of nodes that the user should try instead.
The text was updated successfully, but these errors were encountered: