Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Residuals are very large for certain number of nodes #23

Closed
martin-ueding opened this issue Jan 28, 2017 · 7 comments
Closed

Residuals are very large for certain number of nodes #23

martin-ueding opened this issue Jan 28, 2017 · 7 comments
Assignees

Comments

@martin-ueding
Copy link
Contributor

I run Chroma with the QPhiX clover solvers on an Intel Xeon Haswell (AVX2) architecture. Each node has two Xeons with 12 physical cores, 24 virtual cores. I do not use SMT and a single MPI rank, so that is 24 threads per node.

The 16³×32 lattice works just find on 1, 2, 4, 8, and 32 nodes. A 32³×96 lattice works fine on 8, 64, or 128 nodes. The 24³×96 lattice however, fails on 64 nodes:

QPHIX_RESIDUUM_REPORT:
         Red_Final[0]=2.35420099376685e-310
         Red_Final[1]=2.35420099376685e-310
         Red_Final[2]=0
         Red_Final[3]=0
         Red_Final[4]=2.21341409336878e-321
         Red_Final[5]=6.32404026676796e-322
         Red_Final[6]=0
         Red_Final[7]=1.48219693752374e-323
         Red_Final[8]=1.06527781423771e-316
         Red_Final[9]=0
         Red_Final[10]=0
         Red_Final[11]=3.44237511497523e-316
         Red_Final[12]=0
         Red_Final[13]=0
         Red_Final[14]=3.44235930487457e-316
         Red_Final[15]=7.83598624667517e-12
QPHIX_CLOVER_MULTI_SHIFT_CG_MDAGM_SOLVER: Residua Check: 
         shift[0]  Actual || r || / || b || = 72301.7437396402
QMP m7,n64@jrc0384 error: abort: 1
SOLVE FAILED: rel_resid=72301.7437396402 target=1e-09 tolerance_factor=10 max tolerated=1e-08

On 8 nodes, it works. On 128 it fails, I think. I now have it running on 81. It is probably caused by all the factors of 3 that are in the volume.

It would be nice if there was some error message when the number of nodes does not make sense for QPhiX. At least it fails early on with the residuals, still it took me a while to figure out a working number of nodes, especially since the queueing time for jobs with more than 8 nodes can be several days for my account.

What is this condition? What has do be divisible by what? Then I would attempt to implement this warning and suggest a number of nodes that the user should try instead.

@azrael417
Copy link
Contributor

azrael417 commented Jan 28, 2017 via email

@martin-ueding
Copy link
Contributor Author

The QPhiX arguments are the following:

-by 8 -bz 8 -c 24 -sy 1 -sz 1 -pxy 1 -pxyz 0 -minct 2

Below is the beginning of the standard output from the job, the lines starting with + are the shell commands executed (I ran bash -x).

+ export OMP_NUM_THREADS=24
+ OMP_NUM_THREADS=24
+ export KMP_AFFINITY=compact,0
+ KMP_AFFINITY=compact,0
+ mkdir -p cfg
+ mkdir -p hmc-out
+ srun ./hmc -i hmc.ini.xml -o hmc-out/hmc.slurm-2831834.out.xml -l hmc-out/hmc.slurm-2831834.log.xml -by 8 -bz 8 -c 24 -sy 1 -sz 1 -pxy 1 -pxyz 0 -minct 2
QDP use OpenMP threading. We have 24 threads
Affinity reporting not implemented for this architecture
Initialize done
Initializing QPhiX CLI Args
QPhiX CLI Args Initialized
 QPhiX: By=8
 QPhiX: Bz=8
 QPhiX: Pxy=1
 QPhiX: Pxyz=0
 QPhiX: NCores=24
 QPhiX: Sy=1
 QPhiX: Sz=1
 QPhiX: MinCt=2
---%<---
Lattice initialized:
  problem size = 24 24 24 96
  layout size = 12 24 24 96
  logical machine size = 1 4 2 8
  subgrid size = 24 6 12 12
  total number of nodes = 64
  total volume = 1327104
  subgrid volume = 20736

@ddkalamk
Copy link
Contributor

ddkalamk commented Jan 28, 2017 via email

@bjoo
Copy link
Contributor

bjoo commented Jan 28, 2017 via email

@martin-ueding
Copy link
Contributor Author

The system I run on is JURECA in Jülich, which is a dual socket Xeon machine:

Two Intel Xeon E5-2680 v3 Haswell CPUs per node

  • 2 x 12 cores, 2.5 GHz
  • Intel Hyperthreading Technology (Simultaneous Multithreading)
  • AVX 2.0 ISA extension

Divisibility

So the X and Y components of the subgrid size have to be divisable by Bx and By` after checkerboarding? This would certainly explain why it works with 81 but not with 64.

  • 64 nodes: subgrid size = 24 6 12 12, 24/8 works, 6/8 is not integer.
  • 81 nodes: subgrid size = 8 8 8 32, 8/8 is integer.

But it must be before checkerboarding, right? Otherwise it would not have worked on 81 nodes, since then 4/8 would have become a problem.

Two MPI processes per node

I_MPI_PIN=1 I_MPI_PIN_DOMAIN=socket are environment variables to set?

I had tried two MPI processes but went from some minutes per trajectory to 4 hours. But perhaps I did something wrong with the OpenMP thread binding or SMT. I'll have to look into that again since a performance test on 8 nodes showed that although the solver performance is a bit less, the time to solution is improved.

So far I have not tuned the performance extensively, most of my time has been spend on getting the Delta H under control. I will have to look into performance again and make a couple test with the larger lattices.

@bjoo
Copy link
Contributor

bjoo commented Jan 28, 2017 via email

martin-ueding added a commit to HISKP-LQCD/qphix that referenced this issue Mar 11, 2017
Every QPhiX operation has to construct a `Geometry` object at some time.
Therefore the check seems to be best placed there.

At startup, there will be an uncaught exception that causes the calling
program (Chroma, tmLQCD) to crash right away. Currently Chroma will
crash because the residuals become too high. For the user it is probably
nicer to have a hard error message that points to the cause of the
problem.
bjoo added a commit that referenced this issue Mar 11, 2017
Ensure that blocking sizes can work out (fixes GH-23)
@martin-ueding
Copy link
Contributor Author

That should be fixed now.

kostrzewa pushed a commit to kostrzewa/qphix that referenced this issue Apr 29, 2017
Every QPhiX operation has to construct a `Geometry` object at some time.
Therefore the check seems to be best placed there.

At startup, there will be an uncaught exception that causes the calling
program (Chroma, tmLQCD) to crash right away. Currently Chroma will
crash because the residuals become too high. For the user it is probably
nicer to have a hard error message that points to the cause of the
problem.
martin-ueding added a commit to HISKP-LQCD/qphix that referenced this issue Apr 29, 2017
Every QPhiX operation has to construct a `Geometry` object at some time.
Therefore the check seems to be best placed there.

At startup, there will be an uncaught exception that causes the calling
program (Chroma, tmLQCD) to crash right away. Currently Chroma will
crash because the residuals become too high. For the user it is probably
nicer to have a hard error message that points to the cause of the
problem.
@martin-ueding martin-ueding self-assigned this Apr 29, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants