Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DMC LocalECP incorrect in GPU code on titan #1440

Open
jtkrogel opened this Issue Mar 13, 2019 · 38 comments

Comments

Projects
None yet
5 participants
@jtkrogel
Copy link
Contributor

jtkrogel commented Mar 13, 2019

Disagreement between CPU and GPU DMC total energies was observed for a water molecule in periodic boundary conditions (8 A cubic cell, CASINO pseudopotentials, Titan at OLCF, QMCPACK 3.6.0). Issue originally reported by Andrea Zen. Original inputs and outputs: TEST_DMC.zip

From the attached outputs, the VMC energies agree, while the DMC energies differ by about 0.3 Ha:

#VMC
>qmca -q e *s001*scalar*
dmc_cpu  series 1  LocalEnergy           =  -17.183577 +/- 0.007486 
dmc_gpu  series 1  LocalEnergy           =  -17.152789 +/- 0.018592 

#DMC
>qmca -q e *s002*scalar*
dmc_cpu  series 2  LocalEnergy           =  -17.220971 +/- 0.000968 
dmc_gpu  series 2  LocalEnergy           =  -16.869061 +/- 0.003256 

The difference is entirely attributable to the local part of the ECP:

#DMC
>qmca -q l *s002*scalar*
dmc_cpu  series 2  LocalECP              =  -41.436580 +/- 0.021199 
dmc_gpu  series 2  LocalECP              =  -41.026695 +/- 0.028982 

Note: the DMC error bars are not statistically meaningful here (10 blocks), but the difference is large enough to support this conclusion.

The oddity here is that the error is only seen in DMC and it is limited to a single potential energy term. This may indicate a bug in LocalECP that surfaces with increased walker count on the GPU (1 walker/gpu in VMC, 320 walkers/gpu in DMC). Likely, a series of VMC runs with increasing number of walkers will show this.

@jtkrogel jtkrogel added the bug label Mar 13, 2019

@prckent

This comment has been minimized.

Copy link
Contributor

prckent commented Mar 13, 2019

The local ECP kenel is one that is known to not be reproducibile between runs, i.e. is buggy. Something to do with walker and GPU thread/block count. Previously the differences have been small enough to be ignorable; this problem indicates it must be fixed. There are a couple of issues on this.

You don't state explicitly, but is the non-local ECP term correct?

@jtkrogel

This comment has been minimized.

Copy link
Contributor Author

jtkrogel commented Mar 13, 2019

The non-local ECP term appears to be correct.

@prckent prckent added this to the V3.7.0 Release milestone Mar 15, 2019

@prckent

This comment has been minimized.

Copy link
Contributor

prckent commented Mar 19, 2019

To save time debugging this, for the next 3 weeks the necessary pwscf file is at
https://ftp.ornl.gov/filedownload?ftp=e;dir=WATER
Replace WATER with uP24qpBh6M3N

@prckent

This comment has been minimized.

Copy link
Contributor

prckent commented Mar 19, 2019

I did some VMC experimentation. On a single Kepler GPU with a fixed seed and either 1 or 320 walkers, I was able to reproduce the previously noticed non-determinism with just a few moves. i.e. Multiple runs of the executable generate slightly different results. From this short run and my current inputs we can't say if the energies are "bad" but the local electron ion and electron-electron terms are not repeatable. The much harder to compute kinetic energy and non-local electron-ion are repeatable (?!).

@prckent

This comment has been minimized.

Copy link
Contributor

prckent commented Mar 20, 2019

VMC runs with 320 walkers are essentially the same, i.e. no 0.3 Ha shift.

All inputs and outputs from test including wavefunction: https://ftp.ornl.gov/filedownload?ftp=e;dir=ICE
Replace ICE with uP21fJWh6csV

  <qmc method="vmc" move="pbyp" gpu="yes">
    <parameter name="blocks">      40 </parameter>
    <parameter name="substeps">    1 </parameter>
    <parameter name="steps">       100 </parameter>
    <parameter name="warmupSteps">  500 </parameter>
    <parameter name="usedrift">     no </parameter>
    <parameter name="timestep">    0.3 </parameter>
    <parameter name="walkers">    320 </parameter>
  </qmc>
qmca -e 0 vmc*.dat

vmc_cuda  series 1
  LocalEnergy           =          -17.1638 +/-           0.0011
  Variance              =            0.4991 +/-           0.0063
  Kinetic               =            13.508 +/-            0.018
  LocalPotential        =           -30.672 +/-            0.018
  ElecElec              =           11.1265 +/-           0.0097
  LocalECP              =           -41.409 +/-            0.019
  NonLocalECP           =           -1.3970 +/-           0.0095
  IonIon                =              1.01 +/-             0.00
  LocalEnergy_sq        =           295.097 +/-            0.036
  BlockWeight           =          32000.00 +/-             0.00
  BlockCPU              =             1.248 +/-            0.018
  AcceptRatio           =           0.47567 +/-          0.00017
  Efficiency            =           1908.34 +/-             0.00
  TotalTime             =             49.91 +/-             0.00
  TotalSamples          =           1280000 +/-                0

vmc_omp  series 1
  LocalEnergy           =          -17.1718 +/-           0.0012
  Variance              =            0.5031 +/-           0.0092
  Kinetic               =            13.510 +/-            0.016
  LocalPotential        =           -30.682 +/-            0.016
  ElecElec              =           11.1155 +/-           0.0087
  LocalECP              =           -41.408 +/-            0.017
  NonLocalECP           =           -1.3964 +/-           0.0094
  IonIon                =              1.01 +/-             0.00
  LocalEnergy_sq        =           295.375 +/-            0.039
  BlockWeight           =          32000.00 +/-             0.00
  BlockCPU              =            1.0728 +/-           0.0024
  AcceptRatio           =           0.47613 +/-          0.00015
  Efficiency            =           1885.79 +/-             0.00
  TotalTime             =             42.91 +/-             0.00
  TotalSamples          =           1280000 +/-                0
@prckent

This comment has been minimized.

Copy link
Contributor

prckent commented Mar 20, 2019

@jtkrogel where and how were you able to produce the cpu-gpu energy shift? machine, qmcpack version, software versions, node/mpi/thread counts etc.

In my DMC tests so far I have not found such a sizable shift.

@jtkrogel

This comment has been minimized.

Copy link
Contributor Author

jtkrogel commented Mar 20, 2019

The results are from runs performed by Andrea Zen (@zenandrea) on Titan with QMCPACK 3.6.0 on 4 nodes, 1 mpi task per node, 1 thread per mpi task (see files job_qmcpack_gpu-titan, input_dmcgpu.xml, and out_dmcgpu in TEST_DMC.zip).

The build details, as far as I know, are according to our build_olcf_titan.sh script, but with changes to the boost and fftw libraries as follows: boost/1.62.0 fftw/3.3.4.11). Presumably with the real AoS code.

@zenandrea, please check if I have missed something.

@zenandrea

This comment has been minimized.

Copy link

zenandrea commented Mar 21, 2019

Dear @jtkrogel and @prckent,
almost everything as you told, but I used fftw/3.3.4.8, which is loaded as default.
I confirm that I compiled for real AoS code.

In particular, this is my compilations script:

export CRAYPE_LINK_TYPE=dynamic
module swap PrgEnv-pgi PrgEnv-gnu
module load cudatoolkit/9.1.85_3.10-1.0502.df1cc54.3.1
module load cray-hdf5-parallel
module load cmake3
module load fftw
export FFTW_HOME=$FFTW_DIR/..
module load boost/1.67.0
export CC=cc
export CXX=CC
mkdir build_titan_gpu
cd build_titan_gpu
cmake -DQMC_CUDA=1 ..
cmake -DQMC_CUDA=1 ..
make -j 8
ls -l bin/qmcpack

@prckent

This comment has been minimized.

Copy link
Contributor

prckent commented Mar 21, 2019

Thanks. Nothing unreasonable in the above. It should work without problems.

FFTW would not cause the failures. If FFTW were wrong - and I don't recall a single case ever where it has been - the kinetic energy and Monte Carlo walk in general would also be wrong.

@prckent

This comment has been minimized.

Copy link
Contributor

prckent commented Mar 22, 2019

I have reproduced this problem using the current develop version and with builds that pass the unit and diamond and LiH integration tests. I used the updated build script #1472 i.e. Nothing out of the ordinary.

Using 1MPI, 16 OMP threads and 0/1 GPU I have a 0.6 Hartree (!) difference in the DMC energies (series 2 & 3 below), while the VMC energies agree. The difference is in the local part of the pseudopotential. The analysis below is not done carefully, but it is interesting that the kinetic energy and acceptance ratio appear to match between CPU and GPU.

A 4 node run shows a slightly smaller disagreement between the codes.

qmca -q ev ../titan_orig*/*.scalar.dat

                            LocalEnergy               Variance           ratio
../titan_orig_1mpi/qmc_cpu  series 1  -17.176063 +/- 0.016221   0.595062 +/- 0.154097   0.0346
../titan_orig_1mpi/qmc_cpu  series 2  -17.219573 +/- 0.002273   0.461457 +/- 0.003292   0.0268
../titan_orig_1mpi/qmc_cpu  series 3  -17.220429 +/- 0.001601   0.490561 +/- 0.007181   0.0285

../titan_orig_1mpi/qmc_gpu  series 1  -17.155363 +/- 0.025336   0.467373 +/- 0.056839   0.0272
../titan_orig_1mpi/qmc_gpu  series 2  -16.647208 +/- 0.000720   1.010610 +/- 0.005110   0.0607
../titan_orig_1mpi/qmc_gpu  series 3  -16.639882 +/- 0.001205   1.026227 +/- 0.007102   0.0617

pk7@titan-ext4:/lustre/atlas/ ... /Zen_water_problem/titan_orig_1mpi> qmca ../titan_orig_1mpi/qmc_cpu.s003.scalar.dat

../titan_orig_1mpi/qmc_cpu  series 3
  LocalEnergy           =          -17.2187 +/-           0.0020
  Variance              =            0.4878 +/-           0.0063
  Kinetic               =            13.587 +/-            0.024
  LocalPotential        =           -30.805 +/-            0.025
  ElecElec              =            11.115 +/-            0.015
  LocalECP              =           -41.502 +/-            0.031
  NonLocalECP           =            -1.425 +/-            0.016
  IonIon                =              1.01 +/-             0.00
  LocalEnergy_sq        =           296.972 +/-            0.073
  BlockWeight           =         634774.40 +/-          1923.92
  BlockCPU              =            302.38 +/-             1.12
  AcceptRatio           =          0.993562 +/-         0.000029
  Efficiency            =              0.93 +/-             0.00
  TotalTime             =           1511.88 +/-             0.00
  TotalSamples          =           3173872 +/-                0
pk7@titan-ext4:/lustre/atlas/ ... /Zen_water_problem/titan_orig_1mpi> qmca ../titan_orig_1mpi/qmc_gpu.s003.scalar.dat

../titan_orig_1mpi/qmc_gpu  series 3
  LocalEnergy           =          -16.6399 +/-           0.0012
  Variance              =            1.0262 +/-           0.0071
  Kinetic               =            13.533 +/-            0.019
  LocalPotential        =           -30.173 +/-            0.019
  ElecElec              =            11.032 +/-            0.012
  LocalECP              =           -40.787 +/-            0.025
  NonLocalECP           =           -1.4246 +/-           0.0066
  IonIon                =              1.01 +/-             0.00
  LocalEnergy_sq        =           277.912 +/-            0.042
  BlockWeight           =         638124.30 +/-          1124.31
  BlockCPU              =            26.026 +/-            0.039
  AcceptRatio           =          0.993609 +/-         0.000016
  Efficiency            =             14.94 +/-             0.00
  TotalTime             =            260.26 +/-             0.00
  TotalSamples          =           6381243 +/-                0
@prckent

This comment has been minimized.

Copy link
Contributor

prckent commented Mar 22, 2019

Also worth noting that the DMC energy is above the VMC one...

@prckent

This comment has been minimized.

Copy link
Contributor

prckent commented Mar 25, 2019

Attempting to bracket the problem:

  1. QMCPACK v.3.1.1 (August 2017) also has the error. i.e. It is not a recently introduced bug in our source code.
  2. Using the lastest develop version but with no Jastrow in the wavefunction the bug persists.

Still puzzling is why our existing carbon diamond or LiH tests don't trigger this bug.

@prckent

This comment has been minimized.

Copy link
Contributor

prckent commented Mar 25, 2019

  1. Using the BFD potentials from examples/molecules/H2O the problem persists. This rules out handling of CASINO format potentials. Again the DMC energy is above the VMC energy on the GPU while the CPU result appears OK.
                           LocalEnergy               Variance           ratio
../titan_orig_1mpi_noj_bfd/qmc_cpu  series 1  -17.017532 +/- 0.053990   3.475117 +/- 0.377453   0.2042
../titan_orig_1mpi_noj_bfd/qmc_cpu  series 2  -17.257461 +/- 0.003199   3.439663 +/- 0.020524   0.1993
../titan_orig_1mpi_noj_bfd/qmc_cpu  series 3  -17.271529 +/- 0.003633   3.671973 +/- 0.031433   0.2126

../titan_orig_1mpi_noj_bfd/qmc_gpu  series 1  -16.898081 +/- 0.064148   3.766366 +/- 0.306030   0.2229
../titan_orig_1mpi_noj_bfd/qmc_gpu  series 2  -16.694704 +/- 0.005017   4.001500 +/- 0.038960   0.2397
../titan_orig_1mpi_noj_bfd/qmc_gpu  series 3  -16.687953 +/- 0.002943   4.170178 +/- 0.020878   0.2499
@prckent

This comment has been minimized.

Copy link
Contributor

prckent commented Mar 25, 2019

  1. Persists with no MPI, -DQMC_MPI=0
@prckent

This comment has been minimized.

Copy link
Contributor

prckent commented Mar 25, 2019

By varying the number of walkers I was able to break VMC (good suggestion by @jtkrogel ). The bug is back to looking like a bad kernel.

@prckent

This comment has been minimized.

Copy link
Contributor

prckent commented Mar 26, 2019

The linked VMC test gives incorrect results on titan.
titan_vmc_only.zip 146.46 MB https://ftp.ornl.gov/filedownload?ftp=e;dir=FRUIT
Replace FRUIT with uP10HwMh8qGU

Puzzlingly, these same files give correct results on oxygen (Intel Xeon + Kepler + clang6 +cuda 10.0 currently). A naively incorrect kernel would give reproducible errors.

@atillack

This comment has been minimized.

Copy link
Contributor

atillack commented Mar 26, 2019

@prckent I can reproduce your numbers on Titan.

@atillack

This comment has been minimized.

Copy link
Contributor

atillack commented Mar 26, 2019

@prckent When I go back to Cuda 7.5 (using Gcc 4.9.3 and an older version of QMCPACK) I get the correct results:

qmc_gpu series 1
LocalEnergy = -17.1716 +/- 0.0021
Variance = 0.490 +/- 0.017
Kinetic = 13.481 +/- 0.025
LocalPotential = -30.652 +/- 0.025
ElecElec = 11.129 +/- 0.013
LocalECP = -41.424 +/- 0.029
NonLocalECP = -1.364 +/- 0.014
IonIon = 1.01 +/- 0.00
LocalEnergy_sq = 295.354 +/- 0.074
BlockWeight = 2560.00 +/- 0.00
BlockCPU = 0.310562 +/- 0.000093
AcceptRatio = 0.47525 +/- 0.00029
Efficiency = 16660.91 +/- 0.00
TotalTime = 19.57 +/- 0.00
TotalSamples = 161280 +/- 0

So this could be an issue with the Cuda installation on Titan...

@prckent

This comment has been minimized.

Copy link
Contributor

prckent commented Mar 26, 2019

@atillack Interesting. If you are using a standalone workstation with CUDA 7.5 (!), the question is whether you can break VMC by e.g. varying the number of walkers, or if running Andrea's original DMC case still breaks.

@jtkrogel

This comment has been minimized.

Copy link
Contributor Author

jtkrogel commented Mar 27, 2019

@atillack Is there a specific build config + QMCPACK version you can recommend that does not display the problem on Titan? This may represent a practical way @zenandrea can get correct production runs sooner.

@atillack

This comment has been minimized.

Copy link
Contributor

atillack commented Mar 27, 2019

@jtkrogel QMPACK 3.5.0

Here are the modules I have loaded (for gcc/4.9.3, module unload gcc; module load gcc/4.9.3 after "module swap PrgEnv-pgi PrgEnv-gnu" works):

Currently Loaded Modulefiles:

  1. eswrap/1.3.3-1.020200.1280.0
  2. craype-network-gemini
  3. craype/2.5.13
  4. cray-mpich/7.6.3
  5. craype-interlagos
  6. lustredu/1.4
  7. xalt/0.7.5
  8. git/2.13.0
  9. module_msg/0.1
  10. modulator/1.2.0
  11. hsi/5.0.2.p1
  12. DefApps
  13. cray-libsci/16.11.1
  14. udreg/2.3.2-1.0502.10518.2.17.gem
  15. ugni/6.0-1.0502.10863.8.28.gem
  16. pmi/5.0.12
  17. dmapp/7.0.1-1.0502.11080.8.74.gem
  18. gni-headers/4.0-1.0502.10859.7.8.gem
  19. xpmem/0.1-2.0502.64982.5.3.gem
  20. dvs/2.5_0.9.0-1.0502.2188.1.113.gem
  21. alps/5.2.4-2.0502.9774.31.12.gem
  22. rca/1.0.0-2.0502.60530.1.63.gem
  23. atp/2.1.1
  24. PrgEnv-gnu/5.2.82
  25. cray-hdf5/1.10.0.3
  26. cmake3/3.9.0
  27. fftw/3.3.4.8
  28. boost/1.62.0
  29. subversion/1.9.3
  30. cudatoolkit/7.5.18-1.0502.10743.2.1
  31. gcc/4.9.3
@atillack

This comment has been minimized.

Copy link
Contributor

atillack commented Mar 27, 2019

@prckent @jtkrogel I just looked into the Cuda 9 changelog and found this wonderful snippet:

The compiler has transitioned to a new code-generation back end for Kepler GPUs.
PTXAS now includes a new option --new-sm3x-opt=false that allows developers to continue using the legacy back end. Use ptxas --help to get more information about these command-line options.

This at least may explain what is going on. I am not sure how to pass down this parameter to ptxas though ...

Edit: Testing now.

@atillack

This comment has been minimized.

Copy link
Contributor

atillack commented Mar 27, 2019

@prckent @jtkrogel Cuda 7.5 is still the temporary solution. The ptxas flag (-Xptxas --new-sm3x-opt=false can be put in CUDA_NVCC_FLAGS) only helps to get results halfway to the correct number on Cuda 9.1 on Titan:

qmc_gpu series 1
LocalEnergy = -16.9815 +/- 0.0021
Variance = 0.797 +/- 0.015
Kinetic = 13.483 +/- 0.022
LocalPotential = -30.465 +/- 0.022
ElecElec = 11.125 +/- 0.012
LocalECP = -41.235 +/- 0.025
NonLocalECP = -1.362 +/- 0.013
IonIon = 1.01 +/- 0.00
LocalEnergy_sq = 289.167 +/- 0.073
BlockWeight = 2560.00 +/- 0.00
BlockCPU = 0.302379 +/- 0.000059
AcceptRatio = 0.47550 +/- 0.00025
Efficiency = 12570.17 +/- 0.00
TotalTime = 24.49 +/- 0.00
TotalSamples = 207360 +/- 0

@atillack

This comment has been minimized.

Copy link
Contributor

atillack commented Mar 27, 2019

@prckent @jtkrogel After talking with our Nvidia representatives, there is a code generation regression in 9.1 which is fixed in 9.2. So on Titan, it seems the only work-around is to use 7.5 for the time being.

If a newer version than QMCPACK 3.5.0 is needed some (minor) code changes are needed in order to compile with Cuda 7.5:

  • lines containing cudamemadvise need to be commented out in QMCWaveFunctions/EinsplineSetCuda.cpp
  • "#include <nvml.h>" needs to be commented out in Platforms/devices.h
  • CMake/GNUCompilers.cmake needs to be changed to accept compilers after 4.8 (second line 5.0 needs changing to 4.8 like in older version of QMCPACK)
@atillack

This comment has been minimized.

Copy link
Contributor

atillack commented Mar 27, 2019

@prckent @jtkrogel Another data point. I also get correct results if the Cuda 9.1 toolkit is loaded when executing QMCPACK that was compiled with Cuda 7.5. This does seem to point to the code generation being the issue.

@atillack

This comment has been minimized.

Copy link
Contributor

atillack commented Mar 28, 2019

@prckent @jtkrogel On Summit using Cuda 9.2 the correct results are also obtained:

qmc_gpu series 1
LocalEnergy = -17.1707 +/- 0.0020
Variance = 0.489 +/- 0.016
Kinetic = 13.480 +/- 0.025
LocalPotential = -30.651 +/- 0.025
ElecElec = 11.128 +/- 0.013
LocalECP = -41.421 +/- 0.029
NonLocalECP = -1.364 +/- 0.014
IonIon = 1.01 +/- 0.00
LocalEnergy_sq = 295.321 +/- 0.073
BlockWeight = 2560.00 +/- 0.00
BlockCPU = 0.179291 +/- 0.000020
AcceptRatio = 0.47529 +/- 0.00029
Efficiency = 29197.52 +/- 0.00
TotalTime = 11.47 +/- 0.00
TotalSamples = 163840 +/- 0

@prckent prckent changed the title DMC LocalECP incorrect in GPU code DMC LocalECP incorrect in GPU code on titan Mar 28, 2019

@zenandrea

This comment has been minimized.

Copy link

zenandrea commented Mar 28, 2019

Dear @atillack @prckent @jtkrogel
it seems very likely that the source of the issue was the cudatoolkit version 9.1.
Shall we ask the OLCF's system administrators if they can install the 9.2 version?

Maybe there might be other packages other than qmcpack affected by this kind of problem!

@prckent

This comment has been minimized.

Copy link
Contributor

prckent commented Mar 28, 2019

@zenandrea Please ask - I am not sure that 9.2 will be installed given that Titan has only a few more months of accessibility, but other packages are certainly at risk. Are you able to move to Summit or is your time only on Titan?

This is a scary problem and I am not keen on recommending use of older software.

@zenandrea

This comment has been minimized.

Copy link

zenandrea commented Mar 28, 2019

@prckent I have half the resources on titan and half on summit.
I'm going to ask straight away.

@atillack

This comment has been minimized.

Copy link
Contributor

atillack commented Mar 28, 2019

@prckent @zenandrea As Cuda 9.1's behavior was seen as mostly a performance regression, the Nvidia folks are looking at our kernel giving bad numbers under 9.1 to see if there's a possible workaround.

@zenandrea It's a good idea to ask but like Paul I am uncertain if this will happen in time to be useful. In the interim, with small code changes (see post above) it is possible to compile a current version of QMCPACK on Titan with Cuda 7.5 but this only works with GCC 4.9.3 as otherwise modules are missing.

@prckent

This comment has been minimized.

Copy link
Contributor

prckent commented Mar 28, 2019

I am still open to the idea that we have illegal/buggy code, and that different CUDA versions, GPUs, etc. expose the problem in different ways. However "bad generated code" is the best explanation given the established facts. What is so strange still is that all the difficult and costly parts of the calculation involving the wavefunction are correct.

@ye-luo

This comment has been minimized.

Copy link
Contributor

ye-luo commented Mar 28, 2019

I have a solution to use 7.5 with the current QMCPACK. Will PR soon.

@atillack

This comment has been minimized.

Copy link
Contributor

atillack commented Mar 28, 2019

@ye-luo Thanks!

@ye-luo

This comment has been minimized.

Copy link
Contributor

ye-luo commented Mar 28, 2019

I failed to find a clean solution through the source because I need to hack cmake.
To enable our production needs, I'm making all the build variants and will put it in a place anyone can access.

@prckent

This comment has been minimized.

Copy link
Contributor

prckent commented Mar 29, 2019

I'll note that an initialization bug similar to #1518 could explain these problems.

@ye-luo

This comment has been minimized.

Copy link
Contributor

ye-luo commented Mar 29, 2019

I checked. Un fortunately #1518 is not related to this bug.

@atillack

This comment has been minimized.

Copy link
Contributor

atillack commented Mar 29, 2019

@prckent The problem seems contained to Titan. Cuda 9.1 on Summit also gives the correct results:

qmc_gpu series 1
LocalEnergy = -17.1703 +/- 0.0020
Variance = 0.496 +/- 0.017
Kinetic = 13.479 +/- 0.024
LocalPotential = -30.650 +/- 0.024
ElecElec = 11.128 +/- 0.012
LocalECP = -41.420 +/- 0.028
NonLocalECP = -1.365 +/- 0.013
IonIon = 1.01 +/- 0.00
LocalEnergy_sq = 295.316 +/- 0.071
BlockWeight = 2560.00 +/- 0.00
BlockCPU = 0.182407 +/- 0.000021
AcceptRatio = 0.47538 +/- 0.00028
Efficiency = 27168.89 +/- 0.00
TotalTime = 12.40 +/- 0.00
TotalSamples = 174080 +/- 0

@ye-luo

This comment has been minimized.

Copy link
Contributor

ye-luo commented Mar 29, 2019

I put both v3.6 and 3.7 binaries at
/lustre/atlas/world-shared/mat189/qmcpack_binaries_titan
They should last till the retirement of Titan.

To workaround the bug in CUDA 9.1 which gives wrong results.
The following steps are taken to compile CudaCoulomb.cu with CUDA 7.5.
After building QMCPACK CUDA version,

  1. From the build folder, cd src/QMCHamiltonians
  2. find -name qmcham_generated_CudaCoulomb.cu.o.RELEASE.cmake and open it with an editor.
  3. touch ./CMakeFiles/qmcham.dir/qmcham_generated_CudaCoulomb.cu.o.RELEASE.cmake
  4. Modify CUDA_HOST_COMPILER from /opt/cray/craype/2.5.13/bin/cc to /opt/gcc/4.9.3/bin/gcc
  5. Replace all cudatoolkit9.1/9.1.85_3.10-1.0502.df1cc54.3.1 to cudatoolkit7.5/7.5.18-1.0502.10743.2.1
  6. Type make -j32 and you see "Built target qmcham". If CMake is triggered, repleat step 2-4 because CMake overwrites qmcham_generated_CudaCoulomb.cu.o.RELEASE.cmake.
  7. cd ../QMCApp ; sh CMakeFiles/qmcpack.dir/link.txt
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.