Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some afqmc test fail #4206

Closed
correaa opened this issue Aug 25, 2022 · 10 comments
Closed

Some afqmc test fail #4206

correaa opened this issue Aug 25, 2022 · 10 comments
Labels

Comments

@correaa
Copy link
Contributor

correaa commented Aug 25, 2022

Some AFQMC test fail in my system.

To Reproduce

clone last version of QMCPACK.

commit 719a09ef8f6c93449d026c12cd8f4475d5af403a (HEAD -> develop, upstream/develop, origin/develop, origin/HEAD)
Merge: 4d9e4b481 e88319f56
Author: Paul R. C. Kent <kentpr@ornl.gov>
Date:   Wed Aug 24 12:54:22 2022 +0200

    Merge pull request #4166 from kgasperich/rmg-conv-fix
    
    RMG converter/workflow test fixes

Do cmake

cmake .. -DCMAKE_C_COMPILER=mpicc -DCMAKE_CXX_COMPILER=mpicxx -DBUILD_AFQMC=1 -DQMC_CXX_STANDARD=17 -DENABLE_CUDA=1 -DCMAKE_CUDA_HOST_COMPILER=g++-9 -DQMC_MIXED_PRECISION=1 -DCMAKE_BUILD_TYPE=Debug -DCMAKE_CXX_FLAGS="-Werror" -DMPIEXEC_PREFLAGS="--allow-run-as-root;--bind-to;none"

Do tests, specifically ctest -R afqmc --output-on-failure

Expected behavior

No error in tests.

System:

My laptop: Ubuntu 22.04, NVCC Build cuda_11.7.r11.7/compiler.31294372_0, g++ 11.2
GPU GeForce GTX 3060m
OpenMPI 4.1.2

@prckent
Copy link
Contributor

prckent commented Aug 25, 2022

Please detail the tests that fail for you in the issue

@prckent
Copy link
Contributor

prckent commented Aug 25, 2022

Note that there is an earlier report from NVIDIA (issue exists sometime in the last ~2 months, couldn't find right now) of test problems with AFQMC. If you can fix it great, otherwise it will be up to the original authors to fix or not.

@prckent prckent added the bug label Aug 25, 2022
@correaa
Copy link
Contributor Author

correaa commented Aug 25, 2022

These are the ones that fail.

The following tests FAILED:
	 41 - deterministic-unit_test_afqmc_matrix (Failed)
	 42 - deterministic-unit_test_afqmc_numerics (Failed)
	 43 - deterministic-unit_test_afqmc_slaterdeterminantoperations (Failed)
	 44 - deterministic-unit_test_afqmc_walkers (Failed)
	 46 - deterministic-unit_test_afqmc_wfn_factory_ham_chol_dense_wfn_rhf (Failed)
	 48 - deterministic-unit_test_afqmc_prop_factory_ham_chol_dense_wfn_rhf (Failed)
	 49 - deterministic-unit_test_afqmc_estimators_ham_chol_dense_wfn_rhf (Failed)
Errors while running CTest

details (sorry for dumping all the output).
I don't think it is a problem with my GPU, I can run other GPU code.
Maybe it is an architecture problem.

$              
ctest -R afqmc     --output-on-failure

Test project /home/correaa/qmcpack/build
    Start 41: deterministic-unit_test_afqmc_matrix
1/9 Test #41: deterministic-unit_test_afqmc_matrix ................................***Failed    2.55 sec
QMCPACK printout is suppressed. Use --turn-on-printout to see all the printout.

 cudaGetErrorName: cudaErrorNoKernelImageForDevice
 cudaGetErrorString: no kernel image is available for execution on the device

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
test_afqmc_matrix is a Catch v2.13.6 host application.
Run with -? for options

-------------------------------------------------------------------------------
csr_matrix_serial
-------------------------------------------------------------------------------
/home/correaa/qmcpack/src/AFQMC/Matrix/tests/test_csr_matrix.cpp:631
...............................................................................

/home/correaa/qmcpack/src/AFQMC/Matrix/tests/test_csr_matrix.cpp:631: FAILED:
  {Unknown expression after the reported line}
due to unexpected exception with message:
   Error code returned by cuda. 


 cudaGetErrorName: cudaErrorNoKernelImageForDevice
 cudaGetErrorString: no kernel image is available for execution on the device
-------------------------------------------------------------------------------
csr_matrix_shm
-------------------------------------------------------------------------------
/home/correaa/qmcpack/src/AFQMC/Matrix/tests/test_csr_matrix.cpp:650
...............................................................................

/home/correaa/qmcpack/src/AFQMC/Matrix/tests/test_csr_matrix.cpp:650: FAILED:
  {Unknown expression after the reported line}
due to unexpected exception with message:
   Error code returned by cuda. 

===============================================================================
test cases:   2 |   0 passed | 2 failed
assertions: 200 | 198 passed | 2 failed

--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpiexec detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[21328,1],0]
  Exit code:    2
--------------------------------------------------------------------------

    Start 42: deterministic-unit_test_afqmc_numerics
2/9 Test #42: deterministic-unit_test_afqmc_numerics ..............................***Failed    3.38 sec
QMCPACK printout is suppressed. Use --turn-on-printout to see all the printout.

 cudaGetErrorName: cudaErrorNoKernelImageForDevice
 cudaGetErrorString: no kernel image is available for execution on the device

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
test_afqmc_numerics is a Catch v2.13.6 host application.
Run with -? for options

-------------------------------------------------------------------------------
dense_ma_operations_device_double
-------------------------------------------------------------------------------
/home/correaa/qmcpack/src/AFQMC/Numerics/tests/test_dense_numerics.cpp:740
...............................................................................

/home/correaa/qmcpack/src/AFQMC/Numerics/tests/test_dense_numerics.cpp:740: FAILED:
  {Unknown expression after the reported line}
due to unexpected exception with message:
   Error code returned by cuda. 


 cudaGetErrorName: cudaErrorNoKernelImageForDevice
 cudaGetErrorString: no kernel image is available for execution on the device
-------------------------------------------------------------------------------
dense_ma_operations_device_complex
-------------------------------------------------------------------------------
/home/correaa/qmcpack/src/AFQMC/Numerics/tests/test_dense_numerics.cpp:754
...............................................................................

/home/correaa/qmcpack/src/AFQMC/Numerics/tests/test_dense_numerics.cpp:754: FAILED:
due to unexpected exception with message:
   Error code returned by cuda. 


 cudaGetErrorName: cudaErrorNoKernelImageForDevice
 cudaGetErrorString: no kernel image is available for execution on the device
-------------------------------------------------------------------------------
sparse_ma_operations
-------------------------------------------------------------------------------
/home/correaa/qmcpack/src/AFQMC/Numerics/tests/test_sparse_numerics.cpp:185
...............................................................................

/home/correaa/qmcpack/src/AFQMC/Numerics/tests/test_sparse_numerics.cpp:185: FAILED:
due to unexpected exception with message:
   Error code returned by cuda. 


 cudaGetErrorName: cudaErrorNoKernelImageForDevice
 cudaGetErrorString: no kernel image is available for execution on the device
-------------------------------------------------------------------------------
Tab_to_Kl
-------------------------------------------------------------------------------
/home/correaa/qmcpack/src/AFQMC/Numerics/tests/test_batched_operations.cpp:85
...............................................................................

/home/correaa/qmcpack/src/AFQMC/Numerics/tests/test_batched_operations.cpp:85: FAILED:
due to unexpected exception with message:
   Error code returned by cuda. 


 cudaGetErrorName: cudaErrorNoKernelImageForDevice
 cudaGetErrorString: no kernel image is available for execution on the device
-------------------------------------------------------------------------------
batched_Tab_to_Klr
-------------------------------------------------------------------------------
/home/correaa/qmcpack/src/AFQMC/Numerics/tests/test_batched_operations.cpp:108
...............................................................................

/home/correaa/qmcpack/src/AFQMC/Numerics/tests/test_batched_operations.cpp:108: FAILED:
due to unexpected exception with message:
   Error code returned by cuda. 

-------------------------------------------------------------------------------
Tanb_to_Kl
-------------------------------------------------------------------------------
/home/correaa/qmcpack/src/AFQMC/Numerics/tests/test_batched_operations.cpp:135
...............................................................................

/home/correaa/qmcpack/src/AFQMC/Numerics/tests/test_batched_operations.cpp:135: FAILED:
due to unexpected exception with message:
   Error code returned by cuda. 


 cudaGetErrorName: cudaErrorNoKernelImageForDevice
 cudaGetErrorString: no kernel image is available for execution on the device

 cudaGetErrorName: cudaErrorNoKernelImageForDevice
 cudaGetErrorString: no kernel image is available for execution on the device
-------------------------------------------------------------------------------
batched_dot_wabn_wban
-------------------------------------------------------------------------------
/home/correaa/qmcpack/src/AFQMC/Numerics/tests/test_batched_operations.cpp:166
...............................................................................

/home/correaa/qmcpack/src/AFQMC/Numerics/tests/test_batched_operations.cpp:166: FAILED:
due to unexpected exception with message:
   Error code returned by cuda. 


 cudaGetErrorName: cudaErrorNoKernelImageForDevice
 cudaGetErrorString: no kernel image is available for execution on the device
-------------------------------------------------------------------------------
batched_dot_wanb_wbna
-------------------------------------------------------------------------------
/home/correaa/qmcpack/src/AFQMC/Numerics/tests/test_batched_operations.cpp:190
...............................................................................

/home/correaa/qmcpack/src/AFQMC/Numerics/tests/test_batched_operations.cpp:190: FAILED:
due to unexpected exception with message:
   Error code returned by cuda. 


 cudaGetErrorName: cudaErrorNoKernelImageForDevice
 cudaGetErrorString: no kernel image is available for execution on the device
-------------------------------------------------------------------------------
dot_wabn
-------------------------------------------------------------------------------
/home/correaa/qmcpack/src/AFQMC/Numerics/tests/test_batched_operations.cpp:214
...............................................................................

/home/correaa/qmcpack/src/AFQMC/Numerics/tests/test_batched_operations.cpp:214: FAILED:
due to unexpected exception with message:
   Error code returned by cuda. 


 cudaGetErrorName: cudaErrorNoKernelImageForDevice
 cudaGetErrorString: no kernel image is available for execution on the device

 cudaGetErrorName: cudaErrorNoKernelImageForDevice
 cudaGetErrorString: no kernel image is available for execution on the device
-------------------------------------------------------------------------------
dot_wanb
-------------------------------------------------------------------------------
/home/correaa/qmcpack/src/AFQMC/Numerics/tests/test_batched_operations.cpp:231
...............................................................................

/home/correaa/qmcpack/src/AFQMC/Numerics/tests/test_batched_operations.cpp:231: FAILED:
due to unexpected exception with message:
   Error code returned by cuda. 

-------------------------------------------------------------------------------
Auwn_Bun_Cuw
-------------------------------------------------------------------------------
/home/correaa/qmcpack/src/AFQMC/Numerics/tests/test_batched_operations.cpp:250
...............................................................................

/home/correaa/qmcpack/src/AFQMC/Numerics/tests/test_batched_operations.cpp:250: FAILED:
due to unexpected exception with message:
   Error code returned by cuda. 

-------------------------------------------------------------------------------
Awiu_Biu_Cuw
-------------------------------------------------------------------------------
/home/correaa/qmcpack/src/AFQMC/Numerics/tests/test_batched_operations.cpp:267
...............................................................................

/home/correaa/qmcpack/src/AFQMC/Numerics/tests/test_batched_operations.cpp:267: FAILED:
due to unexpected exception with message:
   Error code returned by cuda. 


 cudaGetErrorName: cudaErrorNoKernelImageForDevice
 cudaGetErrorString: no kernel image is available for execution on the device

 cudaGetErrorName: cudaErrorNoKernelImageForDevice
 cudaGetErrorString: no kernel image is available for execution on the device
-------------------------------------------------------------------------------
Aijk_Bkj_Cik
-------------------------------------------------------------------------------
/home/correaa/qmcpack/src/AFQMC/Numerics/tests/test_batched_operations.cpp:287
...............................................................................

/home/correaa/qmcpack/src/AFQMC/Numerics/tests/test_batched_operations.cpp:287: FAILED:
due to unexpected exception with message:
   Error code returned by cuda. 


 cudaGetErrorName: cudaErrorNoKernelImageForDevice
 cudaGetErrorString: no kernel image is available for execution on the device
-------------------------------------------------------------------------------
viwj_vwij
-------------------------------------------------------------------------------
/home/correaa/qmcpack/src/AFQMC/Numerics/tests/test_batched_operations.cpp:307
...............................................................................

/home/correaa/qmcpack/src/AFQMC/Numerics/tests/test_batched_operations.cpp:307: FAILED:
due to unexpected exception with message:
   Error code returned by cuda. 


 cudaGetErrorName: cudaErrorNoKernelImageForDevice
 cudaGetErrorString: no kernel image is available for execution on the device
-------------------------------------------------------------------------------
element_wise_Aij_Bjk_Ckij
-------------------------------------------------------------------------------
/home/correaa/qmcpack/src/AFQMC/Numerics/tests/test_batched_operations.cpp:323
...............................................................................

/home/correaa/qmcpack/src/AFQMC/Numerics/tests/test_batched_operations.cpp:323: FAILED:
due to unexpected exception with message:
   Error code returned by cuda. 

-------------------------------------------------------------------------------
element_wise_Aij_Bjk_Ckji
-------------------------------------------------------------------------------
/home/correaa/qmcpack/src/AFQMC/Numerics/tests/test_batched_operations.cpp:368
...............................................................................


 cudaGetErrorName: cudaErrorNoKernelImageForDevice
 cudaGetErrorString: no kernel image is available for execution on the device
/home/correaa/qmcpack/src/AFQMC/Numerics/tests/test_batched_operations.cpp:368: FAILED:
due to unexpected exception with message:
   Error code returned by cuda. 


 cudaGetErrorName: cudaErrorNoKernelImageForDevice
 cudaGetErrorString: no kernel image is available for execution on the device
-------------------------------------------------------------------------------
inplace_product
-------------------------------------------------------------------------------
/home/correaa/qmcpack/src/AFQMC/Numerics/tests/test_batched_operations.cpp:376
...............................................................................

/home/correaa/qmcpack/src/AFQMC/Numerics/tests/test_batched_operations.cpp:376: FAILED:
due to unexpected exception with message:
   Error code returned by cuda. 


 cudaGetErrorName: cudaErrorNoKernelImageForDevice
 cudaGetErrorString: no kernel image is available for execution on the device
-------------------------------------------------------------------------------
vbias_from_v1
-------------------------------------------------------------------------------
/home/correaa/qmcpack/src/AFQMC/Numerics/tests/test_batched_operations.cpp:425
...............................................................................

/home/correaa/qmcpack/src/AFQMC/Numerics/tests/test_batched_operations.cpp:425: FAILED:
due to unexpected exception with message:
   Error code returned by cuda. 


 cudaGetErrorName: cudaErrorNoKernelImageForDevice
 cudaGetErrorString: no kernel image is available for execution on the device
-------------------------------------------------------------------------------
adotpby
  double
-------------------------------------------------------------------------------
/home/correaa/qmcpack/src/AFQMC/Numerics/tests/test_ma_blas_extensions.cpp:87
...............................................................................

/home/correaa/qmcpack/src/AFQMC/Numerics/tests/test_ma_blas_extensions.cpp:87: FAILED:
due to unexpected exception with message:
   Error code returned by cuda. 


 cudaGetErrorName: cudaErrorNoKernelImageForDevice
 cudaGetErrorString: no kernel image is available for execution on the device
-------------------------------------------------------------------------------
adotpby
  complex
-------------------------------------------------------------------------------
/home/correaa/qmcpack/src/AFQMC/Numerics/tests/test_ma_blas_extensions.cpp:104
...............................................................................

/home/correaa/qmcpack/src/AFQMC/Numerics/tests/test_ma_blas_extensions.cpp:104: FAILED:
due to unexpected exception with message:
   Error code returned by cuda. 

-------------------------------------------------------------------------------
axty
-------------------------------------------------------------------------------
/home/correaa/qmcpack/src/AFQMC/Numerics/tests/test_ma_blas_extensions.cpp:126
...............................................................................

/home/correaa/qmcpack/src/AFQMC/Numerics/tests/test_ma_blas_extensions.cpp:126: FAILED:
due to unexpected exception with message:
  parallel_for failed: cudaErrorNoKernelImageForDevice: no kernel image is
  available for execution on the device

-------------------------------------------------------------------------------
axty2D
-------------------------------------------------------------------------------
/home/correaa/qmcpack/src/AFQMC/Numerics/tests/test_ma_blas_extensions.cpp:138
...............................................................................

/home/correaa/qmcpack/src/AFQMC/Numerics/tests/test_ma_blas_extensions.cpp:138: FAILED:
due to unexpected exception with message:
  parallel_for failed: cudaErrorNoKernelImageForDevice: no kernel image is
  available for execution on the device


 cudaGetErrorName: cudaErrorNoKernelImageForDevice
 cudaGetErrorString: no kernel image is available for execution on the device
-------------------------------------------------------------------------------
acAxpbB
-------------------------------------------------------------------------------
/home/correaa/qmcpack/src/AFQMC/Numerics/tests/test_ma_blas_extensions.cpp:151
...............................................................................

/home/correaa/qmcpack/src/AFQMC/Numerics/tests/test_ma_blas_extensions.cpp:151: FAILED:
due to unexpected exception with message:
   Error code returned by cuda. 


 cudaGetErrorName: cudaErrorNoKernelImageForDevice
 cudaGetErrorString: no kernel image is available for execution on the device
-------------------------------------------------------------------------------
adiagApy
-------------------------------------------------------------------------------
/home/correaa/qmcpack/src/AFQMC/Numerics/tests/test_ma_blas_extensions.cpp:168
...............................................................................

/home/correaa/qmcpack/src/AFQMC/Numerics/tests/test_ma_blas_extensions.cpp:168: FAILED:
due to unexpected exception with message:
   Error code returned by cuda. 

-------------------------------------------------------------------------------
sum
-------------------------------------------------------------------------------
/home/correaa/qmcpack/src/AFQMC/Numerics/tests/test_ma_blas_extensions.cpp:183
...............................................................................

/home/correaa/qmcpack/src/AFQMC/Numerics/tests/test_ma_blas_extensions.cpp:183: FAILED:
due to unexpected exception with message:
  after reduction step 1: cudaErrorInvalidDeviceFunction: invalid device
  function

-------------------------------------------------------------------------------
sum2D
-------------------------------------------------------------------------------
/home/correaa/qmcpack/src/AFQMC/Numerics/tests/test_ma_blas_extensions.cpp:192
...............................................................................

/home/correaa/qmcpack/src/AFQMC/Numerics/tests/test_ma_blas_extensions.cpp:192: FAILED:
due to unexpected exception with message:
  after reduction step 1: cudaErrorInvalidDeviceFunction: invalid device
  function

-------------------------------------------------------------------------------
sum3D
-------------------------------------------------------------------------------
/home/correaa/qmcpack/src/AFQMC/Numerics/tests/test_ma_blas_extensions.cpp:204
...............................................................................

/home/correaa/qmcpack/src/AFQMC/Numerics/tests/test_ma_blas_extensions.cpp:204: FAILED:
due to unexpected exception with message:
  after reduction step 1: cudaErrorInvalidDeviceFunction: invalid device
  function

-------------------------------------------------------------------------------
sum4D
-------------------------------------------------------------------------------
/home/correaa/qmcpack/src/AFQMC/Numerics/tests/test_ma_blas_extensions.cpp:216
...............................................................................

/home/correaa/qmcpack/src/AFQMC/Numerics/tests/test_ma_blas_extensions.cpp:216: FAILED:
due to unexpected exception with message:
  after reduction step 1: cudaErrorInvalidDeviceFunction: invalid device
  function


 cudaGetErrorName: cudaErrorNoKernelImageForDevice
 cudaGetErrorString: no kernel image is available for execution on the device
-------------------------------------------------------------------------------
zero_complex_part
-------------------------------------------------------------------------------
/home/correaa/qmcpack/src/AFQMC/Numerics/tests/test_ma_blas_extensions.cpp:228
...............................................................................

/home/correaa/qmcpack/src/AFQMC/Numerics/tests/test_ma_blas_extensions.cpp:228: FAILED:
due to unexpected exception with message:
   Error code returned by cuda. 


 cudaGetErrorName: cudaErrorNoKernelImageForDevice
 cudaGetErrorString: no kernel image is available for execution on the device

 cudaGetErrorName: cudaErrorNoKernelImageForDevice
 cudaGetErrorString: no kernel image is available for execution on the device
-------------------------------------------------------------------------------
set_identity2D
-------------------------------------------------------------------------------
/home/correaa/qmcpack/src/AFQMC/Numerics/tests/test_ma_blas_extensions.cpp:239
...............................................................................

/home/correaa/qmcpack/src/AFQMC/Numerics/tests/test_ma_blas_extensions.cpp:239: FAILED:
due to unexpected exception with message:
   Error code returned by cuda. 

-------------------------------------------------------------------------------
set_identity3D
-------------------------------------------------------------------------------
/home/correaa/qmcpack/src/AFQMC/Numerics/tests/test_ma_blas_extensions.cpp:252
...............................................................................

/home/correaa/qmcpack/src/AFQMC/Numerics/tests/test_ma_blas_extensions.cpp:252: FAILED:
due to unexpected exception with message:
   Error code returned by cuda. 

-------------------------------------------------------------------------------
fill2D
-------------------------------------------------------------------------------
/home/correaa/qmcpack/src/AFQMC/Numerics/tests/test_ma_blas_extensions.cpp:265
...............................................................................

/home/correaa/qmcpack/src/AFQMC/Numerics/tests/test_ma_blas_extensions.cpp:265: FAILED:
due to unexpected exception with message:
   Error code returned by cuda. 


 cudaGetErrorName: cudaErrorNoKernelImageForDevice
 cudaGetErrorString: no kernel image is available for execution on the device

 cudaGetErrorName: cudaErrorNoKernelImageForDevice
 cudaGetErrorString: no kernel image is available for execution on the device
-------------------------------------------------------------------------------
get_diagonal_strided
-------------------------------------------------------------------------------
/home/correaa/qmcpack/src/AFQMC/Numerics/tests/test_ma_blas_extensions.cpp:277
...............................................................................

/home/correaa/qmcpack/src/AFQMC/Numerics/tests/test_ma_blas_extensions.cpp:277: FAILED:
due to unexpected exception with message:
   Error code returned by cuda. 


 cudaGetErrorName: cudaErrorNoKernelImageForDevice
 cudaGetErrorString: no kernel image is available for execution on the device
-------------------------------------------------------------------------------
KaKjw_to_KKwaj
-------------------------------------------------------------------------------
/home/correaa/qmcpack/src/AFQMC/Numerics/tests/test_tensor_operations.cpp:84
...............................................................................

/home/correaa/qmcpack/src/AFQMC/Numerics/tests/test_tensor_operations.cpp:84: FAILED:
due to unexpected exception with message:
   Error code returned by cuda. 


 cudaGetErrorName: cudaErrorNoKernelImageForDevice
 cudaGetErrorString: no kernel image is available for execution on the device
-------------------------------------------------------------------------------
KaKjw_to_QKwaj
-------------------------------------------------------------------------------
/home/correaa/qmcpack/src/AFQMC/Numerics/tests/test_tensor_operations.cpp:111
...............................................................................

/home/correaa/qmcpack/src/AFQMC/Numerics/tests/test_tensor_operations.cpp:111: FAILED:
due to unexpected exception with message:
   Error code returned by cuda. 


 cudaGetErrorName: cudaErrorNoKernelImageForDevice
 cudaGetErrorString: no kernel image is available for execution on the device
-------------------------------------------------------------------------------
term_by_term_matrix_vector
-------------------------------------------------------------------------------
/home/correaa/qmcpack/src/AFQMC/Numerics/tests/test_tensor_operations.cpp:156
...............................................................................

/home/correaa/qmcpack/src/AFQMC/Numerics/tests/test_tensor_operations.cpp:156: FAILED:
due to unexpected exception with message:
   Error code returned by cuda. 


 cudaGetErrorName: cudaErrorNoKernelImageForDevice
 cudaGetErrorString: no kernel image is available for execution on the device
-------------------------------------------------------------------------------
transpose_wabn_to_wban
-------------------------------------------------------------------------------
/home/correaa/qmcpack/src/AFQMC/Numerics/tests/test_tensor_operations.cpp:207
...............................................................................

/home/correaa/qmcpack/src/AFQMC/Numerics/tests/test_tensor_operations.cpp:207: FAILED:
due to unexpected exception with message:
   Error code returned by cuda. 


 cudaGetErrorName: cudaErrorNoKernelImageForDevice
 cudaGetErrorString: no kernel image is available for execution on the device
-------------------------------------------------------------------------------
vKKwij_tovwKiKj
-------------------------------------------------------------------------------
/home/correaa/qmcpack/src/AFQMC/Numerics/tests/test_tensor_operations.cpp:229
...............................................................................

/home/correaa/qmcpack/src/AFQMC/Numerics/tests/test_tensor_operations.cpp:229: FAILED:
due to unexpected exception with message:
   Error code returned by cuda. 


 cudaGetErrorName: cudaErrorNoKernelImageForDevice
 cudaGetErrorString: no kernel image is available for execution on the device
-------------------------------------------------------------------------------
determinant_from_getrf
-------------------------------------------------------------------------------
/home/correaa/qmcpack/src/AFQMC/Numerics/tests/test_determinant.cpp:66
...............................................................................

/home/correaa/qmcpack/src/AFQMC/Numerics/tests/test_determinant.cpp:66: FAILED:
due to unexpected exception with message:
   Error code returned by cuda. 


 cudaGetErrorName: cudaErrorNoKernelImageForDevice
 cudaGetErrorString: no kernel image is available for execution on the device
-------------------------------------------------------------------------------
strided_determinant_from_getrf
-------------------------------------------------------------------------------
/home/correaa/qmcpack/src/AFQMC/Numerics/tests/test_determinant.cpp:86
...............................................................................

/home/correaa/qmcpack/src/AFQMC/Numerics/tests/test_determinant.cpp:86: FAILED:
due to unexpected exception with message:
   Error code returned by cuda. 


 cudaGetErrorName: cudaErrorNoKernelImageForDevice
 cudaGetErrorString: no kernel image is available for execution on the device
-------------------------------------------------------------------------------
batched_determinant_from_getrf
-------------------------------------------------------------------------------
/home/correaa/qmcpack/src/AFQMC/Numerics/tests/test_determinant.cpp:114
...............................................................................

/home/correaa/qmcpack/src/AFQMC/Numerics/tests/test_determinant.cpp:114: FAILED:
due to unexpected exception with message:
   Error code returned by cuda. 


 cudaGetErrorName: cudaErrorNoKernelImageForDevice
 cudaGetErrorString: no kernel image is available for execution on the device
-------------------------------------------------------------------------------
batched_determinant_from_getrf_complex
-------------------------------------------------------------------------------
/home/correaa/qmcpack/src/AFQMC/Numerics/tests/test_determinant.cpp:142
...............................................................................

/home/correaa/qmcpack/src/AFQMC/Numerics/tests/test_determinant.cpp:142: FAILED:
due to unexpected exception with message:
   Error code returned by cuda. 


 cudaGetErrorName: cudaErrorNoKernelImageForDevice
 cudaGetErrorString: no kernel image is available for execution on the device
-------------------------------------------------------------------------------
determinant
-------------------------------------------------------------------------------
/home/correaa/qmcpack/src/AFQMC/Numerics/tests/test_determinant.cpp:170
...............................................................................

/home/correaa/qmcpack/src/AFQMC/Numerics/tests/test_determinant.cpp:170: FAILED:
due to unexpected exception with message:
   Error code returned by cuda. 


 cudaGetErrorName: cudaErrorNoKernelImageForDevice
 cudaGetErrorString: no kernel image is available for execution on the device
-------------------------------------------------------------------------------
axpyBatched
-------------------------------------------------------------------------------
/home/correaa/qmcpack/src/AFQMC/Numerics/tests/test_misc_kernels.cpp:71
...............................................................................

/home/correaa/qmcpack/src/AFQMC/Numerics/tests/test_misc_kernels.cpp:71: FAILED:
due to unexpected exception with message:
   Error code returned by cuda. 


 cudaGetErrorName: cudaErrorNoKernelImageForDevice
 cudaGetErrorString: no kernel image is available for execution on the device
-------------------------------------------------------------------------------
construct_X
-------------------------------------------------------------------------------
/home/correaa/qmcpack/src/AFQMC/Numerics/tests/test_misc_kernels.cpp:93
...............................................................................

/home/correaa/qmcpack/src/AFQMC/Numerics/tests/test_misc_kernels.cpp:93: FAILED:
due to unexpected exception with message:
   Error code returned by cuda. 


 cudaGetErrorName: cudaErrorNoKernelImageForDevice
 cudaGetErrorString: no kernel image is available for execution on the device
-------------------------------------------------------------------------------
batchedDot
-------------------------------------------------------------------------------
/home/correaa/qmcpack/src/AFQMC/Numerics/tests/test_misc_kernels.cpp:118
...............................................................................

/home/correaa/qmcpack/src/AFQMC/Numerics/tests/test_misc_kernels.cpp:118: FAILED:
due to unexpected exception with message:
   Error code returned by cuda. 

===============================================================================
test cases:  48 |   3 passed | 45 failed
assertions: 609 | 563 passed | 46 failed

 Error from calling cudaFree: driver shutting down
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpiexec detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[21140,1],0]
  Exit code:    46
--------------------------------------------------------------------------

    Start 43: deterministic-unit_test_afqmc_slaterdeterminantoperations
3/9 Test #43: deterministic-unit_test_afqmc_slaterdeterminantoperations ...........***Failed    3.01 sec
QMCPACK printout is suppressed. Use --turn-on-printout to see all the printout.

 cudaGetErrorName: cudaErrorNoKernelImageForDevice
 cudaGetErrorString: no kernel image is available for execution on the device

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
test_afqmc_slaterdeterminantoperations is a Catch v2.13.6 host application.
Run with -? for options

-------------------------------------------------------------------------------
SDetOps_complex_serial
-------------------------------------------------------------------------------
/home/correaa/qmcpack/src/AFQMC/SlaterDeterminantOperations/tests/test_sdet_ops.cpp:973
...............................................................................

/home/correaa/qmcpack/src/AFQMC/SlaterDeterminantOperations/tests/test_sdet_ops.cpp:973: FAILED:
due to unexpected exception with message:
   Error code returned by cuda. 

===============================================================================
test cases:   3 |   2 passed | 1 failed
assertions: 663 | 662 passed | 1 failed

 Error from calling cudaFree: driver shutting down
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpiexec detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[21098,1],0]
  Exit code:    1
--------------------------------------------------------------------------

    Start 44: deterministic-unit_test_afqmc_walkers
4/9 Test #44: deterministic-unit_test_afqmc_walkers ...............................***Failed    2.95 sec
QMCPACK printout is suppressed. Use --turn-on-printout to see all the printout.

 cudaGetErrorName: cudaErrorNoKernelImageForDevice
 cudaGetErrorString: no kernel image is available for execution on the device

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
test_afqmc_walkers is a Catch v2.13.6 host application.
Run with -? for options

-------------------------------------------------------------------------------
swset_test_serial
-------------------------------------------------------------------------------
/home/correaa/qmcpack/src/AFQMC/Walkers/tests/test_sharedwset.cpp:506
...............................................................................

/home/correaa/qmcpack/src/AFQMC/Walkers/tests/test_sharedwset.cpp:506: FAILED:
  {Unknown expression after the reported line}
due to unexpected exception with message:
   Error code returned by cuda. 


 cudaGetErrorName: cudaErrorNoKernelImageForDevice
 cudaGetErrorString: no kernel image is available for execution on the device
-------------------------------------------------------------------------------
walker_io
-------------------------------------------------------------------------------
/home/correaa/qmcpack/src/AFQMC/Walkers/tests/test_sharedwset.cpp:522
...............................................................................

/home/correaa/qmcpack/src/AFQMC/Walkers/tests/test_sharedwset.cpp:522: FAILED:
  {Unknown expression after the reported line}
due to unexpected exception with message:
   Error code returned by cuda. 

===============================================================================
test cases: 2 | 2 failed
assertions: 4 | 2 passed | 2 failed

--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpiexec detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[20911,1],0]
  Exit code:    2
--------------------------------------------------------------------------

    Start 45: deterministic-unit_test_afqmc_hamiltonians_ham_chol
5/9 Test #45: deterministic-unit_test_afqmc_hamiltonians_ham_chol .................   Passed    0.99 sec
    Start 46: deterministic-unit_test_afqmc_wfn_factory_ham_chol_dense_wfn_rhf
6/9 Test #46: deterministic-unit_test_afqmc_wfn_factory_ham_chol_dense_wfn_rhf ....***Failed    3.06 sec
QMCPACK printout is suppressed. Use --turn-on-printout to see all the printout.

 cudaGetErrorName: cudaErrorNoKernelImageForDevice
 cudaGetErrorString: no kernel image is available for execution on the device

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
test_afqmc_wfn_factory is a Catch v2.13.6 host application.
Run with -? for options

-------------------------------------------------------------------------------
wfn_fac_sdet
-------------------------------------------------------------------------------
/home/correaa/qmcpack/src/AFQMC/Wavefunctions/tests/test_wfn_factory.cpp:1107
...............................................................................

/home/correaa/qmcpack/src/AFQMC/Wavefunctions/tests/test_wfn_factory.cpp:1107: FAILED:
  {Unknown expression after the reported line}
due to unexpected exception with message:
   Error code returned by cuda. 

-------------------------------------------------------------------------------
wfn_fac_distributed
-------------------------------------------------------------------------------
/home/correaa/qmcpack/src/AFQMC/Wavefunctions/tests/test_wfn_factory.cpp:1127
...............................................................................

/home/correaa/qmcpack/src/AFQMC/Wavefunctions/tests/test_wfn_factory.cpp:1127: FAILED:
due to unexpected exception with message:
  Error: Incorrect global state in require (found initialized).

===============================================================================
test cases: 2 | 2 failed
assertions: 5 | 3 passed | 2 failed

 Error from calling cudaFree: driver shutting down
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpiexec detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[20935,1],0]
  Exit code:    2
--------------------------------------------------------------------------

    Start 47: deterministic-unit_test_afqmc_phmsd
7/9 Test #47: deterministic-unit_test_afqmc_phmsd .................................   Passed    0.94 sec
    Start 48: deterministic-unit_test_afqmc_prop_factory_ham_chol_dense_wfn_rhf
8/9 Test #48: deterministic-unit_test_afqmc_prop_factory_ham_chol_dense_wfn_rhf ...***Failed    3.10 sec
QMCPACK printout is suppressed. Use --turn-on-printout to see all the printout.

 cudaGetErrorName: cudaErrorNoKernelImageForDevice
 cudaGetErrorString: no kernel image is available for execution on the device

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
test_afqmc_prop_factory is a Catch v2.13.6 host application.
Run with -? for options

-------------------------------------------------------------------------------
propg_fac_shared
-------------------------------------------------------------------------------
/home/correaa/qmcpack/src/AFQMC/Propagators/tests/test_propagator_factory.cpp:419
...............................................................................

/home/correaa/qmcpack/src/AFQMC/Propagators/tests/test_propagator_factory.cpp:419: FAILED:
  {Unknown expression after the reported line}
due to unexpected exception with message:
   Error code returned by cuda. 

-------------------------------------------------------------------------------
propg_fac_distributed
-------------------------------------------------------------------------------
/home/correaa/qmcpack/src/AFQMC/Propagators/tests/test_propagator_factory.cpp:435
...............................................................................

/home/correaa/qmcpack/src/AFQMC/Propagators/tests/test_propagator_factory.cpp:435: FAILED:
due to unexpected exception with message:
  Error: Incorrect global state in require (found initialized).

===============================================================================
test cases: 2 | 2 failed
assertions: 5 | 3 passed | 2 failed

 Error from calling cudaFree: driver shutting down
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpiexec detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[20767,1],0]
  Exit code:    2
--------------------------------------------------------------------------

    Start 49: deterministic-unit_test_afqmc_estimators_ham_chol_dense_wfn_rhf
9/9 Test #49: deterministic-unit_test_afqmc_estimators_ham_chol_dense_wfn_rhf .....***Failed    3.12 sec
QMCPACK printout is suppressed. Use --turn-on-printout to see all the printout.

 cudaGetErrorName: cudaErrorNoKernelImageForDevice
 cudaGetErrorString: no kernel image is available for execution on the device

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
test_afqmc_estimators is a Catch v2.13.6 host application.
Run with -? for options

-------------------------------------------------------------------------------
reduced_density_matrix
-------------------------------------------------------------------------------
/home/correaa/qmcpack/src/AFQMC/Estimators/tests/test_estimators.cpp:246
...............................................................................

/home/correaa/qmcpack/src/AFQMC/Estimators/tests/test_estimators.cpp:246: FAILED:
  {Unknown expression after the reported line}
due to unexpected exception with message:
   Error code returned by cuda. 

===============================================================================
test cases: 1 | 1 failed
assertions: 4 | 3 passed | 1 failed

 Error from calling cudaFree: driver shutting down
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpiexec detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[20829,1],0]
  Exit code:    1
--------------------------------------------------------------------------


22% tests passed, 7 tests failed out of 9

Label Time Summary:
afqmc              =  23.11 sec*proc (9 tests)
deterministic      =  23.11 sec*proc (9 tests)
quality_unknown    =  23.11 sec*proc (9 tests)
unit               =  23.11 sec*proc (9 tests)

Total Test time (real) =  28.01 sec

The following tests FAILED:
	 41 - deterministic-unit_test_afqmc_matrix (Failed)
	 42 - deterministic-unit_test_afqmc_numerics (Failed)
	 43 - deterministic-unit_test_afqmc_slaterdeterminantoperations (Failed)
	 44 - deterministic-unit_test_afqmc_walkers (Failed)
	 46 - deterministic-unit_test_afqmc_wfn_factory_ham_chol_dense_wfn_rhf (Failed)
	 48 - deterministic-unit_test_afqmc_prop_factory_ham_chol_dense_wfn_rhf (Failed)
	 49 - deterministic-unit_test_afqmc_estimators_ham_chol_dense_wfn_rhf (Failed)
Errors while running CTest

@correaa
Copy link
Contributor Author

correaa commented Aug 25, 2022

Note that there is an earlier report from NVIDIA (issue exists sometime in the last ~2 months, couldn't find right now) of test problems with AFQMC. If you can fix it great, otherwise it will be up to the original authors to fix or not.

Yes, now I realize I filed basically the same issue a few months ago.

@prckent
Copy link
Contributor

prckent commented Aug 25, 2022

Check the architectures used in the build. Looks like an install or compilation problem, not a bug since nothing is running.

@prckent
Copy link
Contributor

prckent commented Aug 25, 2022

Try summit or perlmutter?

@correaa
Copy link
Contributor Author

correaa commented Aug 25, 2022

This one: #4038

yes, must be a compilation/architecture error, although the cmake line worked before.
Is it possible to set the architecture (e.g. 61) from the CMake command line?

@correaa
Copy link
Contributor Author

correaa commented Aug 25, 2022

That must be it, the default of 70 doesn't play well with my 61 card.

@correaa
Copy link
Contributor Author

correaa commented Aug 25, 2022

All solved adding -DCMAKE_CUDA_ARCHITECTURES=61 to cmake line.

cmake .. -DCMAKE_C_COMPILER=mpicc -DCMAKE_CXX_COMPILER=mpicxx -DBUILD_AFQMC=1 -DQMC_CXX_STANDARD=17 -DENABLE_CUDA=1 -DCMAKE_CUDA_HOST_COMPILER=g++-9 -DQMC_MIXED_PRECISION=1 -DCMAKE_BUILD_TYPE=Debug -DCMAKE_CXX_FLAGS="-Werror" -DMPIEXEC_PREFLAGS="--allow-run-as-root;--bind-to;none" -DCMAKE_CUDA_ARCHITECTURES=61

make ppconvert afqmc test_afqmc_matrix test_afqmc_numerics test_afqmc_slaterdeterminantoperations test_afqmc_walkers test_afqmc_hamiltonians test_afqmc_hamiltonian_operations test_afqmc_phmsd test_afqmc_wfn_factory test_afqmc_prop_factory test_afqmc_estimators qmc-afqmc-performance VERBOSE=1 -j 5

ctest -R ppconvert --output-on-failure
ctest -R afqmc     --output-on-failure

@correaa correaa closed this as completed Aug 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants