Methane Oxidation Case Study Fails to train via "dp train methane_param.json" #4769
Replies: 2 comments
-
This looks like a bug from TensorFlow. Will rerunning work? |
Beta Was this translation helpful? Give feedback.
-
Since my original post, I reinstalled deepmd and horovod, downloaded the tutorial fresh (Chapter13-tutorial) and again tried the command "dp train methane_param.json > logfile 2> logfile". Note: I forgot to set the environment variables ( OMP_NUM_THREADS, DP_INTRA_OP_PARALLELISM_THREADS, and DP_INTER_OP_PARALLELISM_THREADS), but in the original post I correctly set these variables correctly. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Installation of deepmd went smoothly via
conda create -n deepmd deepmd-kit lammps horovod -c conda-forge
Download of the methane oxidation case study dataset from github.com/tongzhugroup/Chapter13-tutorial went smoothly via
git clone https://github.com/tongzhugroup/Chapter13-tutorial
But "dp train methane_param.json > logfile 2> logfile" results in errors in the 2 or 3 times I've tried it.
Here's the final 100 lines of the log file:
[2025-05-26 20:28:09,140] DEEPMD INFO batch 14600: total wall time = 7.12 s
[2025-05-26 20:28:16,083] DEEPMD INFO batch 14700: trn: rmse = 8.38e+00, rmse_e = 1.01e-01, rmse_f = 2.70e-01, lr = 9.64e-04
[2025-05-26 20:28:16,084] DEEPMD INFO batch 14700: total wall time = 6.94 s
[2025-05-26 20:28:23,089] DEEPMD INFO batch 14800: trn: rmse = 9.34e+00, rmse_e = 1.18e-01, rmse_f = 3.01e-01, lr = 9.63e-04
[2025-05-26 20:28:23,089] DEEPMD INFO batch 14800: total wall time = 7.01 s
[2025-05-26 20:28:30,309] DEEPMD INFO batch 14900: trn: rmse = 2.54e+01, rmse_e = 3.58e-01, rmse_f = 8.18e-01, lr = 9.63e-04
[2025-05-26 20:28:30,309] DEEPMD INFO batch 14900: total wall time = 7.22 s
[2025-05-26 20:28:37,339] DEEPMD INFO batch 15000: trn: rmse = 2.64e+01, rmse_e = 2.52e-01, rmse_f = 8.51e-01, lr = 9.63e-04
[2025-05-26 20:28:37,339] DEEPMD INFO batch 15000: total wall time = 7.03 s
[2025-05-26 20:28:37,433] DEEPMD INFO saved checkpoint model.ckpt
[2025-05-26 20:28:44,569] DEEPMD INFO batch 15100: trn: rmse = 7.51e+00, rmse_e = 1.53e-01, rmse_f = 2.42e-01, lr = 9.63e-04
[2025-05-26 20:28:44,569] DEEPMD INFO batch 15100: total wall time = 7.23 s
2025-05-26 20:28:49.670824: F external/local_xla/xla/tsl/lib/monitoring/counter.h:205] Check failed: 0 <= step (0 vs. -31781)Must not decrement cumulative metrics.
[Mac:07228] *** Process received signal ***
[Mac:07228] Signal: Abort trap: 6 (6)
[Mac:07228] Signal code: (0)
[Mac:07228] [ 0] 0 libsystem_platform.dylib 0x000000018f053624 _sigtramp + 56
[Mac:07228] [ 1] 0 libsystem_pthread.dylib 0x000000018f01988c pthread_kill + 296
[Mac:07228] [ 2] 0 libsystem_c.dylib 0x000000018ef22c60 abort + 124
[Mac:07228] [ 3] 0 libtensorflow_framework.2.dylib 0x00000001117ea3a8 _ZN3tsl8internal15LogMessageFatalD2Ev + 36
[Mac:07228] [ 4] 0 libtensorflow_framework.2.dylib 0x00000001117ea3c4 _ZTv0_n24_N3tsl8internal15LogMessageFatalD1Ev + 0
[Mac:07228] [ 5] 0 libtensorflow_framework.2.dylib 0x00000001112e822c _ZN10tensorflow7metrics19UpdateGraphExecTimeEy + 284
[Mac:07228] [ 6] 0 libtensorflow_cc.2.dylib 0x000000030e2b764c _ZN10tensorflow13DirectSession11RunInternalExRKNS_10RunOptionsEPNS_18CallFrameInterfaceEPNS0_16ExecutorsAndKeysEPNS_11RunMetadataERKN3tsl6thread17ThreadPoolOptionsE + 4044
[Mac:07228] [ 7] 0 libtensorflow_cc.2.dylib 0x000000030e2b84cc _ZN10tensorflow13DirectSession3RunERKNS_10RunOptionsERKNSt3__16vectorINS4_4pairINS4_12basic_stringIcNS4_11char_traitsIcEENS4_9allocatorIcEEEENS_6TensorEEENSA_ISE_EEEERKNS5_ISC_NSA_ISC_EEEESM_PNS5_ISD_NSA_ISD_EEEEPNS_11RunMetadataERKN3tsl6thread17ThreadPoolOptionsE + 1280
[Mac:07228] [ 8] 0 libtensorflow_cc.2.dylib 0x000000030e2b7fa4 _ZN10tensorflow13DirectSession3RunERKNS_10RunOptionsERKNSt3__16vectorINS4_4pairINS4_12basic_stringIcNS4_11char_traitsIcEENS4_9allocatorIcEEEENS_6TensorEEENSA_ISE_EEEERKNS5_ISC_NSA_ISC_EEEESM_PNS5_ISD_NSA_ISD_EEEEPNS_11RunMetadataE + 48
[Mac:07228] [ 9] 0 _pywrap_tensorflow_internal.so 0x000000010c2fecd0 _ZN10tensorflow10SessionRef3RunERKNS_10RunOptionsERKNSt3__16vectorINS4_4pairINS4_12basic_stringIcNS4_11char_traitsIcEENS4_9allocatorIcEEEENS_6TensorEEENSA_ISE_EEEERKNS5_ISC_NSA_ISC_EEEESM_PNS5_ISD_NSA_ISD_EEEEPNS_11RunMetadataE + 316
[Mac:07228] [10] 0 libtensorflow_cc.2.dylib 0x0000000302089d5c _ZL13TF_Run_HelperPN10tensorflow7SessionEPKcPK9TF_BufferRKNSt3__16vectorINS7_4pairINS7_12basic_stringIcNS7_11char_traitsIcEENS7_9allocatorIcEEEENS_6TensorEEENSD_ISH_EEEERKNS8_ISF_NSD_ISF_EEEEPP9TF_TensorSP_PS4_P10TSL_Status + 1508
[Mac:07228] [11] 0 libtensorflow_cc.2.dylib 0x0000000302094f84 TF_SessionRun + 908
[Mac:07228] [12] 0 _pywrap_tensorflow_internal.so 0x000000010c2fbfd4 ZN10tensorflow28TF_SessionRun_wrapper_helperEP10TF_SessionPKcPK9TF_BufferRKNSt3__16vectorI9TF_OutputNS7_9allocatorIS9_EEEERKNS8_IP7_objectNSA_ISG_EEEESE_RKNS8_IP12TF_OperationNSA_ISM_EEEEPS4_P10TSL_StatusPSI + 1132
[Mac:07228] [13] 0 _pywrap_tensorflow_internal.so 0x000000010c2fc8ec ZN10tensorflow21TF_SessionRun_wrapperEP10TF_SessionPK9TF_BufferRKNSt3__16vectorI9TF_OutputNS5_9allocatorIS7_EEEERKNS6_IP7_objectNS8_ISE_EEEESC_RKNS6_IP12TF_OperationNS8_ISK_EEEEPS2_P10TSL_StatusPSG + 56
[Mac:07228] [14] 0 _pywrap_tf_session.so 0x0000000116067080 _ZNO8pybind116detail15argument_loaderIJP10TF_SessionP9TF_BufferRKNS_6handleERKNSt3__16vectorI9TF_OutputNS9_9allocatorISB_EEEERKNSA_IP12TF_OperationNSC_ISI_EEEES5_EE4callINS_6objectENS0_9void_typeERZL32pybind11_init__pywrap_tf_sessionRNS_7module_EE4$46EENS9_9enable_ifIXntsr3std7is_voidIT_EE5valueESW_E4typeEOT1 + 728
[Mac:07228] [15] 0 _pywrap_tf_session.so 0x0000000116066c80 _ZZN8pybind1112cpp_function10initializeIZL32pybind11_init__pywrap_tf_sessionRNS_7module_EE4$46NS_6objectEJP10TF_SessionP9TF_BufferRKNS_6handleERKNSt3__16vectorI9TF_OutputNSD_9allocatorISF_EEEERKNSE_IP12TF_OperationNSG_ISM_EEEES9_EJNS_4nameENS_5scopeENS_7siblingEEEEvOT_PFT0_DpT1_EDpRKT2_ENUlRNS_6detail13function_callEE_8__invokeES17 + 172
[Mac:07228] [16] 0 _pywrap_tf_session.so 0x000000011601e160 ZN8pybind1112cpp_function10dispatcherEP7_objectS2_S2 + 4508
[Mac:07228] [17] 0 python3.11 0x0000000100422938 cfunction_call + 124
[Mac:07228] [18] 0 python3.11 0x00000001003cabc0 _PyObject_MakeTpCall + 332
[Mac:07228] [19] 0 python3.11 0x00000001004ce228 _PyEval_EvalFrameDefault + 45376
[Mac:07228] [20] 0 python3.11 0x00000001004d2708 _PyEval_Vector + 184
[Mac:07228] [21] 0 python3.11 0x00000001004d005c _PyEval_EvalFrameDefault + 53108
[Mac:07228] [22] 0 python3.11 0x00000001004d2708 _PyEval_Vector + 184
[Mac:07228] [23] 0 python3.11 0x00000001003ce6a8 method_vectorcall + 172
[Mac:07228] [24] 0 python3.11 0x00000001003cb358 _PyVectorcall_Call + 132
[Mac:07228] [25] 0 python3.11 0x00000001004d005c _PyEval_EvalFrameDefault + 53108
[Mac:07228] [26] 0 python3.11 0x00000001004d2708 _PyEval_Vector + 184
[Mac:07228] [27] 0 python3.11 0x00000001003cb358 _PyVectorcall_Call + 132
[Mac:07228] [28] 0 python3.11 0x00000001004d005c _PyEval_EvalFrameDefault + 53108
[Mac:07228] [29] 0 python3.11 0x00000001004c210c PyEval_EvalCode + 204
[Mac:07228] *** End of error message ***
Beta Was this translation helpful? Give feedback.
All reactions