Skip to content

Fix rocm alias#23

Closed
reza-amd wants to merge 2 commits intomainfrom
fix_rocm_alias
Closed

Fix rocm alias#23
reza-amd wants to merge 2 commits intomainfrom
fix_rocm_alias

Conversation

@reza-amd
Copy link
Copy Markdown

No description provided.

@reza-amd
Copy link
Copy Markdown
Author

jenkins: retest Ubuntu-GPU-single please

@Ruturaj4 Ruturaj4 force-pushed the main branch 2 times, most recently from ec7d625 to 1f71f84 Compare July 26, 2024 01:05
@mrodden mrodden closed this Sep 6, 2024
JehandadKhan pushed a commit that referenced this pull request Oct 9, 2024
Typo, Python parens

GitOrigin-RevId: 9e739fb
rocm-repo-management-api-2 bot pushed a commit that referenced this pull request Apr 10, 2025
When run under an optimized build and Python 3.13.2t, I saw the
following high probability crash in lax_control_flow_test:

```
                Stack trace of thread 3526917:
                #0  0x00007f0898c4bf91 dump_frame (libpython3.13t.so.1.0 + 0x24bf91)
                #1  0x00007f0898c4b73f dump_traceback (libpython3.13t.so.1.0 + 0x24b73f)
                #2  0x00007f0898c4b86f _Py_DumpTracebackThreads (libpython3.13t.so.1.0 + 0x24b86f)
                #3  0x00007f0898cd4fe0 faulthandler_dump_traceback (libpython3.13t.so.1.0 + 0x2d4fe0)
                #4  0x00007f0898cd4f44 faulthandler_fatal_error (libpython3.13t.so.1.0 + 0x2d4f44)
                #5  0x00007f0898849e20 __restore_rt (libc.so.6 + 0x3fe20)
                #6  0x00007f07eb80e493 _ZNSt8__detail16_Hashtable_allocISaINS_10_Hash_nodeISt4pairIKN3jax15WeakrefLRUCache15WeakrefCacheKeyENS4_17WeakrefCacheValueEELb1EEEEE18_M_deallocate_nodeEPS9_ (libjax_common.so + 0x2c0e493)
                #7  0x00007f07eb80e13e _ZN3jax15WeakrefLRUCache5ClearEv (libjax_common.so + 0x2c0e13e)
                #8  0x00007f07eb812e37 _ZZN8nanobind6detail11func_createILb0ELb1EZNS_16cpp_function_defIN3jax15WeakrefLRUCacheEvS4_JEJNS_5scopeENS_4nameENS_9is_methodENS_9lock_selfEEEEvMT1_FT0_DpT2_EDpRKT3_EUlPS4_E_vJSJ_EJLm0EEJS5_S6_S7_S8_EEEP>
                #9  0x00007f07eb7fff70 _ZN8nanobind6detailL25nb_func_vectorcall_simpleEP7_objectPKS2_mS2_ (libjax_common.so + 0x2bfff70)
                #10 0x00007f0898dbbdee _PyObject_VectorcallTstate (libpython3.13t.so.1.0 + 0x3bbdee)
                #11 0x00007f0898d1d4db _PyEval_EvalFrame (libpython3.13t.so.1.0 + 0x31d4db)
                #12 0x00007f0898d1ee78 _PyObject_VectorcallTstate (libpython3.13t.so.1.0 + 0x31ee78)
                #13 0x00007f0898dc0054 _PyVectorcall_Call (libpython3.13t.so.1.0 + 0x3c0054)
                #14 0x00007f0898d1d4db _PyEval_EvalFrame (libpython3.13t.so.1.0 + 0x31d4db)
                #15 0x00007f0898d1e02c _PyObject_VectorcallDictTstate (libpython3.13t.so.1.0 + 0x31e02c)
                #16 0x00007f0898ed8e35 slot_tp_call (libpython3.13t.so.1.0 + 0x4d8e35)
                #17 0x00007f0898dbc312 _PyObject_MakeTpCall (libpython3.13t.so.1.0 + 0x3bc312)
                #18 0x00007f0898d1d4db _PyEval_EvalFrame (libpython3.13t.so.1.0 + 0x31d4db)
                #19 0x00007f0898d1ef54 _PyObject_VectorcallTstate (libpython3.13t.so.1.0 + 0x31ef54)
                #20 0x00007f0899094c1f thread_run (libpython3.13t.so.1.0 + 0x694c1f)
                #21 0x00007f0898fa0c58 pythread_wrapper (libpython3.13t.so.1.0 + 0x5a0c58)
                #22 0x00007f089889c103 start_thread (libc.so.6 + 0x92103)
                #23 0x00007f089891a7b8 __clone3 (libc.so.6 + 0x1107b8)
```

It appears that this is due to freeing Python objects during
unordered_map::clear(), which may release the enclosing critical section
(`nb::lock_self()` on the method). Fix this by deferring destruction of
the both the keys and the values to after the map's destruction.
charleshofer pushed a commit that referenced this pull request Apr 30, 2025
When run under an optimized build and Python 3.13.2t, I saw the
following high probability crash in lax_control_flow_test:

```
                Stack trace of thread 3526917:
                #0  0x00007f0898c4bf91 dump_frame (libpython3.13t.so.1.0 + 0x24bf91)
                #1  0x00007f0898c4b73f dump_traceback (libpython3.13t.so.1.0 + 0x24b73f)
                #2  0x00007f0898c4b86f _Py_DumpTracebackThreads (libpython3.13t.so.1.0 + 0x24b86f)
                #3  0x00007f0898cd4fe0 faulthandler_dump_traceback (libpython3.13t.so.1.0 + 0x2d4fe0)
                #4  0x00007f0898cd4f44 faulthandler_fatal_error (libpython3.13t.so.1.0 + 0x2d4f44)
                #5  0x00007f0898849e20 __restore_rt (libc.so.6 + 0x3fe20)
                #6  0x00007f07eb80e493 _ZNSt8__detail16_Hashtable_allocISaINS_10_Hash_nodeISt4pairIKN3jax15WeakrefLRUCache15WeakrefCacheKeyENS4_17WeakrefCacheValueEELb1EEEEE18_M_deallocate_nodeEPS9_ (libjax_common.so + 0x2c0e493)
                #7  0x00007f07eb80e13e _ZN3jax15WeakrefLRUCache5ClearEv (libjax_common.so + 0x2c0e13e)
                #8  0x00007f07eb812e37 _ZZN8nanobind6detail11func_createILb0ELb1EZNS_16cpp_function_defIN3jax15WeakrefLRUCacheEvS4_JEJNS_5scopeENS_4nameENS_9is_methodENS_9lock_selfEEEEvMT1_FT0_DpT2_EDpRKT3_EUlPS4_E_vJSJ_EJLm0EEJS5_S6_S7_S8_EEEP>
                #9  0x00007f07eb7fff70 _ZN8nanobind6detailL25nb_func_vectorcall_simpleEP7_objectPKS2_mS2_ (libjax_common.so + 0x2bfff70)
                #10 0x00007f0898dbbdee _PyObject_VectorcallTstate (libpython3.13t.so.1.0 + 0x3bbdee)
                #11 0x00007f0898d1d4db _PyEval_EvalFrame (libpython3.13t.so.1.0 + 0x31d4db)
                #12 0x00007f0898d1ee78 _PyObject_VectorcallTstate (libpython3.13t.so.1.0 + 0x31ee78)
                #13 0x00007f0898dc0054 _PyVectorcall_Call (libpython3.13t.so.1.0 + 0x3c0054)
                #14 0x00007f0898d1d4db _PyEval_EvalFrame (libpython3.13t.so.1.0 + 0x31d4db)
                #15 0x00007f0898d1e02c _PyObject_VectorcallDictTstate (libpython3.13t.so.1.0 + 0x31e02c)
                #16 0x00007f0898ed8e35 slot_tp_call (libpython3.13t.so.1.0 + 0x4d8e35)
                #17 0x00007f0898dbc312 _PyObject_MakeTpCall (libpython3.13t.so.1.0 + 0x3bc312)
                #18 0x00007f0898d1d4db _PyEval_EvalFrame (libpython3.13t.so.1.0 + 0x31d4db)
                #19 0x00007f0898d1ef54 _PyObject_VectorcallTstate (libpython3.13t.so.1.0 + 0x31ef54)
                #20 0x00007f0899094c1f thread_run (libpython3.13t.so.1.0 + 0x694c1f)
                #21 0x00007f0898fa0c58 pythread_wrapper (libpython3.13t.so.1.0 + 0x5a0c58)
                #22 0x00007f089889c103 start_thread (libc.so.6 + 0x92103)
                #23 0x00007f089891a7b8 __clone3 (libc.so.6 + 0x1107b8)
```

It appears that this is due to freeing Python objects during
unordered_map::clear(), which may release the enclosing critical section
(`nb::lock_self()` on the method). Fix this by deferring destruction of
the both the keys and the values to after the map's destruction.
charleshofer pushed a commit that referenced this pull request May 1, 2025
When run under an optimized build and Python 3.13.2t, I saw the
following high probability crash in lax_control_flow_test:

```
                Stack trace of thread 3526917:
                #0  0x00007f0898c4bf91 dump_frame (libpython3.13t.so.1.0 + 0x24bf91)
                #1  0x00007f0898c4b73f dump_traceback (libpython3.13t.so.1.0 + 0x24b73f)
                #2  0x00007f0898c4b86f _Py_DumpTracebackThreads (libpython3.13t.so.1.0 + 0x24b86f)
                #3  0x00007f0898cd4fe0 faulthandler_dump_traceback (libpython3.13t.so.1.0 + 0x2d4fe0)
                #4  0x00007f0898cd4f44 faulthandler_fatal_error (libpython3.13t.so.1.0 + 0x2d4f44)
                #5  0x00007f0898849e20 __restore_rt (libc.so.6 + 0x3fe20)
                #6  0x00007f07eb80e493 _ZNSt8__detail16_Hashtable_allocISaINS_10_Hash_nodeISt4pairIKN3jax15WeakrefLRUCache15WeakrefCacheKeyENS4_17WeakrefCacheValueEELb1EEEEE18_M_deallocate_nodeEPS9_ (libjax_common.so + 0x2c0e493)
                #7  0x00007f07eb80e13e _ZN3jax15WeakrefLRUCache5ClearEv (libjax_common.so + 0x2c0e13e)
                #8  0x00007f07eb812e37 _ZZN8nanobind6detail11func_createILb0ELb1EZNS_16cpp_function_defIN3jax15WeakrefLRUCacheEvS4_JEJNS_5scopeENS_4nameENS_9is_methodENS_9lock_selfEEEEvMT1_FT0_DpT2_EDpRKT3_EUlPS4_E_vJSJ_EJLm0EEJS5_S6_S7_S8_EEEP>
                #9  0x00007f07eb7fff70 _ZN8nanobind6detailL25nb_func_vectorcall_simpleEP7_objectPKS2_mS2_ (libjax_common.so + 0x2bfff70)
                #10 0x00007f0898dbbdee _PyObject_VectorcallTstate (libpython3.13t.so.1.0 + 0x3bbdee)
                #11 0x00007f0898d1d4db _PyEval_EvalFrame (libpython3.13t.so.1.0 + 0x31d4db)
                #12 0x00007f0898d1ee78 _PyObject_VectorcallTstate (libpython3.13t.so.1.0 + 0x31ee78)
                #13 0x00007f0898dc0054 _PyVectorcall_Call (libpython3.13t.so.1.0 + 0x3c0054)
                #14 0x00007f0898d1d4db _PyEval_EvalFrame (libpython3.13t.so.1.0 + 0x31d4db)
                #15 0x00007f0898d1e02c _PyObject_VectorcallDictTstate (libpython3.13t.so.1.0 + 0x31e02c)
                #16 0x00007f0898ed8e35 slot_tp_call (libpython3.13t.so.1.0 + 0x4d8e35)
                #17 0x00007f0898dbc312 _PyObject_MakeTpCall (libpython3.13t.so.1.0 + 0x3bc312)
                #18 0x00007f0898d1d4db _PyEval_EvalFrame (libpython3.13t.so.1.0 + 0x31d4db)
                #19 0x00007f0898d1ef54 _PyObject_VectorcallTstate (libpython3.13t.so.1.0 + 0x31ef54)
                #20 0x00007f0899094c1f thread_run (libpython3.13t.so.1.0 + 0x694c1f)
                #21 0x00007f0898fa0c58 pythread_wrapper (libpython3.13t.so.1.0 + 0x5a0c58)
                #22 0x00007f089889c103 start_thread (libc.so.6 + 0x92103)
                #23 0x00007f089891a7b8 __clone3 (libc.so.6 + 0x1107b8)
```

It appears that this is due to freeing Python objects during
unordered_map::clear(), which may release the enclosing critical section
(`nb::lock_self()` on the method). Fix this by deferring destruction of
the both the keys and the values to after the map's destruction.
@gulsumgudukbay gulsumgudukbay deleted the fix_rocm_alias branch June 20, 2025 15:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants