Skip to content

[Help] How to use tools to debug the performance of communication kernels #274

@the-strawhat

Description

@the-strawhat

Hello,
I am currently developing a communication kernel and would like to use Thread Trace to debug and analyze its performance at the ISA (assembly) instruction level.I tried running the following command: rocprofv3 --att -d <output_dir> -- <application_path>However, I encountered an error during execution (see below).

Details

Unidirectional LOAD bandwidth GiB/s [Remote read]
SRC\DST GPU 00 GPU 01
GPU 00 -> 986.54 42.27
GPU 01 -> 42.07 984.74
W20251030 09:30:51.570256 139906822775424 generateRocpd.cpp:580] writing SQL database for process 142107 on node 3638630046
E20251030 09:30:51.571632 139906822775424 generateRocpd.cpp:603] Opened result file: /home/tmp/iris/pass_2/normandy-mlflow-debug-j6iqur-041086-2xwbd/142107_results.db (UUID=00049be3-1c39-7c39-8ea0-168663cde420)
W20251030 09:30:51.617263 139793821794944 generateRocpd.cpp:580] writing SQL database for process 142108 on node 3638630046
E20251030 09:30:51.618204 139793821794944 generateRocpd.cpp:603] Opened result file: /home/tmp/iris/pass_2/normandy-mlflow-debug-j6iqur-041086-2xwbd/142108_results.db (UUID=00049be3-1c3a-7c3a-8f87-75d03bc0f13d)
W20251030 09:30:51.679299 139906822775424 simple_timer.cpp:55] SQLite3 generation :: rocpd_string :: 0.045479 sec
W20251030 09:30:51.679628 139906822775424 simple_timer.cpp:55] SQLite3 generation :: rocpd_info_node :: 0.000309 sec
F20251030 09:30:51.680204 139906822775424 common.cpp:117] SQLite3 error 1: unrecognized token: "'/opt/venv/bin/python -c frommultiprocessing.spawnimportspawn_main;"
SQLite3 error: common.cpp:830
SQL Statement: INSERT INTO rocpd_info_process_00049be3_1c39_7c39_8ea0_168663cde420 (id, nid, ppid, pid, init, fini, start, end, command, environment, extdata) VALUES (142107, 3638630046, 142014, 142107, 19795221561709861, 19795233094016172, 19795221561709861, 19795233094016172, '/opt/venv/bin/python -c frommultiprocessing.spawnimportspawn_main;
W20251030 09:30:51.689028 139793821794944 simple_timer.cpp:55] SQLite3 generation :: rocpd_string :: 0.045302 sec
W20251030 09:30:51.689389 139793821794944 simple_timer.cpp:55] SQLite3 generation :: rocpd_info_node :: 0.000341 sec
F20251030 09:30:51.689972 139793821794944 common.cpp:117] SQLite3 error 1: unrecognized token: "'/opt/venv/bin/python -c frommultiprocessing.spawnimportspawn_main;"
SQLite3 error: common.cpp:830
SQL Statement: INSERT INTO rocpd_info_process_00049be3_1c3a_7c3a_8f87_75d03bc0f13d (id, nid, ppid, pid, init, fini, start, end, command, environment, extdata) VALUES (142108, 3638630046, 142014, 142108, 19795221562653435, 19795233140494068, 19795221562653435, 19795233140494068, '/opt/venv/bin/python -c frommultiprocessing.spawnimportspawn_main;
@ 0x7f3ea41dad40 google::LogMessage::Fail()
@ 0x7f3ea41dbe98 google::LogMessageFatal::~LogMessageFatal()
@ 0x7f3ea3f783bc rocprofiler::tool::sql::execute_raw_sql_statements_impl(sqlite3*, std::basic_string_view<char, std::char_traits >, int ()(void, int, char**, char**), void*, int)
@ 0x7f3ea3f2be8f rocprofiler::tool::write_rocpd(rocprofiler::tool::output_config const&, rocprofiler::tool::metadata const&, std::vector<rocprofiler::tool::agent_info, std::allocatorrocprofiler::tool::agent_info > const&, rocprofiler::tool::generator<rocprofiler::tool::t
@ 0x7f3ea3f3dde8 rocprofiler::tool::write_rocpd(rocprofiler::tool::output_config const&, rocprofiler::tool::metadata const&, std::vector<rocprofiler::tool::agent_info, std::allocatorrocprofiler::tool::agent_info > const&, rocprofiler::tool::generator<rocprofiler::tool::t
@ 0x7f3ea3e57ec6 tool_fini(void*)
@ 0x7f2454bb4d40 google::LogMessage::Fail()
@ 0x7f3ea357e333 rocprofiler::registration::(anonymous namespace)::invoke_client_finalizer(rocprofiler_client_id_t)
@ 0x7f3ea357e571 std::once_flag::_Prepare_execution::_Prepare_execution<std::call_oncerocprofiler::registration::finalize()::{lambda()#1}(std::once_flag&, rocprofiler::registration::finalize()::{lambda()#1}&&)::{lambda()#1}>(rocprofiler::registration::finalize()::{lambda
@ 0x7f2454bb5e98 google::LogMessageFatal::~LogMessageFatal()
@ 0x7f3ea3099ee8 (unknown)
@ 0x7f3ea3579737 rocprofiler::registration::finalize()
@ 0x7f3ea3e3f461 finalize_rocprofv3(std::basic_string_view<char, std::char_traits >)
@ 0x7f24549523bc rocprofiler::tool::sql::execute_raw_sql_statements_impl(sqlite3*, std::basic_string_view<char, std::char_traits >, int ()(void, int, char**, char**), void*, int)
@ 0x7f3ea3045a56 __cxa_finalize
@ 0x7f2454905e8f rocprofiler::tool::write_rocpd(rocprofiler::tool::output_config const&, rocprofiler::tool::metadata const&, std::vector<rocprofiler::tool::agent_info, std::allocatorrocprofiler::tool::agent_info > const&, rocprofiler::tool::generator<rocprofiler::tool::t
@ 0x7f3ea3e1d997 (unknown)
W20251030 09:30:51.805797 139906822775424 tool.cpp:2776] [PPID=142014][PID=142107][TID=142107][rocprofv3_error_signal_handler] rocprofv3 caught signal 6...
W20251030 09:30:51.805820 139906822775424 tool.cpp:2799] [PPID=142014][PID=142107][TID=142107][rocprofv3_error_signal_handler] rocprofv3 will wait for 0 children to exit
W20251030 09:30:51.805826 139906822775424 tool.cpp:2814] [PPID=142014][PID=142107][TID=142107][rocprofv3_error_signal_handler] rocprofv3 finalizing after signal 6...
W20251030 09:30:51.805837 139906822775424 tool.cpp:2855] [PPID=142014][PID=142107][TID=142107][rocprofv3_error_signal_handler] rocprofv3 found chained signal handler for 6... executing chained sigaction (SIGINFO)
*** Aborted at 1761816651 (unix time) try "date -d @1761816651" if you are using GNU date ***
PC: @ 0x7f3ea30969fc pthread_kill
*** SIGABRT (@0x22b1b) received by PID 142107 (TID 0x7f3e98792280) from PID 142107; stack trace: ***
@ 0x7f3ea3099ee8 (unknown)
@ 0x7f3ea41e55f0 google::(anonymous namespace)::FailureSignalHandler(int, siginfo_t*, void*)
@ 0x7f3ea3e44795 rocprofv3_error_signal_handler::{lambda()#1}::operator()() const
@ 0x7f2454917de8 rocprofiler::tool::write_rocpd(rocprofiler::tool::output_config const&, rocprofiler::tool::metadata const&, std::vector<rocprofiler::tool::agent_info, std::allocatorrocprofiler::tool::agent_info > const&, rocprofiler::tool::generator<rocprofiler::tool::t
@ 0x7f3ea3099ee8 (unknown)
@ 0x7f2454831ec6 tool_fini(void*)
@ 0x7f3ea3e3b59d rocprofv3_error_signal_handler
@ 0x7f2453f58333 rocprofiler::registration::(anonymous namespace)::invoke_client_finalizer(rocprofiler_client_id_t)
@ 0x7f3ea3042520 (unknown)
@ 0x7f2453f58571 std::once_flag::_Prepare_execution::_Prepare_execution<std::call_oncerocprofiler::registration::finalize()::{lambda()#1}(std::once_flag&, rocprofiler::registration::finalize()::{lambda()#1}&&)::{lambda()#1}>(rocprofiler::registration::finalize()::{lambda
@ 0x7f3ea30969fc pthread_kill
@ 0x7f2453a99ee8 (unknown)
@ 0x7f3ea3042476 raise
@ 0x7f2453f53737 rocprofiler::registration::finalize()
@ 0x7f3ea30287f3 abort
@ 0x7f3ea3e1a93d google::DumpStackTraceAndExit() [clone .cold]
@ 0x7f2454819461 finalize_rocprofv3(std::basic_string_view<char, std::char_traits >)
@ 0x7f2453a45a56 __cxa_finalize
@ 0x7f3ea41dad40 google::LogMessage::Fail()
@ 0x7f3ea41dbe98 google::LogMessageFatal::~LogMessageFatal()
@ 0x7f24547f7997 (unknown)
W20251030 09:30:51.816354 139793821794944 tool.cpp:2776] [PPID=142014][PID=142108][TID=142108][rocprofv3_error_signal_handler] rocprofv3 caught signal 6...
W20251030 09:30:51.816383 139793821794944 tool.cpp:2799] [PPID=142014][PID=142108][TID=142108][rocprofv3_error_signal_handler] rocprofv3 will wait for 0 children to exit
W20251030 09:30:51.816389 139793821794944 tool.cpp:2814] [PPID=142014][PID=142108][TID=142108][rocprofv3_error_signal_handler] rocprofv3 finalizing after signal 6...
W20251030 09:30:51.816398 139793821794944 tool.cpp:2855] [PPID=142014][PID=142108][TID=142108][rocprofv3_error_signal_handler] rocprofv3 found chained signal handler for 6... executing chained sigaction (SIGINFO)
*** Aborted at 1761816651 (unix time) try "date -d @1761816651" if you are using GNU date ***
@ 0x7f3ea3f783bc rocprofiler::tool::sql::execute_raw_sql_statements_impl(sqlite3*, std::basic_string_view<char, std::char_traits >, int ()(void, int, char**, char**), void*, int)
PC: @ 0x7f2453a969fc pthread_kill
*** SIGABRT (@0x22b1c) received by PID 142108 (TID 0x7f2449170280) from PID 142108; stack trace: ***
@ 0x7f3ea3f2be8f rocprofiler::tool::write_rocpd(rocprofiler::tool::output_config const&, rocprofiler::tool::metadata const&, std::vector<rocprofiler::tool::agent_info, std::allocatorrocprofiler::tool::agent_info > const&, rocprofiler::tool::generator<rocprofiler::tool::t
@ 0x7f2453a99ee8 (unknown)
@ 0x7f2454bbf5f0 google::(anonymous namespace)::FailureSignalHandler(int, siginfo_t*, void*)
@ 0x7f3ea3f3dde8 rocprofiler::tool::write_rocpd(rocprofiler::tool::output_config const&, rocprofiler::tool::metadata const&, std::vector<rocprofiler::tool::agent_info, std::allocatorrocprofiler::tool::agent_info > const&, rocprofiler::tool::generator<rocprofiler::tool::t
@ 0x7f245481e795 rocprofv3_error_signal_handler::{lambda()#1}::operator()() const
@ 0x7f3ea3e57ec6 tool_fini(void*)
@ 0x7f2453a99ee8 (unknown)
@ 0x7f3ea357e333 rocprofiler::registration::(anonymous namespace)::invoke_client_finalizer(rocprofiler_client_id_t)
@ 0x7f245481559d rocprofv3_error_signal_handler
@ 0x7f3ea357e571 std::once_flag::_Prepare_execution::_Prepare_execution<std::call_oncerocprofiler::registration::finalize()::{lambda()#1}(std::once_flag&, rocprofiler::registration::finalize()::{lambda()#1}&&)::{lambda()#1}>(rocprofiler::registration::finalize()::{lambda
@ 0x7f2453a42520 (unknown)
@ 0x7f3ea3099ee8 (unknown)
@ 0x7f2453a969fc pthread_kill
@ 0x7f3ea3579737 rocprofiler::registration::finalize()
@ 0x7f2453a42476 raise
@ 0x7f3ea3e3f461 finalize_rocprofv3(std::basic_string_view<char, std::char_traits >)
@ 0x7f2453a287f3 abort
@ 0x7f3ea3045a56 __cxa_finalize
@ 0x7f24547f493d google::DumpStackTraceAndExit() [clone .cold]
@ 0x7f3ea3e1d997 (unknown)
@ 0x7f2454bb4d40 google::LogMessage::Fail()
@ 0x7f2454bb5e98 google::LogMessageFatal::~LogMessageFatal()
@ 0x7f24549523bc rocprofiler::tool::sql::execute_raw_sql_statements_impl(sqlite3*, std::basic_string_view<char, std::char_traits >, int ()(void, int, char**, char**), void*, int)
@ 0x7f2454905e8f rocprofiler::tool::write_rocpd(rocprofiler::tool::output_config const&, rocprofiler::tool::metadata const&, std::vector<rocprofiler::tool::agent_info, std::allocatorrocprofiler::tool::agent_info > const&, rocprofiler::tool::generator<rocprofiler::tool::t
@ 0x7f2454917de8 rocprofiler::tool::write_rocpd(rocprofiler::tool::output_config const&, rocprofiler::tool::metadata const&, std::vector<rocprofiler::tool::agent_info, std::allocatorrocprofiler::tool::agent_info > const&, rocprofiler::tool::generator<rocprofiler::tool::t
@ 0x7f2454831ec6 tool_fini(void*)
@ 0x7f2453f58333 rocprofiler::registration::(anonymous namespace)::invoke_client_finalizer(rocprofiler_client_id_t)
@ 0x7f2453f58571 std::once_flag::_Prepare_execution::_Prepare_execution<std::call_oncerocprofiler::registration::finalize()::{lambda()#1}(std::once_flag&, rocprofiler::registration::finalize()::{lambda()#1}&&)::{lambda()#1}>(rocprofiler::registration::finalize()::{lambda
@ 0x7f2453a99ee8 (unknown)
@ 0x7f2453f53737 rocprofiler::registration::finalize()
@ 0x7f2454819461 finalize_rocprofv3(std::basic_string_view<char, std::char_traits >)
@ 0x7f2453a45a56 __cxa_finalize
@ 0x7f24547f7997 (unknown)
W1030 09:31:16.230000 142014 torch/multiprocessing/spawn.py:160] Terminating process 142107 via signal SIGTERM
Traceback (most recent call last):
File "/home/tmp/iris/examples/00_load/load_bench.py", line 296, in
main()
File "/home/tmp/iris/examples/00_load/load_bench.py", line 287, in main
mp.spawn(
File "/opt/venv/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 328, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
File "/opt/venv/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 284, in start_processes
while not context.join():
File "/opt/venv/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 184, in join
raise ProcessExitedException(
torch.multiprocessing.spawn.ProcessExitedException: process 1 terminated with signal SIGABRT
W20251030 09:31:16.359694 139813742502528 simple_timer.cpp:55] [rocprofv3] 'python examples/00_load/load_bench.py --num_ranks 2' :: 37.664389 sec
W20251030 09:31:16.361439 139813742502528 simple_timer.cpp:55] [rocprofv3] tool finalization :: 0.001590 sec
Fatal error: Command '['python', 'examples/00_load/load_bench.py', '--num_ranks', '2']' returned non-zero exit status 1.
W20251030 09:31:16.570020 139751244759680 simple_timer.cpp:55] [rocprofv3] '/opt/venv/bin/python -c from multiprocessing.resource_tracker import main;main(3)' :: 36.195495 sec
W20251030 09:31:16.571160 139751244759680 simple_timer.cpp:55] [rocprofv3] tool finalization :: 0.001032 sec

The same command works fine when profiling compute kernels.Since communication kernels usually require multiple processes (multiple ranks) to run cooperatively, I suspect the issue may be related to using Thread Trace in a multi-process environment.

Questions
Does Thread Trace support usage in multi-process (multi-rank) communication kernels?
If yes, how should it be correctly configured or enabled?
If not, what are the recommended methods to debug or analyze the performance of multi-process communication kernels?

Metadata

Metadata

Assignees

No one assigned

    Labels

    examplesExamples showcasing Iris APIs and usageirisIris project issue

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions