Exp callback 0.12.0 #87

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Draft

jfedorov wants to merge 3 commits into master from exp_callback_0.12.0

Contributor

jfedorov commented Mar 31, 2025

Creating this pull request - so we can discuss CallBack API here.

Anyone, the feedback is very welcome!

After we more or less agree on the API - I will publish implementation in this branch.

thank you

jfedorov added 3 commits

March 31, 2025 05:19


          [PTI-LIB] Make start of Callback api definition

b5b9a7d

Signed-off-by: jfedorov <julia.fedorova@intel.com>


          [PTI-LIB] Improve callback spec

4a77086

- improve enum
- add comments
- fix api on getting external ids

Signed-off-by: jfedorov <julia.fedorova@intel.com>


          [PTI-LIB] Clean callback api enums, add support of immeediate and com…

…mand queue execute

Signed-off-by: jfedorov <julia.fedorova@intel.com>

jmellorcrummey suggested changes

View reviewed changes

sdk/include/pti/pti_callback.h

+               */
+              typedef enum _pti_internal_event_type {
+                PTI_INTERNAL_EVENT_TYPE_INFO       = 0,
+                PTI_INTERNAL_EVENT_TYPE_WARNING    = 1, // one or few records data inconsistences, or other

jmellorcrummey Apr 1, 2025

This could use some info about how it would be used.

Knowing that something is inconsistent but not what doesn't seem very useful.

Contributor Author

jfedorov Apr 2, 2025 •

edited

Loading

As PTI depends on many lower level components.. there could be unexpected situations - due to either bugs (in PTI and others) or some missing "synchronizations" between components.

This might result in two major issues

PTI crashes application
PTI delivers unreliable data
We have to absolutely avoid 1). As well as much as possible avoid 2) - if PTI knows that something went wrong and it effects validity of data - it can report about it.
We don't want user to look into unreliable data but we better mark them.

Underneath PTI traces a lot of "paths" - some paths might be better tested/ simpler/ more frequently used than the others.
So when data on "frequent" and "reliable" paths are OK, some other (not tested well) might experience issues.

But we don't want to throw all collected data - but rather mark those that PTI knows that are invalid and also provide context where they were generated.
So it will be WARNING. And message will have details.

The recent example: in some situations expected events for memory operations are not signaled - so their Timestamps are not available. I think that in this situation - we can use this mechanism and notify a user (a tool). Of course, if it wants (subscribes).

Or it could be that something went completely out of control - say, PTI is starting GPU Sampling data collection, trying to open a temp file but there is no permission or other issue...
That could be CRITICAL. PTI shuts down its all operations. Each PTI call would return error (hopefully descriptive enough). But CRITICAL event will have a message with more details as well some info on the context.

Does it make any sense?

sdk/include/pti/pti_callback.h

+              typedef struct _pti_gpu_op_launch_detail {
+                pti_operation_kind                 _operation_kind; // kind of the operation: kernel, mem op
+                uint32_t                           _correlation_id; // correlation IDs of operation append
+                const char*                        _kernel_name;    // type of the memcpy operation

jmellorcrummey Apr 1, 2025

The field is _kernel_name and the comment says type of memcpy operation. These seem inconsistent.

Rather than a kernel name, I would prefer something like a kernel handle that can be more readily correlated with machine code.

Contributor Author

jfedorov Apr 2, 2025 •

edited

Loading

Returning L0 kernel handle would be OK.
Would the tool then try to get the kernel binary?
I will look how to do it in a safe way.

The comment // type of memcpy.. is definitely a typo - thanks for spotting it!

jmellorcrummey Apr 2, 2025

I would want a callback when the binary was loaded and unloaded.

At the time a kernel is launched, I only want to look up the kernel handle and associate the launch with the kernel in the binary, turning it into a (load module id, offset) pair. Post-mortem, HPCToolkit uses a recorded copy of the binary to interpret all of the measurement data.

sdk/include/pti/pti_callback.h

+              typedef struct _pti_apicall_callback_data {
+                pti_callback_domain       _domain;            // domain of the callback
+                pti_backend_submit_type   _submit_type;       // Immediate command list, Command Queue execute,..
+                pti_callback_site         _site;              // ENTER or EXIT

jmellorcrummey Apr 1, 2025

I prefer the name phase rather than site. I realize that NVIDIA uses site. AMD uses phase. This is more descriptive since unless the enter/exit callbacks occur in different calling contexts in user code.

Contributor Author

jfedorov Apr 2, 2025 •

edited

Loading

Make sense - will use the name phase.

Question - do you suggest to keep both ENTER/EXIT phase-s ?
Asking because some internal feedback was to simplify the APIs setting callbacks..E.g.

pti_result PTI_EXPORT
ptiCallbackEnableDomain(pti_callback_subscriber subscriber,
                        pti_callback_domain  domain,
                        uint32_t enter_cb,
                        uint32_t exit_cb);

avoiding passing phases enter_cb and exit_cb ..
and probably having by default only one Enter phase enabled..
and having may be separate API for each phase..
?

sdk/include/pti/pti_callback.h

+              typedef struct _pti_internal_callback_data {
+                pti_callback_domain  _domain;        // domain of the callback
+                pti_callback_site    _site;          // THREAD START/END or INTERNAL EVENT

jmellorcrummey Apr 1, 2025

I prefer phase rather than site, as described in prior comment.

Contributor Author

jfedorov Apr 2, 2025

will make phase.

sdk/include/pti/pti_callback.h

		@@ -0,0 +1,265 @@
		//==============================================================

jmellorcrummey Apr 1, 2025

In general, what is described in the header file seems fine. I have a few questions though:

How would GPU hardware counters be configured? Would there be callbacks before/after kernel launches (as in AMD's interface) to set counter configurations determined previously?
There don't seem to be callbacks reporting code loading/unloading. Is that just one of the driver operations reported, or is something different required?
Is there a guarantee that the callbacks occur in the thread initiating the operation (e.g. a callstack unwind there is meaningful)?
There is another API in PTI-GPU for reporting the completion of GPU operations. That also requires callbacks. Perhaps this file should be named the pti_sync_callbacks.h, since the callbacks are synchronous in the thread initiating an operation (or so I believe).

A concern about using it is that there is no API for tool initialization. With OpenMP target programs, static constructors load code into a GPU, requiring the tool to be on at that point. When would one register these callbacks?

Contributor Author

jfedorov Apr 2, 2025

@jmellorcrummey, thank you very much for your review! we will work to address your comments in the next couple of days and push the update.
Let me start from the end:

A concern about using it is that there is no API for tool initialization. With OpenMP target programs, static constructors load code into a GPU, requiring the tool to be on at that point. When would one register these callbacks?

Today's approach that PTI initializes lazily on the first call to it. So far it is working.. But we have not tested on OpenMP programs yet. We will look into it.
if you think that lazy initialization would not work - please, describe why.

There is another API in PTI-GPU for reporting the completion of GPU operations. That also requires callbacks. Perhaps this file should be named the pti_sync_callbacks.h, since the callbacks are synchronous in the thread initiating an operation (or so I believe).

Make sense. Will discuss in the team and, probably, will rename this one.

Is there a guarantee that the callbacks occur in the thread initiating the operation (e.g. a callstack unwind there is meaningful)?

Yes. So far we have exactly this synchronous implementation in mind. (even removed thread_id from all Callback structures)

There don't seem to be callbacks reporting code loading/unloading. Is that just one of the driver operations reported, or is something different required?

Such callback make sense. Let us look into it.

How would GPU hardware counters be configured? Would there be callbacks before/after kernel launches (as in AMD's interface) to set counter configurations determined previously?

We have not put yet a lot of thoughts into how HW counters would "cooperate" with Callback APIs. The thoughts were that we will make the first experimental version of Callback and on the way - will think about how make them work together. Does it make sense?

Do we understand correctly that your primary interest GPU EU Stall sampling and not time-based streaming counters?

jmellorcrummey Apr 2, 2025 •

edited

Loading

Today's approach that PTI initializes lazily on the first call to it. So far it is working.. But we have not tested on OpenMP programs yet. We will look into it. if you think that lazy initialization would not work - please, describe why.

I am not concerned about initialization of the PTI-GPU monitoring interface itself. I am concerned about initialization of a tool using PTI-GPU. When launching an application with HPCToolkit, before we collect ANY performance data, we initialize HPCToolkit's measurement subsystem for all of the desired data collection.

In working with AMD, we helped them design two APIs to support initialization of client tools. We would like essentially the same capabilities in PTI-GPU.

if PTI-GPU initialization is triggered before HPCToolkit's initialization, we want an upcall from PTI-GPU where we can trigger HPCToolkit's initialization. Look at the design for AMD's rocprofiler_configure function (https://github.com/ROCm/rocprofiler-sdk/blob/a7f96dde29389b77b1f056953a962ca1fb11eb62/source/include/rocprofiler-sdk/registration.h#L221). This is a function that their rocprofiler-sdk will invoke to enable a tool to initialize itself.
if HPCToolkit's initialization is triggered before PTI-GPU, e.g. a DPC++ program may only trigger PTI-GPU initialization the first time it tries to touch a GPU, then there needs to be an API that HPCToolkit can invoke to force initialization of PTI-GPU. Look at the design for AMD's rocprofiler_force_configure function (https://github.com/ROCm/rocprofiler-sdk/blob/a7f96dde29389b77b1f056953a962ca1fb11eb62/source/include/rocprofiler-sdk/registration.h#L255).

The implementation using such primitives is a bit more subtle than it appears at first glance. Basically, the initialization for HPCToolkit and PTI-GPU need to rendezvous and both initialize at the same time.

Do we understand correctly that your primary interest GPU EU Stall sampling and not time-based streaming counters?

Our current interests are threefold

we want to collect information about GPU operations using the PTI-View activity trace
we want to collect GPU EU stall samples
we want to collect hardware counter metrics for individual kernels (not using the streaming sampling)

In the future, we expect that we would like to collect time-based streaming metrics as well if we are not collecting HW counter metrics for individual kernels.

sdk/include/pti/pti_callback.h

+               * @return pti_result
+               */
+              pti_result PTI_EXPORT
+              ptiCallbackGetExternalCorrelationIds(uint32_t* number,

jmellorcrummey Apr 1, 2025

Is this to retrieve correlation ids set by the runtime system? If so, what is external about these ids?

From HPCToolkit's perspective, I want to be able to specify a 64-bit correlation id that represents a (thread id, sequence number) pair that I can use to direct monitoring information back to the initiating thread. I am not sure what your vision is for how tools will work with correlation ids using this API.

Both NVIDIA and AMD APIs provide the ability to either push/pop a correlation id for an operation before/after launch, or receive a singleton callback where a tool can specify an external correlation id.

Contributor Author

jfedorov Apr 2, 2025

There are also external IDs that user can set up and with their help make a correlation of calling context with low level APIs and GPU operations. Please, see the API below.
It looked useful if in the Callback PTI can report external IDs from the top of a stack.
So that this could immediately correlate low-level API/operation with the calling context.
Of course it means that a tool is doing some instrumentation of a traced application.
if such mechanism is not required at the moment - we can postpone exposing/implementing it.

pti-gpu/sdk/include/pti/pti_view.h

Lines 98 to 108 in 2688439

    
           /** 
        
            *  @brief External correlation kinds 
        
            */ 
        
           typedef enum _pti_view_external_kind { 
        
             PTI_VIEW_EXTERNAL_KIND_INVALID = 0,   //!< Invalid external kind 
        
             PTI_VIEW_EXTERNAL_KIND_UNKNOWN = 1,   //!< Unknown external kind 
        
             PTI_VIEW_EXTERNAL_KIND_CUSTOM_0 = 2,  //!< Custom external kind 
        
             PTI_VIEW_EXTERNAL_KIND_CUSTOM_1 = 3,  //!< Custom external kind 
        
             PTI_VIEW_EXTERNAL_KIND_CUSTOM_2 = 4,  //!< Custom external kind 
        
             PTI_VIEW_EXTERNAL_KIND_CUSTOM_3 = 5,  //!< Custom external kind 
        
           } pti_view_external_kind;

pti-gpu/sdk/include/pti/pti_view.h

Lines 289 to 298 in 2688439

    
            * @brief External Correlation View record type 
        
            */ 
        
           typedef struct pti_view_record_external_correlation { 
        
             pti_view_record_base _view_kind;          //!< Base record 
        
             uint32_t _correlation_id;                 //!< ID that correlates this record with records 
        
                                                       //!< of other Views 
        
             uint64_t _external_id;                    //!< ID provided by user, marking an external 
        
                                                       //!< to PTI operation 
        
             pti_view_external_kind _external_kind; 
        
           } pti_view_record_external_correlation;

pti-gpu/sdk/include/pti/pti_view.h

Lines 415 to 432 in 2688439

    
            * @brief Pushes ExternelCorrelationId kind and id for generation of external correlation records 
        
            * 
        
            * @param external_kind 
        
            * @param external_id 
        
            * @return pti_result 
        
            */ 
        
           pti_result PTI_EXPORT 
        
           ptiViewPushExternalCorrelationId(pti_view_external_kind external_kind, uint64_t external_id); 
        
           /** 
        
            * @brief Pops ExternelCorrelationId kind and id for generation of external correlation records 
        
            * 
        
            * @param external_kind 
        
            * @param external_id 
        
            * @return pti_result 
        
            */ 
        
           pti_result PTI_EXPORT 
        
           ptiViewPopExternalCorrelationId(pti_view_external_kind external_kind, uint64_t* p_external_id);

anmyachev mentioned this pull request

Integrate PTI callback interface intel/intel-xpu-backend-for-triton#3852

Open

anmyachev commented May 5, 2025

Could we have at least draft implementation of this API to start enabling/debugging on Triton XPU side? It seems like there won't be any major API changes in sight.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet