Skip to content

Exp callback 0.12.0 #87

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 3 commits into
base: master
Choose a base branch
from
Draft

Exp callback 0.12.0 #87

wants to merge 3 commits into from

Conversation

jfedorov
Copy link
Contributor

Creating this pull request - so we can discuss CallBack API here.

Anyone, the feedback is very welcome!

After we more or less agree on the API - I will publish implementation in this branch.

thank you

Signed-off-by: jfedorov <julia.fedorova@intel.com>
- improve enum
- add comments
- fix api on getting external ids

Signed-off-by: jfedorov <julia.fedorova@intel.com>
…mand queue execute

Signed-off-by: jfedorov <julia.fedorova@intel.com>
*/
typedef enum _pti_internal_event_type {
PTI_INTERNAL_EVENT_TYPE_INFO = 0,
PTI_INTERNAL_EVENT_TYPE_WARNING = 1, // one or few records data inconsistences, or other

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could use some info about how it would be used.

Knowing that something is inconsistent but not what doesn't seem very useful.

Copy link
Contributor Author

@jfedorov jfedorov Apr 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As PTI depends on many lower level components.. there could be unexpected situations - due to either bugs (in PTI and others) or some missing "synchronizations" between components.

This might result in two major issues

  1. PTI crashes application
  2. PTI delivers unreliable data
    We have to absolutely avoid 1). As well as much as possible avoid 2) - if PTI knows that something went wrong and it effects validity of data - it can report about it.
    We don't want user to look into unreliable data but we better mark them.

Underneath PTI traces a lot of "paths" - some paths might be better tested/ simpler/ more frequently used than the others.
So when data on "frequent" and "reliable" paths are OK, some other (not tested well) might experience issues.

But we don't want to throw all collected data - but rather mark those that PTI knows that are invalid and also provide context where they were generated.
So it will be WARNING. And message will have details.

The recent example: in some situations expected events for memory operations are not signaled - so their Timestamps are not available. I think that in this situation - we can use this mechanism and notify a user (a tool). Of course, if it wants (subscribes).

Or it could be that something went completely out of control - say, PTI is starting GPU Sampling data collection, trying to open a temp file but there is no permission or other issue...
That could be CRITICAL. PTI shuts down its all operations. Each PTI call would return error (hopefully descriptive enough). But CRITICAL event will have a message with more details as well some info on the context.

Does it make any sense?

typedef struct _pti_gpu_op_launch_detail {
pti_operation_kind _operation_kind; // kind of the operation: kernel, mem op
uint32_t _correlation_id; // correlation IDs of operation append
const char* _kernel_name; // type of the memcpy operation

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The field is _kernel_name and the comment says type of memcpy operation. These seem inconsistent.

Rather than a kernel name, I would prefer something like a kernel handle that can be more readily correlated with machine code.

Copy link
Contributor Author

@jfedorov jfedorov Apr 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Returning L0 kernel handle would be OK.
Would the tool then try to get the kernel binary?
I will look how to do it in a safe way.

The comment // type of memcpy.. is definitely a typo - thanks for spotting it!

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would want a callback when the binary was loaded and unloaded.

At the time a kernel is launched, I only want to look up the kernel handle and associate the launch with the kernel in the binary, turning it into a (load module id, offset) pair. Post-mortem, HPCToolkit uses a recorded copy of the binary to interpret all of the measurement data.

typedef struct _pti_apicall_callback_data {
pti_callback_domain _domain; // domain of the callback
pti_backend_submit_type _submit_type; // Immediate command list, Command Queue execute,..
pti_callback_site _site; // ENTER or EXIT

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer the name phase rather than site. I realize that NVIDIA uses site. AMD uses phase. This is more descriptive since unless the enter/exit callbacks occur in different calling contexts in user code.

Copy link
Contributor Author

@jfedorov jfedorov Apr 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense - will use the name phase.

Question - do you suggest to keep both ENTER/EXIT phase-s ?
Asking because some internal feedback was to simplify the APIs setting callbacks..E.g.

pti_result PTI_EXPORT
ptiCallbackEnableDomain(pti_callback_subscriber subscriber,
                        pti_callback_domain  domain,
                        uint32_t enter_cb,
                        uint32_t exit_cb);

avoiding passing phases enter_cb and exit_cb ..
and probably having by default only one Enter phase enabled..
and having may be separate API for each phase..
?


typedef struct _pti_internal_callback_data {
pti_callback_domain _domain; // domain of the callback
pti_callback_site _site; // THREAD START/END or INTERNAL EVENT

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer phase rather than site, as described in prior comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will make phase.

@@ -0,0 +1,265 @@
//==============================================================

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, what is described in the header file seems fine. I have a few questions though:

  • How would GPU hardware counters be configured? Would there be callbacks before/after kernel launches (as in AMD's interface) to set counter configurations determined previously?
  • There don't seem to be callbacks reporting code loading/unloading. Is that just one of the driver operations reported, or is something different required?
  • Is there a guarantee that the callbacks occur in the thread initiating the operation (e.g. a callstack unwind there is meaningful)?
  • There is another API in PTI-GPU for reporting the completion of GPU operations. That also requires callbacks. Perhaps this file should be named the pti_sync_callbacks.h, since the callbacks are synchronous in the thread initiating an operation (or so I believe).

A concern about using it is that there is no API for tool initialization. With OpenMP target programs, static constructors load code into a GPU, requiring the tool to be on at that point. When would one register these callbacks?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jmellorcrummey, thank you very much for your review! we will work to address your comments in the next couple of days and push the update.
Let me start from the end:

A concern about using it is that there is no API for tool initialization. With OpenMP target programs, static constructors load code into a GPU, requiring the tool to be on at that point. When would one register these callbacks?

Today's approach that PTI initializes lazily on the first call to it. So far it is working.. But we have not tested on OpenMP programs yet. We will look into it.
if you think that lazy initialization would not work - please, describe why.

There is another API in PTI-GPU for reporting the completion of GPU operations. That also requires callbacks. Perhaps this file should be named the pti_sync_callbacks.h, since the callbacks are synchronous in the thread initiating an operation (or so I believe).

Make sense. Will discuss in the team and, probably, will rename this one.

Is there a guarantee that the callbacks occur in the thread initiating the operation (e.g. a callstack unwind there is meaningful)?

Yes. So far we have exactly this synchronous implementation in mind. (even removed thread_id from all Callback structures)

There don't seem to be callbacks reporting code loading/unloading. Is that just one of the driver operations reported, or is something different required?

Such callback make sense. Let us look into it.

How would GPU hardware counters be configured? Would there be callbacks before/after kernel launches (as in AMD's interface) to set counter configurations determined previously?

We have not put yet a lot of thoughts into how HW counters would "cooperate" with Callback APIs. The thoughts were that we will make the first experimental version of Callback and on the way - will think about how make them work together. Does it make sense?

Do we understand correctly that your primary interest GPU EU Stall sampling and not time-based streaming counters?

Copy link

@jmellorcrummey jmellorcrummey Apr 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Today's approach that PTI initializes lazily on the first call to it. So far it is working.. But we have not tested on OpenMP programs yet. We will look into it. if you think that lazy initialization would not work - please, describe why.

I am not concerned about initialization of the PTI-GPU monitoring interface itself. I am concerned about initialization of a tool using PTI-GPU. When launching an application with HPCToolkit, before we collect ANY performance data, we initialize HPCToolkit's measurement subsystem for all of the desired data collection.

In working with AMD, we helped them design two APIs to support initialization of client tools. We would like essentially the same capabilities in PTI-GPU.

The implementation using such primitives is a bit more subtle than it appears at first glance. Basically, the initialization for HPCToolkit and PTI-GPU need to rendezvous and both initialize at the same time.

Do we understand correctly that your primary interest GPU EU Stall sampling and not time-based streaming counters?

Our current interests are threefold

  • we want to collect information about GPU operations using the PTI-View activity trace
  • we want to collect GPU EU stall samples
  • we want to collect hardware counter metrics for individual kernels (not using the streaming sampling)

In the future, we expect that we would like to collect time-based streaming metrics as well if we are not collecting HW counter metrics for individual kernels.

* @return pti_result
*/
pti_result PTI_EXPORT
ptiCallbackGetExternalCorrelationIds(uint32_t* number,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this to retrieve correlation ids set by the runtime system? If so, what is external about these ids?

From HPCToolkit's perspective, I want to be able to specify a 64-bit correlation id that represents a (thread id, sequence number) pair that I can use to direct monitoring information back to the initiating thread. I am not sure what your vision is for how tools will work with correlation ids using this API.

Both NVIDIA and AMD APIs provide the ability to either push/pop a correlation id for an operation before/after launch, or receive a singleton callback where a tool can specify an external correlation id.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are also external IDs that user can set up and with their help make a correlation of calling context with low level APIs and GPU operations. Please, see the API below.
It looked useful if in the Callback PTI can report external IDs from the top of a stack.
So that this could immediately correlate low-level API/operation with the calling context.
Of course it means that a tool is doing some instrumentation of a traced application.
if such mechanism is not required at the moment - we can postpone exposing/implementing it.

/**
* @brief External correlation kinds
*/
typedef enum _pti_view_external_kind {
PTI_VIEW_EXTERNAL_KIND_INVALID = 0, //!< Invalid external kind
PTI_VIEW_EXTERNAL_KIND_UNKNOWN = 1, //!< Unknown external kind
PTI_VIEW_EXTERNAL_KIND_CUSTOM_0 = 2, //!< Custom external kind
PTI_VIEW_EXTERNAL_KIND_CUSTOM_1 = 3, //!< Custom external kind
PTI_VIEW_EXTERNAL_KIND_CUSTOM_2 = 4, //!< Custom external kind
PTI_VIEW_EXTERNAL_KIND_CUSTOM_3 = 5, //!< Custom external kind
} pti_view_external_kind;

* @brief External Correlation View record type
*/
typedef struct pti_view_record_external_correlation {
pti_view_record_base _view_kind; //!< Base record
uint32_t _correlation_id; //!< ID that correlates this record with records
//!< of other Views
uint64_t _external_id; //!< ID provided by user, marking an external
//!< to PTI operation
pti_view_external_kind _external_kind;
} pti_view_record_external_correlation;

* @brief Pushes ExternelCorrelationId kind and id for generation of external correlation records
*
* @param external_kind
* @param external_id
* @return pti_result
*/
pti_result PTI_EXPORT
ptiViewPushExternalCorrelationId(pti_view_external_kind external_kind, uint64_t external_id);
/**
* @brief Pops ExternelCorrelationId kind and id for generation of external correlation records
*
* @param external_kind
* @param external_id
* @return pti_result
*/
pti_result PTI_EXPORT
ptiViewPopExternalCorrelationId(pti_view_external_kind external_kind, uint64_t* p_external_id);

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants