-
Notifications
You must be signed in to change notification settings - Fork 57
Exp callback 0.12.0 #87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Signed-off-by: jfedorov <julia.fedorova@intel.com>
- improve enum - add comments - fix api on getting external ids Signed-off-by: jfedorov <julia.fedorova@intel.com>
…mand queue execute Signed-off-by: jfedorov <julia.fedorova@intel.com>
*/ | ||
typedef enum _pti_internal_event_type { | ||
PTI_INTERNAL_EVENT_TYPE_INFO = 0, | ||
PTI_INTERNAL_EVENT_TYPE_WARNING = 1, // one or few records data inconsistences, or other |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could use some info about how it would be used.
Knowing that something is inconsistent but not what doesn't seem very useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As PTI depends on many lower level components.. there could be unexpected situations - due to either bugs (in PTI and others) or some missing "synchronizations" between components.
This might result in two major issues
- PTI crashes application
- PTI delivers unreliable data
We have to absolutely avoid 1). As well as much as possible avoid 2) - if PTI knows that something went wrong and it effects validity of data - it can report about it.
We don't want user to look into unreliable data but we better mark them.
Underneath PTI traces a lot of "paths" - some paths might be better tested/ simpler/ more frequently used than the others.
So when data on "frequent" and "reliable" paths are OK, some other (not tested well) might experience issues.
But we don't want to throw all collected data - but rather mark those that PTI knows that are invalid and also provide context where they were generated.
So it will be WARNING. And message will have details.
The recent example: in some situations expected events for memory operations are not signaled - so their Timestamps are not available. I think that in this situation - we can use this mechanism and notify a user (a tool). Of course, if it wants (subscribes).
Or it could be that something went completely out of control - say, PTI is starting GPU Sampling data collection, trying to open a temp file but there is no permission or other issue...
That could be CRITICAL. PTI shuts down its all operations. Each PTI call would return error (hopefully descriptive enough). But CRITICAL event will have a message with more details as well some info on the context.
Does it make any sense?
typedef struct _pti_gpu_op_launch_detail { | ||
pti_operation_kind _operation_kind; // kind of the operation: kernel, mem op | ||
uint32_t _correlation_id; // correlation IDs of operation append | ||
const char* _kernel_name; // type of the memcpy operation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The field is _kernel_name and the comment says type of memcpy operation. These seem inconsistent.
Rather than a kernel name, I would prefer something like a kernel handle that can be more readily correlated with machine code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Returning L0 kernel handle
would be OK.
Would the tool then try to get the kernel binary?
I will look how to do it in a safe way.
The comment // type of memcpy..
is definitely a typo - thanks for spotting it!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would want a callback when the binary was loaded and unloaded.
At the time a kernel is launched, I only want to look up the kernel handle and associate the launch with the kernel in the binary, turning it into a (load module id, offset) pair. Post-mortem, HPCToolkit uses a recorded copy of the binary to interpret all of the measurement data.
typedef struct _pti_apicall_callback_data { | ||
pti_callback_domain _domain; // domain of the callback | ||
pti_backend_submit_type _submit_type; // Immediate command list, Command Queue execute,.. | ||
pti_callback_site _site; // ENTER or EXIT |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer the name phase rather than site. I realize that NVIDIA uses site. AMD uses phase. This is more descriptive since unless the enter/exit callbacks occur in different calling contexts in user code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sense - will use the name phase
.
Question - do you suggest to keep both ENTER/EXIT phase
-s ?
Asking because some internal feedback was to simplify the APIs setting callbacks..E.g.
pti_result PTI_EXPORT
ptiCallbackEnableDomain(pti_callback_subscriber subscriber,
pti_callback_domain domain,
uint32_t enter_cb,
uint32_t exit_cb);
avoiding passing phases enter_cb
and exit_cb
..
and probably having by default only one Enter phase enabled..
and having may be separate API for each phase
..
?
|
||
typedef struct _pti_internal_callback_data { | ||
pti_callback_domain _domain; // domain of the callback | ||
pti_callback_site _site; // THREAD START/END or INTERNAL EVENT |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer phase rather than site, as described in prior comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will make phase
.
@@ -0,0 +1,265 @@ | |||
//============================================================== |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, what is described in the header file seems fine. I have a few questions though:
- How would GPU hardware counters be configured? Would there be callbacks before/after kernel launches (as in AMD's interface) to set counter configurations determined previously?
- There don't seem to be callbacks reporting code loading/unloading. Is that just one of the driver operations reported, or is something different required?
- Is there a guarantee that the callbacks occur in the thread initiating the operation (e.g. a callstack unwind there is meaningful)?
- There is another API in PTI-GPU for reporting the completion of GPU operations. That also requires callbacks. Perhaps this file should be named the pti_sync_callbacks.h, since the callbacks are synchronous in the thread initiating an operation (or so I believe).
A concern about using it is that there is no API for tool initialization. With OpenMP target programs, static constructors load code into a GPU, requiring the tool to be on at that point. When would one register these callbacks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jmellorcrummey, thank you very much for your review! we will work to address your comments in the next couple of days and push the update.
Let me start from the end:
A concern about using it is that there is no API for tool initialization. With OpenMP target programs, static constructors load code into a GPU, requiring the tool to be on at that point. When would one register these callbacks?
Today's approach that PTI initializes lazily on the first call to it. So far it is working.. But we have not tested on OpenMP programs yet. We will look into it.
if you think that lazy initialization would not work - please, describe why.
There is another API in PTI-GPU for reporting the completion of GPU operations. That also requires callbacks. Perhaps this file should be named the pti_sync_callbacks.h, since the callbacks are synchronous in the thread initiating an operation (or so I believe).
Make sense. Will discuss in the team and, probably, will rename this one.
Is there a guarantee that the callbacks occur in the thread initiating the operation (e.g. a callstack unwind there is meaningful)?
Yes. So far we have exactly this synchronous implementation in mind. (even removed thread_id
from all Callback structures)
There don't seem to be callbacks reporting code loading/unloading. Is that just one of the driver operations reported, or is something different required?
Such callback make sense. Let us look into it.
How would GPU hardware counters be configured? Would there be callbacks before/after kernel launches (as in AMD's interface) to set counter configurations determined previously?
We have not put yet a lot of thoughts into how HW counters would "cooperate" with Callback APIs. The thoughts were that we will make the first experimental version of Callback and on the way - will think about how make them work together. Does it make sense?
Do we understand correctly that your primary interest GPU EU Stall sampling and not time-based streaming counters?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Today's approach that PTI initializes lazily on the first call to it. So far it is working.. But we have not tested on OpenMP programs yet. We will look into it. if you think that lazy initialization would not work - please, describe why.
I am not concerned about initialization of the PTI-GPU monitoring interface itself. I am concerned about initialization of a tool using PTI-GPU. When launching an application with HPCToolkit, before we collect ANY performance data, we initialize HPCToolkit's measurement subsystem for all of the desired data collection.
In working with AMD, we helped them design two APIs to support initialization of client tools. We would like essentially the same capabilities in PTI-GPU.
- if PTI-GPU initialization is triggered before HPCToolkit's initialization, we want an upcall from PTI-GPU where we can trigger HPCToolkit's initialization. Look at the design for AMD's rocprofiler_configure function (https://github.com/ROCm/rocprofiler-sdk/blob/a7f96dde29389b77b1f056953a962ca1fb11eb62/source/include/rocprofiler-sdk/registration.h#L221). This is a function that their rocprofiler-sdk will invoke to enable a tool to initialize itself.
- if HPCToolkit's initialization is triggered before PTI-GPU, e.g. a DPC++ program may only trigger PTI-GPU initialization the first time it tries to touch a GPU, then there needs to be an API that HPCToolkit can invoke to force initialization of PTI-GPU. Look at the design for AMD's rocprofiler_force_configure function (https://github.com/ROCm/rocprofiler-sdk/blob/a7f96dde29389b77b1f056953a962ca1fb11eb62/source/include/rocprofiler-sdk/registration.h#L255).
The implementation using such primitives is a bit more subtle than it appears at first glance. Basically, the initialization for HPCToolkit and PTI-GPU need to rendezvous and both initialize at the same time.
Do we understand correctly that your primary interest GPU EU Stall sampling and not time-based streaming counters?
Our current interests are threefold
- we want to collect information about GPU operations using the PTI-View activity trace
- we want to collect GPU EU stall samples
- we want to collect hardware counter metrics for individual kernels (not using the streaming sampling)
In the future, we expect that we would like to collect time-based streaming metrics as well if we are not collecting HW counter metrics for individual kernels.
* @return pti_result | ||
*/ | ||
pti_result PTI_EXPORT | ||
ptiCallbackGetExternalCorrelationIds(uint32_t* number, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this to retrieve correlation ids set by the runtime system? If so, what is external about these ids?
From HPCToolkit's perspective, I want to be able to specify a 64-bit correlation id that represents a (thread id, sequence number) pair that I can use to direct monitoring information back to the initiating thread. I am not sure what your vision is for how tools will work with correlation ids using this API.
Both NVIDIA and AMD APIs provide the ability to either push/pop a correlation id for an operation before/after launch, or receive a singleton callback where a tool can specify an external correlation id.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are also external IDs that user can set up and with their help make a correlation of calling context with low level APIs and GPU operations. Please, see the API below.
It looked useful if in the Callback PTI can report external IDs from the top of a stack.
So that this could immediately correlate low-level API/operation with the calling context.
Of course it means that a tool is doing some instrumentation of a traced application.
if such mechanism is not required at the moment - we can postpone exposing/implementing it.
pti-gpu/sdk/include/pti/pti_view.h
Lines 98 to 108 in 2688439
/** | |
* @brief External correlation kinds | |
*/ | |
typedef enum _pti_view_external_kind { | |
PTI_VIEW_EXTERNAL_KIND_INVALID = 0, //!< Invalid external kind | |
PTI_VIEW_EXTERNAL_KIND_UNKNOWN = 1, //!< Unknown external kind | |
PTI_VIEW_EXTERNAL_KIND_CUSTOM_0 = 2, //!< Custom external kind | |
PTI_VIEW_EXTERNAL_KIND_CUSTOM_1 = 3, //!< Custom external kind | |
PTI_VIEW_EXTERNAL_KIND_CUSTOM_2 = 4, //!< Custom external kind | |
PTI_VIEW_EXTERNAL_KIND_CUSTOM_3 = 5, //!< Custom external kind | |
} pti_view_external_kind; |
pti-gpu/sdk/include/pti/pti_view.h
Lines 289 to 298 in 2688439
* @brief External Correlation View record type | |
*/ | |
typedef struct pti_view_record_external_correlation { | |
pti_view_record_base _view_kind; //!< Base record | |
uint32_t _correlation_id; //!< ID that correlates this record with records | |
//!< of other Views | |
uint64_t _external_id; //!< ID provided by user, marking an external | |
//!< to PTI operation | |
pti_view_external_kind _external_kind; | |
} pti_view_record_external_correlation; |
pti-gpu/sdk/include/pti/pti_view.h
Lines 415 to 432 in 2688439
* @brief Pushes ExternelCorrelationId kind and id for generation of external correlation records | |
* | |
* @param external_kind | |
* @param external_id | |
* @return pti_result | |
*/ | |
pti_result PTI_EXPORT | |
ptiViewPushExternalCorrelationId(pti_view_external_kind external_kind, uint64_t external_id); | |
/** | |
* @brief Pops ExternelCorrelationId kind and id for generation of external correlation records | |
* | |
* @param external_kind | |
* @param external_id | |
* @return pti_result | |
*/ | |
pti_result PTI_EXPORT | |
ptiViewPopExternalCorrelationId(pti_view_external_kind external_kind, uint64_t* p_external_id); |
Creating this pull request - so we can discuss CallBack API here.
Anyone, the feedback is very welcome!
After we more or less agree on the API - I will publish implementation in this branch.
thank you