Skip to content

[MIGraphx EP] Sync AMD changes upstream #25338

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 33 commits into
base: main
Choose a base branch
from

Conversation

TedThemistokleous
Copy link
Contributor

Set of changes that were added to generate ROCm 7.0 branch pertaining to the MIGraphX Execution provider. Upstreams a bunch of changes

Adds the following features

Contrib Operator support:
groupnorm
nhwcconv
attention
skip_layernorm

Changes to EP flags
save/load has been removed for model_cache_dir
bf16 quantizer support + appropriate EP flags
hashing of saved .mxr file names based on gpu target, model name, and IO

Additional fixes involving renaming, tidy and other checks

Motivation and Context
Fixes, changes and updates to MIGraphX EP thats been done for ROCm development. Pushing this back upstream to ensure mainline onnxruntime is using the latest changes. Moving forward MIGraphX EP will be cut from the latest official release tag as a base point while also adding additional features that will be contributed back.

TedThemistokleous and others added 20 commits July 8, 2025 17:38
Needed to maintain API compatability for ROCm 6.3.3 builds
* update

* update

* Fix build error CK_BUFFER_RESOURCE_3RD_DWORD

Signed-off-by: Jagadish Krishnamoorthy <jagadish.krishnamoorthy@amd.com>

* Add all archs back + gfx950

Signed-off-by: Jagadish Krishnamoorthy <jagadish.krishnamoorthy@amd.com>

---------

Signed-off-by: Jagadish Krishnamoorthy <jagadish.krishnamoorthy@amd.com>
Co-authored-by: kithumma <kiran.thumma@amd.com>
MIGraphX has enabled this support as part of the latest release. This allows Onnxruntime MIGraphX EP to parse in Attention and SkipLayernormalization operators and pass them to the MIGraphX API for further processing
* provider_options

* use migraphx_provider_option contants

* use migraphx_* constants for printing flags

* add kDeviceId
* Adding support for mxr caching similar to MGX EP on Windows

* Adding GenerateGraphId method

* Fixing more issues

* Fixing test

* Adding version file

* Adding charconv

* Adding session input names

* Add mxr filename prefix

* Fixing issues with empty model cache path

* Deleting previous approach of saving mxr files

* use option.second directly for OrtMIGraphXProviderOptions

* Revert "use option.second directly for OrtMIGraphXProviderOptions"

This reverts commit 88f13e7.

* Reordering headers and adding includes

* Changing order include

* Reordering

* another fix

* Adding gfx version hash to the mxr file name

---------

Co-authored-by: Uros Petkovic <urpektov@amd.com>
Co-authored-by: kjayapra <karthik.jayaprakash@amd.com>
Co-authored-by: Artur Wojcik <artur.wojcik@amd.com>
This fixes an issue we were seeing with calibration able using a spare reference for input. Caused us to try to use a stale reference which intermttently referenced a non existent or incorrect input causing calibration to fail or undefined behavior due to trying to use a reference address as a c_str() directly.

Added a check using the .good() call for the ifstream when reading calibration tables as well to ensure we're not only getting a valid no null stream but no other errors are present on open
* Add null check for node arguments.

* Add null check to Shape property of node arg.
…ft#117)

* Add in BF16 datatype and quantization support to MIGraphX EP

- Enable BF16 datatype in MIGraphX (migraphx_type_bf16_type)
- Leverage MIGraphX API to do BF16 quantization (migraphx::quantize_bf16)
- Add environment variable (MIGRAPHX_BF16_ENABLE)
- Add provider session options variable (migraphx_bf16_enable)
- Add python pybind API variable (migraphx_bf16_enable)

* Lintrunner pass
Signed-off-by: Jagadish Krishnamoorthy <jagadish.krishnamoorthy@amd.com>
The onnxruntime_add_shared_library_module() command places generated DLLs in the lib directory instead of the bin on Windows. The PR changes to use onnxruntime_add_shared_library(). I checked most of the essential EPs, and they all use onnxruntime_add_shared_library() as the method to create execution provider targets. The difference between the module and non-module versions of the command is that the module version does not install .lib files for targets it creates.
@TedThemistokleous
Copy link
Contributor Author

TedThemistokleous commented Jul 9, 2025

ping @tianleiwu changes that AMD wants to upstream that are going into ROCm 7.0 release. Likely more incoming in another pR

@tianleiwu
Copy link
Contributor

@TedThemistokleous, please merge main since there is a recent change in lintrunner settting. Need lintrunner init and lintrunner -a --all-files to format the code after merge.

@TedThemistokleous
Copy link
Contributor Author

@tianleiwu done. Please let me know if there are anymore issues.

1 similar comment
@TedThemistokleous
Copy link
Contributor Author

@tianleiwu done. Please let me know if there are anymore issues.

@@ -711,15 +711,13 @@ typedef struct OrtTensorRTProviderOptions {
typedef struct OrtMIGraphXProviderOptions {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a breaking change. ONNX Runtime maintains ABI backwards compatibility. Certainly, this change breaks that. As it will only impact the MIGraphX EP, I could accept this change. So, ORT's ABI backwards compatibility promise becomes if ORT provides ABI backwards compatibility if you don't use the functions that are not backwards compatible. (A little bit funny)

The correct way of doing this is to move the whole data structure out of this header file and make it opaque(invisible), and use getters/setters to access the fields.

https://community.kde.org/Policies/Binary_Compatibility_Issues_With_C%2B%2B

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was added in an earlier commit and I assume moving forward anyone using OnnxRT will be using the latest release supported with their ROCm install since we have a mapping in the docs.

https://onnxruntime.ai/docs/execution-providers/MIGraphX-ExecutionProvider.html#requirements

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is part of ORT's public API that everyone uses, no matter if MIGRAPHX EP is used or not.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we pad this instead if this is a concern then? We're already using this internally for our builds and release, adding two padding vars should keep things consistent since we're missing a char* and int no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or is it better to have these items but tie them to nothing in the EP then and have a /*depricated */ comment for sunset flags.?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Padding is also a breaking change. Because clients built with older ORT versions will end-up with garbage data for the new fields. Please try to minimize the number of changes of this struct. Going forward you should move it to an opaque type like what CUDA EP did.

@@ -23,11 +23,11 @@ void MIGraphXAllocator::CheckDevice() const {
#endif
}

void* MIGraphXAllocator::Alloc(size_t size) {
void* MIGraphXAllocator::Alloc(const size_t size) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need to add const here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We ran this through clang-tidy

Copy link
Member

@snnn snnn Jul 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our clang-tidy configs need to be improve. Our coding style is inherited from Google's coding style See https://github.com/microsoft/onnxruntime/blob/main/docs/Coding_Conventions_and_Standards.md. The document is the golden standard.

@@ -11,27 +11,27 @@ namespace onnxruntime {

class MIGraphXAllocator : public IAllocator {
public:
MIGraphXAllocator(int device_id, const char* name)
MIGraphXAllocator(const OrtDevice::DeviceId device_id, const char* name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your linter and tidy said otherwise. These were ran through clang-tidy

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then I think we do not have good clang-tidy config here.

auto use_compute_command_list = (feature_levels.MaxSupportedFeatureLevel <= D3D_FEATURE_LEVEL_1_0_CORE);
// Use compute queue whenever possible on supported hardware to avoid TDR and maintain UI QoS
// Core and generic devices only have compute queues, DX11 has "immediate" submission, DX12 has both
auto use_compute_command_list = (feature_levels.MaxSupportedFeatureLevel <= D3D_FEATURE_LEVEL_1_0_CORE) ||
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does it relate to MIGRAPHX?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its a bunch of changes that are being integrated on the ROCm side for our EPs we maintain.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Who made this change? Was someone from AMD or not?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From AMD

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to split this change to a separate PR? I need to find someone from the directml team to review this change.

file(GLOB_RECURSE onnxruntime_providers_migraphx_cc_srcs CONFIGURE_DEPENDS
"${ONNXRUNTIME_ROOT}/core/providers/migraphx/*.h"
"${ONNXRUNTIME_ROOT}/core/providers/migraphx/*.cc"
"${ONNXRUNTIME_ROOT}/core/providers/shared_library/*.h"
"${ONNXRUNTIME_ROOT}/core/providers/shared_library/*.cc"
)
source_group(TREE ${ONNXRUNTIME_ROOT}/core FILES ${onnxruntime_providers_migraphx_cc_srcs})
onnxruntime_add_shared_library_module(onnxruntime_providers_migraphx ${onnxruntime_providers_migraphx_cc_srcs})
onnxruntime_add_shared_library(onnxruntime_providers_migraphx ${onnxruntime_providers_migraphx_cc_srcs})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you change it from onnxruntime_add_shared_library_module to onnxruntime_add_shared_library ? A Module Library is a plugin that may not be linked by other targets, but may be dynamically loaded at runtime using dlopen-like functionality.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed due to development for our windows side run. Is there one way or the other preferred here? let me know so I can flow this back to the ROCm internal branch

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change should not be made, since nobody would link to this library.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a use case which does link to this. Hence why this was added and integrated back on the ROCm side. Let me know whats the optimal way to handle this. Looks like I'll need to handle the integration work on this. Appreciate your comments sand insights btw @snnn

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then where are the header files of this library? I need to know more background.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And, different link type results different construction/destruction order for global vars, which is common source of crashes. It's not just a build type thing.

std::filesystem::path model_cache_file;
// empty cache path means the MXR caching is disabled - always compile
if (!model_cache_path_.empty()) {
model_cache_file = mgx_state->model_cache_dir / (mxr_filename_prefix + make_hash(input_shapes) + ".mxr");
Copy link
Member

@snnn snnn Jul 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part won't work good on Windows if the model_cache_dir contains non-ASCII chars.
First, in your code you need to put a comment to clarify the encoding of mgx_state->model_cache_dir. Is it ANSI encoding(that maps to a Windows codepage) or UTF-8?
Usually, most multibyte strings in ORT are UTF-8 strings. Then please convert the string to a wide string and do all the concatenation/modification in UTF-16.
On Linux all these strings should be UTF-8.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should all be UTF-8 I assume. We shouldn't be using non ascii characters here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not that simple. You need to carefully examine the code. Add a comment to the definition of the variable.

…soft#148)

* Correct tidy warnings for MIGraphX factory

* add #pragma once

* lintrunner

* cpplint

* add mising namespace
static const char kExhaustiveTune[] = "ORT_MIGRAPHX_EXHAUSTIVE_TUNE";

}; // namespace migraphx_env_vars
constexpr auto kFP16Enable = "ORT_MIGRAPHX_FP16_ENABLE";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better to follow this guideline: https://abseil.io/tips/168 You may directly use std::string_view instead of absl::string_view since we have C++17.
Or, I think you need to declare them as inline static constexpr const char*.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wouldn't that be covered by the auto here though? We're enforcing style checks that aren't in the liner?

Just want some clarity on this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, auto is discouraged to use except in the cases where the type is very obvious to see. https://google.github.io/styleguide/cppguide.html#Type_deduction
"Use type deduction only if it makes the code clearer to readers who aren't familiar with the project, or if it makes the code safer. Do not use it merely to avoid the inconvenience of writing an explicit type."

I think there it is not that obvious. The keyword inline is needed.

Copy link
Contributor Author

@TedThemistokleous TedThemistokleous Jul 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure but again, why is this style comment not part of your linter profile checks then? This isn't a technical comment but style and if you want to enforce that, make it part of your lint runner checks/lint stage.

save_compiled_path_ = info.save_model_file;
load_compiled_model_ = info.load_compiled_model;
load_compiled_path_ = info.load_model_file;
model_cache_path_ = info.model_cache_dir;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While you are welcomed to use std::filesystem::path, every time when you assign a value to this type you need carefully check the source string's encoding.
On Windows we expect the source string is in native code page encoding. Otherwise if the string is encoded as UTF-8 you must manually convert it to std::wstring before assign it to a path.
Or you may change type of model_cache_dir to const ORTCHAR_T*.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Odd, these came from windows side changes and we haven't seen this effecting our runs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, if you have a std::wstring or ORT_CHART* string, you have no problem. Otherwise try to explicitly state the encoding. Otherwise people will assume the string is a UTF-8 string.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On Windows You should never assign a UTF-8 string to a std::filesystem::path.

if (load_compiled_model_ && !load_model_path_env.empty()) {
load_compiled_path_ = load_model_path_env;
LOGS_DEFAULT(WARNING) << "\nORT_MIGRAPHX_LOAD_COMPILED_PATH: " << load_compiled_path_;
const auto model_cache_path_env = GetEnvironmentVar(migraphx_env_vars::kModelCachePath);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make the type explicit, and check the encoding.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just checked, onnxruntime::GetEnvironmentVar returns an ANSI string, not a UTF-8 string. As you are define a new var, it's better to state it there this string is not a UTF-8 string(while most std::string variables are).

@@ -379,13 +375,16 @@ static bool IsTypeSupported(const NodeArg* node_arg) {
}
}

static bool getMIGraphXType(ONNXTensorElementDataType type,
static bool getMIGraphXType(const ONNXTensorElementDataType type,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove the const

@@ -14,6 +14,23 @@

namespace onnxruntime {

namespace migraphx_provider_option {
constexpr auto kDeviceId = "device_id";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as the above. Consider using std::string_view instead.

return MIGraphXExecutionProvider::CreateMIGraphXAllocator(device_id, migx_mem_limit, arena_extend_strategy, external_allocator_info, default_memory_arena_cfg);
}
} g_info;

struct MIGraphX_Provider : Provider {
void* GetInfo() override { return &g_info; }

std::shared_ptr<IExecutionProviderFactory> CreateExecutionProviderFactory(int device_id) override {
std::shared_ptr<IExecutionProviderFactory> CreateExecutionProviderFactory(const int device_id) override {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove the const.

migx_options.migraphx_save_model_path = internal_options.save_model_file.c_str();
migx_options.migraphx_load_compiled_model = internal_options.load_compiled_model;
migx_options.migraphx_load_model_path = internal_options.load_model_file.c_str();
migx_options.migraphx_cache_dir = internal_options.model_cache_dir.string().c_str();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.string().c_str() returns a temporary pointer I believe. You need to store the result of internal_options.model_cache_dir.string() at somewhere.

&module) != 0) {
char buffer[MAX_PATH];
if(GetModuleFileName(module, buffer, sizeof(buffer)) != 0) {
PathRemoveFileSpec(buffer);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Try to see if std::filesystem has a function for doing so. Then you will not need to use the low level Windows APIs.

GET_MODULE_HANDLE_EX_FLAG_UNCHANGED_REFCOUNT,
static_cast<LPCSTR>(static_cast<void*>(InitializeRegistry)),
&module) != 0) {
char buffer[MAX_PATH];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Always use wide Windows APIs when possible.

Copy link
Contributor Author

@TedThemistokleous TedThemistokleous Jul 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be split between windows/linux builds then around #ifdefs?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean when a Windows API has an ANSI version and Wide version, you need to use the Wide version.
The buffer here should be std::wstring.

@@ -145,6 +145,19 @@ struct MIGraphX_Provider : Provider {
}

void Initialize() override {
#ifdef _WIN32
HMODULE module = nullptr;
if(GetModuleHandleEx(GET_MODULE_HANDLE_EX_FLAG_FROM_ADDRESS |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change usually is not needed. Usually it prefers the folder of the executable module that LoadLibraryEx is loading

See: https://learn.microsoft.com/en-us/windows/win32/dlls/dynamic-link-library-search-order#alternate-search-order-for-unpackaged-apps

@TedThemistokleous
Copy link
Contributor Author

@snnn rolled back latest commit since that was the cause of a few issues. Let me know if you have any more review comments.

@snnn
Copy link
Member

snnn commented Jul 16, 2025

Thanks. I don't have other comments.

@snnn
Copy link
Member

snnn commented Jul 16, 2025

Sorry, one more thing: onnxruntime/tools/ci_build/github/linux/docker/migraphx-ci-pipeline-env.Dockerfile needs to be updated:

  1. The docker file uses Ubuntu, which I could not get an approval from our internal. We need to use almalinux8 or AzureLinux3. This restriction only applies to the pipelines that runs on Microsoft's first-party machines(which do not include the public build machines that AzureDevOps or Github offers).
  2. The docker file uses conda, which is not free. Please consider using system python, or at least you need to disable the default channels and only use conda-forge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants