Skip to content

[Performance] CudaPinned allocator used wrong backend device_allocator if shared environment allocators are activated #25211

Closed
@AndreyOrb

Description

@AndreyOrb

Describe the issue

The correct CudaPinned allocator from CUDA EP's CudaPinned (pointing the correct CUDAPinnedAllocator) IS replaced with "buggy" CudaPinned allocator (pointing CPUAllocator)!

session_state_ = std::make_unique<SessionState>(

SessionState before applying env. shared allocators:

Image

SessionState right after applying env. shared allocators:

Image

This happens because the environment holds CudaPinned allocator pointing CPUAllocator backend device_allocator.

Here's my code used to register CudaPinned allocator:

void RegisterCudaPinnedEnvAllocator(OrtApi& api, OrtEnv* env)
{
	nvtxRangePush("RegisterCudaPinnedEnvAllocator");

	OrtMemoryInfo* cudaPinnedMemoryInfo;
	ASSERT_ORT_STATUS(api.CreateMemoryInfo("CudaPinned", OrtArenaAllocator, 0, OrtMemTypeCPUOutput, &cudaPinnedMemoryInfo));

	OrtArenaCfg* cudaPinnedArenaConfig;
	ASSERT_ORT_STATUS(api.CreateArenaCfg(0, ArenaExtendStrategy::kNextPowerOfTwo, -1, -1, &cudaPinnedArenaConfig));

	// This creates an ORT-internal allocator instance and registers it in the environment for sharing
	vector<const char*> keys, values;
	ASSERT_ORT_STATUS(api.CreateAndRegisterAllocatorV2(env, "CPUExecutionProvider", cudaPinnedMemoryInfo, cudaPinnedArenaConfig, keys.data(), values.data(), 0));

	api.ReleaseArenaCfg(cudaPinnedArenaConfig);
	api.ReleaseMemoryInfo(cudaPinnedMemoryInfo);
	nvtxRangePop();
}

This is how I revealed the issue:

Image

This is without shared env. allocators:

Image

To reproduce

void RegisterCudaPinnedEnvAllocator(OrtApi& api, OrtEnv* env)
{
nvtxRangePush("RegisterCudaPinnedEnvAllocator");

OrtMemoryInfo* cudaPinnedMemoryInfo;
ASSERT_ORT_STATUS(api.CreateMemoryInfo("CudaPinned", OrtArenaAllocator, 0, OrtMemTypeCPUOutput, &cudaPinnedMemoryInfo));

OrtArenaCfg* cudaPinnedArenaConfig;
ASSERT_ORT_STATUS(api.CreateArenaCfg(0, ArenaExtendStrategy::kNextPowerOfTwo, -1, -1, &cudaPinnedArenaConfig));

// This creates an ORT-internal allocator instance and registers it in the environment for sharing
vector<const char*> keys, values;
ASSERT_ORT_STATUS(api.CreateAndRegisterAllocatorV2(env, "CPUExecutionProvider", cudaPinnedMemoryInfo, cudaPinnedArenaConfig, keys.data(), values.data(), 0));

api.ReleaseArenaCfg(cudaPinnedArenaConfig);
api.ReleaseMemoryInfo(cudaPinnedMemoryInfo);
nvtxRangePop();

}

I tried to search in ORT repo for any clues or similar issues but found nothing.

Urgency

Urgent! Directly impacts performance when shared environment allocators are activated.

Platform

Windows

OS Version

11

ONNX Runtime Installation

Built from Source

ONNX Runtime Version or Commit ID

849eee8

ONNX Runtime API

C++

Architecture

X64

Execution Provider

CUDA

Execution Provider Library Version

CUDA 11.8

Model File

No response

Is this a quantized model?

Unknown

Metadata

Metadata

Assignees

No one assigned

    Labels

    performanceissues related to performance regressions

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions