Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a check_doubles rule and documentation to run the double-precision requiring tests separately if target has doubles #135

Open
pjaaskel opened this issue Sep 8, 2022 · 7 comments
Labels
enhancement New feature or request

Comments

@pjaaskel
Copy link
Collaborator

pjaaskel commented Sep 8, 2022

Are these known? Happens with my Iris Xe mobile GPU with LLVM 15.

OpenCL:

The following tests FAILED:
	108 - Unit_hipHostGetFlags_Basic - int (Failed)
	109 - Unit_hipHostGetFlags_Basic - float (Failed)
	110 - Unit_hipHostGetFlags_Basic - double (Failed)
	111 - Unit_hipMallocManaged_MultiChunkSingleDevice (Failed)
	112 - Unit_hipMallocManaged_MultiChunkMultiDevice (Failed)
	115 - Unit_hipMallocManaged_TwoPointers - int (Failed)
	116 - Unit_hipMallocManaged_TwoPointers - float (Failed)
	117 - Unit_hipMallocManaged_TwoPointers - double (Failed)
	118 - Unit_hipMallocManaged_DeviceContextChange - unsigned char (Failed)
	119 - Unit_hipMallocManaged_DeviceContextChange - int (Failed)
	120 - Unit_hipMallocManaged_DeviceContextChange - float (Failed)
	121 - Unit_hipMallocManaged_DeviceContextChange - double (Failed)
	187 - Unit_hipMemcpy_KernelLaunch - int (Failed)
	188 - Unit_hipMemcpy_KernelLaunch - float (Failed)
	189 - Unit_hipMemcpy_KernelLaunch - double (Failed)
	193 - Unit_hipMemcpy_MultiThreadWithSerialization (Subprocess aborted)
	197 - Unit_hipMemcpyAsync_KernelLaunch - int (Failed)
	198 - Unit_hipMemcpyAsync_KernelLaunch - float (Failed)
	199 - Unit_hipMemcpyAsync_KernelLaunch - double (Failed)
	204 - Unit_hipMemcpyAsync_hipMultiMemcpyMultiThread - int (Subprocess aborted)
	205 - Unit_hipMemcpyAsync_hipMultiMemcpyMultiThread - float (SEGFAULT)
	206 - Unit_hipMemcpyAsync_hipMultiMemcpyMultiThread - double (Subprocess aborted)
	215 - Unit_ldg (Failed)
	450 - Unit_deviceFunctions_CompileTest_modf_double (Failed)
	454 - Unit_deviceFunctions_CompileTest_norm_double (Failed)
	463 - Unit_deviceFunctions_CompileTest_rhypot_double (Failed)
	465 - Unit_deviceFunctions_CompileTest_rnorm_double (Failed)
	466 - Unit_deviceFunctions_CompileTest_rnorm3d_double (Failed)
	467 - Unit_deviceFunctions_CompileTest_rnorm4d_double (Failed)
	474 - Unit_deviceFunctions_CompileTest_sincos_double (Failed)
	475 - Unit_deviceFunctions_CompileTest_sincospi_double (Failed)
	528 - Unit_hipGetDeviceProperties_ArchPropertiesTst (Failed)
	540 - Unit_hipStreamPerThread_MultiThread (Subprocess aborted)
	541 - Unit_hipStreamPerThread_DeviceReset_1 (Subprocess aborted)
	626 - cuda-reduction (Failed)
Errors while running CTest

Level0:

The following tests FAILED:
	108 - Unit_hipHostGetFlags_Basic - int (Failed)
	109 - Unit_hipHostGetFlags_Basic - float (Failed)
	110 - Unit_hipHostGetFlags_Basic - double (Failed)
	111 - Unit_hipMallocManaged_MultiChunkSingleDevice (Failed)
	112 - Unit_hipMallocManaged_MultiChunkMultiDevice (Failed)
	115 - Unit_hipMallocManaged_TwoPointers - int (Failed)
	116 - Unit_hipMallocManaged_TwoPointers - float (Failed)
	117 - Unit_hipMallocManaged_TwoPointers - double (Failed)
	118 - Unit_hipMallocManaged_DeviceContextChange - unsigned char (Failed)
	119 - Unit_hipMallocManaged_DeviceContextChange - int (Failed)
	120 - Unit_hipMallocManaged_DeviceContextChange - float (Failed)
	121 - Unit_hipMallocManaged_DeviceContextChange - double (Failed)
	187 - Unit_hipMemcpy_KernelLaunch - int (Failed)
	188 - Unit_hipMemcpy_KernelLaunch - float (Failed)
	189 - Unit_hipMemcpy_KernelLaunch - double (Failed)
	193 - Unit_hipMemcpy_MultiThreadWithSerialization (Subprocess aborted)
	197 - Unit_hipMemcpyAsync_KernelLaunch - int (Failed)
	198 - Unit_hipMemcpyAsync_KernelLaunch - float (Failed)
	199 - Unit_hipMemcpyAsync_KernelLaunch - double (Failed)
	204 - Unit_hipMemcpyAsync_hipMultiMemcpyMultiThread - int (Subprocess aborted)
	205 - Unit_hipMemcpyAsync_hipMultiMemcpyMultiThread - float (Subprocess aborted)
	206 - Unit_hipMemcpyAsync_hipMultiMemcpyMultiThread - double (Subprocess aborted)
	215 - Unit_ldg (Failed)
	450 - Unit_deviceFunctions_CompileTest_modf_double (Failed)
	454 - Unit_deviceFunctions_CompileTest_norm_double (Failed)
	463 - Unit_deviceFunctions_CompileTest_rhypot_double (Failed)
	465 - Unit_deviceFunctions_CompileTest_rnorm_double (Failed)
	466 - Unit_deviceFunctions_CompileTest_rnorm3d_double (Failed)
	467 - Unit_deviceFunctions_CompileTest_rnorm4d_double (Failed)
	474 - Unit_deviceFunctions_CompileTest_sincos_double (Failed)
	475 - Unit_deviceFunctions_CompileTest_sincospi_double (Failed)
	528 - Unit_hipGetDeviceProperties_ArchPropertiesTst (Failed)
	570 - hipKernelLaunchIsNonBlocking (Subprocess terminated)
	626 - cuda-reduction (Failed)
Errors while running CTest

Some errors from logs (LZ):

Command: "/home/pjaaskel/src/chip-spv/build/catch/unit/memory/hipMemcpy2DToArray" "Unit_hipMemcpy2DToArray_Negative"
Directory: /home/pjaaskel/src/chip-spv/build/catch/hipTestMain
"Unit_hipMemcpy2DToArray_Negative" start time: Sep 08 14:28 EEST
Output:
----------------------------------------------------------
CHIP error [TID 13776] [1662636513.907324677] : hipErrorInvalidHandle (passed in nullptr) in /home/pjaaskel/src/chip-spv/src/CHIPException.hh:91:checkIfNullptr

CHIP error [TID 13776] [1662636513.907437162] : Caught Error: hipErrorInvalidHandle
CHIP error [TID 13776] [1662636513.907551852] : hipErrorInvalidHandle (passed in nullptr) in /home/pjaaskel/src/chip-spv/src/CHIPException.hh:91:checkIfNullptr

CHIP error [TID 13776] [1662636513.907561840] : Caught Error: hipErrorInvalidHandle
Filters: Unit_hipMemcpy2DToArray_Negative
===============================================================================
...

"Unit_hipMemcpyParam2D_Negative" start time: Sep 08 14:28 EEST
Output:
----------------------------------------------------------
CHIP error [TID 13900] [1662636519.092989120] : hipErrorTbd (Source Device pointer is null) in /home/pjaaskel/src/chip-spv/src/CHIPBindings.cc:700:hipMemcpyParam2DAsync

CHIP error [TID 13900] [1662636519.093107437] : Caught Error: hipErrorTbd
CHIP error [TID 13900] [1662636519.096081339] : hipErrorTbd (Source Device pointer is null) in /home/pjaaskel/src/chip-spv/src/CHIPBindings.cc:702:hipMemcpyParam2DAsync

CHIP error [TID 13900] [1662636519.096100443] : Caught Error: hipErrorTbd
CHIP error [TID 13900] [1662636519.099407248] : hipErrorTbd (Source and Destination Device pointer is null) in /home/pjaaskel/src/chip-spv/src/CHIPBindings.cc:697:hipMemcpyParam2DAsync

CHIP error [TID 13900] [1662636519.099423555] : Caught Error: hipErrorTbd
CHIP error [TID 13900] [1662636519.102441277] : hipErrorTbd (Width > src/dest pitches) in /home/pjaaskel/src/chip-spv/src/CHIPBindings.cc:706:hipMemcpyParam2DAsync

CHIP error [TID 13900] [1662636519.102455602] : Caught Error: hipErrorTbd
Filters: Unit_hipMemcpyParam2D_Negative
===============================================

Command: "/home/pjaaskel/src/chip-spv/build/catch/unit/memory/hipHostRegister" "Unit_hipHostRegister_Memcpy - int"
Directory: /home/pjaaskel/src/chip-spv/build/catch/hipTestMain
"Unit_hipHostRegister_Memcpy - int" start time: Sep 08 14:28 EEST
Output:
----------------------------------------------------------
CHIP error [TID 14142] [1662636531.471413075] : ZE Build Log:
error: Double type is not supported on this platform.
in kernel: 'void Inc<double>(double*)'
error: backend compiler failed build.

error: Double type is not supported on this platform.
in kernel: 'void Inc<double>(double*)'
error: backend compiler failed build.

error: Double type is not supported on this platform.
in kernel: 'void Inc<double>(double*)'
error: backend compiler failed build.

error: Double type is not supported on this platform.
in kernel: 'void Inc<double>(double*)'
error: backend compiler failed build.

error: Double type is not supported on this platform.
in kernel: 'void Inc<double>(double*)'
error: backend compiler failed build.

error: Double type is not supported on this platform.
in kernel: 'void Inc<double>(double*)'
error: backend compiler failed build.

...
Command: "/home/pjaaskel/src/chip-spv/build/catch/unit/memory/hipMemset2DAsyncMultiThreadAndKernel" "Unit_hipMemset2DAsync_MultiThread"
Directory: /home/pjaaskel/src/chip-spv/build/catch/hipTestMain
"Unit_hipMemset2DAsync_MultiThread" start time: Sep 08 14:30 EEST
Output:
----------------------------------------------------------
CHIP error [TID 15912] [1662636608.615095007] : hipErrorTbd (ZE_RESULT_ERROR_INVALID_ARGUMENT ) in /home/pjaaskel/src/chip-spv/src/backend/Level0/CHIPBackendLevel0.cc:979:memFillAsyncImpl

601/627 Test: cuda-asyncAPI
Command: "/home/pjaaskel/src/chip-spv/build/samples/cuda_samples/cuda-asyncAPI"
Directory: /home/pjaaskel/src/chip-spv/build/samples/cuda_samples
"cuda-asyncAPI" start time: Sep 08 14:35 EEST
Output:
----------------------------------------------------------
CHIP error [TID 19505] [1662636943.558795687] : hipErrorNotReady (Event Not Ready) in /home/pjaaskel/src/chip-spv/src/backend/Level0/CHIPBackendLevel0.cc:379:updateFinishStatus

CHIP error [TID 19505] [1662636943.559533885] : Caught Error: hipErrorNotReady
CHIP error [TID 19505] [1662636943.559581970] : hipErrorNotReady (Event Not Ready) in /home/pjaaskel/src/chip-spv/src/backend/Level0/CHIPBackendLevel0.cc:379:updateFinishStatus

The "Double type is not supported on this platform." is a HW/driver issue. OpenCL/SPIR-V doesn't require double support or its SW emulation. https://registry.khronos.org/OpenCL/sdk/1.0/docs/man/xhtml/cl_khr_fp64.html Can we fail more gracefully at runtime? Should we add a separate test suite for the double tests since they are not supposed to work on OpenCL/LZ devices which do not provide double support?

@pjaaskel pjaaskel added the opencl Issues affecting only the OpenCL backend label Sep 8, 2022
@pjaaskel pjaaskel added this to the 0.9 - the first release milestone Sep 8, 2022
@pvelesko
Copy link
Collaborator

pvelesko commented Sep 8, 2022 via email

@pvelesko
Copy link
Collaborator

pvelesko commented Sep 8, 2022

The "Double type is not supported on this platform." is a HW/driver issue

I haven't seen this error before. Are you using an outdated driver? Trying to understand what could cause a failure to emulate.

OpenCL/SPIR-V doesn't require double support or its SW emulation

I thought we enable double support by using #pragma OPENCL EXTENSION cl_khr_fp64 : enable?

Can we fail more gracefully at runtime?

If we fail to compile we throw an error and print the compilation log:

    CHIPERR_CHECK_LOG_AND_THROW(Status, ZE_RESULT_SUCCESS, hipErrorTbd);
    logError("ZE Build Log: {}", std::string(LogStr).c_str());

besides changing the error from hipErrorTbd to something more specific, how do you propose we make JIT failure more graceful?

@pjaaskel
Copy link
Collaborator Author

pjaaskel commented Sep 8, 2022

This is an extension, meaning it's not mandatory to support it. I do not know if we can make it more graceful, as long as the user sees it's because of missing capabilities and not just crash, I think there's not much else we can do.

@pvelesko
Copy link
Collaborator

pvelesko commented Sep 8, 2022

I do not know if we can make it more graceful, as long as the user sees it's because of missing capabilities and not just crash, I think there's not much else we can do.

Ok so sounds like we don't have an issue in that regard then since we already print the exact error.

Should we add a separate test suite for the double tests since they are not supposed to work on OpenCL/LZ devices which do not provide double support?

Sounds like good idea and something we can consider for the next release. I haven't come across a device that doesn't support double precision (so no way of testing) so it doesn't seem like a high-priority issue at this time.

@pjaaskel pjaaskel changed the title Failing tests with the OpenCL BE Failing tests with Iris Xe iGPU and LLVM 15 Sep 15, 2022
@pjaaskel
Copy link
Collaborator Author

All of these cases except the one reported in Issue #134 seem to be caused by the double precision issue. I'll change this ticket to fix it for 0.9 via the double precision test tag and another test suite for tests that are not supposed to work with non-double capable HW, and let's leave Issue #137 open for later consideration.

@pjaaskel pjaaskel changed the title Failing tests with Iris Xe iGPU and LLVM 15 Add a check_doubles rule and documentation to run the double-precision requiring tests separately if target has doubles Sep 15, 2022
@pjaaskel pjaaskel removed the opencl Issues affecting only the OpenCL backend label Sep 15, 2022
@pjaaskel
Copy link
Collaborator Author

Not sure if we really need this one as there's a workaround for the only known target where the issue appears.

@pvelesko
Copy link
Collaborator

pvelesko commented Oct 11, 2022 via email

@pvelesko pvelesko added the enhancement New feature or request label May 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants