Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenCL compilation failures on Intel HD4600 on Windows #280

Closed
CNugteren opened this issue May 10, 2018 · 6 comments
Closed

OpenCL compilation failures on Intel HD4600 on Windows #280

CNugteren opened this issue May 10, 2018 · 6 comments

Comments

@CNugteren
Copy link
Owner

One of the automated build tests on an Intel HD4600 GPU on Windows fails due to OpenCL compilation failures. The test and results are e.g. here:
http://ci.arrayfire.org:8010/#/builders/16/builds/5

All tests failing use the GEMM kernel, so that seems to be the issue.

	 34 - clblast_test_xgemm (Failed)
	 35 - clblast_test_xsymm (Failed)
	 41 - clblast_test_xtrmm (Failed)
	 42 - clblast_test_xtrsm (Failed)
	 47 - clblast_test_xgemmbatched (Failed)
	 48 - clblast_test_xgemmstridedbatched (Failed)
	 49 - clblast_test_override_parameters (Failed)
	 51 - clblast_test_preprocessor (Failed)

There are some regular incorrectness errors as well, but also some compilation related things. Examples of outputs for ./clblast_test_xgemm include a regular -11 compilation error, but als output like:

fcl build 1 succeeded.
fcl build 2 succeeded.
Error: parse error.

Not sure how to debug this. Is the latest driver installed, do other OpenCL programs work well? Perhaps more info would be found when running a non-test program (since errors are caught), e.g. ./clblast_client_xgemm?

Or, perhaps the device needs to be tuned for specifically, although I'm not sure since some cases do seem to pass...?

Perhaps @umar456 can help out a bit as well?

@umar456
Copy link
Contributor

umar456 commented May 10, 2018

I can take a look for a couple of hours tonight. I am not super familiar with the codebase. Could you give me a couple of tips about how you would go about debugging this? Are there files/flags which I can modify to get more information about the kernels, launch parameters, etc?

@CNugteren
Copy link
Owner Author

Sure, I'll give you some pointers soon, I'm not behind a laptop now. One thing I would double check is whether you have the latest drivers?

@umar456
Copy link
Contributor

umar456 commented May 10, 2018

Updating from x.x.x.4889 to x.x.x.4963. I will run the tests again once this is done.

@umar456
Copy link
Contributor

umar456 commented May 10, 2018

It seems I have to update the OS to Windows 10 in order to install that driver for that specific chip. I don't have physical access to the machine so I will have to do it at a later time.

@CNugteren
Copy link
Owner Author

Re-installing Windows seems a bit rigourous perhaps, CLBlast should work on old Windowses as well. I've also taken another look at the output and observed the following:

  • If there is a regular OpenCL kernel compilation error (-11), running the tests will output what that error is. However, in this case the error is only something like fcl build 1 succeeded. This indicates some issue with the Intel driver and could be a bug on their side.

  • CLBlast uses a cache to store compiled kernels. Thus, once a specific kernel is compiled for a specific device, the next time it will be read from cache. Each group of tests will need the same kernels everytime, so compilation should only happen at the start. However, if you look at the first output (SGEMM from clblast_test_xgemm) you see 4 compilation errors (-11) followed by a lot of succesfull tests. What I conclude from this is that it tried to compile the kernel 4 times without success, followed by a 5th succesful compilation of the same kernel. Afterwards, the kernel is stored in cache and re-used for all following experiments. Other tests do show similar behaviour. Here I can only conclude that there is some undefined behaviour in the Intel OpenCL compiler.

  • There are some tests failing with actual incorrect results as well it seems. An example is clblast_test_xsymm. I've seen similar issues with 4th/5th generation Intel GPUs in the past as well, but I never managed to solve those.

Given all of the above, perhaps we should file a bug with Intel or just ignore these failures, I am not sure what I can do about it from the CLBlast side.

@CNugteren
Copy link
Owner Author

I'm closing this issue because I believe it is beyond CLBlast's scope and actually an issue with the Intel software (see above). New Intel efforts like Intel's NEO don't seem to support old hardware anymore, so I'm afraid we have to consider this deprecated hardware.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants