-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bounty $50] Inconsistent results between GPU and CPU when integers overflow. #38
Comments
IT should be noted this is not simply an issue with rolling over. if I change the calculate line to
|
Hi, thanks for opening this issue to here. I think there is something wrong in OpenCl. I am sending kernel and host code in C language. It STILL gives wrong results on GPU. |
What opencl implementation are you using and what cpu/gpu does the macbook have? |
It's default implementation in macOS. I'm using 10.12 (16A320) MacBook Pro (Retina, 13-inch, Late 2013). It has i5 4258U CPU and Iris 5100 GPU |
Are you using the amd app sdk? |
Also in the calculate can you try it wi brackets around the cast, putting the multiplication outside of the brackets? |
@CC007 The issue seems to occur on linux as well and on AMD App SDK. It appears this bug is not platform specific. IF you runt he code in the original post on your local computer will probably see the bug as well. Did you try? the bug behaves very oddly for me. for example even though id is always 0 if you remove the "+ id" part int he calculate call it wont break anymore. |
Anyone check the c code i posted? |
@savaskoc Not yet but i will give it a go this evening. If your saying it produces the same incorrect results however then I expect you to be correct that it is an opencl issue directly. |
I think this is an OpenCl issue because GPU produces incorrect results independently from platform and/or language (C, Java, Python etc.). |
@savaskoc You realize the bug is producing the result but clipping it to 32 bits? To use the test I posted above as the example (the same is true for the numbers you posted) here is the breakdown. Expected result is 214748364700 which in binary would be: 0011 0001 1111 1111 1111 1111 1111 1111 1001 1100 actual result we get is 4294967196 which in binary is: 0000 0000 1111 1111 1111 1111 1111 1111 1001 1100 Basically just drops all but the last 32 bits. So while this is a legitimate bug, it is definitely occurring due to mishandling of 64bit variables. |
I'm aware of that. I tried another types than long but still gives incorrect results. Maybe some GPU's can't process 64 bit data types? |
@savaskoc I think there is more to it than that. I think when i was testing i tried some variants that produced some very odd results. I need to test this again to make sure I'm remembering correctly but when I removed the id variable (which in this test is always 0 anyway so shouldnt make a difference) it actually caused the correct results to be produced. Once I saw that behavior it was apparent that we were talking about a legitimate bug, I just cant confirm yet if the bug is isolated to aparapi or an OpenCL issue in general yet (I need to do more testing). |
@savaskoc also as I stated in the OP if you change the operation to addition but change the value of the operand to make it mathematically equivalent, it magically works. So the problem seems to occur only on multiplication but not addition. this leads me to believe it is a genuine error rather than a hardware compatibility issue or something. |
Does opencl provide software 64bit support or hardware 64bit support (or both)? |
My guess is that savaskoc is correct. I think that the GPU OpenCL runtime does not support long. Aparapi will detect this for doubles, surprised it does not detect this for long. |
@grfrost If I am correct, why |
Actually, I was about to retract ;) from the 1.0 spec https://www.khronos.org/registry/cl/sdk/1.0/docs/man/xhtml/scalarDataTypes.html it does not look like cl_long is optional. So now I think I will blame Aparapi. My guess is that as the AST is built we end up with
Instead of * / \ (long) 100 | tc Can you try return 100 * (long) tc ; and/or return (long)(tc+0L) * 100; (hope my diagrams make it unscathed) |
I doubt. even original OpenCl does not produce correct results. Check my kernel above. |
I ran the code on my computer (the code from freemo) and it runs as it should. Using: |
@CC007 You're saying that you get same results both GPU and CPU mode right? Can you try my kernel in C? |
I will try that one next. I also tested freemo's code by setting it up to use the cpu opencl device |
@savaskoc The problem seems to be that I don't have any opencl sdk, only the driver and implementation afaik, as I am not using the AMD app SDK |
I have a pc :) |
I think you misunderstand, It wasn't the kernel.cl.h that caused a problem. |
Ok, now that that is installed, it seems that there are types used that don't come from openCL |
On my macbook pro the following yields the same error. clang++ -framework OpenCL longtst.cpp -o longtst #include <iostream>
#ifdef __APPLE__
#include <opencl/opencl.h>
#else
#include <CL/opencl.h>
#endif
#define DATA_SIZE 1
#define LONG_DATA_SIZE DATA_SIZE*sizeof(cl_long)
int main(int argc, char **argv){
long out[DATA_SIZE];
out[0]=0L;
// How many platforms are there ?
cl_uint platformc = 0;
clGetPlatformIDs(0, NULL, &platformc);
if (platformc >0){
// Extract a list of available platforms
cl_platform_id *platforms = new cl_platform_id[platformc];
clGetPlatformIDs(platformc, platforms, NULL);
cl_device_id device_id=0;
// loop through platforms until we have a valid GPU device
for (unsigned int i = 0; !device_id && i < platformc; ++i) {
clGetDeviceIDs(platforms[i], CL_DEVICE_TYPE_GPU, 1, &device_id, nullptr);
}
delete[] platforms;
// only device_id context and command queue needed below
if (device_id){
cl_int err;
// Create a context
cl_context context = clCreateContext(0, 1, &device_id, NULL, NULL, &err);
// Create command queue for this context
cl_command_queue command_queue = clCreateCommandQueue(context, device_id, 0, &err);
// Here is our OpenCL kernel source for
const char *source =
"__constant int start = 2147483647;\n"
"static long calculate(int tc){\n"
" return (long)tc * 100;\n"
"}\n"
"__kernel void longtst(__global long *result){\n"
" int id = get_global_id(0);\n"
" result[id] = calculate(start + id);\n"
"}\n";
// Compile source
cl_program program = clCreateProgramWithSource(context, 1, (const char **) &source, NULL, &err);
err = clBuildProgram(program, 1, &(device_id), NULL, NULL, NULL);
// Extract and show any compile errors or warnings
if (err != CL_SUCCESS){
size_t len;
err = clGetProgramBuildInfo(program, device_id, CL_PROGRAM_BUILD_LOG, 0, NULL, &len);
if (len >0){
len++; // for '\0'
char *compile_log = (char *) malloc(len);
clGetProgramBuildInfo(program, device_id, CL_PROGRAM_BUILD_LOG, len, (void *)compile_log, NULL);
std::cerr <<"log{"<<std::endl<< compile_log << std::endl<<"}"<<std::endl;
free (compile_log);
}
}
// A program can have more than one kernel, select the kernel we want to call
cl_kernel kernel = clCreateKernel(program, "longtst", &err);
// Create buffers which 'wrap' the host data
cl_mem outBuf = clCreateBuffer(context, CL_MEM_USE_HOST_PTR|CL_MEM_WRITE_ONLY, LONG_DATA_SIZE, (void*)out, &err);
// Set any kernel args
err = clSetKernelArg(kernel, 0 , sizeof(cl_mem), &outBuf);
// An event list helps us dispatch efficiently
#define EVENTS 2
cl_event *events = new cl_event[EVENTS];
// Decide how to partition the execution (we choose 1 group 1 threads)
size_t globalRange = DATA_SIZE;
size_t localRange = DATA_SIZE;
// Enqueue the execution
err = clEnqueueNDRangeKernel(command_queue, kernel, 1, NULL, &globalRange, &localRange, 0, NULL, &events[1]);
// Enqueue a read of 'out' data to the command queue
err = clEnqueueReadBuffer(command_queue, outBuf, CL_TRUE, 0, LONG_DATA_SIZE, out, 1, &events[1], &events[0]);
// Wait for all the dispatches to complete.
err = clWaitForEvents(EVENTS, events);
// Release and delete the events
for (int i=0; i<EVENTS; i++){
err = clReleaseEvent(events[i]);
}
delete[] events;
// Release mem objects
clReleaseMemObject(outBuf);
// Release Kernel
clReleaseKernel(kernel);
// Release Program
clReleaseProgram(program);
// Release Context
clReleaseContext(context);
// Release Command Queue
clReleaseCommandQueue(command_queue);
// Note that we don't releaese any type ending in _id
std::cout << out[0] << std::endl;
}
}
} |
"This error is most likely a result of a 32-bit (x86) executable trying to load a 64-bit (x64) DLL. You might have to adjust your PATH or copy DLLs to avoid this. For example, my PATH is set up to find the x64 version of d3dcompiler_46.dll (for DX11.1)." Maybe your path has 32 dll's in it. I don't really use Windows these days, there is a great tool for debugging this sort of thing called http://www.dependencywalker.com/. If you install it and then point it at your executable, it will show you the dll's it is trying to load and thus the error. |
Ok so few things. @grfrost The DLL for Aparapi itself no longer needs to be installed. It is loaded into the path dynamically by the aparapi-jni project which is a dependency on aparapi. So if this particular DLL were the problem (loading the 32bit version on a 64 bit machine) then the problem would still be with aparapi rather than the user. However in the past on windows we have seen an error where if you try to load the 32 bit dll on a 64 bit system it refuses to load and throws an exception. According to @CC007 however the error discussed in this bug doesnt occur on windows but it has been observed on both Mac and Linux. The Mac dylib is 64bit only. Note that OpenCL still needs to be installed manually of course. It is just the aparapi dll that no longer needs to be installed manually. So based on these details I suspect it isnt the aparapi native library that is the issue. It doesnt rule out that a 32bit of opencl was installed or something of course. But the fact that all of us were able to get the error on mac and linux seems suspicious to this point (unless we all made the same mistake when setting up our environment?). Another point I'm going to explore in a few minutes is to confirm @CC007 comment that it works on windows. This seems like an important detail that might help us to debug. Another point to consider if this is a 32bit dll issue is how the dll to be loaded is chosen by aparapi-jni. It uses the arch property to determine if the system is 32bit or 64bit. However one caveat I have not been able to test yet is a 32bit JVM on a 64bit system. In this case the arch would be reported as 32bit on windows. Ergo this may be part of the problem. However since old aparapi (before aparapi-jni was written) still had this problem I am not convinced this is the issue. Anyway, just all things to consider. I am investigating this as we speak to see if i can find any more clues. |
So I just spent over an hour trying to tinker with this problem again and my results this time are more perplexing than the last. First I went over to my windows box and ran the above code and it did not produce the error we had been seeing. Excited this might be a clue I wanted to double check my work and headed back over to linux and ran the original code I pasted in the first post of this issue. To my disbelief it no longer produced the erroneous results either, it appears to have magically started working. This was odd as the bug was consistently produced when I tried running this code yesterday. Thinking this may have been an a problem with my use case; perhaps I made a mistake when i copy and pasted it to the issue. So I loaded the original code supplied by @savaskoc again to re-rerun it. Again to my amazement it no longer produced the bug on either linux or windows for me. The following is the code I just ran on linux that is now working but previously was not: public class Main {
public static void main(String[] args) throws FileNotFoundException, UnsupportedEncodingException {
int num = 406816900 - 406816880;
TCKernel kernel = new TCKernel(406816880, num);
kernel.execute(num);
PrintWriter writer = new PrintWriter("numbers.txt");
kernel.saveResults(writer);
writer.flush();
}
public static class TCKernel extends Kernel {
long[] result;
int start;
public TCKernel(int start, int num) {
this.result = new long[num];
this.start = start;
}
@Override
public void run() {
result[getGlobalId()] = calculate(start + getGlobalId());
}
public long calculate(int tc) {
int num = tc;
int n9 = num % 10;
num /= 10;
int n8 = num % 10;
num /= 10;
int n7 = num % 10;
num /= 10;
int n6 = num % 10;
num /= 10;
int n5 = num % 10;
num /= 10;
int n4 = num % 10;
num /= 10;
int n3 = num % 10;
num /= 10;
int n2 = num % 10;
num /= 10;
int n1 = num % 10;
int odds = n1 + n3 + n5 + n7 + n9;
int evens = n2 + n4 + n6 + n8;
int n10 = (odds * 7 - evens) % 10;
int n11 = (odds + evens + n10) % 10;
return (long) tc * 100 + (n10 * 10 + n11);
}
public void saveResults(PrintWriter writer) {
writer.println("Result\t\tNum\t\tExpected");
for (int i = 0; i < result.length; i++) {
int tc = start + i;
writer.printf("%d\t%d\t%d%s", result[i], tc, calculate(tc), System.lineSeparator());
}
}
}
}
So I'm not sure what to do now, the bug appears to be intermittent somehow. The code yesterday was consistently giving me a bad result and now it is consistently giving me the correct result. Since I can no longer reproduce the error I was unable to debug the problem much to arrive at a solution. Needless to say this has become a very frustrating bug for me now. @grfrost Can you think of any reason in aparapi that might produce intermittent results like this? Have either of you ever witnessed it spontaneously start working during one execution or more? UPDATE: After some reflection I think I was originally testing this on Mac and not linux after all. So my conclusion is that it isnt as weird as i first though. It simply is, and always has been, a mac only issue. |
@CC007 Did you have an ubuntu box going somewhere you could run it on? If possible I'd be curious to see what sort of results you get when you run it on that box? |
Ok. So here is my personal conclusion, tell me if you guys agree. I have tested locally, it works on windows and linux but the bug occurs on mac. Both @grfrost and @savaskoc saw the error but only on Mac. Therefore unless anyone has experienced this bug on non-mac systems, I am going to conclude this is a problem that only occurs on mac. Since this is a mac only issue it seems most likely the bug is in the OSX implementation of OpenCL and not in aparapi itself. Any reason for anyone to suspect this isnt the case? |
I think so that's a bug about osx's OpenCl implementation. |
@savaskoc ok that means we need to see if we can find if a bug was already filed or not. I fit was we can reference it here until it is fixed. If not we should file one. Did you file a bug report with them yet? |
I did but it would be good if you file another one |
@savaskoc It isnt usually good practice to file the same bug twice. But I might be able to add useful comments to the bug you filed. Do you have a link to your bug? |
Bug reporter doesn't provide a link. I can attach your comments to bug file, or you can file them |
@savaskoc well link me to the site where you reported it so I can poke around at least. |
@grfrost It seems that the executable does use 64bit opencl, msvcrt and kernel libraries, but the libstdC++ library is 32bit. I think that I read that this was an issue with Code::Blocks in combination with mingw, but using any of the other compilers gives the errors I posted previously. Thanks for the help though |
@grfrost This is what made it run without error http://stackoverflow.com/a/6405064, so probably I don't have a 64bit libstdC++ dynamic library installed. Also should that code only output -100? |
@CC007 Glad you sorted it out. BTW gcc/mingw on windows is responsible
for me having to drink many many beers ;)
…On Fri, Jan 6, 2017 at 10:34 AM, CC007 ***@***.***> wrote:
@grfrost <https://github.com/grfrost> This is what made it run without
error http://stackoverflow.com/a/6405064, so probably I don't have a
64bit libstdC++ dynamic library installed.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#38 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AEKiN0Jcke4YXm5Rjmaf5iJ1ocL-j4ePks5rPokvgaJpZM4LYPxt>
.
|
@grfrost Also, your kernel.zip contains Mac specific code (libdispatch and its use of blocks), so I can't run that. |
FYI as of 1.3.4 the aparapi native library was compiled using Microsoft
visual studios instead of GCC or minge as with previous versions.
…On Jan 6, 2017 2:03 PM, "grfrost" ***@***.***> wrote:
@CC007 Glad you sorted it out. BTW gcc/mingw on windows is responsible
for me having to drink many many beers ;)
On Fri, Jan 6, 2017 at 10:34 AM, CC007 ***@***.***> wrote:
> @grfrost <https://github.com/grfrost> This is what made it run without
> error http://stackoverflow.com/a/6405064, so probably I don't have a
> 64bit libstdC++ dynamic library installed.
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <#38 (comment)>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/
AEKiN0Jcke4YXm5Rjmaf5iJ1ocL-j4ePks5rPokvgaJpZM4LYPxt>
> .
>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#38 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AC5JAn_VKEikY2MmunA9zh4bKA3BAxj1ks5rPo_sgaJpZM4LYPxt>
.
|
I think you are referring to Jeffreys code.
I was enviously looking at those blocks ;)
My code should have been straight C++
Gary
…On Fri, Jan 6, 2017 at 11:10 AM, CC007 ***@***.***> wrote:
@grfrost <https://github.com/grfrost> Also, your kernel.zip contains Mac
specific code (libdispatch and its use of blocks), so I can't run that.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#38 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AEKiN06VSl7R7q9Oiy0yOytdzDtp2aUyks5rPpGQgaJpZM4LYPxt>
.
|
Isn't this a clear case of undefined behaviour? As far as I know OpenCL C signed integer arithmetic overflow is only defined for atomic operations, so |
Addition is atomic but multiplication is indeed not atomic in the opencl specification for neither 32 nor 64 bit numbers: https://www.khronos.org/registry/OpenCL/sdk/1.2/docs/man/xhtml/atomicFunctions.html |
I was referring to the https://www.khronos.org/registry/OpenCL/sdk/2.0/docs/man/xhtml/atomic_fetch_key.html which guarantees defined overflow for atomic_fetch_add. Aparapi is not using that function for addition, and for good reason. In general, overflow of signed integers is undefined in C, and I think OpenCL C follows C in that area. |
@TPolzer "For signed integer types, arithmetic is defined to use two’s complement representation with silent wrap-around on overflow; there are no undefined results." I don't see what you mean. This might however explain why multiplication can cause issues |
The problem is, that it does not apply in this situation, but only on explicitly atomic fetch and add. |
I have noticed the same problem : Inconsistent results between GPU and CPU and between different GPUs this kernel calculate the product of two Big Integers 256 bits each using the product scanning algorithm. This Kernel gives different results on CPU (correct) / GPU (Intel Iris) (wrong) public void multiplyProductScanning(final byte[] a,final byte[] b,final byte[] output) {
} Notice that we canot have an overflow in UV : when i declare UV as long the results are correct in CPU and 4 different GPU. |
I'm not entirely sure this can be solved. Aparapi works by converting java
byte code into opencl's C-like language and running that on the gpu.
Primitive operations like basic arithmitic is governed by the gpu and for
Effiency reasons it would be infeasable to try to override that
functionality in any meaningful way.
We could of course add some sort of a add, subtract, divide, and multiply
method that garuntees consistency at the expense of speed. But I'm not sure
we would want to address this directly on the primitive math operations for
fear of loosing significant performance
…On Oct 18, 2017 6:06 AM, "Nejeoui Abderrazzak" ***@***.***> wrote:
I have noticed the same problem : Inconsistent results between GPU and CPU
and between different GPUs
this kernel calculate the product of two Big Integers 256 bits each using
the product scanning algorithm.
This Kernel gives different results on CPU (correct) / GPU (Intel Iris)
(wrong)
This Kernel gives different results on GPU (Nvidia Tesla m40)(correct) /
GPU (Intel Iris)(wrong)
public void multiplyProductScanning(final byte[] a,final byte[] b,final
byte[] output) {
int UV=0;
int U=0;
int V=0;
for(int k=62;k>=0;k--) {
UV=0;
for(int i=MAX(0,k-31);i<=MIN(k,31);i++)
UV+=(a[i]&0xFF)*(b[k-i]&0xFF);
UV=UV+U;
U=(UV&0xFFFFFF00)>>8;
V=UV&0xFF;
output[k+1]=(byte)V;
}
output[0]=(byte)U;
}
Notice that we canot have an overflow in UV :
a[i]&0xFF < 256
(b[k-i]&0xFF <256
UV < 32 x 256 x 256
when i declare UV as long the results are correct in CPU and 4 different
GPU.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#38 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AC5JApl7NJyauQ8uXJGnhZWKB6p-tAsvks5stc2xgaJpZM4LYPxt>
.
|
The following code produces different results when run on the GPU vs the CPU.
The output from the above code snippet is:
expected: 214748364700 result: 4294967196
I tested this on my Macbook pro but others noticed the problem as well on other unspecified platforms. Also changin the calculate function such that 100 is a long rather than an integer with
return (long) tc * 100l;
(notice the letter l at the end of the 100) will produce the exact same incorrect results as above.The text was updated successfully, but these errors were encountered: