Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bounty $50] Inconsistent results between GPU and CPU when integers overflow. #38

Open
freemo opened this issue Dec 30, 2016 · 62 comments
Labels
bounty $$$ Cash reward! bug Fix something that is broken

Comments

@freemo
Copy link
Member

freemo commented Dec 30, 2016

The following code produces different results when run on the GPU vs the CPU.

import com.aparapi.*;

public class Main {
    public static void main(String[] args) {
        int num = 1;

        final long[] result = new long[num];
        final int start = Integer.MAX_VALUE;

        Kernel kernel = new Kernel() {
            @Override
            public void run() {
                final int id = getGlobalId();
                result[id] = calculate(start + id);
            }
        };
        kernel.execute(num);

        System.out.println( "expected: " +  calculate(start) + " result: " + result[0]);
    }

    public static long calculate(int tc) {
        return (long) tc * 100;
    }
}

The output from the above code snippet is:

expected: 214748364700 result: 4294967196

I tested this on my Macbook pro but others noticed the problem as well on other unspecified platforms. Also changin the calculate function such that 100 is a long rather than an integer with return (long) tc * 100l; (notice the letter l at the end of the 100) will produce the exact same incorrect results as above.

@freemo
Copy link
Member Author

freemo commented Dec 30, 2016

IT should be noted this is not simply an issue with rolling over. if I change the calculate line to return (long) tc + 212600881053l; instead it should produce the same result mathematically using addition rather than multiplication. Despite this the program actually runs successfully with this new edit producing the following result:

expected: 214748364700 result: 214748364700

@savaskoc
Copy link

Hi, thanks for opening this issue to here.

I think there is something wrong in OpenCl. I am sending kernel and host code in C language. It STILL gives wrong results on GPU.

kernel.zip

@freemo freemo added the bug Fix something that is broken label Jan 2, 2017
@CC007
Copy link

CC007 commented Jan 2, 2017

What opencl implementation are you using and what cpu/gpu does the macbook have?

@savaskoc
Copy link

savaskoc commented Jan 2, 2017

It's default implementation in macOS. I'm using 10.12 (16A320) MacBook Pro (Retina, 13-inch, Late 2013). It has i5 4258U CPU and Iris 5100 GPU

@CC007
Copy link

CC007 commented Jan 2, 2017

Are you using the amd app sdk?

@CC007
Copy link

CC007 commented Jan 2, 2017

Also in the calculate can you try it wi brackets around the cast, putting the multiplication outside of the brackets?

@freemo
Copy link
Member Author

freemo commented Jan 2, 2017

@CC007 The issue seems to occur on linux as well and on AMD App SDK. It appears this bug is not platform specific. IF you runt he code in the original post on your local computer will probably see the bug as well. Did you try? the bug behaves very oddly for me. for example even though id is always 0 if you remove the "+ id" part int he calculate call it wont break anymore.

@savaskoc
Copy link

savaskoc commented Jan 2, 2017

Anyone check the c code i posted?

@freemo
Copy link
Member Author

freemo commented Jan 2, 2017

@savaskoc Not yet but i will give it a go this evening. If your saying it produces the same incorrect results however then I expect you to be correct that it is an opencl issue directly.

@savaskoc
Copy link

savaskoc commented Jan 2, 2017

I think this is an OpenCl issue because GPU produces incorrect results independently from platform and/or language (C, Java, Python etc.).

@freemo
Copy link
Member Author

freemo commented Jan 2, 2017

@savaskoc You realize the bug is producing the result but clipping it to 32 bits? To use the test I posted above as the example (the same is true for the numbers you posted) here is the breakdown.

Expected result is 214748364700 which in binary would be:

0011 0001 1111 1111 1111 1111 1111 1111 1001 1100

actual result we get is 4294967196 which in binary is:

0000 0000 1111 1111 1111 1111 1111 1111 1001 1100

Basically just drops all but the last 32 bits.

So while this is a legitimate bug, it is definitely occurring due to mishandling of 64bit variables.

@savaskoc
Copy link

savaskoc commented Jan 2, 2017

I'm aware of that. I tried another types than long but still gives incorrect results. Maybe some GPU's can't process 64 bit data types?

@freemo
Copy link
Member Author

freemo commented Jan 2, 2017

@savaskoc I think there is more to it than that. I think when i was testing i tried some variants that produced some very odd results. I need to test this again to make sure I'm remembering correctly but when I removed the id variable (which in this test is always 0 anyway so shouldnt make a difference) it actually caused the correct results to be produced. Once I saw that behavior it was apparent that we were talking about a legitimate bug, I just cant confirm yet if the bug is isolated to aparapi or an OpenCL issue in general yet (I need to do more testing).

@freemo
Copy link
Member Author

freemo commented Jan 2, 2017

@savaskoc also as I stated in the OP if you change the operation to addition but change the value of the operand to make it mathematically equivalent, it magically works. So the problem seems to occur only on multiplication but not addition. this leads me to believe it is a genuine error rather than a hardware compatibility issue or something.

@CC007
Copy link

CC007 commented Jan 2, 2017

Does opencl provide software 64bit support or hardware 64bit support (or both)?

@grfrost
Copy link
Contributor

grfrost commented Jan 2, 2017

My guess is that savaskoc is correct. I think that the GPU OpenCL runtime does not support long.

Aparapi will detect this for doubles, surprised it does not detect this for long.

@savaskoc
Copy link

savaskoc commented Jan 2, 2017

@grfrost If I am correct, why (long) tc + 212600881053l; line works but (long) tc * 100; not?

@grfrost
Copy link
Contributor

grfrost commented Jan 2, 2017

Actually, I was about to retract ;) from the 1.0 spec https://www.khronos.org/registry/cl/sdk/1.0/docs/man/xhtml/scalarDataTypes.html it does not look like cl_long is optional. So now I think I will blame Aparapi. My guess is that as the AST is built we end up with

     (long)
        | 
        *
 tc         100

Instead of

             *
         /        \
    (long)         100
        |
       tc

Can you try

return 100 * (long) tc ;

and/or

return (long)(tc+0L) * 100;

(hope my diagrams make it unscathed)

@savaskoc
Copy link

savaskoc commented Jan 2, 2017

I doubt. even original OpenCl does not produce correct results. Check my kernel above.

@CC007
Copy link

CC007 commented Jan 2, 2017

I ran the code on my computer (the code from freemo) and it runs as it should.

Using:
CPU: 4670k
GPU: 1070GTX
Nvidia and intel openCL implementation

@savaskoc
Copy link

savaskoc commented Jan 2, 2017

@CC007 You're saying that you get same results both GPU and CPU mode right? Can you try my kernel in C?

@CC007
Copy link

CC007 commented Jan 2, 2017

I will try that one next. I also tested freemo's code by setting it up to use the cpu opencl device

@savaskoc
Copy link

savaskoc commented Jan 2, 2017

It generated within build phase. Did you add kernel.cl to compile sources?
screen shot 2017-01-02 at 22 57 56

@CC007
Copy link

CC007 commented Jan 2, 2017

@savaskoc The problem seems to be that I don't have any opencl sdk, only the driver and implementation afaik, as I am not using the AMD app SDK

@CC007
Copy link

CC007 commented Jan 2, 2017

I have a pc :)

@CC007
Copy link

CC007 commented Jan 2, 2017

I think you misunderstand, It wasn't the kernel.cl.h that caused a problem.
I dont have the OpenCL/opencl.h, because I don't have an opencl SDK. I'm installing one now (Intel openCL sdk)

@CC007
Copy link

CC007 commented Jan 2, 2017

Ok, now that that is installed, it seems that there are types used that don't come from openCL

@grfrost
Copy link
Contributor

grfrost commented Jan 2, 2017

On my macbook pro the following yields the same error.

clang++ -framework OpenCL longtst.cpp -o longtst
#include <iostream>
#ifdef __APPLE__
#include <opencl/opencl.h>
#else
#include <CL/opencl.h>
#endif

#define DATA_SIZE 1
#define LONG_DATA_SIZE DATA_SIZE*sizeof(cl_long)
int main(int argc, char **argv){
   long out[DATA_SIZE];
   out[0]=0L;

   // How many platforms are there ?
   cl_uint platformc = 0;
   clGetPlatformIDs(0, NULL, &platformc);

   if (platformc >0){
      // Extract a list of available platforms
      cl_platform_id *platforms = new cl_platform_id[platformc];
      clGetPlatformIDs(platformc, platforms, NULL);
   
      cl_device_id device_id=0;
      // loop through platforms until we have a valid GPU device 
      for (unsigned int i = 0; !device_id && i < platformc; ++i) {
         clGetDeviceIDs(platforms[i], CL_DEVICE_TYPE_GPU, 1, &device_id, nullptr);
      }
      delete[] platforms;
   
      // only device_id context and command queue needed below
   
      if (device_id){
         cl_int err;

         // Create a context
         cl_context context = clCreateContext(0, 1, &device_id, NULL, NULL, &err);

         // Create command queue for this context
         cl_command_queue command_queue = clCreateCommandQueue(context, device_id, 0, &err);

         // Here is our OpenCL kernel source for
         const char *source = 
            "__constant int start = 2147483647;\n"
            "static long calculate(int tc){\n"
            "    return (long)tc * 100;\n"
            "}\n"
            "__kernel void longtst(__global long *result){\n"
            "   int id = get_global_id(0);\n"
            "   result[id] = calculate(start + id);\n"
            "}\n";

         // Compile source 
         cl_program program = clCreateProgramWithSource(context, 1, (const char **) &source, NULL, &err);
         err = clBuildProgram(program, 1, &(device_id), NULL, NULL, NULL);

         // Extract and show any compile errors or warnings
         if (err != CL_SUCCESS){
            size_t len;
            err = clGetProgramBuildInfo(program, device_id, CL_PROGRAM_BUILD_LOG, 0, NULL, &len);
            if (len >0){
              len++; // for '\0'
              char *compile_log = (char *) malloc(len);
              clGetProgramBuildInfo(program, device_id, CL_PROGRAM_BUILD_LOG, len, (void *)compile_log, NULL);
              std::cerr <<"log{"<<std::endl<< compile_log << std::endl<<"}"<<std::endl;
              free (compile_log);
            }
         }

         // A program can have more than one kernel, select the kernel we want to call
         cl_kernel kernel = clCreateKernel(program, "longtst", &err);

         // Create buffers which 'wrap' the host data
         cl_mem outBuf = clCreateBuffer(context, CL_MEM_USE_HOST_PTR|CL_MEM_WRITE_ONLY, LONG_DATA_SIZE, (void*)out, &err);

         // Set any kernel args
         err = clSetKernelArg(kernel, 0 , sizeof(cl_mem), &outBuf);

         // An event list helps us dispatch efficiently
#define EVENTS 2
         cl_event *events = new cl_event[EVENTS];

         // Decide how to partition the execution (we choose 1 group 1 threads)
         size_t globalRange = DATA_SIZE;
         size_t localRange = DATA_SIZE;

         // Enqueue the execution
         err = clEnqueueNDRangeKernel(command_queue, kernel, 1, NULL, &globalRange, &localRange, 0, NULL, &events[1]);
   
         // Enqueue a read of 'out' data to the command queue
         err = clEnqueueReadBuffer(command_queue, outBuf, CL_TRUE, 0, LONG_DATA_SIZE, out, 1, &events[1], &events[0]);  

         // Wait for all the dispatches to complete.
         err = clWaitForEvents(EVENTS, events);
   
         // Release and delete the events 
         for (int i=0; i<EVENTS; i++){
            err = clReleaseEvent(events[i]);
         }
         delete[] events;

         // Release mem objects
         clReleaseMemObject(outBuf);

         // Release Kernel
         clReleaseKernel(kernel);

         // Release Program
         clReleaseProgram(program);

         // Release Context
         clReleaseContext(context);

         // Release Command Queue
         clReleaseCommandQueue(command_queue);

         // Note that we don't releaese any type ending in _id
   
         std::cout << out[0] << std::endl;
      }
   }
}

@grfrost
Copy link
Contributor

grfrost commented Jan 3, 2017

From this page -
https://answers.microsoft.com/en-us/windows/forum/windows_7-performance/error-the-application-was-unable-to-start/05a2b904-3f61-4d08-94d6-e2ff92161111?auth=1

"This error is most likely a result of a 32-bit (x86) executable trying to load a 64-bit (x64) DLL. You might have to adjust your PATH or copy DLLs to avoid this. For example, my PATH is set up to find the x64 version of d3dcompiler_46.dll (for DX11.1)."

Maybe your path has 32 dll's in it.

I don't really use Windows these days, there is a great tool for debugging this sort of thing called http://www.dependencywalker.com/.

If you install it and then point it at your executable, it will show you the dll's it is trying to load and thus the error.

@freemo
Copy link
Member Author

freemo commented Jan 3, 2017

Ok so few things.

@grfrost The DLL for Aparapi itself no longer needs to be installed. It is loaded into the path dynamically by the aparapi-jni project which is a dependency on aparapi. So if this particular DLL were the problem (loading the 32bit version on a 64 bit machine) then the problem would still be with aparapi rather than the user. However in the past on windows we have seen an error where if you try to load the 32 bit dll on a 64 bit system it refuses to load and throws an exception. According to @CC007 however the error discussed in this bug doesnt occur on windows but it has been observed on both Mac and Linux. The Mac dylib is 64bit only.

Note that OpenCL still needs to be installed manually of course. It is just the aparapi dll that no longer needs to be installed manually.

So based on these details I suspect it isnt the aparapi native library that is the issue. It doesnt rule out that a 32bit of opencl was installed or something of course. But the fact that all of us were able to get the error on mac and linux seems suspicious to this point (unless we all made the same mistake when setting up our environment?).

Another point I'm going to explore in a few minutes is to confirm @CC007 comment that it works on windows. This seems like an important detail that might help us to debug.

Another point to consider if this is a 32bit dll issue is how the dll to be loaded is chosen by aparapi-jni. It uses the arch property to determine if the system is 32bit or 64bit. However one caveat I have not been able to test yet is a 32bit JVM on a 64bit system. In this case the arch would be reported as 32bit on windows. Ergo this may be part of the problem. However since old aparapi (before aparapi-jni was written) still had this problem I am not convinced this is the issue.

Anyway, just all things to consider. I am investigating this as we speak to see if i can find any more clues.

@freemo
Copy link
Member Author

freemo commented Jan 3, 2017

So I just spent over an hour trying to tinker with this problem again and my results this time are more perplexing than the last.

First I went over to my windows box and ran the above code and it did not produce the error we had been seeing. Excited this might be a clue I wanted to double check my work and headed back over to linux and ran the original code I pasted in the first post of this issue. To my disbelief it no longer produced the erroneous results either, it appears to have magically started working. This was odd as the bug was consistently produced when I tried running this code yesterday.

Thinking this may have been an a problem with my use case; perhaps I made a mistake when i copy and pasted it to the issue. So I loaded the original code supplied by @savaskoc again to re-rerun it. Again to my amazement it no longer produced the bug on either linux or windows for me. The following is the code I just ran on linux that is now working but previously was not:

public class Main {
    public static void main(String[] args) throws FileNotFoundException, UnsupportedEncodingException {
        int num = 406816900 - 406816880;

        TCKernel kernel = new TCKernel(406816880, num);
        kernel.execute(num);


        PrintWriter writer = new PrintWriter("numbers.txt");
        kernel.saveResults(writer);
        writer.flush();
    }

    public static class TCKernel extends Kernel {
        long[] result;
        int start;

        public TCKernel(int start, int num) {
            this.result = new long[num];
            this.start = start;
        }

        @Override
        public void run() {
            result[getGlobalId()] = calculate(start + getGlobalId());
        }

        public long calculate(int tc) {
            int num = tc;
            int n9 = num % 10;
            num /= 10;
            int n8 = num % 10;
            num /= 10;
            int n7 = num % 10;
            num /= 10;
            int n6 = num % 10;
            num /= 10;
            int n5 = num % 10;
            num /= 10;
            int n4 = num % 10;
            num /= 10;
            int n3 = num % 10;
            num /= 10;
            int n2 = num % 10;
            num /= 10;
            int n1 = num % 10;

            int odds = n1 + n3 + n5 + n7 + n9;
            int evens = n2 + n4 + n6 + n8;

            int n10 = (odds * 7 - evens) % 10;
            int n11 = (odds + evens + n10) % 10;

            return (long) tc * 100 + (n10 * 10 + n11);
        }

        public void saveResults(PrintWriter writer) {
            writer.println("Result\t\tNum\t\tExpected");
            for (int i = 0; i < result.length; i++) {
                int tc = start + i;
                writer.printf("%d\t%d\t%d%s", result[i], tc, calculate(tc), System.lineSeparator());
            }
        }
    }
}
Result		Num		Expected
40681688012	406816880	40681688012
40681688180	406816881	40681688180
40681688258	406816882	40681688258
40681688326	406816883	40681688326
40681688494	406816884	40681688494
40681688562	406816885	40681688562
40681688630	406816886	40681688630
40681688708	406816887	40681688708
40681688876	406816888	40681688876
40681688944	406816889	40681688944
40681689002	406816890	40681689002
40681689170	406816891	40681689170
40681689248	406816892	40681689248
40681689316	406816893	40681689316
40681689484	406816894	40681689484
40681689552	406816895	40681689552
40681689620	406816896	40681689620
40681689798	406816897	40681689798
40681689866	406816898	40681689866
40681689934	406816899	40681689934

So I'm not sure what to do now, the bug appears to be intermittent somehow. The code yesterday was consistently giving me a bad result and now it is consistently giving me the correct result. Since I can no longer reproduce the error I was unable to debug the problem much to arrive at a solution.

Needless to say this has become a very frustrating bug for me now. @grfrost Can you think of any reason in aparapi that might produce intermittent results like this?

Have either of you ever witnessed it spontaneously start working during one execution or more?

UPDATE: After some reflection I think I was originally testing this on Mac and not linux after all. So my conclusion is that it isnt as weird as i first though. It simply is, and always has been, a mac only issue.

@freemo
Copy link
Member Author

freemo commented Jan 3, 2017

@CC007 Did you have an ubuntu box going somewhere you could run it on? If possible I'd be curious to see what sort of results you get when you run it on that box?

@freemo
Copy link
Member Author

freemo commented Jan 3, 2017

Ok. So here is my personal conclusion, tell me if you guys agree.

I have tested locally, it works on windows and linux but the bug occurs on mac. Both @grfrost and @savaskoc saw the error but only on Mac. Therefore unless anyone has experienced this bug on non-mac systems, I am going to conclude this is a problem that only occurs on mac.

Since this is a mac only issue it seems most likely the bug is in the OSX implementation of OpenCL and not in aparapi itself.

Any reason for anyone to suspect this isnt the case?

@savaskoc
Copy link

savaskoc commented Jan 3, 2017

I think so that's a bug about osx's OpenCl implementation.

@freemo
Copy link
Member Author

freemo commented Jan 3, 2017

@savaskoc ok that means we need to see if we can find if a bug was already filed or not. I fit was we can reference it here until it is fixed. If not we should file one. Did you file a bug report with them yet?

@savaskoc
Copy link

savaskoc commented Jan 3, 2017

I did but it would be good if you file another one

@freemo
Copy link
Member Author

freemo commented Jan 3, 2017

@savaskoc It isnt usually good practice to file the same bug twice. But I might be able to add useful comments to the bug you filed. Do you have a link to your bug?

@savaskoc
Copy link

savaskoc commented Jan 3, 2017

Bug reporter doesn't provide a link. I can attach your comments to bug file, or you can file them

@freemo
Copy link
Member Author

freemo commented Jan 3, 2017

@savaskoc well link me to the site where you reported it so I can poke around at least.

@savaskoc
Copy link

savaskoc commented Jan 3, 2017

@CC007
Copy link

CC007 commented Jan 6, 2017

@grfrost It seems that the executable does use 64bit opencl, msvcrt and kernel libraries, but the libstdC++ library is 32bit. I think that I read that this was an issue with Code::Blocks in combination with mingw, but using any of the other compilers gives the errors I posted previously. Thanks for the help though

@CC007
Copy link

CC007 commented Jan 6, 2017

@grfrost This is what made it run without error http://stackoverflow.com/a/6405064, so probably I don't have a 64bit libstdC++ dynamic library installed.

Also should that code only output -100?

@grfrost
Copy link
Contributor

grfrost commented Jan 6, 2017 via email

@CC007
Copy link

CC007 commented Jan 6, 2017

@grfrost Also, your kernel.zip contains Mac specific code (libdispatch and its use of blocks), so I can't run that.

@freemo
Copy link
Member Author

freemo commented Jan 6, 2017 via email

@grfrost
Copy link
Contributor

grfrost commented Jan 6, 2017 via email

@CC007
Copy link

CC007 commented Jan 6, 2017

@grfrost ah I see now that it was @savaskoc who posted the code. Btw there is a libdispatch library for windows, but the block notation doesn't seem to be supported by mingw gcc.

@TPolzer
Copy link

TPolzer commented Jan 10, 2017

Isn't this a clear case of undefined behaviour? As far as I know OpenCL C signed integer arithmetic overflow is only defined for atomic operations, so start + id is undefined if id is a positive int and start is Integer.MAX_VALUE.

@CC007
Copy link

CC007 commented Jan 10, 2017

Addition is atomic but multiplication is indeed not atomic in the opencl specification for neither 32 nor 64 bit numbers: https://www.khronos.org/registry/OpenCL/sdk/1.2/docs/man/xhtml/atomicFunctions.html

@TPolzer
Copy link

TPolzer commented Jan 10, 2017

I was referring to the https://www.khronos.org/registry/OpenCL/sdk/2.0/docs/man/xhtml/atomic_fetch_key.html which guarantees defined overflow for atomic_fetch_add. Aparapi is not using that function for addition, and for good reason. In general, overflow of signed integers is undefined in C, and I think OpenCL C follows C in that area.

@CC007
Copy link

CC007 commented Jan 10, 2017

@TPolzer "For signed integer types, arithmetic is defined to use two’s complement representation with silent wrap-around on overflow; there are no undefined results." I don't see what you mean.

This might however explain why multiplication can cause issues

@TPolzer
Copy link

TPolzer commented Jan 10, 2017

The problem is, that it does not apply in this situation, but only on explicitly atomic fetch and add.

@nejeoui
Copy link

nejeoui commented Oct 18, 2017

I have noticed the same problem : Inconsistent results between GPU and CPU and between different GPUs

this kernel calculate the product of two Big Integers 256 bits each using the product scanning algorithm.

This Kernel gives different results on CPU (correct) / GPU (Intel Iris) (wrong)
This Kernel gives different results on GPU (Nvidia Tesla m40)(correct) / GPU (Intel Iris)(wrong)

public void multiplyProductScanning(final byte[] a,final byte[] b,final byte[] output) {

int UV=0;
int U=0;
int V=0;

for(int k=62;k>=0;k--) {
	UV=0;
	for(int i=MAX(0,k-31);i<=MIN(k,31);i++) 
		UV+=(a[i]&0xFF)*(b[k-i]&0xFF);
	UV=UV+U;
	U=(UV&0xFFFFFF00)>>8;
	V=UV&0xFF;
	output[k+1]=(byte)V;
	
}
output[0]=(byte)U;

}

Notice that we canot have an overflow in UV :
a[i]&0xFF < 256
(b[k-i]&0xFF <256
UV < 32 x 256 x 256

when i declare UV as long the results are correct in CPU and 4 different GPU.

@freemo
Copy link
Member Author

freemo commented Oct 18, 2017 via email

@freemo freemo added help wanted Have a question? Need assistance? bounty $$$ Cash reward! labels Apr 19, 2018
@freemo freemo changed the title Inconsistent results between GPU and CPU when integers overflow. [Bounty $50] Inconsistent results between GPU and CPU when integers overflow. Apr 19, 2018
@freemo freemo removed the help wanted Have a question? Need assistance? label Apr 19, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bounty $$$ Cash reward! bug Fix something that is broken
Projects
None yet
Development

No branches or pull requests

6 participants