[Bounty $50] Inconsistent results between GPU and CPU when integers overflow. #38

freemo · 2016-12-30T15:41:17Z

The following code produces different results when run on the GPU vs the CPU.

import com.aparapi.*;

public class Main {
    public static void main(String[] args) {
        int num = 1;

        final long[] result = new long[num];
        final int start = Integer.MAX_VALUE;

        Kernel kernel = new Kernel() {
            @Override
            public void run() {
                final int id = getGlobalId();
                result[id] = calculate(start + id);
            }
        };
        kernel.execute(num);

        System.out.println( "expected: " +  calculate(start) + " result: " + result[0]);
    }

    public static long calculate(int tc) {
        return (long) tc * 100;
    }
}

The output from the above code snippet is:

expected: 214748364700 result: 4294967196

I tested this on my Macbook pro but others noticed the problem as well on other unspecified platforms. Also changin the calculate function such that 100 is a long rather than an integer with return (long) tc * 100l; (notice the letter l at the end of the 100) will produce the exact same incorrect results as above.

The text was updated successfully, but these errors were encountered:

freemo · 2016-12-30T15:57:45Z

IT should be noted this is not simply an issue with rolling over. if I change the calculate line to return (long) tc + 212600881053l; instead it should produce the same result mathematically using addition rather than multiplication. Despite this the program actually runs successfully with this new edit producing the following result:

expected: 214748364700 result: 214748364700

savaskoc · 2016-12-30T22:32:14Z

Hi, thanks for opening this issue to here.

I think there is something wrong in OpenCl. I am sending kernel and host code in C language. It STILL gives wrong results on GPU.

kernel.zip

CC007 · 2017-01-02T16:36:53Z

What opencl implementation are you using and what cpu/gpu does the macbook have?

savaskoc · 2017-01-02T16:40:41Z

It's default implementation in macOS. I'm using 10.12 (16A320) MacBook Pro (Retina, 13-inch, Late 2013). It has i5 4258U CPU and Iris 5100 GPU

CC007 · 2017-01-02T17:52:22Z

Are you using the amd app sdk?

CC007 · 2017-01-02T17:58:00Z

Also in the calculate can you try it wi brackets around the cast, putting the multiplication outside of the brackets?

freemo · 2017-01-02T18:00:54Z

@CC007 The issue seems to occur on linux as well and on AMD App SDK. It appears this bug is not platform specific. IF you runt he code in the original post on your local computer will probably see the bug as well. Did you try? the bug behaves very oddly for me. for example even though id is always 0 if you remove the "+ id" part int he calculate call it wont break anymore.

savaskoc · 2017-01-02T18:05:57Z

Anyone check the c code i posted?

freemo · 2017-01-02T18:07:42Z

@savaskoc Not yet but i will give it a go this evening. If your saying it produces the same incorrect results however then I expect you to be correct that it is an opencl issue directly.

savaskoc · 2017-01-02T18:11:29Z

I think this is an OpenCl issue because GPU produces incorrect results independently from platform and/or language (C, Java, Python etc.).

freemo · 2017-01-02T18:20:37Z

@savaskoc You realize the bug is producing the result but clipping it to 32 bits? To use the test I posted above as the example (the same is true for the numbers you posted) here is the breakdown.

Expected result is 214748364700 which in binary would be:

0011 0001 1111 1111 1111 1111 1111 1111 1001 1100

actual result we get is 4294967196 which in binary is:

0000 0000 1111 1111 1111 1111 1111 1111 1001 1100

Basically just drops all but the last 32 bits.

So while this is a legitimate bug, it is definitely occurring due to mishandling of 64bit variables.

savaskoc · 2017-01-02T18:27:25Z

I'm aware of that. I tried another types than long but still gives incorrect results. Maybe some GPU's can't process 64 bit data types?

freemo · 2017-01-02T18:30:18Z

@savaskoc I think there is more to it than that. I think when i was testing i tried some variants that produced some very odd results. I need to test this again to make sure I'm remembering correctly but when I removed the id variable (which in this test is always 0 anyway so shouldnt make a difference) it actually caused the correct results to be produced. Once I saw that behavior it was apparent that we were talking about a legitimate bug, I just cant confirm yet if the bug is isolated to aparapi or an OpenCL issue in general yet (I need to do more testing).

freemo · 2017-01-02T18:33:12Z

@savaskoc also as I stated in the OP if you change the operation to addition but change the value of the operand to make it mathematically equivalent, it magically works. So the problem seems to occur only on multiplication but not addition. this leads me to believe it is a genuine error rather than a hardware compatibility issue or something.

CC007 · 2017-01-02T18:44:31Z

Does opencl provide software 64bit support or hardware 64bit support (or both)?

grfrost · 2017-01-02T18:44:43Z

My guess is that savaskoc is correct. I think that the GPU OpenCL runtime does not support long.

Aparapi will detect this for doubles, surprised it does not detect this for long.

savaskoc · 2017-01-02T18:45:58Z

@grfrost If I am correct, why (long) tc + 212600881053l; line works but (long) tc * 100; not?

grfrost · 2017-01-02T18:52:39Z

Actually, I was about to retract ;) from the 1.0 spec https://www.khronos.org/registry/cl/sdk/1.0/docs/man/xhtml/scalarDataTypes.html it does not look like cl_long is optional. So now I think I will blame Aparapi. My guess is that as the AST is built we end up with

     (long)
        | 
        *
 tc         100

Instead of

             *
         /        \
    (long)         100
        |
       tc

Can you try

return 100 * (long) tc ;

and/or

return (long)(tc+0L) * 100;

(hope my diagrams make it unscathed)

savaskoc · 2017-01-02T18:55:04Z

I doubt. even original OpenCl does not produce correct results. Check my kernel above.

CC007 · 2017-01-02T19:41:50Z

I ran the code on my computer (the code from freemo) and it runs as it should.

Using:
CPU: 4670k
GPU: 1070GTX
Nvidia and intel openCL implementation

savaskoc · 2017-01-02T19:43:54Z

@CC007 You're saying that you get same results both GPU and CPU mode right? Can you try my kernel in C?

CC007 · 2017-01-02T19:52:33Z

I will try that one next. I also tested freemo's code by setting it up to use the cpu opencl device

savaskoc · 2017-01-02T19:58:19Z

It generated within build phase. Did you add kernel.cl to compile sources?

CC007 · 2017-01-02T20:01:26Z

@savaskoc The problem seems to be that I don't have any opencl sdk, only the driver and implementation afaik, as I am not using the AMD app SDK

CC007 · 2017-01-02T20:16:11Z

I have a pc :)

CC007 · 2017-01-02T20:47:45Z

I think you misunderstand, It wasn't the kernel.cl.h that caused a problem.
I dont have the OpenCL/opencl.h, because I don't have an opencl SDK. I'm installing one now (Intel openCL sdk)

CC007 · 2017-01-02T21:09:54Z

Ok, now that that is installed, it seems that there are types used that don't come from openCL

grfrost · 2017-01-02T21:34:20Z

On my macbook pro the following yields the same error.

clang++ -framework OpenCL longtst.cpp -o longtst

#include <iostream>
#ifdef __APPLE__
#include <opencl/opencl.h>
#else
#include <CL/opencl.h>
#endif

#define DATA_SIZE 1
#define LONG_DATA_SIZE DATA_SIZE*sizeof(cl_long)
int main(int argc, char **argv){
   long out[DATA_SIZE];
   out[0]=0L;

   // How many platforms are there ?
   cl_uint platformc = 0;
   clGetPlatformIDs(0, NULL, &platformc);

   if (platformc >0){
      // Extract a list of available platforms
      cl_platform_id *platforms = new cl_platform_id[platformc];
      clGetPlatformIDs(platformc, platforms, NULL);
   
      cl_device_id device_id=0;
      // loop through platforms until we have a valid GPU device 
      for (unsigned int i = 0; !device_id && i < platformc; ++i) {
         clGetDeviceIDs(platforms[i], CL_DEVICE_TYPE_GPU, 1, &device_id, nullptr);
      }
      delete[] platforms;
   
      // only device_id context and command queue needed below
   
      if (device_id){
         cl_int err;

         // Create a context
         cl_context context = clCreateContext(0, 1, &device_id, NULL, NULL, &err);

         // Create command queue for this context
         cl_command_queue command_queue = clCreateCommandQueue(context, device_id, 0, &err);

         // Here is our OpenCL kernel source for
         const char *source = 
            "__constant int start = 2147483647;\n"
            "static long calculate(int tc){\n"
            "    return (long)tc * 100;\n"
            "}\n"
            "__kernel void longtst(__global long *result){\n"
            "   int id = get_global_id(0);\n"
            "   result[id] = calculate(start + id);\n"
            "}\n";

         // Compile source 
         cl_program program = clCreateProgramWithSource(context, 1, (const char **) &source, NULL, &err);
         err = clBuildProgram(program, 1, &(device_id), NULL, NULL, NULL);

         // Extract and show any compile errors or warnings
         if (err != CL_SUCCESS){
            size_t len;
            err = clGetProgramBuildInfo(program, device_id, CL_PROGRAM_BUILD_LOG, 0, NULL, &len);
            if (len >0){
              len++; // for '\0'
              char *compile_log = (char *) malloc(len);
              clGetProgramBuildInfo(program, device_id, CL_PROGRAM_BUILD_LOG, len, (void *)compile_log, NULL);
              std::cerr <<"log{"<<std::endl<< compile_log << std::endl<<"}"<<std::endl;
              free (compile_log);
            }
         }

         // A program can have more than one kernel, select the kernel we want to call
         cl_kernel kernel = clCreateKernel(program, "longtst", &err);

         // Create buffers which 'wrap' the host data
         cl_mem outBuf = clCreateBuffer(context, CL_MEM_USE_HOST_PTR|CL_MEM_WRITE_ONLY, LONG_DATA_SIZE, (void*)out, &err);

         // Set any kernel args
         err = clSetKernelArg(kernel, 0 , sizeof(cl_mem), &outBuf);

         // An event list helps us dispatch efficiently
#define EVENTS 2
         cl_event *events = new cl_event[EVENTS];

         // Decide how to partition the execution (we choose 1 group 1 threads)
         size_t globalRange = DATA_SIZE;
         size_t localRange = DATA_SIZE;

         // Enqueue the execution
         err = clEnqueueNDRangeKernel(command_queue, kernel, 1, NULL, &globalRange, &localRange, 0, NULL, &events[1]);
   
         // Enqueue a read of 'out' data to the command queue
         err = clEnqueueReadBuffer(command_queue, outBuf, CL_TRUE, 0, LONG_DATA_SIZE, out, 1, &events[1], &events[0]);  

         // Wait for all the dispatches to complete.
         err = clWaitForEvents(EVENTS, events);
   
         // Release and delete the events 
         for (int i=0; i<EVENTS; i++){
            err = clReleaseEvent(events[i]);
         }
         delete[] events;

         // Release mem objects
         clReleaseMemObject(outBuf);

         // Release Kernel
         clReleaseKernel(kernel);

         // Release Program
         clReleaseProgram(program);

         // Release Context
         clReleaseContext(context);

         // Release Command Queue
         clReleaseCommandQueue(command_queue);

         // Note that we don't releaese any type ending in _id
   
         std::cout << out[0] << std::endl;
      }
   }
}

grfrost · 2017-01-03T03:52:02Z

From this page -
https://answers.microsoft.com/en-us/windows/forum/windows_7-performance/error-the-application-was-unable-to-start/05a2b904-3f61-4d08-94d6-e2ff92161111?auth=1

"This error is most likely a result of a 32-bit (x86) executable trying to load a 64-bit (x64) DLL. You might have to adjust your PATH or copy DLLs to avoid this. For example, my PATH is set up to find the x64 version of d3dcompiler_46.dll (for DX11.1)."

Maybe your path has 32 dll's in it.

I don't really use Windows these days, there is a great tool for debugging this sort of thing called http://www.dependencywalker.com/.

If you install it and then point it at your executable, it will show you the dll's it is trying to load and thus the error.

freemo · 2017-01-03T06:52:31Z

Ok so few things.

@grfrost The DLL for Aparapi itself no longer needs to be installed. It is loaded into the path dynamically by the aparapi-jni project which is a dependency on aparapi. So if this particular DLL were the problem (loading the 32bit version on a 64 bit machine) then the problem would still be with aparapi rather than the user. However in the past on windows we have seen an error where if you try to load the 32 bit dll on a 64 bit system it refuses to load and throws an exception. According to @CC007 however the error discussed in this bug doesnt occur on windows but it has been observed on both Mac and Linux. The Mac dylib is 64bit only.

Note that OpenCL still needs to be installed manually of course. It is just the aparapi dll that no longer needs to be installed manually.

So based on these details I suspect it isnt the aparapi native library that is the issue. It doesnt rule out that a 32bit of opencl was installed or something of course. But the fact that all of us were able to get the error on mac and linux seems suspicious to this point (unless we all made the same mistake when setting up our environment?).

Another point I'm going to explore in a few minutes is to confirm @CC007 comment that it works on windows. This seems like an important detail that might help us to debug.

Another point to consider if this is a 32bit dll issue is how the dll to be loaded is chosen by aparapi-jni. It uses the arch property to determine if the system is 32bit or 64bit. However one caveat I have not been able to test yet is a 32bit JVM on a 64bit system. In this case the arch would be reported as 32bit on windows. Ergo this may be part of the problem. However since old aparapi (before aparapi-jni was written) still had this problem I am not convinced this is the issue.

Anyway, just all things to consider. I am investigating this as we speak to see if i can find any more clues.

freemo · 2017-01-03T14:01:08Z

So I just spent over an hour trying to tinker with this problem again and my results this time are more perplexing than the last.

First I went over to my windows box and ran the above code and it did not produce the error we had been seeing. Excited this might be a clue I wanted to double check my work and headed back over to linux and ran the original code I pasted in the first post of this issue. To my disbelief it no longer produced the erroneous results either, it appears to have magically started working. This was odd as the bug was consistently produced when I tried running this code yesterday.

Thinking this may have been an a problem with my use case; perhaps I made a mistake when i copy and pasted it to the issue. So I loaded the original code supplied by @savaskoc again to re-rerun it. Again to my amazement it no longer produced the bug on either linux or windows for me. The following is the code I just ran on linux that is now working but previously was not:

public class Main {
    public static void main(String[] args) throws FileNotFoundException, UnsupportedEncodingException {
        int num = 406816900 - 406816880;

        TCKernel kernel = new TCKernel(406816880, num);
        kernel.execute(num);


        PrintWriter writer = new PrintWriter("numbers.txt");
        kernel.saveResults(writer);
        writer.flush();
    }

    public static class TCKernel extends Kernel {
        long[] result;
        int start;

        public TCKernel(int start, int num) {
            this.result = new long[num];
            this.start = start;
        }

        @Override
        public void run() {
            result[getGlobalId()] = calculate(start + getGlobalId());
        }

        public long calculate(int tc) {
            int num = tc;
            int n9 = num % 10;
            num /= 10;
            int n8 = num % 10;
            num /= 10;
            int n7 = num % 10;
            num /= 10;
            int n6 = num % 10;
            num /= 10;
            int n5 = num % 10;
            num /= 10;
            int n4 = num % 10;
            num /= 10;
            int n3 = num % 10;
            num /= 10;
            int n2 = num % 10;
            num /= 10;
            int n1 = num % 10;

            int odds = n1 + n3 + n5 + n7 + n9;
            int evens = n2 + n4 + n6 + n8;

            int n10 = (odds * 7 - evens) % 10;
            int n11 = (odds + evens + n10) % 10;

            return (long) tc * 100 + (n10 * 10 + n11);
        }

        public void saveResults(PrintWriter writer) {
            writer.println("Result\t\tNum\t\tExpected");
            for (int i = 0; i < result.length; i++) {
                int tc = start + i;
                writer.printf("%d\t%d\t%d%s", result[i], tc, calculate(tc), System.lineSeparator());
            }
        }
    }
}

Result		Num		Expected
40681688012	406816880	40681688012
40681688180	406816881	40681688180
40681688258	406816882	40681688258
40681688326	406816883	40681688326
40681688494	406816884	40681688494
40681688562	406816885	40681688562
40681688630	406816886	40681688630
40681688708	406816887	40681688708
40681688876	406816888	40681688876
40681688944	406816889	40681688944
40681689002	406816890	40681689002
40681689170	406816891	40681689170
40681689248	406816892	40681689248
40681689316	406816893	40681689316
40681689484	406816894	40681689484
40681689552	406816895	40681689552
40681689620	406816896	40681689620
40681689798	406816897	40681689798
40681689866	406816898	40681689866
40681689934	406816899	40681689934

So I'm not sure what to do now, the bug appears to be intermittent somehow. The code yesterday was consistently giving me a bad result and now it is consistently giving me the correct result. Since I can no longer reproduce the error I was unable to debug the problem much to arrive at a solution.

Needless to say this has become a very frustrating bug for me now. @grfrost Can you think of any reason in aparapi that might produce intermittent results like this?

Have either of you ever witnessed it spontaneously start working during one execution or more?

UPDATE: After some reflection I think I was originally testing this on Mac and not linux after all. So my conclusion is that it isnt as weird as i first though. It simply is, and always has been, a mac only issue.

freemo · 2017-01-03T14:03:55Z

@CC007 Did you have an ubuntu box going somewhere you could run it on? If possible I'd be curious to see what sort of results you get when you run it on that box?

freemo · 2017-01-03T16:49:45Z

Ok. So here is my personal conclusion, tell me if you guys agree.

I have tested locally, it works on windows and linux but the bug occurs on mac. Both @grfrost and @savaskoc saw the error but only on Mac. Therefore unless anyone has experienced this bug on non-mac systems, I am going to conclude this is a problem that only occurs on mac.

Since this is a mac only issue it seems most likely the bug is in the OSX implementation of OpenCL and not in aparapi itself.

Any reason for anyone to suspect this isnt the case?

savaskoc · 2017-01-03T17:35:56Z

I think so that's a bug about osx's OpenCl implementation.

freemo · 2017-01-03T17:38:04Z

@savaskoc ok that means we need to see if we can find if a bug was already filed or not. I fit was we can reference it here until it is fixed. If not we should file one. Did you file a bug report with them yet?

savaskoc · 2017-01-03T18:26:30Z

I did but it would be good if you file another one

freemo · 2017-01-03T18:27:47Z

@savaskoc It isnt usually good practice to file the same bug twice. But I might be able to add useful comments to the bug you filed. Do you have a link to your bug?

savaskoc · 2017-01-03T18:30:11Z

Bug reporter doesn't provide a link. I can attach your comments to bug file, or you can file them

freemo · 2017-01-03T18:31:21Z

@savaskoc well link me to the site where you reported it so I can poke around at least.

savaskoc · 2017-01-03T18:31:51Z

@freemo https://bugreport.apple.com/

CC007 · 2017-01-06T18:08:56Z

@grfrost It seems that the executable does use 64bit opencl, msvcrt and kernel libraries, but the libstdC++ library is 32bit. I think that I read that this was an issue with Code::Blocks in combination with mingw, but using any of the other compilers gives the errors I posted previously. Thanks for the help though

CC007 · 2017-01-06T18:34:22Z

@grfrost This is what made it run without error http://stackoverflow.com/a/6405064, so probably I don't have a 64bit libstdC++ dynamic library installed.

Also should that code only output -100?

grfrost · 2017-01-06T19:03:08Z

@CC007 Glad you sorted it out. BTW gcc/mingw on windows is responsible for me having to drink many many beers ;)

…

On Fri, Jan 6, 2017 at 10:34 AM, CC007 ***@***.***> wrote: @grfrost <https://github.com/grfrost> This is what made it run without error http://stackoverflow.com/a/6405064, so probably I don't have a 64bit libstdC++ dynamic library installed. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#38 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AEKiN0Jcke4YXm5Rjmaf5iJ1ocL-j4ePks5rPokvgaJpZM4LYPxt> .

CC007 · 2017-01-06T19:10:07Z

@grfrost Also, your kernel.zip contains Mac specific code (libdispatch and its use of blocks), so I can't run that.

freemo · 2017-01-06T19:13:34Z

FYI as of 1.3.4 the aparapi native library was compiled using Microsoft visual studios instead of GCC or minge as with previous versions.

…

On Jan 6, 2017 2:03 PM, "grfrost" ***@***.***> wrote: @CC007 Glad you sorted it out. BTW gcc/mingw on windows is responsible for me having to drink many many beers ;) On Fri, Jan 6, 2017 at 10:34 AM, CC007 ***@***.***> wrote: > @grfrost <https://github.com/grfrost> This is what made it run without > error http://stackoverflow.com/a/6405064, so probably I don't have a > 64bit libstdC++ dynamic library installed. > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <#38 (comment)>, > or mute the thread > <https://github.com/notifications/unsubscribe-auth/ AEKiN0Jcke4YXm5Rjmaf5iJ1ocL-j4ePks5rPokvgaJpZM4LYPxt> > . > — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#38 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AC5JAn_VKEikY2MmunA9zh4bKA3BAxj1ks5rPo_sgaJpZM4LYPxt> .

grfrost · 2017-01-06T19:17:32Z

I think you are referring to Jeffreys code. I was enviously looking at those blocks ;) My code should have been straight C++ Gary

…

On Fri, Jan 6, 2017 at 11:10 AM, CC007 ***@***.***> wrote: @grfrost <https://github.com/grfrost> Also, your kernel.zip contains Mac specific code (libdispatch and its use of blocks), so I can't run that. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#38 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AEKiN06VSl7R7q9Oiy0yOytdzDtp2aUyks5rPpGQgaJpZM4LYPxt> .

CC007 · 2017-01-06T19:28:56Z

@grfrost ah I see now that it was @savaskoc who posted the code. Btw there is a libdispatch library for windows, but the block notation doesn't seem to be supported by mingw gcc.

TPolzer · 2017-01-10T18:30:00Z

Isn't this a clear case of undefined behaviour? As far as I know OpenCL C signed integer arithmetic overflow is only defined for atomic operations, so start + id is undefined if id is a positive int and start is Integer.MAX_VALUE.

CC007 · 2017-01-10T21:04:21Z

Addition is atomic but multiplication is indeed not atomic in the opencl specification for neither 32 nor 64 bit numbers: https://www.khronos.org/registry/OpenCL/sdk/1.2/docs/man/xhtml/atomicFunctions.html

TPolzer · 2017-01-10T21:10:47Z

I was referring to the https://www.khronos.org/registry/OpenCL/sdk/2.0/docs/man/xhtml/atomic_fetch_key.html which guarantees defined overflow for atomic_fetch_add. Aparapi is not using that function for addition, and for good reason. In general, overflow of signed integers is undefined in C, and I think OpenCL C follows C in that area.

CC007 · 2017-01-10T21:44:26Z

@TPolzer "For signed integer types, arithmetic is defined to use two’s complement representation with silent wrap-around on overflow; there are no undefined results." I don't see what you mean.

This might however explain why multiplication can cause issues

TPolzer · 2017-01-10T21:45:34Z

The problem is, that it does not apply in this situation, but only on explicitly atomic fetch and add.

nejeoui · 2017-10-18T10:06:39Z

I have noticed the same problem : Inconsistent results between GPU and CPU and between different GPUs

this kernel calculate the product of two Big Integers 256 bits each using the product scanning algorithm.

This Kernel gives different results on CPU (correct) / GPU (Intel Iris) (wrong)
This Kernel gives different results on GPU (Nvidia Tesla m40)(correct) / GPU (Intel Iris)(wrong)

public void multiplyProductScanning(final byte[] a,final byte[] b,final byte[] output) {

int UV=0;
int U=0;
int V=0;

for(int k=62;k>=0;k--) {
	UV=0;
	for(int i=MAX(0,k-31);i<=MIN(k,31);i++) 
		UV+=(a[i]&0xFF)*(b[k-i]&0xFF);
	UV=UV+U;
	U=(UV&0xFFFFFF00)>>8;
	V=UV&0xFF;
	output[k+1]=(byte)V;
	
}
output[0]=(byte)U;

}

Notice that we canot have an overflow in UV :
a[i]&0xFF < 256
(b[k-i]&0xFF <256
UV < 32 x 256 x 256

when i declare UV as long the results are correct in CPU and 4 different GPU.

freemo · 2017-10-18T12:41:13Z

I'm not entirely sure this can be solved. Aparapi works by converting java byte code into opencl's C-like language and running that on the gpu. Primitive operations like basic arithmitic is governed by the gpu and for Effiency reasons it would be infeasable to try to override that functionality in any meaningful way. We could of course add some sort of a add, subtract, divide, and multiply method that garuntees consistency at the expense of speed. But I'm not sure we would want to address this directly on the primitive math operations for fear of loosing significant performance

…

On Oct 18, 2017 6:06 AM, "Nejeoui Abderrazzak" ***@***.***> wrote: I have noticed the same problem : Inconsistent results between GPU and CPU and between different GPUs this kernel calculate the product of two Big Integers 256 bits each using the product scanning algorithm. This Kernel gives different results on CPU (correct) / GPU (Intel Iris) (wrong) This Kernel gives different results on GPU (Nvidia Tesla m40)(correct) / GPU (Intel Iris)(wrong) public void multiplyProductScanning(final byte[] a,final byte[] b,final byte[] output) { int UV=0; int U=0; int V=0; for(int k=62;k>=0;k--) { UV=0; for(int i=MAX(0,k-31);i<=MIN(k,31);i++) UV+=(a[i]&0xFF)*(b[k-i]&0xFF); UV=UV+U; U=(UV&0xFFFFFF00)>>8; V=UV&0xFF; output[k+1]=(byte)V; } output[0]=(byte)U; } Notice that we canot have an overflow in UV : a[i]&0xFF < 256 (b[k-i]&0xFF <256 UV < 32 x 256 x 256 when i declare UV as long the results are correct in CPU and 4 different GPU. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#38 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AC5JApl7NJyauQ8uXJGnhZWKB6p-tAsvks5stc2xgaJpZM4LYPxt> .

freemo added the bug Fix something that is broken label Jan 2, 2017

freemo mentioned this issue Jan 2, 2017

Is there any issue about data types? #14

Closed

freemo added help wanted Have a question? Need assistance? bounty $$$ Cash reward! labels Apr 19, 2018

freemo changed the title ~~Inconsistent results between GPU and CPU when integers overflow.~~ [Bounty $50] Inconsistent results between GPU and CPU when integers overflow. Apr 19, 2018

freemo removed the help wanted Have a question? Need assistance? label Apr 19, 2018

[Bounty $50] Inconsistent results between GPU and CPU when integers overflow. #38

[Bounty $50] Inconsistent results between GPU and CPU when integers overflow. #38

Comments

freemo commented Dec 30, 2016 • edited

freemo commented Dec 30, 2016

savaskoc commented Dec 30, 2016

CC007 commented Jan 2, 2017

savaskoc commented Jan 2, 2017

CC007 commented Jan 2, 2017

CC007 commented Jan 2, 2017

freemo commented Jan 2, 2017

savaskoc commented Jan 2, 2017

freemo commented Jan 2, 2017

savaskoc commented Jan 2, 2017

freemo commented Jan 2, 2017 • edited

savaskoc commented Jan 2, 2017

freemo commented Jan 2, 2017

freemo commented Jan 2, 2017 • edited

CC007 commented Jan 2, 2017

grfrost commented Jan 2, 2017

savaskoc commented Jan 2, 2017

grfrost commented Jan 2, 2017 • edited

savaskoc commented Jan 2, 2017

CC007 commented Jan 2, 2017

savaskoc commented Jan 2, 2017

CC007 commented Jan 2, 2017

savaskoc commented Jan 2, 2017

CC007 commented Jan 2, 2017

CC007 commented Jan 2, 2017

CC007 commented Jan 2, 2017 • edited

CC007 commented Jan 2, 2017 • edited

grfrost commented Jan 2, 2017

grfrost commented Jan 3, 2017

freemo commented Jan 3, 2017 • edited

freemo commented Jan 3, 2017 • edited

freemo commented Jan 3, 2017

freemo commented Jan 3, 2017

savaskoc commented Jan 3, 2017

freemo commented Jan 3, 2017

savaskoc commented Jan 3, 2017

freemo commented Jan 3, 2017

savaskoc commented Jan 3, 2017

freemo commented Jan 3, 2017

savaskoc commented Jan 3, 2017

CC007 commented Jan 6, 2017

CC007 commented Jan 6, 2017 • edited

grfrost commented Jan 6, 2017 via email

CC007 commented Jan 6, 2017

freemo commented Jan 6, 2017 via email

grfrost commented Jan 6, 2017 via email

CC007 commented Jan 6, 2017 • edited

TPolzer commented Jan 10, 2017

CC007 commented Jan 10, 2017

TPolzer commented Jan 10, 2017

CC007 commented Jan 10, 2017

TPolzer commented Jan 10, 2017

nejeoui commented Oct 18, 2017

freemo commented Oct 18, 2017 via email

freemo commented Dec 30, 2016 •

edited

freemo commented Jan 2, 2017 •

edited

freemo commented Jan 2, 2017 •

edited

grfrost commented Jan 2, 2017 •

edited

CC007 commented Jan 2, 2017 •

edited

CC007 commented Jan 2, 2017 •

edited

freemo commented Jan 3, 2017 •

edited

freemo commented Jan 3, 2017 •

edited

CC007 commented Jan 6, 2017 •

edited

CC007 commented Jan 6, 2017 •

edited