fmod producing the wrong results on the GPU #1384

timgrant · 2021-06-24T23:47:21Z

Problem

We have received reports of fmod giving the wrong answer on the GPU when the denominator is negative. I was able to confirm this behavior in testshade using a simple shader, given below.

Interestingly, fmod does produce the correct result when the argument is a local variable or a non-varying parameter, but not with varying parameters.

Steps to Reproduce

shader test (float a = -0.666667 [[ int lockgeom=0 ]],
             float b = -0.666667 [[ int lockgeom=1 ]],
             output color Cout = 0)
{
    float denom = 0.5;
    printf("fmod(%f, %f) = %f\n", a, denom, fmod(a, denom));
    printf("fmod(%f, %f) = %f\n", b, denom, fmod(b, denom));
}

$ testshade --optix -g 1 1 test
fmod(-0.666667, 0.500000) = 0.333333
fmod(-0.666667, 0.500000) = -0.166667

$ testshade -g 1 1 test
fmod(-0.666667, 0.500000) = -0.166667
fmod(-0.666667, 0.500000) = -0.166667

Versions

OSL branch/version: master (b0ed183)
OS: Ubuntu 20.04
C++ compiler: clang 12.0.0
LLVM version: LLVM 12.0.0
OIIO version: master (a07efab5)

The text was updated successfully, but these errors were encountered:

lgritz · 2021-06-24T23:56:15Z

fmod(-0.666667, 0.500000) = 0.333333
fmod(-0.666667, 0.500000) = -0.166667

Whoa, it's different with the same input values and only differing in lockgeom?

timgrant · 2021-06-25T00:30:19Z

It looks like the values are getting constant folded in the non-varying case, based on the PTX. Chaging the optimization level for testshade to O1 produces the correct results, but it also changes the remainder calculation significantly. I'm not quite sure what to make of that. Peeking at the LLVM IR might be illuminating.

lgritz · 2021-06-25T00:39:14Z

It means we're doing the calculation differently for the constant-folded case versus the runtime case.

timgrant · 2021-06-25T01:55:11Z

The LLVM IR is pretty straightforward, especially for O2 :

  ;; O2
  %23 = load float, float* %20, align 4
  %24 = frem float %23, 5.000000e-01

  ;; O1
  %27 = load float, float* %20, align 4
  %28 = fmul fast float %27, 2.000000e+00; -0.666667 * 2.0    = -1.333334
  %29 = fptosi float %28 to i32;           sint(-1.333334)    = -1
  %30 = sitofp i32 %29 to float;           float(-1)          = -1.0
  %31 = fmul fast float %30, 5.000000e-01; -1.0 * 0.5         = -0.5
  %32 = fsub fast float %27, %31;          -0.666667 - (-0.5) = -0.16667

If I'm reading the PTX correctly, this is the remainder calculation from the O2 case:

  ;; %f8 holds the value of float a (-0.666667)
  div.rn.f32 %f4, %f8, 0f3F000000; -0.666667 / 0.5    = -1.333334
  cvt.rmi.f32.f32 %f5, %f4;        floor( -1.333334 ) = -2.0
  mul.f32 %f6, %f5, 0f3F000000;    2.0 * 0.5          = -1.0;
  sub.f32 %f7, %f8, %f6;           -0.666667 - (-1.0) = 0.333333

The key difference is in the rounding operation. cvt.rmi rounds to the nearest integer, towards negative infinity (from the PTX ISA).

I guess this is the purview of the NVPTX backend. It's possible that using different options in LLVM_Util::ptx_compile_group() can produce the desired behavior, but I'm not sure at this point.

FWIW, I tested fmod in a simple CUDA program and it did the right thing.

timgrant · 2021-06-25T15:43:44Z

It should come as no surprise, but the PTX instructions follow the pattern defined in the NVPTXInstrInfo.td. Given that, I don't think we can leave it up to NVPTX to handle frem lowering. I suppose we want something more like the O1 case?

fpsunflower · 2021-06-25T17:13:27Z

Is it possible to submit a fix to the NVPTX backend in LLVM? Am I understanding correctly that LLVM has its own IR level simplification of fmod that is competing with the one from the NVPTX backend? In that case - is there a way to force LLVM to always do what it does in the O1 case?

timgrant · 2021-06-25T22:53:12Z

Chris, I think your understanding is correct. It would be great to patch this in NVPTX, but I couldn't begin to put a time frame on that.

In the O1 case, we get the right behavior on the GPU because it forces use of safe_fmod. In terms of edit distance, this might be the minimal patch to achieve that result. But it's worth finding out if there is an NVPTX option that affects this if it can help us avoid a special case.

diff --git a/src/liboslexec/llvm_gen.cpp b/src/liboslexec/llvm_gen.cpp
index a93faeba..95d414f4 100644
--- a/src/liboslexec/llvm_gen.cpp
+++ b/src/liboslexec/llvm_gen.cpp
@@ -789,7 +789,7 @@ LLVMGEN (llvm_gen_modulus)
         if (!a || !b)
             return false;
         llvm::Value *r;
-        if (B.is_constant() && ! rop.is_zero(B))
+        if (! rop.use_optix() && B.is_constant() && ! rop.is_zero(B))
             r = rop.ll.op_mod (a, b);
         else
             r = rop.ll.call_function (safe_mod, a, b);

lgritz · 2021-06-25T23:00:55Z

Even if we get a fix in NVPTX in the very next major version (13), it will probably be years before that's the minimum required LLVM version for OSL. So we need a workaround no matter what.

I'm fine with this suggested workaround, unless there is already some NVPTX option that does what we need.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fmod producing the wrong results on the GPU #1384

fmod producing the wrong results on the GPU #1384

timgrant commented Jun 24, 2021

lgritz commented Jun 24, 2021

timgrant commented Jun 25, 2021

lgritz commented Jun 25, 2021

timgrant commented Jun 25, 2021

timgrant commented Jun 25, 2021

fpsunflower commented Jun 25, 2021

timgrant commented Jun 25, 2021

lgritz commented Jun 25, 2021

Navigation Menu

fmod producing the wrong results on the GPU #1384

fmod producing the wrong results on the GPU #1384

Comments

timgrant commented Jun 24, 2021

Problem

Steps to Reproduce

Versions

lgritz commented Jun 24, 2021

timgrant commented Jun 25, 2021

lgritz commented Jun 25, 2021

timgrant commented Jun 25, 2021

timgrant commented Jun 25, 2021

fpsunflower commented Jun 25, 2021

timgrant commented Jun 25, 2021

lgritz commented Jun 25, 2021