Navigation Menu

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fmod producing the wrong results on the GPU #1384

Open
timgrant opened this issue Jun 24, 2021 · 8 comments
Open

fmod producing the wrong results on the GPU #1384

timgrant opened this issue Jun 24, 2021 · 8 comments

Comments

@timgrant
Copy link
Contributor

Problem

We have received reports of fmod giving the wrong answer on the GPU when the denominator is negative. I was able to confirm this behavior in testshade using a simple shader, given below.

Interestingly, fmod does produce the correct result when the argument is a local variable or a non-varying parameter, but not with varying parameters.

Steps to Reproduce

shader test (float a = -0.666667 [[ int lockgeom=0 ]],
             float b = -0.666667 [[ int lockgeom=1 ]],
             output color Cout = 0)
{
    float denom = 0.5;
    printf("fmod(%f, %f) = %f\n", a, denom, fmod(a, denom));
    printf("fmod(%f, %f) = %f\n", b, denom, fmod(b, denom));
}
$ testshade --optix -g 1 1 test
fmod(-0.666667, 0.500000) = 0.333333
fmod(-0.666667, 0.500000) = -0.166667

$ testshade -g 1 1 test
fmod(-0.666667, 0.500000) = -0.166667
fmod(-0.666667, 0.500000) = -0.166667

Versions

  • OSL branch/version: master (b0ed183)
  • OS: Ubuntu 20.04
  • C++ compiler: clang 12.0.0
  • LLVM version: LLVM 12.0.0
  • OIIO version: master (a07efab5)
@lgritz
Copy link
Collaborator

lgritz commented Jun 24, 2021

fmod(-0.666667, 0.500000) = 0.333333
fmod(-0.666667, 0.500000) = -0.166667

Whoa, it's different with the same input values and only differing in lockgeom?

@timgrant
Copy link
Contributor Author

It looks like the values are getting constant folded in the non-varying case, based on the PTX. Chaging the optimization level for testshade to O1 produces the correct results, but it also changes the remainder calculation significantly. I'm not quite sure what to make of that. Peeking at the LLVM IR might be illuminating.

@lgritz
Copy link
Collaborator

lgritz commented Jun 25, 2021

It means we're doing the calculation differently for the constant-folded case versus the runtime case.

@timgrant
Copy link
Contributor Author

The LLVM IR is pretty straightforward, especially for O2 :

  ;; O2
  %23 = load float, float* %20, align 4
  %24 = frem float %23, 5.000000e-01

  ;; O1
  %27 = load float, float* %20, align 4
  %28 = fmul fast float %27, 2.000000e+00; -0.666667 * 2.0    = -1.333334
  %29 = fptosi float %28 to i32;           sint(-1.333334)    = -1
  %30 = sitofp i32 %29 to float;           float(-1)          = -1.0
  %31 = fmul fast float %30, 5.000000e-01; -1.0 * 0.5         = -0.5
  %32 = fsub fast float %27, %31;          -0.666667 - (-0.5) = -0.16667

If I'm reading the PTX correctly, this is the remainder calculation from the O2 case:

  ;; %f8 holds the value of float a (-0.666667)
  div.rn.f32 %f4, %f8, 0f3F000000; -0.666667 / 0.5    = -1.333334
  cvt.rmi.f32.f32 %f5, %f4;        floor( -1.333334 ) = -2.0
  mul.f32 %f6, %f5, 0f3F000000;    2.0 * 0.5          = -1.0;
  sub.f32 %f7, %f8, %f6;           -0.666667 - (-1.0) = 0.333333

The key difference is in the rounding operation. cvt.rmi rounds to the nearest integer, towards negative infinity (from the PTX ISA).

I guess this is the purview of the NVPTX backend. It's possible that using different options in LLVM_Util::ptx_compile_group() can produce the desired behavior, but I'm not sure at this point.

FWIW, I tested fmod in a simple CUDA program and it did the right thing.

@timgrant
Copy link
Contributor Author

It should come as no surprise, but the PTX instructions follow the pattern defined in the NVPTXInstrInfo.td. Given that, I don't think we can leave it up to NVPTX to handle frem lowering. I suppose we want something more like the O1 case?

@fpsunflower
Copy link
Contributor

Is it possible to submit a fix to the NVPTX backend in LLVM? Am I understanding correctly that LLVM has its own IR level simplification of fmod that is competing with the one from the NVPTX backend? In that case - is there a way to force LLVM to always do what it does in the O1 case?

@timgrant
Copy link
Contributor Author

Chris, I think your understanding is correct. It would be great to patch this in NVPTX, but I couldn't begin to put a time frame on that.

In the O1 case, we get the right behavior on the GPU because it forces use of safe_fmod. In terms of edit distance, this might be the minimal patch to achieve that result. But it's worth finding out if there is an NVPTX option that affects this if it can help us avoid a special case.

diff --git a/src/liboslexec/llvm_gen.cpp b/src/liboslexec/llvm_gen.cpp
index a93faeba..95d414f4 100644
--- a/src/liboslexec/llvm_gen.cpp
+++ b/src/liboslexec/llvm_gen.cpp
@@ -789,7 +789,7 @@ LLVMGEN (llvm_gen_modulus)
         if (!a || !b)
             return false;
         llvm::Value *r;
-        if (B.is_constant() && ! rop.is_zero(B))
+        if (! rop.use_optix() && B.is_constant() && ! rop.is_zero(B))
             r = rop.ll.op_mod (a, b);
         else
             r = rop.ll.call_function (safe_mod, a, b);

@lgritz
Copy link
Collaborator

lgritz commented Jun 25, 2021

Even if we get a fix in NVPTX in the very next major version (13), it will probably be years before that's the minimum required LLVM version for OSL. So we need a workaround no matter what.

I'm fine with this suggested workaround, unless there is already some NVPTX option that does what we need.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants