New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimizing pow with a literal exponent #1244

Closed
TheRealMJP opened this Issue Apr 24, 2018 · 5 comments

Comments

Projects
None yet
5 participants
@TheRealMJP
Copy link

TheRealMJP commented Apr 24, 2018

Hey guys! The old FXC compiler would always optimize certain cases of pow with a literal exponent. The classic case is something like this, with an exponent of 2.0:

float PSMain(in float x : X) : SV_Target0
{
    return pow(x, 2.0f);
}

FXC produces the following DXBC output for this:

ps_5_0
dcl_globalFlags refactoringAllowed
dcl_input_ps linear v0.x
dcl_output o0.x
mul o0.x, v0.x, v0.x
ret

As you can see it removed the pow and instead multplied x with itself, which I would assume is always going to be cheaper than the log2/mul/exp2 sequence that you get with an non-constant exponent. It looks like dxc doesn't do this transformation:

define void @PSMain() {
entry:
  %0 = call float @dx.op.loadInput.f32(i32 4, i32 0, i32 0, i8 0, i32 undef)  ; LoadInput(inputSigId,rowIndex,colIndex,gsVertexAxis)
  %Log = call float @dx.op.unary.f32(i32 23, float %0)  ; Log(value)
  %1 = fmul fast float %Log, 2.000000e+00
  %Exp = call float @dx.op.unary.f32(i32 21, float %1)  ; Exp(value)
  call void @dx.op.storeOutput.f32(i32 5, i32 0, i32 0, i8 0, float %Exp)  ; StoreOutput(outputSigId,rowIndex,colIndex,value)
  ret void
}

Do you guys know if this is intentional? If not, I think it would be nice to include that optimization for cases where people use pow(x, 1), pow(x, 2) or pow(x, 4) for convenience.

@hekota

This comment has been minimized.

Copy link
Member

hekota commented Apr 25, 2018

Thank you for the suggestion! It would certainly be nice to optimize these cases. I'll see what we can do!

@hekota hekota added the performance label May 15, 2018

@gwihlidal

This comment has been minimized.

Copy link

gwihlidal commented Sep 11, 2018

This would be a welcome optimization; it is very common to see pow(x, 2), pow(x, 3), etc.. in a variety of shaders, and this is a perf and numerical regression compared to fxc.

Getting power expansion up to 16 would likely cover what fxc was doing.

@hekota

This comment has been minimized.

Copy link
Member

hekota commented Sep 11, 2018

We are considering adding pow and a few other routines to core intrinsics, meaning they would be provided to the driver for custom optimization and enabling the use of hardware implementation when available.

@tex3d

This comment has been minimized.

Copy link
Contributor

tex3d commented Nov 1, 2018

For backwards compatibility, the referenced changes will do what fxc did, as long as you pass -HV 2016 (HLSL Version 2016). This should cover existing code you don't want to edit. The issue is that this expansion isn't always correct according to the spec (and IEEE safe mode doesn't correct it on fxc either).

Currently, when we lower (and expand to muls or log/mul/exp), we don't yet know whether things are marked precise, so we can't decide whether mul expansion is ok. Later, when we have precise marking, we would have to match the log-mul-exp pattern without precise and replace it with a mul expansion. Since doing this is extra work and the ideal optimization will be dependent on the target device, it's probably best left to the driver to decide whether to do this optimization.

Additionally, in future shader models, we plan on having a native pow intrinsic in DXIL, so this can be more easily matched for optimization. But the ideal expansion should still be performed by the driver compiler.

For new/modified HLSL, you can use your own manual multiply expansion if that's really what you want. Then you don't have to worry about the spec issues. Here's a function to do the expansion, with overloads for vector sizes, that should optimize to the code you want when using a literal uint up to 15.

// Should be called with literal uint power < 16.
float pow_mul(float value, uint power) {
  if (power >= 16) {
    uint neg_bit = asuint(value) & (1 << 31);
    value = asfloat(asuint(value) & ~neg_bit);
    return asfloat(asuint(pow(value, power)) | 
      ((power & 1) ? neg_bit : 0));
  }
  uint bit = 0;
  float result = (power & (1 << bit++)) ? value : 1.0F;
  float value2 = value * value; // value ^ 2
  if (power & (1 << (bit++))) result *= value2;
  value2 = value2 * value2; // value ^ 4
  if (power & (1 << (bit++))) result *= value2;
  value2 = value2 * value2; // value ^ 8
  if (power & (1 << (bit++))) result *= value2;
  return result;
}
float2 pow_mul(float2 value, uint power) {
  return float2(
    pow_mul(value.x, power),
    pow_mul(value.y, power)
  );
}
float3 pow_mul(float3 value, uint power) {
  return float3(
    pow_mul(value.x, power),
    pow_mul(value.y, power),
    pow_mul(value.z, power)
  );
}
float4 pow_mul(float4 value, uint power) {
  return float4(
    pow_mul(value.x, power),
    pow_mul(value.y, power),
    pow_mul(value.z, power),
    pow_mul(value.w, power)
  );
}

@vcsharma vcsharma self-assigned this Nov 2, 2018

@vcsharma

This comment has been minimized.

Copy link
Contributor

vcsharma commented Nov 4, 2018

Thanks @tex3d for the additional explanation!

Given that we have added support to match fxc's behavior for pow with literal exponent in #1564 which is available with -HV 2016 flag, I am closing this issue.

@vcsharma vcsharma closed this Nov 4, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment