Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OptiX PTX pipeline overhaul #1680

Merged

Conversation

tgrant-nv
Copy link
Contributor

Description

This PR is an overhaul of PTX compilation in OSL. The goal is to minimize the size of the generated PTX to help reduce time spent in subsequent stages (e.g., module creation, pipeline linking, etc.).

rend_lib

The main functional change is in how the functions defined in rend_lib.cu are handled. In the current setup, rend_lib.cu is compiled to a standalone PTX file using nvcc. The functions are considered extern in generated shaders, and the dependencies are satisfied when the final pipeline is linked together. Since the functions are extern, they are not available for optimization when each shader is generated. This prevents the inlining of small functions, which incurs a lot of function call overhead in terms of PTX size and run-time cost.

So instead of compiling rend_lib.cu separately, it is compiled to LLVM bitcode along with the rest of the "shadeops" sources. This bitcode is used to seed the LLVM Module for each shader. This makes the definitions available when each shader is being generated and optimized, which allows the optimizer to make better decisions. It also allows for better control of inlining decisions which affect the size and quality of the generated PTX.

shadeops

Another significant functional change is in how the shadeops functions (including those defined in rend_lib.cu) are treated. In the current pipeline, each shader must carry along with it definitions for all non-inlined shadeops functions that are used. This bloats the size of the PTX for each shader, and results in many duplicate copies of those functions existing in the final linked pipeline.

In the new pipeline, the unified shadeops+rend_lib bitcode is compiled to PTX using the offline llc tool. This PTX is used by the renderer to create a single "shadeops" module that provides the definitions for those functions. This makes it possible to drop the definitions for those functions from the PTX for each shader, which can significantly reduce the size of the generated PTX.

quirks

This PR contains some interesting "quirks" which are largely fallout from compiling rend_lib.cu using clang instead of nvcc:

  • The clang-generated PTX is processed by a Python script to transform some function & symbol declarations to better match what was generated by nvcc. It might be possible to add the right decorators to the C++/CUDA code to get the correct visibility for these symbols, but I was not successful in doing so.
  • It was necessary to change the function signatures for some of the rend_lib functions to use void* instead of the actual OSL types (e.g., OSL::Color3*). Trying to use the actual types produces an LLVM assertion failure, for reasons that I don't understand.
  • I have enabled flush-to-zero (FTZ) for clang-compiled code, to better match the behavior of nvcc. I added a CMake variable to enable or disable FTZ in the event that somebody wants to disable it.
  • I have added some alignment hints on pointer accesses in rend_lib.cu. clang does not appear to be able to deduce the alignment of pointers in many cases, which turns things like memcpy and memset into long series of byte operations. Functions like osl_get_matrix are bloated substantially if the hints aren't provided.

Tests

No new tests added, but no new test failures were added either. 😉

Checklist:

  • I have read the contribution guidelines.
  • I have previously submitted a Contributor License Agreement.
  • I have updated the documentation, if applicable.
  • I have ensured that the change is tested somewhere in the testsuite (adding new test cases if necessary).
  • My code follows the prevailing code style of this project.

@@ -720,6 +718,12 @@ class ShadingSystemImpl {
return m_closure_registry.get_entry(id);
}

/// Attributes to control GPU optimization
bool gpu_no_inline() const { return m_gpu_no_inline; }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is intent here for these attributes to apply to all GPU's?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a good question. We could specify cuda or optix rather than gpu. There might be things we'd want to do differently depending on whether OptiX or CUDA was the target. I'm sure there will be greater differences between hardware vendors.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lgritz for ShadingSystem attributes, would you like to specify a vendor specific "extension" (well probably best as prefix) for attributes.
@tgrant-nv not sure which applies best in your case pxt_no_inline vs. cuda_no_inline, optix_no_inline, nvidia_no_inline.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's default to "optix_blah" for the things that are specific to the OptiX path. (I'll let Tim make the call about whether, when, or where it might be better to distinguish cuda vs optix. I'm not sure we currently have a working "pure Cuda" path at the moment, so it may be a moot point at present.)

Alex's point is well taken -- in addition to not assuming one "vendor" is equivalent to all GPU, we can easily imagine a future code path that is for traditional GPU rasterization (e.g. glsl or the like) that is quite different from the current optix path (even for Nvidia cards) and could lead to confusion among future developers. GPU is probably just too generic, since there are so many different and mutually exclusive GPU approaches.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent points. I have updated the prefixes for those attributes.

@chellmuth
Copy link
Collaborator

This LGTM, thanks for addressing all the offline feedback we discussed!

@lgritz
Copy link
Collaborator

lgritz commented Aug 23, 2023

@chellmuth Do we still like this? Is it ready to merge?

I think there are some conflicts. @timgrant, can you rebase on top of the current main?

@tgrant-nv
Copy link
Contributor Author

I will try to rebase, but I was hesitant to do so with main in its current state since I won't be able to confirm the correctness of the merge.

@chellmuth
Copy link
Collaborator

@chellmuth Do we still like this? Is it ready to merge?

Yeah, we've tested/benchmarked this code and saw some nice speedups.

@tgrant-nv
Copy link
Contributor Author

Okay, I have rebased on main. It was surprisingly smooth.

These are the failures I'm seeing among the optix tests:

The following tests FAILED:
	179 - function-overloads.optix (Failed)
	471 - testoptix-reparam.optix.opt (Failed)
	515 - spline-boundarybug.optix (Failed)
	734 - testoptix.optix.opt (Failed)
	735 - testoptix-noise.optix.opt (Failed)

I believe this is the same set of failing tests from main.

@lgritz
Copy link
Collaborator

lgritz commented Aug 23, 2023

I think 4 of those 5 are addressed by my PR yesterday, which I just merged into main. Chasing after that last one next.

@tgrant-nv
Copy link
Contributor Author

I see the same set of failures in a current build of main with #1715. I'm using LLVM 14.0.3 and CUDA 11.8, FWIW.

@lgritz
Copy link
Collaborator

lgritz commented Aug 23, 2023

Sorry, my bad, I hadn't actually done the merge yet.
Coming shortly.

@tgrant-nv
Copy link
Contributor Author

Thanks, Larry. I just tested with the 535.98 driver, and I'm still seeing three failures after #1718:

	471 - testoptix-reparam.optix.opt (Failed)
	734 - testoptix.optix.opt (Failed)
	735 - testoptix-noise.optix.opt (Failed)

It's expected for testoptix.optix.opt to fail, since there are some printf sub-tests that require the string contents (which we should get rid of). But the image diff sub-tests are also failing, which is not expected.

When I run testoptix-reparam.optix.opt, there is a blue background but the three spheres are all black. For the other two tests the images are all black.

Are these tests passing for you?

I also noticed that although spline-boundarybug.optix is not reported as failing, there is a slight precision difference in the output:

-(0.600000,0.306263)
+(0.600000,0.306262)

@tgrant-nv
Copy link
Contributor Author

I found the source of the black shader output. There is a separate definition of ShaderGlobals in testshade/cuda/rend_lib.h, which does not include two new members added in #1702. Adding those fields clears up the newly failing tests. Larry, would you like to commit that as a separate fix? Or I can roll it into this PR.

diff --git a/src/testrender/cuda/rend_lib.h b/src/testrender/cuda/rend_lib.h
index e66f4e33..4ad59208 100644
--- a/src/testrender/cuda/rend_lib.h
+++ b/src/testrender/cuda/rend_lib.h
@@ -99,6 +99,8 @@ struct ShaderGlobals {
     float3 Ps, dPsdx, dPsdy;
     void* renderstate;
     void* shadingStateUniform;
+    int thread_index;
+    int shade_index;
     void* tracedata;
     void* objdata;
     void* context;

@lgritz
Copy link
Collaborator

lgritz commented Aug 25, 2023

Would you mind sending that off as a separate PR and I'll just merge it right away?

@tgrant-nv tgrant-nv force-pushed the optix-ptx-pipeline-redux branch 2 times, most recently from 3d34f96 to f5fa793 Compare August 28, 2023 17:22
Signed-off-by: Tim Grant <tgrant@nvidia.com>
 * Link the shadeops BC and rend lib BC and emit PTX for it; create an OptiX module from that PTX.
 * Pad the OptiX direct callable stack to account for calls to the shadeops functions.
 * Tweak the LLVM optimization passes.
 * Don't clone the Module before lowering to PTX.

Signed-off-by: Tim Grant <tgrant@nvidia.com>
… of creating one each time it is needed.

Signed-off-by: Tim Grant <tgrant@nvidia.com>
… match the built-in DECL.

Signed-off-by: Tim Grant <tgrant@nvidia.com>
… and bitcode.

Signed-off-by: Tim Grant <tgrant@nvidia.com>
…on group and entry functions before lowering to PTX.

Signed-off-by: Tim Grant <tgrant@nvidia.com>
Signed-off-by: Tim Grant <tgrant@nvidia.com>
…gpu_' to 'optix_'. Also, add entries for the new attributes to `getstats`.

Signed-off-by: Tim Grant <tgrant@nvidia.com>
…in shadeops are embedded as both LLVM bitcode and PTX in liboslexec; the BC is used to seed the module for each shader, and the PTX can be retrieved at run-time through the shading system.

WIP: continued.

WIP: continued.

WIP: continued.

WIP: continued.

WIP: continued.

WIP: continued.
Signed-off-by: Tim Grant <tgrant@nvidia.com>
…/CUDA are not in use.

Signed-off-by: Tim Grant <tgrant@nvidia.com>
Signed-off-by: Tim Grant <tgrant@nvidia.com>
Signed-off-by: Tim Grant <tgrant@nvidia.com>
…ng in the shadeops and rendlib. * Drop the bodies of noinline functions before optimization.

Signed-off-by: Tim Grant <tgrant@nvidia.com>
Signed-off-by: Tim Grant <tgrant@nvidia.com>
…/CUDA. It doesn't seem to affect codegen, but it slows down optimization considerably.

Signed-off-by: Tim Grant <tgrant@nvidia.com>
… should not be inlined.

Signed-off-by: Tim Grant <tgrant@nvidia.com>
…ome additional ShadingSystem attributes to control inlining.

Signed-off-by: Tim Grant <tgrant@nvidia.com>
Signed-off-by: Tim Grant <tgrant@nvidia.com>
…ions to the ShadingSystem.

Signed-off-by: Tim Grant <tgrant@nvidia.com>
…entical definition in llvm_ops.cpp instead.

Signed-off-by: Tim Grant <tgrant@nvidia.com>
… older LLVM versions.

Signed-off-by: Tim Grant <tgrant@nvidia.com>
Signed-off-by: Tim Grant <tgrant@nvidia.com>
Signed-off-by: Tim Grant <tgrant@nvidia.com>
…spected.

Signed-off-by: Tim Grant <tgrant@nvidia.com>
…dule before linking in the Shadeops for OptiX/CUDA.

Signed-off-by: Tim Grant <tgrant@nvidia.com>
…dd some documentation for the additional OptiX/CUDA inlining options.

Signed-off-by: Tim Grant <tgrant@nvidia.com>
…r testshade.cpp.

Signed-off-by: Tim Grant <tgrant@nvidia.com>
…sh type.

Signed-off-by: Tim Grant <tgrant@nvidia.com>
@lgritz lgritz merged commit f449749 into AcademySoftwareFoundation:main Sep 8, 2023
21 of 22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants