Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize Compilation Times #1805

Draft
wants to merge 6 commits into
base: release
Choose a base branch
from

Conversation

christophercrouzet
Copy link
Contributor

I thought that I'll just leave this experiment here in case it can be of any help.

That being said, I understand that the code diff can look a bit overwhelming and scary, so no worries if it doesn't land! 😅

What?

After seeing USD's codebase taking around 90 minutes to compile on my laptop (Intel i7-7700HQ, Ubuntu 20.04, single thread), I was curious to understand what was going on and whether the compilation times could be somewhat improved. This pull request is a first iteration of that work.

As it stands, this pull request is still lacking (see the to-do list below) but I'm happy to collaborate if there is any interest in merging it at all.

Why?

With USD being so widespread in the film and tech industries, it seemed like optimizing compilation times could help many developers around the world to speed up iterations and to save in computing resources.

How?

When compiling USD on a single thread using Clang's -ftime-trace compiler flag, here is how the profiling output looks like (after being slightly pruned, see the full log):

**** Time summary:
Compilation (2953 times):
  Parsing (frontend):         3805.9 s
  Codegen & opts (backend):   1056.0 s

**** Files that took longest to parse (compiler frontend):
  9006 ms: build/usd/pxr/usd/usd/CMakeFiles/usd.dir/crateFile.cpp.o
  8529 ms: build/usd/pxr/imaging/hioOpenVDB/CMakeFiles/hioOpenVDB.dir/vdbTextureData.cpp.o
  8399 ms: build/usd/pxr/usd/sdf/CMakeFiles/_sdf.dir/wrapTypes.cpp.o
  7068 ms: build/usd/pxr/usd/usd/CMakeFiles/usd.dir/stage.cpp.o
  6341 ms: build/usd/pxr/base/vt/CMakeFiles/_vt.dir/wrapArrayVec.cpp.o

**** Files that took longest to codegen (compiler backend):
 53870 ms: build/usd/pxr/imaging/hioOpenVDB/CMakeFiles/hioOpenVDB.dir/vdbTextureData.cpp.o
 47413 ms: build/usd/pxr/usd/usd/CMakeFiles/usd.dir/stage.cpp.o
 26069 ms: build/usd/pxr/usd/usd/CMakeFiles/usd.dir/crateFile.cpp.o
 25842 ms: build/usd/pxr/usd/sdf/CMakeFiles/_sdf.dir/wrapTypes.cpp.o
 22792 ms: build/usd/pxr/base/vt/CMakeFiles/_vt.dir/wrapArrayIntegral.cpp.o

**** Templates that took longest to instantiate:
  8600 ms: pxrInternal_v0_22__pxrReserved__::VtValue::_Init<std::basic_string<c... (892 times, avg 9 ms)
  8540 ms: std::unique_ptr<boost::python::objects::py_function_impl_base> (1140 times, avg 7 ms)
  7713 ms: std::unique_ptr<pxrInternal_v0_22__pxrReserved__::TfType::FactoryBase> (1070 times, avg 7 ms)
  7610 ms: pxrInternal_v0_22__pxrReserved__::VtValue::GetTypeInfo<std::basic_st... (892 times, avg 8 ms)
  7327 ms: std::unordered_set<pxrInternal_v0_22__pxrReserved__::TraceDynamicKey... (669 times, avg 10 ms)

**** Template sets that took longest to instantiate:
 95822 ms: std::unique_ptr<$> (12958 times, avg 7 ms)
 76665 ms: std::__uniq_ptr_data<$> (12958 times, avg 5 ms)
 75712 ms: std::__uniq_ptr_impl<$> (12958 times, avg 5 ms)
 44574 ms: std::_Hashtable<$> (8784 times, avg 5 ms)
 43430 ms: std::unordered_map<$> (7579 times, avg 5 ms)

**** Functions that took longest to compile:
  1962 ms: pxrInternal_v0_22__pxrReserved__::internal::GLApi::loadSymbols() (deps/USD/pxr/imaging/garch/glApi.cpp)
   773 ms: textFileFormatYyparse(pxrInternal_v0_22__pxrReserved__::Sdf_TextPars... (deps/USD/pxr/usd/sdf/textFileFormat.tab.cpp)
   448 ms: pxrInternal_v0_22__pxrReserved__::HdSt_ResourceBinder::ResolveBindin... (deps/USD/pxr/imaging/hdSt/resourceBinder.cpp)
   424 ms: pxrInternal_v0_22__pxrReserved__::PcpChanges::DidChange(pxrInternal_... (deps/USD/pxr/usd/pcp/changes.cpp)
   385 ms: stbi__load_and_postprocess_8bit(stbi__context*, int*, int*, int*, int) (deps/USD/pxr/imaging/hio/stbImage.cpp)

**** Function sets that took longest to compile / optimize:
  6783 ms: boost::python::detail::signature_arity<$>::impl<$>::elements() (4070 times, avg 1 ms)
  4244 ms: bool pxrInternal_v0_22__pxrReserved__::UsdStage::_GetGeneralMetadata... (64 times, avg 66 ms)
  4006 ms: boost::python::objects::caller_py_function_impl<$>::signature() const (3047 times, avg 1 ms)
  3525 ms: boost::python::converter::shared_ptr_from_python<$>::construct(_obje... (1452 times, avg 2 ms)
  2588 ms: void std::vector<$>::_M_range_insert<$>(__gnu_cxx::__normal_iterator... (146 times, avg 17 ms)

*** Expensive headers:
972063 ms: /opt/boost/1.70.0/include/boost/preprocessor/iteration/detail/iter/forward1.hpp (included 13959 times, avg 69 ms), included via:
  wrapTimestamp.cpp.o pyUtils.h pyInterpreter.h object.hpp object_core.hpp call.hpp arg_to_python.hpp function_handle.hpp caller.hpp  (320 ms)
  ...

904090 ms: build/usd/include/pxr/usd/usd/prim.h (included 369 times, avg 2450 ms), included via:
  modelAPI.cpp.o modelAPI.h apiSchemaBase.h schemaBase.h  (3114 ms)
  ...

887681 ms: build/usd/include/pxr/base/tf/pyObjWrapper.h (included 993 times, avg 893 ms), included via:
  aov.cpp.o aov.h types.h value.h  (1567 ms)
  ...

838714 ms: build/usd/include/pxr/usd/usd/object.h (included 374 times, avg 2242 ms), included via:
  wrapUtils.cpp.o wrapUtils.h  (3054 ms)
  ...

835586 ms: build/usd/include/pxr/base/vt/value.h (included 894 times, avg 934 ms), included via:
  aov.cpp.o aov.h types.h  (1702 ms)
  ...

We can see that the number one bottleneck clearly comes from the compiler spending around 15 minutes (900,000 ms) for each header that it has to include in hundreds (if not thousands) of other files.

One solution that is a good fit for tackling this issue is unity builds.

Why Not Using Precompiled Headers?

USD seems to already support PCH for MSVC, so it could have been natural to try extending that approach to other compilers, but then I looked at the list of conditions needed to be met for GCC and it seemed... too constraining? So I didn't really look into this.

Furthermore, I prefer a solution that is not compiler-specific so I thought that I'd try to address the issue with what seemed like a simple approach to reason about and to implement (although simple ain't always easy).

Methodology

The obvious requirement for a codebase to support unity builds is to ensure that all symbols are uniquely identified.

Additionally, it helps to have symbols being fully namespaced whenever possible (e.g.: when not relying on an ADL idiom), otherwise there might be ambiguities around symbols relying on using declarations, for example:

using namespace boost;
using namespace std;

tuple foo; // is it “boost::tuple” or “std::tuple”?

I could have performed all the required refactoring manually, first by updating the project's CMake with an unity build approach, and finally fixing all the compilations errors one by one. But then it would have been:

  • fairly tedious.
  • possibly more difficult to rebase, if needed.
  • a missed opportunity to promote consistency in the codebase.
  • less reliable since letting a human manually make so many edits could easily cause many errors.
  • a not really interesting/fun challenge to tackle that'd give me a chance to learn a new API 🤓

Instead, I went (mostly) with a programmatic approach through Clang's AST API.

Caveats

“Surely it should be easy to find all the symbols in a codebase and prefix them with a namespace using the AST”, or so I thought. Alas, C++ is a complex language and it turns out that this complexity is fairly well reflected in Clang's AST API.

Because of that, the refactoring tools that I built do a good chunk of the work but it's not 100% there—after running the tools, there is a need to manually patch/fix some things here and there.

The most obvious (and unfortunate) example of limitation being: the refactoring tools work at the AST level, after C++'s pre-processor has finished evaluating. This means that code wrapped into some #if/#ifdef statements might be discarded and left untouched by the tools—since I've run these tools under Ubuntu, this means for example that the code paths specific to Windows, macOS, Metal, PRMan, Python 2, and others, might require some further attention.

Results

timings

Note: the timings for g++ 11.1.0 (unity=ON) with 8 threads is only an estimate since my laptop runs out of memory for that one.

As for the results from -ftime-trace, they look a bit more nuanced (full log):

**** Time summary:
Compilation (204 times):
  Parsing (frontend):          519.0 s
  Codegen & opts (backend):    841.2 s

**** Files that took longest to parse (compiler frontend):
 16071 ms: build/usd/pxr/usd/usd/CMakeFiles/usd.dir/library_unit.cpp.o
 14580 ms: build/usd/pxr/base/vt/CMakeFiles/_vt.dir/python_module_unit.cpp.o
 13973 ms: build/usd/pxr/usd/sdf/CMakeFiles/_sdf.dir/python_module_unit.cpp.o
 13333 ms: build/usd/pxr/imaging/hdSt/CMakeFiles/hdSt.dir/library_unit.cpp.o
 11849 ms: build/usd/pxr/imaging/hd/CMakeFiles/hd.dir/library_unit.cpp.o

**** Files that took longest to codegen (compiler backend):
 92835 ms: build/usd/pxr/usd/usd/CMakeFiles/usd.dir/library_unit.cpp.o
 75512 ms: build/usd/pxr/base/vt/CMakeFiles/_vt.dir/python_module_unit.cpp.o
 55157 ms: build/usd/pxr/imaging/hioOpenVDB/CMakeFiles/hioOpenVDB.dir/library_unit.cpp.o
 50730 ms: build/usd/pxr/usd/sdf/CMakeFiles/_sdf.dir/python_module_unit.cpp.o
 49143 ms: build/usd/pxr/usd/sdf/CMakeFiles/sdf.dir/library_unit.cpp.o

**** Templates that took longest to instantiate:
  2499 ms: pxrInternal_v0_22__pxrReserved__::pxrImagingHioOpenVDBVdbTextureData... (1 times, avg 2499 ms)
  2493 ms: pxrInternal_v0_22__pxrReserved__::pxrImagingHioOpenVDBVdbTextureData... (1 times, avg 2493 ms)
  2484 ms: pxrInternal_v0_22__pxrReserved__::pxrImagingHioOpenVDBVdbTextureData... (1 times, avg 2484 ms)
  2133 ms: openvdb::v8_1::tools::resampleToMatch<openvdb::v8_1::tools::BoxSampl... (1 times, avg 2133 ms)
  2133 ms: openvdb::v8_1::tools::resampleToMatch<openvdb::v8_1::tools::BoxSampl... (1 times, avg 2133 ms)

**** Template sets that took longest to instantiate:
 26101 ms: boost::python::detail::make_function_aux<$> (5959 times, avg 4 ms)
 26066 ms: boost::python::class_<$>::def<$> (5125 times, avg 5 ms)
 25910 ms: boost::python::make_function<$> (5623 times, avg 4 ms)
 20621 ms: boost::python::class_<$>::def_impl<$> (4223 times, avg 4 ms)
 19269 ms: boost::python::objects::py_function::py_function<$> (6090 times, avg 3 ms)

**** Functions that took longest to compile:
  1987 ms: pxrInternal_v0_22__pxrReserved__::internal::GLApi::loadSymbols() (build/usd/pxr/imaging/garch/library_unit.cpp)
   771 ms: textFileFormatYyparse(pxrInternal_v0_22__pxrReserved__::Sdf_TextPars... (deps/USD/pxr/usd/sdf/textFileFormat.tab.cpp)
   517 ms: pxrInternal_v0_22__pxrReserved__::HdSt_ResourceBinder::ResolveBindin... (build/usd/pxr/imaging/hdSt/library_unit.cpp)
   466 ms: pxrInternal_v0_22__pxrReserved__::PcpChanges::DidChange(pxrInternal_... (build/usd/pxr/usd/pcp/library_unit.cpp)
   442 ms: pxrInternal_v0_22__pxrReserved__::Pcp_BuildPrimIndex(pxrInternal_v0_... (build/usd/pxr/usd/pcp/library_unit.cpp)

**** Function sets that took longest to compile / optimize:
  6214 ms: boost::python::detail::signature_arity<$>::impl<$>::elements() (3571 times, avg 1 ms)
  4038 ms: bool pxrInternal_v0_22__pxrReserved__::UsdStage::_GetGeneralMetadata... (64 times, avg 63 ms)
  3584 ms: boost::python::converter::shared_ptr_from_python<$>::construct(_obje... (1430 times, avg 2 ms)
  3468 ms: boost::python::objects::caller_py_function_impl<$>::signature() const (2529 times, avg 1 ms)
  2370 ms: bool pxrInternal_v0_22__pxrReserved__::UsdStage::_GetMetadataImpl<$>... (64 times, avg 37 ms)

*** Expensive headers:
87569 ms: /opt/boost/1.70.0/include/boost/preprocessor/iteration/detail/iter/forward1.hpp (included 1284 times, avg 68 ms), included via:
  wrapAuthoring.cpp def.hpp make_function.hpp args.hpp object_core.hpp call.hpp arg_to_python.hpp function_handle.hpp caller.hpp  (249 ms)
  ...

57173 ms: build/usd/include/pxr/usd/sdf/layer.h (included 49 times, avg 1166 ms), included via:
  sdfdump.cpp.o  (2044 ms)
  ...

51186 ms: build/usd/include/pxr/base/tf/scriptModuleLoader.h (included 39 times, avg 1312 ms), included via:
  moduleDeps.cpp  (1522 ms)
  ...

45756 ms: /opt/boost/1.70.0/include/boost/preprocessor/iteration/detail/local.hpp (included 4871 times, avg 9 ms), included via:
  type.cpp pyObjectFinder.h pyIdentity.h class.hpp data_members.hpp make_function.hpp args.hpp  (29 ms)
  ...

43292 ms: build/usd/include/pxr/base/tf/pyObjWrapper.h (included 85 times, avg 509 ms), included via:
  rendererPlugin.cpp rendererPlugin.h rendererPlugin.h renderDelegate.h aov.h types.h value.h  (1424 ms)
  ...

The Actual Refactoring Tools

There are 2 of them:

  • inline-namespaces: removes the using declarations and fully declares the namespace for the symbols that were relying on it.
  • disambiguate-symbols: sets a name to all the anonymous namespaces found and declares the namespace for the symbols that were relying on it.

For reference, I've made these tools available on this repository: https://github.com/christophercrouzet/pxr-usd-unity-build.

To emphasize this again: this was developed using Linux—this might run on macOS if the stars are aligned, but probably won't on Windows without a few touches.

The Refactoring Steps

There are several steps that were involved to apply the required changes and created the first commits to reflect these steps:

  • make usd-inline-namespaces -> commit “Inline namespaces using Clang's AST API”.
  • make usd-patch-inline-namespaces -> commit “Apply manual changes to the namespaces inlining step”.
  • make usd-disambiguate-symbols -> commit “Disambiguate symbols using Clang's AST API”.
  • make usd-patch-disambiguate-symbols -> commit “Apply manual changes to the symbols disambiguation step”.
  • make usd-patch-misc -> commit “Apply some miscellaneous manual fixes”.

Build Configuration Used

See https://github.com/christophercrouzet/pxr-usd-unity-build/blob/3da4696dc2bd3212e27c57eee65d00ec56f1e914/Makefile#L47-L110.

Additional Goals

  • no breaking changes.
  • avoid further cluttering the already massive diff by not moving code around (e.g.: moving free functions into an existing namespace scope), and by not attempting to format the code (e.g.: splitting long statements into multiple lines).

To-Do

  • ensure that the CI succeeds with other environments and compilation options than mine (Ubuntu 20.04, Python 3).
  • update the template generation code in pxr/usd/usd/codegenTemplates.
  • improve the unit build implementation in CMake to not recompile the whole codebase when doing code changes.
  • wrap these lengthy unique namespaces into macros and refer to these macros within the code?
  • rebase this PR onto the latest commit from the develop branch—I refrained from even checking how the merge conflicts look like and will keep doing so until necessary 😁

Notes

It should be possible to build some linters on top of the Clang's AST API, that could be run as part of the CI, to enforce certain rules such as flagging free private functions not belonging to any named namespace.

Also, compiling USD with Clang requires the changes described in this other pull request: #1696.

Credits

This work was only possible thanks to @aras-p who implemented the -ftime-trace compiler flag for Clang (see https://aras-p.info/blog/2019/01/16/time-trace-timeline-flame-chart-profiler-for-Clang).


  • I have submitted a signed Contributor License Agreement

@jilliene
Copy link

Filed as internal issue #USD-7270

@spiffmon
Copy link
Member

Hi @christophercrouzet - just wanted to say thanks, this is really interesting work and impressive numbers! It may be awhile before we're able to act on it, but it'd be cool if others are able to take these experiments even further.

@FlorianZ
Copy link
Contributor

Totally agree! Even though I don't yet understand all of it, this is totally awesome, and I am really enjoying learning from your write-up and PRs!

@christophercrouzet
Copy link
Contributor Author

Thanks for the kind words @spiffmon and @FlorianZ!

[...] it'd be cool if others are able to take these experiments even further.

As in exploring other approaches?

@spiffmon
Copy link
Member

spiffmon commented Apr 1, 2022

As in exploring other approaches?

@meshula , @sunyab , and I were discussing earlier this week, and wondering if some further digging could help identify where the biggest bangs for the buck are... e.g. we know some of the boost stuff is heavy, so what do we get if
a) eliminate our use of BOOST_PP
b) Be more selective in our wrap files about which includes we bring in, rather than just dumping in boost/python.hpp in so many places

And then also seeing if there are some smaller set of modules that have the greatest impact on build time in switching to unity build... and/or generally other strategies to deploying improvements in smaller bites.

Thanks again, @christophercrouzet - really thought-provoking work!

@christophercrouzet
Copy link
Contributor Author

Awesome! That would be interesting to see the results of these investigations, thank you for sharing and looking into it!

I don't know how much work that would require but maybe another thing worth considering would be to experiment with converting Python's bindings from Boost.Python to nanobind?

nanobind's author claims that it can be ~2-5x times faster to compile in comparison to Boost.Python, and is ~2x more performant at runtime.

Not only that, but it also seems to come with a ~8-9x reduction in binary size in comparison to Boost.Python, which is something to take into consideration if there is an intention to push USD towards runtime environments, as @mirror2mask mentioned it during the recent panel at GTC that was titled “Exploring USD: The HTML for 3D Virtual World [S42112]”.

@spiffmon
Copy link
Member

spiffmon commented Apr 3, 2022

Thanks for the pointer to nanobind - it looks pretty fantastic! One of our engineers pushed pretty hard a couple years back to try to get USD to use pybind11 without causing the ripple effect of needing to switch the entire rest of our million+ loc codebase built on top of USD from boost to pybind11 also. Alas, it did not seem possible. Given how many of boost::python's features we use, switching to nanobind seems like it would be even more involved than switching to pybind11, and since we can't just upgrade USD without upgrading all of Presto and our vendor DCC plugins, it's going to take some mountain moving to fund such a project. We definitely would like to, though!

@christophercrouzet
Copy link
Contributor Author

Good to know, thanks for the explanation @spiffmon!

@meshula
Copy link
Member

meshula commented Jul 8, 2022

@christophercrouzet, this continues to be such an interesting PR :) I was wondering if you could say a little bit about how you configured your build to work with -ftime-trace and Aras' profiling tools? I was wondering if you patched the usd build scripts to insert the flag, and if there was anything tricky involved in getting Aras' tools to use the results? I'd like to be able to reproduce a timing set up myself to generate the same kinds of reports you are showing here and in your blog post.

@christophercrouzet
Copy link
Contributor Author

Hi @meshula!

Aras' tool is only shipped as part of Clang 9.0+ so the main thing was to get USD to compile using Clang, which required a tiny change in the codebase as described in #1696.

The rest is basically only a matter of following what is described in https://github.com/aras-p/ClangBuildAnalyzer:

  • add the -ftime-trace flag to the compiler.
  • compile.
  • run the ClangBuildAnalyzer --all and ClangBuildAnalyzer --analyze commands on the generated JSON files.

If it can be of any help, I streamlined these steps in the following Makefile: https://github.com/christophercrouzet/pxr-usd-unity-build/blob/main/Makefile. More specifically, look at the usd-build and usd-analyze-trace phony targets.

@meshula
Copy link
Member

meshula commented Jul 9, 2022

Ah, thanks for the pointers to your Makefile, that's very helpful!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants