Optimize typed coordinate conversion functions for faster Windows unsafe_get_submap_at performance #75376
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
Performance "Optimize typed coordinate conversion functions for widespread savings"
Purpose of change
I accidentally left a performance profiling run open for a couple hours. Interestingly, I noticed that what should be a basic math function,
project_coord
, was top of the list in self CPU at a shockingly high % of total cpu. This tickled my performance hotspot spidey sense so I went digging.There's a variety of things going on but the basic point is the compiler was not inlining
project_coord
which by design is supposed to be inlined to take advantage of dead code stripping because the struct has more members than usually needed. This meant wasted computation in a very, very, hot code path.The net win is
map::ter
is about 33% faster,map::furn
is 25% faster, many other things called less often probably have similar wins.map::ter
in particular is called in the render codepath so that makes everything just a bit snappier.There were, unfortunately, no measurable wins with clang-cl. The generated code was identical (and fairly optimal) before/after my changes.
Describe the solution
Describe alternatives you've considered
Testing
Hacked up
map_bounds_checking
test to run 100 times in a loop. Instrumented it enough times to get 3 results within noise of each other, before & after.I also hand expanded the template code into
unsafe_get_submap_at
to strip all the dead code manually and the generated assembly was functionally identical.Additional context
There's nontrivial bulk added by two conditionals because at the bottom of the math stack are calls to
divide_xy_round_to_minus_infinity
. If we can get the inputs to be_ib
points, then the math instead usesdivide_xy_round_to_minus_infinity_non_negative
which is branch free and reduces the asm from 54 lines to 35 lines and no conditional jumps. That's a longer term and harder effort though, and my immediate attempts to do this did not get any significant wins off the bat.