Skip to content

[llvm] Lower latency bonus threshold in function specialization. #143954

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jun 17, 2025

Conversation

vzakhari
Copy link
Contributor

Related to #143219.

Function specialization does not kick in if flang sets noalias
attributes on the function arguments of digits_2, because PRE
optimizes several srem instructions and other memory accesses
from the inner loops causing the latency bonus to be lower than
the current 40% threshold.

While looking at this, I did not really get why we compute the latency
bonus as a ratio of the latency of the "eliminated" instructions
and the code-size of the whole function. It did not make much sense
to me.

I tried computing the total latency as a sum of latencies
of the instructions that belong to non-dead code (including
the instructions that would be executed had they not been
"eliminated" due to the constant propagation). This total
latency should identify the total cost of executing the function
with the given argument being dynamically equal to the tried
constant value. Then the latency bonus would be computed
as the ratio between the latency of the "eliminated" instructions
and the total latency. Unfortunately, this did not given me a good
heuristics either. The bonus was close to 0% on some targets,
and as big as 3-5% on other targets. This does match very well
with the performance gain achieved by function specialization
for exchange2, so it seemd like another artificial heuristic
not better than the current one.

It seems that GCC uses a set of different heuristics for function
specialization, but I am not an expert here and I cannot say
if we can match them in LLVM.

With all that said, I decided to try to lower the threshold
to avoid the regression and be able to re-enable the generally
good change for noalias attribute.

With this patch, I was able to reduce the effect of noalias,
so that -force-no-alias=true is only ~10% slower than
-force-no-alias=false code on neoverse-v1 and neoverse-v2.
On neoverse-n1, -force-no-alias=true is >2x faster than
-force-no-alias=false regardless of this patch.

This threshold has been changed before also due to improved
alias information: 2fb51fb#diff-066363256b7b4164e66b28a3028b2cb9e405c9136241baa33db76ebd2edb87cd

Please let me know what testing I should run to make sure this change
is safe. As I understand, it may affect the compilation time performance,
and I will appreciate it if someone points out which benchmarks
need to be checked before merging this.

Related to llvm#143219.

Function specialization does not kick in if flang sets `noalias`
attributes on the function arguments of `digits_2`, because PRE
optimizes several `srem` instructions and other memory accesses
from the inner loops causing the latency bonus to be lower than
the current 40% threshold.

While looking at this, I did not really get why we compute the latency
bonus as a ratio of the latency of the "eliminated" instructions
and the code-size of the whole function. It did not make much sense
to me.

I tried computing the total latency as a sum of latencies
of the instructions that belong to non-dead code (including
the instructions that would be executed had they not been
"eliminated" due to the constant propagation). This total
latency should identify the total cost of executing the function
with the given argument being dynamically equal to the tried
constant value. Then the latency bonus would be computed
as the ratio between the latency of the "eliminated" instructions
and the total latency. Unfortunately, this did not given me a good
heuristics either. The bonus was close to 0% on some targets,
and as big as 3-5% on other targets. This does match very well
with the performance gain achieved by function specialization
for exchange2, so it seemd like another artificial heuristic
not better than the current one.

It seems that GCC uses a set of different heuristics for function
specialization, but I am not an expert here and I cannot say
if we can match them in LLVM.

With all that said, I decided to try to lower the threshold
to avoid the regression and be able to re-enable the generally
good change for `noalias` attribute.

With this patch, I was able to reduce the effect of `noalias`,
so that `-force-no-alias=true` is only ~10% slower than
`-force-no-alias=false` code on neoverse-v1 and neoverse-v2.
On neoverse-n1, `-force-no-alias=true` is >2x faster than
`-force-no-alias=false` regardless of this patch.

This threshold has been changed before also due to improved
alias information: llvm@2fb51fb#diff-066363256b7b4164e66b28a3028b2cb9e405c9136241baa33db76ebd2edb87cd

Please let me know what testing I should run to make sure this change
is safe. As I understand, it may affect the compilation time performance,
and I will appreciate it if someone points out which benchmarks
need to be checked before merging this.
@llvmbot
Copy link
Member

llvmbot commented Jun 12, 2025

@llvm/pr-subscribers-function-specialization

@llvm/pr-subscribers-llvm-transforms

Author: Slava Zakharin (vzakhari)

Changes

Related to #143219.

Function specialization does not kick in if flang sets noalias
attributes on the function arguments of digits_2, because PRE
optimizes several srem instructions and other memory accesses
from the inner loops causing the latency bonus to be lower than
the current 40% threshold.

While looking at this, I did not really get why we compute the latency
bonus as a ratio of the latency of the "eliminated" instructions
and the code-size of the whole function. It did not make much sense
to me.

I tried computing the total latency as a sum of latencies
of the instructions that belong to non-dead code (including
the instructions that would be executed had they not been
"eliminated" due to the constant propagation). This total
latency should identify the total cost of executing the function
with the given argument being dynamically equal to the tried
constant value. Then the latency bonus would be computed
as the ratio between the latency of the "eliminated" instructions
and the total latency. Unfortunately, this did not given me a good
heuristics either. The bonus was close to 0% on some targets,
and as big as 3-5% on other targets. This does match very well
with the performance gain achieved by function specialization
for exchange2, so it seemd like another artificial heuristic
not better than the current one.

It seems that GCC uses a set of different heuristics for function
specialization, but I am not an expert here and I cannot say
if we can match them in LLVM.

With all that said, I decided to try to lower the threshold
to avoid the regression and be able to re-enable the generally
good change for noalias attribute.

With this patch, I was able to reduce the effect of noalias,
so that -force-no-alias=true is only ~10% slower than
-force-no-alias=false code on neoverse-v1 and neoverse-v2.
On neoverse-n1, -force-no-alias=true is >2x faster than
-force-no-alias=false regardless of this patch.

This threshold has been changed before also due to improved
alias information: 2fb51fb#diff-066363256b7b4164e66b28a3028b2cb9e405c9136241baa33db76ebd2edb87cd

Please let me know what testing I should run to make sure this change
is safe. As I understand, it may affect the compilation time performance,
and I will appreciate it if someone points out which benchmarks
need to be checked before merging this.


Full diff: https://github.com/llvm/llvm-project/pull/143954.diff

1 Files Affected:

  • (modified) llvm/lib/Transforms/IPO/FunctionSpecialization.cpp (+1-1)
diff --git a/llvm/lib/Transforms/IPO/FunctionSpecialization.cpp b/llvm/lib/Transforms/IPO/FunctionSpecialization.cpp
index 1034ce9582152..45fa9d57e4862 100644
--- a/llvm/lib/Transforms/IPO/FunctionSpecialization.cpp
+++ b/llvm/lib/Transforms/IPO/FunctionSpecialization.cpp
@@ -71,7 +71,7 @@ static cl::opt<unsigned> MinCodeSizeSavings(
              "much percent of the original function size"));
 
 static cl::opt<unsigned> MinLatencySavings(
-    "funcspec-min-latency-savings", cl::init(40), cl::Hidden,
+    "funcspec-min-latency-savings", cl::init(20), cl::Hidden,
     cl::desc("Reject specializations whose latency savings are less than this "
              "much percent of the original function size"));
 

@labrinea
Copy link
Collaborator

I agree, while CodeSize savings is a reliable metric, Latency is inaccurate. We use it in an attempt to model hotness of code, i.e loop nests etc. The compilation time will increase if more specializations trigger. The benchmark used to monitor such regressions is the compile time tracker (repository: https://github.com/nikic/llvm-compile-time-tracker, website: http://llvm-compile-time-tracker.com), which compiles the llvm test suite. The interesting configurations for this change would be non-LTO -O3 and LTO -O3. I would measure number of specializations before/after, and instruction count increase while compiling.

I tried computing the total latency as a sum of latencies
of the instructions that belong to non-dead code (including
the instructions that would be executed had they not been
"eliminated" due to the constant propagation). This total
latency should identify the total cost of executing the function
with the given argument being dynamically equal to the tried
constant value. Then the latency bonus would be computed
as the ratio between the latency of the "eliminated" instructions
and the total latency.

While looking at this, I did not really get why we compute the latency
bonus as a ratio of the latency of the "eliminated" instructions
and the code-size of the whole function. It did not make much sense
to me.

The latency bonus is just an approximation, neither of the above gives a good answer, actually they both give similarly inaccurate answers, but the second (currently implemented) is cheaper to compute.

Another idea would be to see check if there are any unhandled instructions we could teach the instruction cost visitor to handle. Last time I checked there wasn't anything expect for memory stores which are very difficult to track.

@nikic
Copy link
Contributor

nikic commented Jun 16, 2025

I don't see any compile-time impact: https://llvm-compile-time-tracker.com/compare.php?from=402c376daa659c0c3a477ad038a415079ffa0a48&to=332ad57967be0ade84cdbc1b3bdaca3cbb00009c&stat=instructions%3Au

@vzakhari
Copy link
Contributor Author

I think this is ready for review. I am not sure what kind of a LIT test I should add (if any). Please advise.

@labrinea
Copy link
Collaborator

At first I thought that a 20% latency threshold is too low, but Nikita's comment seems reassuring. Looking at the related issue #143219 I am seeing that digits_2 needs 34%, is that right? Do we perhaps prefer a higher number like 30%, or is it likely that will need to decrease it again in the future? Out of curiosity, what is the compile time impact if we completely disregard this metric (MinLatencySavings = 0) ?

@vzakhari
Copy link
Contributor Author

At first I thought that a 20% latency threshold is too low, but Nikita's comment seems reassuring. Looking at the related issue #143219 I am seeing that digits_2 needs 34%, is that right? Do we perhaps prefer a higher number like 30%, or is it likely that will need to decrease it again in the future? Out of curiosity, what is the compile time impact if we completely disregard this metric (MinLatencySavings = 0) ?

The 34% bonus was reported only for the first iteration. The following iterations (for the other constant values) reported less bonus. I believe the lowest was about 21-22%, so I picked 20% to make sure all the specializations kick in and I get the performance back.

@nikic wouldn't it be too much trouble to remeasure the 0% threshold change? If not, I will create a separate branch and send you the link. Thank you in advance!

@nikic
Copy link
Contributor

nikic commented Jun 17, 2025

Here are the results for setting it to 0: https://llvm-compile-time-tracker.com/compare.php?from=402c376daa659c0c3a477ad038a415079ffa0a48&to=0dbf17247153e5261059aa3fc272088ce1f9123f&stat=instructions%3Au

@labrinea
Copy link
Collaborator

Thanks for the data @nikic. Alright, so it seems 0% latency threshold does have a significant effect on compilation times for LTO configurations. @vzakhari this suggests that it serves a purpose, let's not remove it entirely. I will approve this patch now.

@vzakhari
Copy link
Contributor Author

@kiranchandramohan , @tblah this change does not affect LTO performance on aarch64 in my testing, but please let me know if you see any discrepancies. Also feel free to revert it while I am out of the office from 19th to 23rd of June.

@vzakhari vzakhari merged commit bec9ac2 into llvm:main Jun 17, 2025
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants