[llvm] Lower latency bonus threshold in function specialization. #143954

vzakhari · 2025-06-12T19:49:41Z

Related to #143219.

Function specialization does not kick in if flang sets noalias
attributes on the function arguments of digits_2, because PRE
optimizes several srem instructions and other memory accesses
from the inner loops causing the latency bonus to be lower than
the current 40% threshold.

While looking at this, I did not really get why we compute the latency
bonus as a ratio of the latency of the "eliminated" instructions
and the code-size of the whole function. It did not make much sense
to me.

I tried computing the total latency as a sum of latencies
of the instructions that belong to non-dead code (including
the instructions that would be executed had they not been
"eliminated" due to the constant propagation). This total
latency should identify the total cost of executing the function
with the given argument being dynamically equal to the tried
constant value. Then the latency bonus would be computed
as the ratio between the latency of the "eliminated" instructions
and the total latency. Unfortunately, this did not given me a good
heuristics either. The bonus was close to 0% on some targets,
and as big as 3-5% on other targets. This does match very well
with the performance gain achieved by function specialization
for exchange2, so it seemd like another artificial heuristic
not better than the current one.

It seems that GCC uses a set of different heuristics for function
specialization, but I am not an expert here and I cannot say
if we can match them in LLVM.

With all that said, I decided to try to lower the threshold
to avoid the regression and be able to re-enable the generally
good change for noalias attribute.

With this patch, I was able to reduce the effect of noalias,
so that -force-no-alias=true is only ~10% slower than
-force-no-alias=false code on neoverse-v1 and neoverse-v2.
On neoverse-n1, -force-no-alias=true is >2x faster than
-force-no-alias=false regardless of this patch.

This threshold has been changed before also due to improved
alias information: 2fb51fb#diff-066363256b7b4164e66b28a3028b2cb9e405c9136241baa33db76ebd2edb87cd

Please let me know what testing I should run to make sure this change
is safe. As I understand, it may affect the compilation time performance,
and I will appreciate it if someone points out which benchmarks
need to be checked before merging this.

Related to llvm#143219. Function specialization does not kick in if flang sets `noalias` attributes on the function arguments of `digits_2`, because PRE optimizes several `srem` instructions and other memory accesses from the inner loops causing the latency bonus to be lower than the current 40% threshold. While looking at this, I did not really get why we compute the latency bonus as a ratio of the latency of the "eliminated" instructions and the code-size of the whole function. It did not make much sense to me. I tried computing the total latency as a sum of latencies of the instructions that belong to non-dead code (including the instructions that would be executed had they not been "eliminated" due to the constant propagation). This total latency should identify the total cost of executing the function with the given argument being dynamically equal to the tried constant value. Then the latency bonus would be computed as the ratio between the latency of the "eliminated" instructions and the total latency. Unfortunately, this did not given me a good heuristics either. The bonus was close to 0% on some targets, and as big as 3-5% on other targets. This does match very well with the performance gain achieved by function specialization for exchange2, so it seemd like another artificial heuristic not better than the current one. It seems that GCC uses a set of different heuristics for function specialization, but I am not an expert here and I cannot say if we can match them in LLVM. With all that said, I decided to try to lower the threshold to avoid the regression and be able to re-enable the generally good change for `noalias` attribute. With this patch, I was able to reduce the effect of `noalias`, so that `-force-no-alias=true` is only ~10% slower than `-force-no-alias=false` code on neoverse-v1 and neoverse-v2. On neoverse-n1, `-force-no-alias=true` is >2x faster than `-force-no-alias=false` regardless of this patch. This threshold has been changed before also due to improved alias information: llvm@2fb51fb#diff-066363256b7b4164e66b28a3028b2cb9e405c9136241baa33db76ebd2edb87cd Please let me know what testing I should run to make sure this change is safe. As I understand, it may affect the compilation time performance, and I will appreciate it if someone points out which benchmarks need to be checked before merging this.

llvmbot · 2025-06-12T19:50:11Z

@llvm/pr-subscribers-function-specialization

@llvm/pr-subscribers-llvm-transforms

Author: Slava Zakharin (vzakhari)

Changes

Related to #143219.

Function specialization does not kick in if flang sets noalias
attributes on the function arguments of digits_2, because PRE
optimizes several srem instructions and other memory accesses
from the inner loops causing the latency bonus to be lower than
the current 40% threshold.

While looking at this, I did not really get why we compute the latency
bonus as a ratio of the latency of the "eliminated" instructions
and the code-size of the whole function. It did not make much sense
to me.

I tried computing the total latency as a sum of latencies
of the instructions that belong to non-dead code (including
the instructions that would be executed had they not been
"eliminated" due to the constant propagation). This total
latency should identify the total cost of executing the function
with the given argument being dynamically equal to the tried
constant value. Then the latency bonus would be computed
as the ratio between the latency of the "eliminated" instructions
and the total latency. Unfortunately, this did not given me a good
heuristics either. The bonus was close to 0% on some targets,
and as big as 3-5% on other targets. This does match very well
with the performance gain achieved by function specialization
for exchange2, so it seemd like another artificial heuristic
not better than the current one.

It seems that GCC uses a set of different heuristics for function
specialization, but I am not an expert here and I cannot say
if we can match them in LLVM.

With all that said, I decided to try to lower the threshold
to avoid the regression and be able to re-enable the generally
good change for noalias attribute.

With this patch, I was able to reduce the effect of noalias,
so that -force-no-alias=true is only ~10% slower than
-force-no-alias=false code on neoverse-v1 and neoverse-v2.
On neoverse-n1, -force-no-alias=true is >2x faster than
-force-no-alias=false regardless of this patch.

This threshold has been changed before also due to improved
alias information: 2fb51fb#diff-066363256b7b4164e66b28a3028b2cb9e405c9136241baa33db76ebd2edb87cd

Please let me know what testing I should run to make sure this change
is safe. As I understand, it may affect the compilation time performance,
and I will appreciate it if someone points out which benchmarks
need to be checked before merging this.

Full diff: https://github.com/llvm/llvm-project/pull/143954.diff

1 Files Affected:

(modified) llvm/lib/Transforms/IPO/FunctionSpecialization.cpp (+1-1)

diff --git a/llvm/lib/Transforms/IPO/FunctionSpecialization.cpp b/llvm/lib/Transforms/IPO/FunctionSpecialization.cpp
index 1034ce9582152..45fa9d57e4862 100644
--- a/llvm/lib/Transforms/IPO/FunctionSpecialization.cpp
+++ b/llvm/lib/Transforms/IPO/FunctionSpecialization.cpp
@@ -71,7 +71,7 @@ static cl::opt<unsigned> MinCodeSizeSavings(
              "much percent of the original function size"));
 
 static cl::opt<unsigned> MinLatencySavings(
-    "funcspec-min-latency-savings", cl::init(40), cl::Hidden,
+    "funcspec-min-latency-savings", cl::init(20), cl::Hidden,
     cl::desc("Reject specializations whose latency savings are less than this "
              "much percent of the original function size"));

labrinea · 2025-06-13T11:12:42Z

I agree, while CodeSize savings is a reliable metric, Latency is inaccurate. We use it in an attempt to model hotness of code, i.e loop nests etc. The compilation time will increase if more specializations trigger. The benchmark used to monitor such regressions is the compile time tracker (repository: https://github.com/nikic/llvm-compile-time-tracker, website: http://llvm-compile-time-tracker.com), which compiles the llvm test suite. The interesting configurations for this change would be non-LTO -O3 and LTO -O3. I would measure number of specializations before/after, and instruction count increase while compiling.

I tried computing the total latency as a sum of latencies
of the instructions that belong to non-dead code (including
the instructions that would be executed had they not been
"eliminated" due to the constant propagation). This total
latency should identify the total cost of executing the function
with the given argument being dynamically equal to the tried
constant value. Then the latency bonus would be computed
as the ratio between the latency of the "eliminated" instructions
and the total latency.

While looking at this, I did not really get why we compute the latency
bonus as a ratio of the latency of the "eliminated" instructions
and the code-size of the whole function. It did not make much sense
to me.

The latency bonus is just an approximation, neither of the above gives a good answer, actually they both give similarly inaccurate answers, but the second (currently implemented) is cheaper to compute.

Another idea would be to see check if there are any unhandled instructions we could teach the instruction cost visitor to handle. Last time I checked there wasn't anything expect for memory stores which are very difficult to track.

nikic · 2025-06-16T20:48:38Z

I don't see any compile-time impact: https://llvm-compile-time-tracker.com/compare.php?from=402c376daa659c0c3a477ad038a415079ffa0a48&to=332ad57967be0ade84cdbc1b3bdaca3cbb00009c&stat=instructions%3Au

vzakhari · 2025-06-16T20:58:49Z

I think this is ready for review. I am not sure what kind of a LIT test I should add (if any). Please advise.

labrinea · 2025-06-17T15:07:31Z

At first I thought that a 20% latency threshold is too low, but Nikita's comment seems reassuring. Looking at the related issue #143219 I am seeing that digits_2 needs 34%, is that right? Do we perhaps prefer a higher number like 30%, or is it likely that will need to decrease it again in the future? Out of curiosity, what is the compile time impact if we completely disregard this metric (MinLatencySavings = 0) ?

vzakhari · 2025-06-17T15:11:52Z

At first I thought that a 20% latency threshold is too low, but Nikita's comment seems reassuring. Looking at the related issue #143219 I am seeing that digits_2 needs 34%, is that right? Do we perhaps prefer a higher number like 30%, or is it likely that will need to decrease it again in the future? Out of curiosity, what is the compile time impact if we completely disregard this metric (MinLatencySavings = 0) ?

The 34% bonus was reported only for the first iteration. The following iterations (for the other constant values) reported less bonus. I believe the lowest was about 21-22%, so I picked 20% to make sure all the specializations kick in and I get the performance back.

@nikic wouldn't it be too much trouble to remeasure the 0% threshold change? If not, I will create a separate branch and send you the link. Thank you in advance!

nikic · 2025-06-17T15:48:33Z

Here are the results for setting it to 0: https://llvm-compile-time-tracker.com/compare.php?from=402c376daa659c0c3a477ad038a415079ffa0a48&to=0dbf17247153e5261059aa3fc272088ce1f9123f&stat=instructions%3Au

labrinea · 2025-06-17T16:43:53Z

Thanks for the data @nikic. Alright, so it seems 0% latency threshold does have a significant effect on compilation times for LTO configurations. @vzakhari this suggests that it serves a purpose, let's not remove it entirely. I will approve this patch now.

vzakhari · 2025-06-17T16:55:00Z

@kiranchandramohan , @tblah this change does not affect LTO performance on aarch64 in my testing, but please let me know if you see any discrepancies. Also feel free to revert it while I am out of the office from 19th to 23rd of June.

vzakhari requested review from DavidTruby, kiranchandramohan, momchil-velikov, Leporacanthicus, labrinea and davemgreen June 12, 2025 19:49

llvmbot added function-specialization llvm:transforms labels Jun 12, 2025

vzakhari mentioned this pull request Jun 12, 2025

[flang][perf] Adding noalias for function arguments affects function specialization #143219

Open

labrinea approved these changes Jun 17, 2025

View reviewed changes

vzakhari merged commit bec9ac2 into llvm:main Jun 17, 2025
10 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[llvm] Lower latency bonus threshold in function specialization. #143954

[llvm] Lower latency bonus threshold in function specialization. #143954

vzakhari commented Jun 12, 2025

Uh oh!

llvmbot commented Jun 12, 2025 •

edited

Loading

Uh oh!

labrinea commented Jun 13, 2025

Uh oh!

nikic commented Jun 16, 2025

Uh oh!

vzakhari commented Jun 16, 2025

Uh oh!

labrinea commented Jun 17, 2025

Uh oh!

vzakhari commented Jun 17, 2025

Uh oh!

nikic commented Jun 17, 2025

Uh oh!

labrinea commented Jun 17, 2025

Uh oh!

vzakhari commented Jun 17, 2025

Uh oh!

Uh oh!

Uh oh!

[llvm] Lower latency bonus threshold in function specialization. #143954

[llvm] Lower latency bonus threshold in function specialization. #143954

Conversation

vzakhari commented Jun 12, 2025

Uh oh!

llvmbot commented Jun 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

labrinea commented Jun 13, 2025

Uh oh!

nikic commented Jun 16, 2025

Uh oh!

vzakhari commented Jun 16, 2025

Uh oh!

labrinea commented Jun 17, 2025

Uh oh!

vzakhari commented Jun 17, 2025

Uh oh!

nikic commented Jun 17, 2025

Uh oh!

labrinea commented Jun 17, 2025

Uh oh!

vzakhari commented Jun 17, 2025

Uh oh!

Uh oh!

Uh oh!

llvmbot commented Jun 12, 2025 •

edited

Loading