-
Notifications
You must be signed in to change notification settings - Fork 14k
[llvm] Lower latency bonus threshold in function specialization. #143954
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Related to llvm#143219. Function specialization does not kick in if flang sets `noalias` attributes on the function arguments of `digits_2`, because PRE optimizes several `srem` instructions and other memory accesses from the inner loops causing the latency bonus to be lower than the current 40% threshold. While looking at this, I did not really get why we compute the latency bonus as a ratio of the latency of the "eliminated" instructions and the code-size of the whole function. It did not make much sense to me. I tried computing the total latency as a sum of latencies of the instructions that belong to non-dead code (including the instructions that would be executed had they not been "eliminated" due to the constant propagation). This total latency should identify the total cost of executing the function with the given argument being dynamically equal to the tried constant value. Then the latency bonus would be computed as the ratio between the latency of the "eliminated" instructions and the total latency. Unfortunately, this did not given me a good heuristics either. The bonus was close to 0% on some targets, and as big as 3-5% on other targets. This does match very well with the performance gain achieved by function specialization for exchange2, so it seemd like another artificial heuristic not better than the current one. It seems that GCC uses a set of different heuristics for function specialization, but I am not an expert here and I cannot say if we can match them in LLVM. With all that said, I decided to try to lower the threshold to avoid the regression and be able to re-enable the generally good change for `noalias` attribute. With this patch, I was able to reduce the effect of `noalias`, so that `-force-no-alias=true` is only ~10% slower than `-force-no-alias=false` code on neoverse-v1 and neoverse-v2. On neoverse-n1, `-force-no-alias=true` is >2x faster than `-force-no-alias=false` regardless of this patch. This threshold has been changed before also due to improved alias information: llvm@2fb51fb#diff-066363256b7b4164e66b28a3028b2cb9e405c9136241baa33db76ebd2edb87cd Please let me know what testing I should run to make sure this change is safe. As I understand, it may affect the compilation time performance, and I will appreciate it if someone points out which benchmarks need to be checked before merging this.
@llvm/pr-subscribers-function-specialization @llvm/pr-subscribers-llvm-transforms Author: Slava Zakharin (vzakhari) ChangesRelated to #143219. Function specialization does not kick in if flang sets While looking at this, I did not really get why we compute the latency I tried computing the total latency as a sum of latencies It seems that GCC uses a set of different heuristics for function With all that said, I decided to try to lower the threshold With this patch, I was able to reduce the effect of This threshold has been changed before also due to improved Please let me know what testing I should run to make sure this change Full diff: https://github.com/llvm/llvm-project/pull/143954.diff 1 Files Affected:
diff --git a/llvm/lib/Transforms/IPO/FunctionSpecialization.cpp b/llvm/lib/Transforms/IPO/FunctionSpecialization.cpp
index 1034ce9582152..45fa9d57e4862 100644
--- a/llvm/lib/Transforms/IPO/FunctionSpecialization.cpp
+++ b/llvm/lib/Transforms/IPO/FunctionSpecialization.cpp
@@ -71,7 +71,7 @@ static cl::opt<unsigned> MinCodeSizeSavings(
"much percent of the original function size"));
static cl::opt<unsigned> MinLatencySavings(
- "funcspec-min-latency-savings", cl::init(40), cl::Hidden,
+ "funcspec-min-latency-savings", cl::init(20), cl::Hidden,
cl::desc("Reject specializations whose latency savings are less than this "
"much percent of the original function size"));
|
I agree, while CodeSize savings is a reliable metric, Latency is inaccurate. We use it in an attempt to model hotness of code, i.e loop nests etc. The compilation time will increase if more specializations trigger. The benchmark used to monitor such regressions is the compile time tracker (repository: https://github.com/nikic/llvm-compile-time-tracker, website: http://llvm-compile-time-tracker.com), which compiles the llvm test suite. The interesting configurations for this change would be non-LTO -O3 and LTO -O3. I would measure number of specializations before/after, and instruction count increase while compiling.
The latency bonus is just an approximation, neither of the above gives a good answer, actually they both give similarly inaccurate answers, but the second (currently implemented) is cheaper to compute. Another idea would be to see check if there are any unhandled instructions we could teach the instruction cost visitor to handle. Last time I checked there wasn't anything expect for memory stores which are very difficult to track. |
I think this is ready for review. I am not sure what kind of a LIT test I should add (if any). Please advise. |
At first I thought that a 20% latency threshold is too low, but Nikita's comment seems reassuring. Looking at the related issue #143219 I am seeing that |
The 34% bonus was reported only for the first iteration. The following iterations (for the other constant values) reported less bonus. I believe the lowest was about 21-22%, so I picked 20% to make sure all the specializations kick in and I get the performance back. @nikic wouldn't it be too much trouble to remeasure the 0% threshold change? If not, I will create a separate branch and send you the link. Thank you in advance! |
Here are the results for setting it to 0: https://llvm-compile-time-tracker.com/compare.php?from=402c376daa659c0c3a477ad038a415079ffa0a48&to=0dbf17247153e5261059aa3fc272088ce1f9123f&stat=instructions%3Au |
@kiranchandramohan , @tblah this change does not affect LTO performance on aarch64 in my testing, but please let me know if you see any discrepancies. Also feel free to revert it while I am out of the office from 19th to 23rd of June. |
Related to #143219.
Function specialization does not kick in if flang sets
noalias
attributes on the function arguments of
digits_2
, because PREoptimizes several
srem
instructions and other memory accessesfrom the inner loops causing the latency bonus to be lower than
the current 40% threshold.
While looking at this, I did not really get why we compute the latency
bonus as a ratio of the latency of the "eliminated" instructions
and the code-size of the whole function. It did not make much sense
to me.
I tried computing the total latency as a sum of latencies
of the instructions that belong to non-dead code (including
the instructions that would be executed had they not been
"eliminated" due to the constant propagation). This total
latency should identify the total cost of executing the function
with the given argument being dynamically equal to the tried
constant value. Then the latency bonus would be computed
as the ratio between the latency of the "eliminated" instructions
and the total latency. Unfortunately, this did not given me a good
heuristics either. The bonus was close to 0% on some targets,
and as big as 3-5% on other targets. This does match very well
with the performance gain achieved by function specialization
for exchange2, so it seemd like another artificial heuristic
not better than the current one.
It seems that GCC uses a set of different heuristics for function
specialization, but I am not an expert here and I cannot say
if we can match them in LLVM.
With all that said, I decided to try to lower the threshold
to avoid the regression and be able to re-enable the generally
good change for
noalias
attribute.With this patch, I was able to reduce the effect of
noalias
,so that
-force-no-alias=true
is only ~10% slower than-force-no-alias=false
code on neoverse-v1 and neoverse-v2.On neoverse-n1,
-force-no-alias=true
is >2x faster than-force-no-alias=false
regardless of this patch.This threshold has been changed before also due to improved
alias information: 2fb51fb#diff-066363256b7b4164e66b28a3028b2cb9e405c9136241baa33db76ebd2edb87cd
Please let me know what testing I should run to make sure this change
is safe. As I understand, it may affect the compilation time performance,
and I will appreciate it if someone points out which benchmarks
need to be checked before merging this.