-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SPECIALIZE needed despite no warnings by -Wall-missed-specialisations #14
Comments
Update re GHC 9.2.2:
GHC 9.2.2 seems to spam just as much, which makes |
I've reproduced exactly the 2 scenarios with 9.2.2, with the same results as with 9.2.1. The fix for too verbose https://gitlab.haskell.org/ghc/ghc/-/issues/20744#note_412503 and when I looked more closely, indeed the spam is much less. However, I could not find the problematic functions, those for which SPECIALIZE was necessary, in the log. OTOH, |
I got stuck on teh
Not sure what to do next. I guess I have to install something -- but what? |
If you're on an Debian-based system (e.g. Ubuntu) then
|
Later on there are also harder deps to fill described here: https://github.com/Mikolaj/horde-ad/blob/master/README.md#compilation-from-source |
BTW, the second commit (with the two SPECIALIZE) produces such crazy results (best to just do
I'm just trying to unravel that. It may be related to laziness and the dirty tricks Edit: GHC 8.10.7 gives almost the same results as GHC 9.0.2. Edit2: and the fact that I can only use GHC 9.2 in the recent code (due to type-level pattern matching that is only in 9.2. see #13), makes it so much harder to compare perf and discover particular GHC version flukes vs. consistent perf changes. Edit3: perhaps what's going is that GHC 9.2.2 (or some newer library versions enabled by GHC 9.2.2) fail to fuse, e.g., vectors. That can cause much more allocation (allocation done not by GC, to be clear) and so much higher mutation time (worse cache locality), despite less time spent in GC. However, why failed fusion would cause so much less time spent in GC eludes me. Unless this has an independent cause. Edit4: this ticket is important also because our benchmarks vs other Haskell libraries are worth nothing if we have such huge unexplained anomalies. Benchmarks against Python/C libraries may also give a wrong impression if the Haskell side is handicapped for no fundamental reason. Edit5: the
makes the |
The result of Profiling output (on master branch, commit 1f7926f, performance is similar, but profile looks quite different to branch ghc-report-specialize, after both diverged a bit lately, but that's probably just a difference in how profiling skews results) for the 9.2.2 run above (the one that is twice slower and allocates 3 times more than 9.0.2, but does almost no GC) does not confirm the failed fusion hypothesis. It assigns most of allocations to this code that should not permanently allocate anything (so perhaps there are some transient allocations? or is horde-ad/src/HordeAd/Core/Delta.hs Lines 341 to 343 in b026848
Eventlog profiling and heap profiles show no relevant thunks (on master branch I squelched most innocent thunks just to be sure). Profiling output from branch ghc-report-specialize with GHC 9.2.2 shows a more believable story, but still no explanation for the difference vs 9.0.2 and still no support for the broken vector fusion hypothesis:
Analogous profile for GHC 9.0.2 is quite similar. |
@simonpj: A remark about the code that needs the two SPECIALIZE pragmas (BENCH1): this is a thick code path, where most of the code is not used (gradient and object function result are computed, the gradient is thrown away and only the result is kept). So perhaps the specialization is lost when some parts of the code are simplified away? Or some types are not instantiated, because the gradient is thrown away, and this prevents specialization somehow? Another hint: a part of the problem goes away (the performance is improve partially) when an associated type ( |
@simonpj: Might this be related: https://gitlab.haskell.org/ghc/ghc/-/issues/20709 ? Or is there a way to inhibit worker/wrapper or inlining of the wrapper? |
It would be great if you could do that. I still have no idea if it is related thought |
First results of BENCH2 with HEAD and head.hackage: the runtime and allocation regression in 9.2.2 is mostly gone [Edit: this is correct; only 14% regression in time and alloc remains]]:
This is much closer to 9.2.2 results now. One thing that didn't change is Edit: this is solved most probably by the fix to https://gitlab.haskell.org/ghc/ghc/-/issues/20364 / https://gitlab.haskell.org/ghc/ghc/-/issues/20709 that is in HEAD. |
And with GHC HEAD there is no difference in The instructions to reproduce the HEAD experiment [the instructions are for BENCH1 and I should have used that, but I used BENCH2 instead];
|
Do you mean: with HEAD the problem is solved? |
[Edit: I was wrong, because I used a wrong test (BENCH2, the one for the 9.2.2 regression). The problem is only partially solved.] The problem that |
So HEAD is still 14% slower than 9.0? Can you give specific repro instructions? Thanks |
Yes, I just repeated the tests. The instructions are as above, but run
|
BTW, in some similar tests earlier on I had 20% speedup in 9.2.2 vs 9.0.2, so this may be a problem that eats up the 20% improvement (e.g., from the low GC copying) and adds 14% degradation on top. |
Actually, for some reason I'm running a different test (BENCH2 instead of BENCH1). Let me retry with the one from the description of this ticket (BENCH1). |
I apologise. I forgot there are two different tests, one for the 2 SPECIALIZE pragmas problem (BENCH1, present in 8.10.7, 9.0.2, 9.2.2, partially in HEAD), another for the slowdown and allocation blowup problem with the pragmas in place (BENCH2, regression in 9.2.2, fixed in HEAD, except for 14%). I was right HEAD mostly fixes the latter problem (14% slowdown and blowup remains). I was wrong that it fixes the former, because I used the wrong test to verify that. The 2 SPECIALIZE pragmas problem in BENCH1 is only partially mitigated in HEAD. Also some new problems seem to be present in HEAD that were not in 9.2.2. Here are the results for BENCH1 from branch
And here are results with both SPECIALIZE pragmas removed (not other change):
|
I've repeated the SPECIALIZE test BENCH1 in GHC 9.2.2 with |
@simonpj: may I help with this one any more? |
Notes after a call with Mikolaj. (@Mikolaj please feel free to edit) [MK: edited; and again] Current status
[1] /home/simonpj/code/HEAD-5/inplace/bin/ghc-stage2 -c -fbuilding-cabal-package -O -static -dynamic-too -dynosuf dyn_o -dynhisuf dyn_hi -outputdir /home/simonpj/code/horde-ad/dist-newstyle/build/x86_64-linux/ghc-9.3.20220316/horde-ad-0.1.0.0/build -odir /home/simonpj/code/horde-ad/dist-newstyle/build/x86_64-linux/ghc-9.3.20220316/horde-ad-0.1.0.0/build -hidir /home/simonpj/code/horde-ad/dist-newstyle/build/x86_64-linux/ghc-9.3.20220316/horde-ad-0.1.0.0/build -stubdir /home/simonpj/code/horde-ad/dist-newstyle/build/x86_64-linux/ghc-9.3.20220316/horde-ad-0.1.0.0/build -i -i/home/simonpj/code/horde-ad/dist-newstyle/build/x86_64-linux/ghc-9.3.20220316/horde-ad-0.1.0.0/build -isrc -i/home/simonpj/code/horde-ad/dist-newstyle/build/x86_64-linux/ghc-9.3.20220316/horde-ad-0.1.0.0/build/autogen -i/home/simonpj/code/horde-ad/dist-newstyle/build/x86_64-linux/ghc-9.3.20220316/horde-ad-0.1.0.0/build/global-autogen -I/home/simonpj/code/horde-ad/dist-newstyle/build/x86_64-linux/ghc-9.3.20220316/horde-ad-0.1.0.0/build/autogen -I/home/simonpj/code/horde-ad/dist-newstyle/build/x86_64-linux/ghc-9.3.20220316/horde-ad-0.1.0.0/build/global-autogen -I/home/simonpj/code/horde-ad/dist-newstyle/build/x86_64-linux/ghc-9.3.20220316/horde-ad-0.1.0.0/build -optP-include -optP/home/simonpj/code/horde-ad/dist-newstyle/build/x86_64-linux/ghc-9.3.20220316/horde-ad-0.1.0.0/build/autogen/cabal_macros.h -this-unit-id horde-ad-0.1.0.0-inplace -hide-all-packages -Wmissing-home-modules -no-user-package-db -package-db /home/simonpj/.cabal/store/ghc-9.3.20220316/package.db -package-db /home/simonpj/code/horde-ad/dist-newstyle/packagedb/ghc-9.3.20220316 -package-db /home/simonpj/code/horde-ad/dist-newstyle/build/x86_64-linux/ghc-9.3.20220316/horde-ad-0.1.0.0/package.conf.inplace -package-id assert-failure-0.1.2.5-ea675a91e1e142052b8c09a6b71455a91a11c690fd2d01a36ad7f4cb68430354 -package-id base-4.16.0.0 -package-id bytestring-0.11.2.0 -package-id hmatrix-0.20.2-10eb67f511637ef41e2613b440db02f1222c9526116b87f7594e67978da7419b -package-id mnist-idx-0.1.3.0-93e628f7c64901a8a0e6eca70046a8335a120ecf78872be7a9055068328d629f -package-id pretty-show-1.10-ec297dfb3c4a097ca84c313cd9d74af54aa9c25eebb805b0f274cb7b06664ee1 -package-id random-1.2.1-dff0ac74d3869fa8e6c81ef2173a538040d7f12e33a327a869fd889135fe0d0a -package-id strict-containers-0.1-39349c55f55f295c5b921956ce232b6925f1851889adcb02cbb2cf17a2a8c34a -package-id transformers-0.5.6.2 -package-id vector-0.12.3.1-29f10f436fed96d2e5db20c8b101f6ed402d245d4282e430571bfefe8a438d14 -package-id zlib-0.6.2.3-c4451d988dbea589a8ceb6952bdc1d6e539a0b6208ae97abe04d017e7edb9c13 -XHaskell2010 -XMonoLocalBinds -XScopedTypeVariables -XOverloadedStrings -XBangPatterns -XRecordWildCards -XNamedFieldPuns -XMultiWayIf -XLambdaCase -XDefaultSignatures -XInstanceSigs -XPatternSynonyms -XStrictData -XTypeApplications -XFlexibleContexts src/HordeAd/Tool/MnistFcnnScalar.hs -Wall -Wcompat -Worphans -Wincomplete-uni-patterns -Wincomplete-record-updates -Wimplicit-prelude -Wmissing-home-modules -Widentities -Wredundant-constraints -Wmissing-export-lists -Wpartial-fields -Wunused-packages -fno-ignore-asserts -fexpose-all-unfoldings -fslpecialise-aggressively '-fsimpl-tick-factor=200' -Wall-missed-specialisations -hide-all-packages -fforce-recomp -ddump-simpl >& fcnn-scalar-no-spec.stg [2] /home/simonpj/code/HEAD-5/inplace/bin/ghc-stage2 -c -no-link -fbuilding-cabal-package -O -static -dynamic-too -dynosuf dyn_o -dynhisuf dyn_hi -outputdir /home/simonpj/code/horde-ad/dist-newstyle/build/x86_64-linux/ghc-9.3.20220316/horde-ad-0.1.0.0/b/mnist/build/mnist/mnist-tmp -odir /home/simonpj/code/horde-ad/dist-newstyle/build/x86_64-linux/ghc-9.3.20220316/horde-ad-0.1.0.0/b/mnist/build/mnist/mnist-tmp -hidir /home/simonpj/code/horde-ad/dist-newstyle/build/x86_64-linux/ghc-9.3.20220316/horde-ad-0.1.0.0/b/mnist/build/mnist/mnist-tmp -stubdir /home/simonpj/code/horde-ad/dist-newstyle/build/x86_64-linux/ghc-9.3.20220316/horde-ad-0.1.0.0/b/mnist/build/mnist/mnist-tmp -i -i/home/simonpj/code/horde-ad/dist-newstyle/build/x86_64-linux/ghc-9.3.20220316/horde-ad-0.1.0.0/b/mnist/build/mnist/mnist-tmp -ibench -i/home/simonpj/code/horde-ad/dist-newstyle/build/x86_64-linux/ghc-9.3.20220316/horde-ad-0.1.0.0/b/mnist/build/mnist/autogen -i/home/simonpj/code/horde-ad/dist-newstyle/build/x86_64-linux/ghc-9.3.20220316/horde-ad-0.1.0.0/b/mnist/build/global-autogen -I/home/simonpj/code/horde-ad/dist-newstyle/build/x86_64-linux/ghc-9.3.20220316/horde-ad-0.1.0.0/b/mnist/build/mnist/autogen -I/home/simonpj/code/horde-ad/dist-newstyle/build/x86_64-linux/ghc-9.3.20220316/horde-ad-0.1.0.0/b/mnist/build/global-autogen -I/home/simonpj/code/horde-ad/dist-newstyle/build/x86_64-linux/ghc-9.3.20220316/horde-ad-0.1.0.0/b/mnist/build/mnist/mnist-tmp -optP-include -optP/home/simonpj/code/horde-ad/dist-newstyle/build/x86_64-linux/ghc-9.3.20220316/horde-ad-0.1.0.0/b/mnist/build/mnist/autogen/cabal_macros.h -hide-all-packages -Wmissing-home-modules -no-user-package-db -package-db /home/simonpj/.cabal/store/ghc-9.3.20220316/package.db -package-db /home/simonpj/code/horde-ad/dist-newstyle/packagedb/ghc-9.3.20220316 -package-db /home/simonpj/code/horde-ad/dist-newstyle/build/x86_64-linux/ghc-9.3.20220316/horde-ad-0.1.0.0/b/mnist/package.conf.inplace -package-id base-4.16.0.0 -package-id criterion-1.5.13.0-5c0f15f9ba93e89d1ffb286bc5c4858ea41ec14be80fab0303ea7a74d39e171e -package-id deepseq-1.4.7.0 -package-id horde-ad-0.1.0.0-inplace -package-id random-1.2.1-dff0ac74d3869fa8e6c81ef2173a538040d7f12e33a327a869fd889135fe0d0a -package-id vector-0.12.3.1-29f10f436fed96d2e5db20c8b101f6ed402d245d4282e430571bfefe8a438d14 -XHaskell2010 -XMonoLocalBinds -XScopedTypeVariables -XOverloadedStrings -XBangPatterns -XRecordWildCards -XNamedFieldPuns -XMultiWayIf -XLambdaCase -XDefaultSignatures -XInstanceSigs -XPatternSynonyms -XStrictData -XTypeApplications -XFlexibleContexts bench/BenchMnistTools.hs -Wall -Wcompat -Worphans -Wincomplete-uni-patterns -Wincomplete-record-updates -Wimplicit-prelude -Wmissing-home-modules -Widentities -Wredundant-constraints -Wmissing-export-lists -Wpartial-fields -Wunused-packages -fno-ignore-asserts -fexpose-all-unfoldings -fspecialise-aggressively '-fsimpl-tick-factor=200' -Wall-missed-specialisations -rtsopts -hide-all-packages -fforce-recomp -ddump-rule-firings -fforce-recomp >& foo-rule-fired |
Notes from today's meeting (more in the google doc): the current idea is "never unbox dictionaries in w/w" and the agreed branch is GHC 9.2 (but branch 9.0 would not be a bad choice, either, because it's only one on which the original workaround fails after recent changes to the codebase, even with extra pragmas and proxies for them (can't compile on HEAD); or perhaps I can backport from 9.2 to 9.0 for my tests). |
Unfortunately, no difference. Probably all relevant SPECIALIZE have been added. The test is BENCH2 from #14 Num methods had to be inlined, because proxies could not be added, which was necessary to enable adding SPECIALIZE.
Unfortunately, no difference in GHC 9.2.2 and not a dramatic difference in GHC 9.0.2 and 8.10.7. Probably all relevant SPECIALIZE have been added in this commit. The test is BENCH2 from #14 Num methods had to be inlined, because proxies could not be added, which was necessary to enable adding SPECIALIZE.
If backporting the fix to 9.2 (and 9.0?) is too hard, an alternative is to port get horde-ad buildable on HEAD, which requires porting https://hackage.haskell.org/package/ghc-tcplugins-extra (which may be enough or not to get ghc-typelits-knownnat and ghc-typelits-natnormalise ported) to HEAD and submiting a merge request to https://gitlab.haskell.org/ghc/head.hackage (and to the ghc-tcplugins-extra repo to let them know). |
I wasn't able to port https://gitlab.haskell.org/ghc/ghc/-/merge_requests/7847 to GHC 9.2 and the type-level plugins we use are not ported to 9.4 nor HEAD yet, so I'm testing with the same old branch I fully reproduced with GHC 9.2.3 the recorded results with GHC 9.2.2 and BENCH1 and BENCH2 (the recorded data was for 10 runs of BENCH2 on a slightly older commit, but it matches well enough, considering). I also reproduced that removing the two SPECIALIZE pragmas from the program wreaks performance with BENCH1, but does not affect BENCH2, just as before. Here are the full results with the pragmas (BENCH1, then BENCH2):
Then I tried GHC 9.5.20220606 that contains https://gitlab.haskell.org/ghc/ghc/-/merge_requests/7985 and https://gitlab.haskell.org/ghc/ghc/-/merge_requests/7997, but not https://gitlab.haskell.org/ghc/ghc/-/merge_requests/7847. The results recorded for ghc-9.3.20220316 match very closely (within 1% plus error) both in BENCH1 and BENCH2. I also reproduced that removing the two SPECIALIZE pragmas from the program wreaks performance with BENCH1, but does not affect BENCH2, exactly as before. Here are the full results with the pragmas (BENCH1, then BENCH2):
Then I tried the same master branch of GHC but with cherry-picked all 3 commits from https://gitlab.haskell.org/ghc/ghc/-/merge_requests/7847, called 9.5.20220608. The results are the same as for GHC 9.5.20220606 with both BENCH1 and BENCH2, except that after removing the two SPECIALIZE pragmas from the program, the results do not change at all [edit: unfortunately, I can't reproduce this now, however hard I try; the lack of specialization seems to persist]. Therefore, the main issue from this ticket, the need for the pragmas despite |
Summarizing, main issue from this ticket, the need for the pragmas despite Given that 9.5.20220608 does not have better performance than 9.2.3 with the two workaround SPECIALIZE pragmas, it seems the workaround is completely effective and the only effect of the 7847 fix is making the workaround spurious (TODO: remove the workaround once we switch to GHC >= 9.6 and till then run our benchmarks periodically with HEAD to ensure we don't miss any SPECIALIZE pragmas that need adding as the library changes; [edit: and actually add a few dozen pragmas right now that it's clear GHC 9.2 won't get the fix]). There are minor issues visible in the data above and both are filed together as https://gitlab.haskell.org/ghc/ghc/-/issues/21715. BENCH1 compiled with HEAD (with the 7847 fix, or with workaround, it's all the same) allocates 18 times more than 9.2.3, copies 35 times more, but has only 5% more maximum residency and uses 9% more total memory. It runs 54% slower. (GHC 9.0.2 is close to 9.2.3, but takes somewhat less total memory still, as seen in older comments.) In BENCH2, 9.2.3 produces a binary that is is 65% slower (MUT time) than from HEAD but, based on older data above, GHC 9.0.2 produces a binary that is 18% faster than HEAD. |
These don't sound like "minor issues"! Are the precisely the issues you raise in GHC ticket 21715? There it says "54% slower whereas here you say "18 times slower". That's a big difference. I'm clearly confused |
Compared to 18 times slowdown fixed by 7847, 54% and 18% slowdown is minor. Also, I reported it here (e.g., at #14 (comment)) three months ago, so that's old news.
Yes, precisely.
I'm saying "allocates 18 times more", unless I'm missing something that I wrote. |
I've repeated the logging of GHC compilation output from #14 (comment). The difference is I used branch The very problem spotted there seems to be fixed (and
that doesn't seem to unbox dictionaries (unless I'm looking at a wrong thing) and from second invocation, the one with
so the specialization seems to work fine. When the pragma is enabled (and so performance is fine), I'm getting
instead of the
Diffs of the whole logs are unreadable due to too much randomness. |
After some more investigation, it seems !7847 really fixes the specialization, after all. See https://gitlab.haskell.org/ghc/ghc/-/merge_requests/7847#note_436933. The corollary is that the workarounds we use are probably sufficient and so we can work with older GHCs by adding workarounds and verifying with HEAD (comparing performance with and without !7847) that the workarounds suffice. After #11 is implemented, the outrageous BENCH1 performance problems without SPECIALIZE pragmas should vanish and I will try INLINEABLE pragmas as an extra or alternative workaround to try to make benchmarking reliable even for such fragile code as BENCH1, to make sure the #11 optimization results in a real performance improvement and not a measurement error due to GHC quirks. Another corollary is that other recorded performance oddities (here and in #20), especially those with BENCH2, are most probably real and not caused/obscured by missed specialization, so they can be investigated separately (though perhaps using !7847 to save on workaround pragmas and/or verify none are missing). It is possible some of them are caused by the worker/wrapper clash with |
[Edit: I've swapped the order of test result snippets and thaks to that noticed that I've just performed an experiment suggested by @simonpj: I've added
to In short, in BENCH1 this pair of pragmas is exactly exchangeable with the BENCH1sad baseline:
happy baseline:
SPECIALIZE out, INLINABLE in; the performance is probably the same as with SPECIALIZE, the differences being noise
BENCH2baseline (
INLINABLE pragmas in
|
To help with https://gitlab.haskell.org/ghc/ghc/-/issues/21715, where GHC 9.0.2 is reported to be 18% faster than HEAD in BENCH2, I tested the baseline (SPECIALIZE testMnist2 pragma is irrelevant for this benchmark); the speedup and lower allocation over HEAD is reproduced:
INLINABLE pragmas in, leading to a disaster
Edit: For completeness, just as with other GHC versions, in the presence of |
Since GHC 9.0.2 reacts badly to |
This is no longer a blocker, because we understand what's going and we've removed the code that was particularly susceptible to the performance swings. What's going on is
We've removed the code that was particularly susceptible to the performance swings in #11. The code was recursively accumulating delta expressions in the second component of dual numbers and then ignoring them, because it was only computing the value of the objective function, not a gradient. The new code consistently and repeatedly puts However, note that this code is no longer naive and exotic, so it's likely a similar potential 20% speedup comes up in the wild often enough and, regardless of speed gains, performance consistency is very valuable, because 20% may easily mask a good algorithmic decision elsewhere and lead to performance-harming decisions due to the noise in benchmarks. BTW, 9.2.3 is a further 20% faster in exactly the same benchmark, see also 21715, which reports similar results with the fragile version of the code. Anyway, let's keep this ticket open to remember to extend the workarounds to other benchmarks and maintain them as the code changes and avoid performance-wise naive code in order to prevent benchmarking noise. |
I can no longer reproduce these problems with GHC 9.6.1, though 9.4.4 is still heavily affected (most recently, these SPECIALIZE pragmas were needed horde-ad/simplified/HordeAd/Core/AstInterpret.hs Lines 321 to 426 in a361d26
Success. Closing. |
In short: despite
-fexpose-all-unfoldings -fspecialise-aggressively
, twoSPECIALIZE
pragmas are required to significantly speed up and lower allocation for the code. Confirmed with GHC 8.10.7, 9.0.2 and 9.2.1.That would imply that there is a specialization that is not performed by GHC without the pragmas. However, despite
-Wall-missed-specialisations
, no missed specialization is reported in the code withoutSPECIALIZE
(with 8.10.7 and 9.0.2;OTOH 9.2.1 spams so much that it's hard to tell; we should try 9.2.2 once it's out[edit: 9.2.2 repro done, see below; in short: less spam, but no relevant warning]).It would be great to understand if these are GHC bugs that needs to be reported and whether we may have other not applied specilizations that
-Wall-missed-specialisations
is not telling us about. Is there a cheap way to find them? Can we fix them or rule them out somehow even without finding them individually? Any there any ideas emerging from this case about how to speed up this particular part of code (which is computing the value of the objective function, ignoring derivatives; #11)?Reproducing BENCH1 with ghc-9.0.2:
This provides the baseline, in which things are apparently not specialized. Note that there is no warning about missing specialization. For me, the results are
The following only adds a commit that introduces the two SPECIALIZE pragmas. [Edit: and then two others that I added only to eliminate thunks that would obscure what's going on, No significant change of results due to their elimination.]
My results are
The real story, the one recorded on master branch, is that I originally applied a workaround that apparently specialized some things without needing SPECIALIZE. It sits halfway between the two extremes above performance-wise. Probably not important, but it may indicate where the lack of specialization is located. Commit 229a9b8 from master branch, while adding the two SPECIALIZE, removes the workaround (and the commit message fails to explain what's going on).
The text was updated successfully, but these errors were encountered: