Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DebugSymbolsTest.debugSymbolsQuarkus fails with latest graal/mandrel main #160

Closed
jerboaa opened this issue Jun 1, 2023 · 47 comments
Closed

Comments

@jerboaa
Copy link
Collaborator

jerboaa commented Jun 1, 2023

Fails locally with:

java.net.ConnectException: Connection refused
	at java.base/sun.nio.ch.Net.pollConnect(Native Method)
	at java.base/sun.nio.ch.Net.pollConnectNow(Net.java:672)
	at java.base/sun.nio.ch.NioSocketImpl.timedFinishConnect(NioSocketImpl.java:542)
	at java.base/sun.nio.ch.NioSocketImpl.connect(NioSocketImpl.java:597)
	at java.base/java.net.Socket.connect(Socket.java:633)
	at java.base/sun.net.NetworkClient.doConnect(NetworkClient.java:178)
	at java.base/sun.net.www.http.HttpClient.openServer(HttpClient.java:531)
	at java.base/sun.net.www.http.HttpClient.openServer(HttpClient.java:636)
	at java.base/sun.net.www.http.HttpClient.<init>(HttpClient.java:279)
	at java.base/sun.net.www.http.HttpClient.New(HttpClient.java:384)
	at java.base/sun.net.www.http.HttpClient.New(HttpClient.java:406)
	at java.base/sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1309)
	at java.base/sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1242)
	at java.base/sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:1128)
	at java.base/sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:1057)
	at java.base/sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1665)
	at java.base/sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1589)
	at org.graalvm.tests.integration.utils.WebpageTester.getUrlContents(WebpageTester.java:99)
	at org.graalvm.tests.integration.DebugSymbolsTest.lambda$carryOutGDBSession$3(DebugSymbolsTest.java:395)
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
	at java.base/java.lang.Thread.run(Thread.java:833)
[ERROR] Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 237.224 s <<< FAILURE! - in org.graalvm.tests.integration.DebugSymbolsTest
[ERROR] debugSymbolsQuarkus{TestInfo}  Time elapsed: 236.932 s  <<< FAILURE!
org.opentest4j.AssertionFailedError: 
There were errors in the GDB session. Note that commands in the session might depend on each other. Errors: 
Command 'b ConfigTestController.java:33' did not match the expected pattern '.*Breakpoint 1 at .*: file com/example/quarkus/config/ConfigTestController.java, line 33.*'.
Output was:
, 
Command 'run&' did not match the expected pattern '.*Installed features:.*'.
Output was:
, 
Command 'bt' did not match the expected pattern '.*at.*com/example/quarkus/config/ConfigTestController.java:33.*'.
Output was:
, 
Command 'list' did not match the expected pattern '.*String value = config.getValue\("value", String.class\);.*'.
Output was:
, 
Command 'c&' did not match the expected pattern '.*Continuing.*'.
Output was:
 ==> expected: <true> but was: <false>
	at org.graalvm.tests.integration.DebugSymbolsTest.carryOutGDBSession(DebugSymbolsTest.java:423)
	at org.graalvm.tests.integration.DebugSymbolsTest.debugSymbolsQuarkus(DebugSymbolsTest.java:251)

[INFO] 
[INFO] Results:
[INFO] 
[ERROR] Failures: 
[ERROR]   DebugSymbolsTest.debugSymbolsQuarkus:251->carryOutGDBSession:423 There were errors in the GDB session. Note that commands in the session might depend on each other. Errors: 
Command 'b ConfigTestController.java:33' did not match the expected pattern '.*Breakpoint 1 at .*: file com/example/quarkus/config/ConfigTestController.java, line 33.*'.
Output was:
, 
Command 'run&' did not match the expected pattern '.*Installed features:.*'.
Output was:
, 
Command 'bt' did not match the expected pattern '.*at.*com/example/quarkus/config/ConfigTestController.java:33.*'.
Output was:
, 
Command 'list' did not match the expected pattern '.*String value = config.getValue\("value", String.class\);.*'.
Output was:
, 
Command 'c&' did not match the expected pattern '.*Continuing.*'.
Output was:
 ==> expected: <true> but was: <false>
[INFO] 
[ERROR] Tests run: 1, Failures: 1, Errors: 0, Skipped: 0
@jerboaa
Copy link
Collaborator Author

jerboaa commented Jun 1, 2023

First failure from ~1.5 weeks ago (May 23):
https://github.com/graalvm/mandrel/actions/runs/5054845481/jobs/9070889829

@jerboaa
Copy link
Collaborator Author

jerboaa commented Jun 1, 2023

The test seems just to time out. So it looks like a perf problem?

When debugging locally, gdb just seems to hang when trying to set a breakpoint, but if you wait long enough it'll eventually give you your prompt back:

$ cd target/quarkus-native-image-source-jar/
$ ls
lib  quarkus-runner-build-output-stats.json  quarkus-runner.jar  sources
$ gdb ../quarkus-runner
GNU gdb (GDB) Fedora Linux 13.1-3.fc37
Copyright (C) 2023 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ../quarkus-runner...
Reading symbols from /disk/graal/upstream-sources/mandrel-integration-tests/apps/quarkus-full-microprofile/target/quarkus-runner.debug...
(gdb) !date
Thu Jun  1 08:30:55 PM CEST 2023
(gdb) b ConfigTestController.java:33
Breakpoint 1 at 0x40e716: file com/example/quarkus/config/ConfigTestController.java, line 33.
(gdb) !date
Thu Jun  1 08:34:16 PM CEST 2023

So it takes on the order of 3 minutes to set the breakpoint.

@jerboaa
Copy link
Collaborator Author

jerboaa commented Jun 1, 2023

The test times out after 10 seconds, which is reasonable.

@zakkak
Copy link
Collaborator

zakkak commented Jun 2, 2023

So it takes on the order of 3 minutes to set the breakpoint.

WoW! That's a lot... And it didn't use to take that long in the past...

@zakkak
Copy link
Collaborator

zakkak commented Jun 2, 2023

The issue was introduced by quarkusio/quarkus#33203, i.e. the addition of -H:+DebugCodeInfoUseSourceMappings to Quarkus builds...

Building quarkus-microprofile-full with:

mvn package -Pnative -Dquarkus.native.debug.enabled=true -Dquarkus.version=999-SNAPSHOT -Dquarkus.native.additional-build-args=-H:Log=registerResource:,--trace-object-instantiation=java.util.Random,--initialize-at-run-time=io.vertx.ext.auth.impl.jose.JWT,-H:-DebugCodeInfoUseSourceMappings

instead of the current:

mvn package -Pnative -Dquarkus.native.debug.enabled=true -Dquarkus.version=999-SNAPSHOT -Dquarkus.native.additional-build-args=-H:Log=registerResource:,--trace-object-instantiation=java.util.Random,--initialize-at-run-time=io.vertx.ext.auth.impl.jose.JWT

Results in the breakpoint being set in seconds.

@zakkak
Copy link
Collaborator

zakkak commented Jun 2, 2023

Now the question is whether this still makes sense to keep as the default. Not sure what's a better default, waiting a few minutes for gdb to load the symbols and get highly accurate mappings or getting slightly worse accuracy but only waiting a few seconds? Probably worth discussing with the Quarkus devs.

@Karm
Copy link
Owner

Karm commented Jun 2, 2023

I don't recall experiencing this 😱

@jerboaa
Copy link
Collaborator Author

jerboaa commented Jun 2, 2023

Now the question is whether this still makes sense to keep as the default. Not sure what's a better default, waiting a few minutes for gdb to load the symbols and get highly accurate mappings or getting slightly worse accuracy but only waiting a few seconds? Probably worth discussing with the Quarkus devs.

Hmm, with -H:-DebugCodeInfoUseSourceMappings it still takes about 15 seconds for me, which seems bad enough. Also, debuginfo shrinks to ~50MB (vs ~200MB+ with the source mappings).

@zakkak
Copy link
Collaborator

zakkak commented Jun 2, 2023

It always used to take some time but I don't have exact numbers... I noticed (while looking for the cause of this) that oracle/graal#6419 seems to result in an increase in gdb start-up times

@adinn
Copy link

adinn commented Jun 2, 2023

This is down to two factors: a lot of extra debug info records for local var data and the recent change to merge all debug info into one big compile unit. So gdb start up requires a big slurp of all the debug info at once. We need to think about maybe switching off source mappnigs but that's not so good for the eventual debug experience.

@jerboaa
Copy link
Collaborator Author

jerboaa commented Jun 2, 2023

As to the numbers here are some (seems an order of magnitude difference). Time it takes to set b ConfigTestController.java:33 in seconds for this quarkus microprofile app:

-H:-DebugCodeInfoUseSourceMappings -H:+DebugCodeInfoUseSourceMappings
14.603474617004395 150.33116626739502

Details -H:-DebugCodeInfoUseSourceMappings:

$ gdb -ex 'set breakpoint pending on' -ex 'set debuginfod enabled off' -ex "set \$starttime=$(date +%s.%N)" -ex 'b ConfigTestController.java:33' -ex 'shell echo set \$endtime=$(date +%s.%N) > ~/gdbtmp.txt' -ex 'source ~/gdbtmp.txt' -ex 'print $endtime-$starttime'  quarkus-runner
GNU gdb (GDB) Fedora Linux 13.1-3.fc37
Copyright (C) 2023 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from quarkus-runner...
Reading symbols from /disk/graal/upstream-sources/mandrel-integration-tests/apps/quarkus-full-microprofile/target/quarkus-native-image-source-jar/quarkus-runner.debug...
Breakpoint 1 at 0x40e73a: file com/example/quarkus/config/ConfigTestController.java, line 33.
$1 = 14.603474617004395

Details -H:+DebugCodeInfoUseSourceMappings:

$ gdb -ex 'set breakpoint pending on' -ex 'set debuginfod enabled off' -ex "set \$starttime=$(date +%s.%N)" -ex 'b ConfigTestController.java:33' -ex 'shell echo set \$endtime=$(date +%s.%N) > ~/gdbtmp.txt' -ex 'source ~/gdbtmp.txt' -ex 'print $endtime-$starttime'  quarkus-runner
GNU gdb (GDB) Fedora Linux 13.1-3.fc37
Copyright (C) 2023 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from quarkus-runner...
Reading symbols from /disk/graal/upstream-sources/mandrel-integration-tests/apps/quarkus-full-microprofile/target/quarkus-native-image-source-jar/quarkus-runner.debug...
Breakpoint 1 at 0x40e716: file com/example/quarkus/config/ConfigTestController.java, line 33.
$1 = 150.33116626739502

@adinn
Copy link

adinn commented Jun 2, 2023

Do you have the size of the debuginfo files for those two cases?

Also, if you dump the info section as text what are the relative line counts? (objdump --dwarf=info | wc -l). That gives a rough measure of how many DIE records there are to be processed in each case.

@jerboaa
Copy link
Collaborator Author

jerboaa commented Jun 2, 2023

Do you have the size of the debuginfo files for those two cases?

According to the native-image build output:

-H:-DebugCodeInfoUseSourceMappings -H:+DebugCodeInfoUseSourceMappings
51.65MB 224.88MB

Also, if you dump the info section as text what are the relative line counts? (objdump --dwarf=info | wc -l). That gives a rough measure of how many DIE records there are to be processed in each case.

-H:-DebugCodeInfoUseSourceMappings -H:+DebugCodeInfoUseSourceMappings
5486358 51195520

Seems also an order of magnitude difference.

@jerboaa
Copy link
Collaborator Author

jerboaa commented Jun 2, 2023

For reference when using a 22.3.2.0-Final mandrel build, setting the break point takes ~1 second. Edit: debuginfo size: 51.41MB, objdump --dwarf=info quarkus-runner |wc -l => 3247557

$ gdb -ex 'set breakpoint pending on' -ex 'set debuginfod enabled off' -ex "set \$starttime=$(date +%s.%N)" -ex 'b ConfigTestController.java:33' -ex 'shell echo set \$endtime=$(date +%s.%N) > ~/gdbtmp.txt' -ex 'source ~/gdbtmp.txt' -ex 'print $endtime-$starttime'  quarkus-runner
GNU gdb (GDB) Fedora Linux 13.1-3.fc37
Copyright (C) 2023 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from quarkus-runner...
Breakpoint 1 at 0x40dda6: file com/example/quarkus/config/ConfigTestController.java, line 33.
$1 = 1.0526514053344727

@adinn
Copy link

adinn commented Jun 2, 2023

Yeah, the on-disk size disparity is bad but the difference in the DIE (info record) count is the real measure of how much more work needs to be done.

There may also be some supra-linearity in the growth even when measured in DIEs because local var info needs to be organized into a tree structure and cross-referenced to the type info for the var type.

Still that is an order of magnitude (actually more) extra information about what data the program is handling which makes all the difference when you are trying to debug optimized code. I'd hate to not have that info when a real production bug turns up.

@adinn
Copy link

adinn commented Jun 2, 2023

For reference when using a 22.3.2.0-Final mandrel build, setting the break point takes ~1 second.

Just to be sure: are you certain that is built to include all local var info?

I'm not implying otherwise not saying that this result is particularly surprising. This result could well just be down to collapsing all that info into one CU. The extra var info may multiply a CU size by 10 but with 22.3 there will be at least 1000 per-class CUs in even a small program. So, even if you had to load 10 CUs in order to resolve the main class that would reduce the loaded record count by 100.

One other thing I might be able do speed up the reading is add sibling links to some of the top level DIEs that have a lot of children, in particular the method definition DIEs that includes a tree of var info DIES as children. That would allow gdb to do a quick traversal of the top level DIES when looking for a give method without having to slurp the whole tree. I'll experiment with that to see if it helps.

@jerboaa
Copy link
Collaborator Author

jerboaa commented Jun 2, 2023

For reference when using a 22.3.2.0-Final mandrel build, setting the break point takes ~1 second.

Just to be sure: are you certain that is built to include all local var info?

No, I'm not sure at all. How can I make sure? Is that in 22.3? I don't know the state of debuginfo between 22.3.2.0 and revision 041a2193d893b312dae4e14f1112d930615679ec from oracle/graal master branch 23.0/23.1-ish (which I've tested). So it'll depend on which features are in 22.3.2 as compared to the later versions.

@adinn
Copy link

adinn commented Jun 5, 2023

@jerboaa I just checked: 22.3 is like 23.0 i.e. it relies on you setting SourceLevelDebug to enable generation of full infopoints (local var info is derived from Infopoint and SourceMapping instances). So, if you set that for the 22.3 build then this is an apples to apples comparison.

@jerboaa
Copy link
Collaborator Author

jerboaa commented Jun 5, 2023

@adinn Hmm, we don't seem to set that option anywhere. -H:+SourceLevelDebug that is. Previous runs of 23+ didn't have it set on either. Example 23.1.0-dev invokation of native-image:

$ native-image -J-Dlogging.initial-configurator.min-level=500 -J-Djava.util.logging.manager=org.jboss.logmanager.LogManager -J-Dsun.nio.ch.maxUpdateArraySize=100 -J-Dio.netty.leakDetection.level=DISABLED -J-Dio.netty.allocator.maxOrder=3 -J-Dvertx.logger-delegate-factory-class-name=io.quarkus.vertx.core.runtime.VertxLogDelegateFactory -J-Dvertx.disableDnsResolver=true -J-Duser.language=en -J-Duser.country=US -J-Dfile.encoding=UTF-8 --features=io.quarkus.runner.Feature,io.quarkus.runtime.graal.DisableLoggingFeature -H:+TrackNodeSourcePosition -H:+DebugCodeInfoUseSourceMappings -J--add-exports=java.security.jgss/sun.security.krb5=ALL-UNNAMED -J--add-opens=java.base/java.text=ALL-UNNAMED -J--add-opens=java.base/java.io=ALL-UNNAMED -J--add-opens=java.base/java.lang.invoke=ALL-UNNAMED -J--add-opens=java.base/java.util=ALL-UNNAMED -H:BuildOutputJSONFile=quarkus-runner-build-output-stats.json -H:Log=registerResource: --initialize-at-run-time=io.vertx.ext.auth.impl.jose.JWT -H:+AllowFoldMethods -J-Djava.awt.headless=true --no-fallback --link-at-build-time -H:+ReportExceptionStackTraces -H:GenerateDebugInfo=1 -H:-AddAllCharsets --enable-url-protocols=http,https -H:NativeLinkerOption=-no-pie -H:-UseServiceLoaderFeature -H:+StackTrace -J--add-exports=org.graalvm.sdk/org.graalvm.nativeimage.impl=ALL-UNNAMED --exclude-config io\.netty\.netty-codec /META-INF/native-image/io\.netty/netty-codec/generated/handlers/reflect-config\.json --exclude-config io\.netty\.netty-handler /META-INF/native-image/io\.netty/netty-handler/generated/handlers/reflect-config\.json quarkus-runner -jar quarkus-runner.jar

@adinn
Copy link

adinn commented Jun 5, 2023

@jerboaa Well, a comparison without -H:+SourceLevelDebug in both cases is not really fair. At the very least it needs to be set in both or omitted in both.

Option -H:+SourceLevelDebug is relatively new (I'm not sure it existed in 22.1). It determines whether or not a method compile is done using full (+) or partial (-) infopoint generation and setting it is critical if you want local var info, both for the amount that is generated and its reliability.

The +SourceLevelDebug setting operates in conjunction with a method filter SourceLevelDebugFilter that restricts the + setting to apply only to methods matching a list of patterns (which defaults to all methods).

I think -H:+DebugCodeInfoUseSourceMappings has been around a little longer but I cannot remember for sure. It is used to decide whether or not to derive local var info from SourceMapping objects -- another source of position info that is provided by the compiler. There are less of these and they are less reliable than Infopoint objects.

However, that is not to suggest this is the full story. Other flags include: -H:GenerateDebugInfo=1, -H:-DeleteLocalSymbols and -H:+TrackNodeSourcePosition.

The first one must be supplied for debug info to work at all.

The second one is needed in order for perf and valgrind to be able to connect mangled method names to code addresses (they need linker symbols as well as debug info to do so).

The third one ensures that file and line number info is attached to nodes during compiler graph building and retained as far as possible during transformation. Omitting it can result in less, or less accurate, local var info.

@jerboaa
Copy link
Collaborator Author

jerboaa commented Jun 5, 2023

At the very least it needs to be set in both or omitted in both.

OK. FWIW, the above runs were "omitted in both" comparisons as the default is false.

I think -H:+DebugCodeInfoUseSourceMappings has been around a little longer but I cannot remember for sure. It is used to decide whether or not to derive local var info from SourceMapping objects -- another source of position info that is provided by the compiler. There are less of these and they are less reliable than Infopoint objects.

It's the other way around. 22.3 doesn't have DebugCodeInfoUseSourceMappings, while 22.3 and 23+ both have SourceLevelDebug:

$ /disk/graal/builds/mandrel-java17-22.3.2.0-Final/bin/native-image -J-Dlogging.initial-configurator.min-level=500 -J-Djava.util.logging.manager=org.jboss.logmanager.LogManager -J-Dsun.nio.ch.maxUpdateArraySize=100 -J-Dio.netty.leakDetection.level=DISABLED -J-Dio.netty.allocator.maxOrder=3 -J-Dvertx.logger-delegate-factory-class-name=io.quarkus.vertx.core.runtime.VertxLogDelegateFactory -J-Dvertx.disableDnsResolver=true -J-Duser.language=en -J-Duser.country=US -J-Dfile.encoding=UTF-8 --features=io.quarkus.runner.Feature,io.quarkus.runtime.graal.DisableLoggingFeature -H:+TrackNodeSourcePosition -H:+DebugCodeInfoUseSourceMappings -J--add-exports=java.security.jgss/sun.security.krb5=ALL-UNNAMED -J--add-opens=java.base/java.text=ALL-UNNAMED -J--add-opens=java.base/java.io=ALL-UNNAMED -J--add-opens=java.base/java.lang.invoke=ALL-UNNAMED -J--add-opens=java.base/java.util=ALL-UNNAMED -H:BuildOutputJSONFile=quarkus-runner-build-output-stats.json -H:Log=registerResource: --initialize-at-run-time=io.vertx.ext.auth.impl.jose.JWT -H:+AllowFoldMethods -J-Djava.awt.headless=true --no-fallback --link-at-build-time -H:+ReportExceptionStackTraces -H:GenerateDebugInfo=1 -H:-AddAllCharsets --enable-url-protocols=http,https -H:NativeLinkerOption=-no-pie -H:-UseServiceLoaderFeature -H:+StackTrace -J--add-exports=org.graalvm.sdk/org.graalvm.nativeimage.impl=ALL-UNNAMED --exclude-config io\.netty\.netty-codec /META-INF/native-image/io\.netty/netty-codec/generated/handlers/reflect-config\.json --exclude-config io\.netty\.netty-handler /META-INF/native-image/io\.netty/netty-handler/generated/handlers/reflect-config\.json quarkus-runner -jar quarkus-runner.jar
Error: Could not find option 'DebugCodeInfoUseSourceMappings'. Use -H:PrintFlags= to list all available options.
Error: Image build request failed with exit status 1

@jerboaa
Copy link
Collaborator Author

jerboaa commented Jun 5, 2023

Another run with -H:+SourceLevelDebug and DebugCodeInfoUseSourceMappings omitted:

GraalVM version Time to set breakpoint objdump --dwarf=info | wc -l
22.3.2.0-Final 1.1244122982025146 3447022
23.x (master) 37.615013122558594 8257770

@jerboaa
Copy link
Collaborator Author

jerboaa commented Jun 6, 2023

Worst case seems to be -H:+SourceLevelDebug and -H:+DebugCodeInfoUseSourceMappings:

GraalVM version Time to set breakpoint objdump --dwarf=info | wc -l Debuginfo size
23.x (master) 193.565 (i.e 3 minutes 13 seconds) 54091355 278.32MB

@jerboaa
Copy link
Collaborator Author

jerboaa commented Jun 6, 2023

We've increased the GDB timeout for the quarkus debuginfo test to 5 minutes (#163) as an interim measure so as to not overlook other issues in CI meanwhile.

@zakkak
Copy link
Collaborator

zakkak commented Jun 7, 2023

It looks like 5 minutes are not enough for github actions, we will need to further increase it:

Error:  debugSymbolsQuarkus{TestInfo}  Time elapsed: 474.501 s  <<< FAILURE!

@adinn
Copy link

adinn commented Jun 7, 2023

I'd really like to look at an example program that manifests this behaviour then see what the debug info looks like and what gdb is doing with it. Most important of all I'd like to be able to build the native image myself using whatever modified version of the debug info generator, native image command line args and gdb command line args I need to test out what is going on (in particular I will likely need to run gdb itself under gdb). I cannot do that as things currently stand because the test setup does not allow me to drive the build or the gdb run. Is there a simpler setup I can use?

@zakkak
Copy link
Collaborator

zakkak commented Jun 7, 2023

I don't know about "simpler" but you can try the following:

git clone git@github.com:Karm/mandrel-integration-tests.git
cd mandrel-integration-tests/apps/quarkus-full-microprofile
git apply quarkus_3.x.patch
export GRAALVM_HOME=/path/to/graalvm
mvn package -Pnative \
  -Dquarkus.version=3.0.1.Final \
  -Dquarkus.native.container-build=false \
  -Dquarkus.native.additional-build-args=-H:Log=registerResource:,--trace-object-instantiation=java.util.Random,--initialize-at-run-time=io.vertx.ext.auth.impl.jose.JWT
# at this point you can add as many `additional-build-args` you need/want
gdb target/quarkus-runner
(gdb) b ConfigTestController.java:33

If you still want to manually run native-image you can copy paste the command printed by maven and adjust it to your needs. You will still need to run through maven once to get the application jar built.

HTH

@adinn
Copy link

adinn commented Jun 7, 2023

Thanks Foivos. There is one point that is still unclear in your explanation. You say:

export GRAALVM_HOME=/path/to/graalvm

What needs to be at the end of that path? (I need it to be something I have built from my dev tree).

@zakkak
Copy link
Collaborator

zakkak commented Jun 7, 2023

You can either use the output of mx graalvm-home (where you need to use exactly the same flags that you pass to mx build) or just use /path/to/graal-repo/sdk/latest_graalvm_home. With the latter, GRAALVM_HOME will always point to your latest build of graalvm (extremely useful when bisecting or trying out things)

@jerboaa
Copy link
Collaborator Author

jerboaa commented Jun 7, 2023

For the manual native-image invocation do it from the mandrel-integration-tests/apps/quarkus-full-microprofile/target/quarkus-native-image-source-jar folder. Quarkus moves that into the target folder.

@adinn
Copy link

adinn commented Jun 8, 2023

Thanks. I am now generating full debug info and full log output for the generator (after I worked out how to escape the commas in the comma separated list of options to the -HLog=... setting that I needed to include in property quarkus.native.additional-build-args).

@adinn
Copy link

adinn commented Jun 8, 2023

I am not seeing any significant delay in the debug image I just built

Here is the build step:

[INFO] [io.quarkus.deployment.pkg.steps.NativeImageBuildRunner] /home/adinn/redhat/openjdk/graal/graal/sdk/latest_graalvm_home/bin/native-image -J-Djava.util.logging.manager=org.jboss.logmanager.LogManager -J-Dsun.nio.ch.maxUpdateArraySize=100 -J-Dlogging.initial-configurator.min-level=500 -J-Dvertx.logger-delegate-factory-class-name=io.quarkus.vertx.core.runtime.VertxLogDelegateFactory -J-Dvertx.disableDnsResolver=true -J-Dio.netty.leakDetection.level=DISABLED -J-Dio.netty.allocator.maxOrder=3 -J-Duser.language=en -J-Duser.country=GB -J-Dfile.encoding=UTF-8 --features=io.quarkus.runner.Feature,io.quarkus.runtime.graal.DisableLoggingFeature -J--add-exports=java.security.jgss/sun.security.krb5=ALL-UNNAMED -J--add-opens=java.base/java.text=ALL-UNNAMED -J--add-opens=java.base/java.io=ALL-UNNAMED -J--add-opens=java.base/java.lang.invoke=ALL-UNNAMED -J--add-opens=java.base/java.util=ALL-UNNAMED -H:BuildOutputJSONFile=quarkus-runner-build-output-stats.json -H:GenerateDebugInfo=1 -H:+SourceLevelDebug -H:-DeleteLocalSymbols -H:+TrackNodeSourcePosition -H:Log=DebugTypeInfo:4,DebugCodeInfo:4,dwarf:4 -Dgraal.LogFile=out.log --trace-object-instantiation=java.util.Random --initialize-at-run-time=io.vertx.ext.auth.impl.jose.JWT -H:+AllowFoldMethods -J-Djava.awt.headless=true --no-fallback --link-at-build-time -H:+ReportExceptionStackTraces -H:-AddAllCharsets --enable-url-protocols=http,https -H:NativeLinkerOption=-no-pie -H:-UseServiceLoaderFeature -H:+StackTrace -J--add-exports=org.graalvm.sdk/org.graalvm.nativeimage.impl=ALL-UNNAMED --exclude-config io\.netty\.netty-codec /META-INF/native-image/io\.netty/netty-codec/generated/handlers/reflect-config\.json --exclude-config io\.netty\.netty-handler /META-INF/native-image/io\.netty/netty-handler/generated/handlers/reflect-config\.json quarkus-runner -jar quarkus-runner.jar
===========================================================================================
GraalVM Native Image: Generating 'quarkus-runner' (executable)...
===========================================================================================
[1/8] Initializing...                                                      (11.1s @ 0.24GB)
 Java version: 20+34, vendor: GraalVM Community
 Graal compiler: optimization level: '2', target machine: 'x86-64-v3'
 C compiler: gcc (redhat, x86_64, 12.3.1)
 Garbage collector: Serial GC (max heap size: 80% of RAM)
 4 user-specific feature(s)
 - com.oracle.svm.thirdparty.gson.GsonFeature
 - io.quarkus.runner.Feature: Auto-generated class by Quarkus from the existing extensions
 - io.quarkus.runtime.graal.DisableLoggingFeature: Disables INFO logging during the analysis phase
 - org.eclipse.angus.activation.nativeimage.AngusActivationFeature
09:01:23,090 INFO  [org.jbo.res.res.i18n] RESTEASY002225: Deploying jakarta.ws.rs.core.Application: class com.example.quarkus.QuarkusRestApplication
[2/8] Performing analysis...  [*****]                                      (85.2s @ 2.72GB)
  16,773 (89.48%) of 18,745 types reachable
  25,653 (63.03%) of 40,701 fields reachable
  83,195 (57.97%) of 143,512 methods reachable
   5,045 types,   272 fields, and 3,263 methods registered for reflection
      61 types,    59 fields, and    55 methods registered for JNI access
       4 native libraries: dl, pthread, rt, z
[3/8] Building universe...                                                 (19.0s @ 4.65GB)
[4/8] Parsing methods...      [***]                                        (10.9s @ 3.08GB)
[5/8] Inlining methods...     [***]                                         (8.5s @ 1.98GB)
[6/8] Compiling methods...    [**********]                                (101.3s @ 3.08GB)
[7/8] Layouting methods...    [***]                                         (6.6s @ 4.49GB)
[8/8] Creating image...       [******************]                        (320.7s @ 6.06GB)
  37.00MB (24.62%) for code area:    53,321 compilation units
  34.18MB (22.74%) for image heap:  397,039 objects and 86 resources
  71.86MB (47.81%) for debug info generated in 226.6s
   7.25MB ( 4.82%) for other data
 150.29MB in total
-------------------------------------------------------------------------------------------
Top 10 origins of code area:                 Top 10 object types in image heap:
  13.81MB java.base                             8.13MB byte[] for code metadata
   3.90MB java.xml                              4.02MB java.lang.Class
   1.95MB c.f.j.c.j.14.2.jar                    3.72MB java.lang.String
   1.39MB svm.jar (Native Image)                3.18MB byte[] for java.lang.String
   1.39MB quarkus-runner.jar                    3.08MB byte[] for general heap data
 952.97KB m.vertx.vertx-core-4.4.1.jar          1.41MB c.o.s.core.hub.DynamicHubCompanion
 902.35KB o.j.r.r.2.1.Final.jar                 1.04MB byte[] for embedded resources
 650.38KB o.a.h.httpclient-4.5.14.jar         892.08KB byte[] for reflection metadata
 534.20KB c.f.j.core.jackson-core-2.14.2.jar  762.38KB java.lang.String[]
 483.07KB c.s.okhttp3.okhttp-3.14.9.jar       596.52KB c.o.s.c.h.DynamicHub$~tionMetadata
  10.77MB for 140 more packages                 7.00MB for 4095 more object types
-------------------------------------------------------------------------------------------
Recommendations:
 HEAP: Set max heap for improved and more predictable memory usage.
 CPU:  Enable more CPU features with '-march=native' for improved performance.
-------------------------------------------------------------------------------------------
         27.6s (4.9% of total time) in 119 GCs | Peak RSS: 8.10GB | CPU load: 3.41
-------------------------------------------------------------------------------------------
Produced artifacts:
 /home/adinn/redhat/openjdk/graal/mandrel-integration-tests/apps/quarkus-full-microprofile/target/quarkus-native-image-source-jar/quarkus-runner (executable)
 /home/adinn/redhat/openjdk/graal/mandrel-integration-tests/apps/quarkus-full-microprofile/target/quarkus-native-image-source-jar/quarkus-runner-build-output-stats.json (build_info)
 /home/adinn/redhat/openjdk/graal/mandrel-integration-tests/apps/quarkus-full-microprofile/target/quarkus-native-image-source-jar/quarkus-runner.debug (debug_info)
 /home/adinn/redhat/openjdk/graal/mandrel-integration-tests/apps/quarkus-full-microprofile/target/quarkus-native-image-source-jar/sources (debug_info)

Note the debug info-specific arguments I provided

. . . -H:GenerateDebugInfo=1 -H:+SourceLevelDebug -H:-DeleteLocalSymbols -H:+TrackNodeSourcePosition -H:Log=DebugTypeInfo:4,DebugCodeInfo:4,dwarf:4 -Dgraal.LogFile=out.log . . .

and the reported debug info generation stats

  . . .
  71.86MB (47.81%) for debug info generated in 226.6s
  . . .

The generated image and debug info files are as follows

[adinn@sputstik quarkus-native-image-source-jar]$ ls -l ../quarkus-runner ../quarkus-runner.debug 
-rwxr-xr-x. 1 adinn adinn 82025760 Jun  8 09:10 ../quarkus-runner
-rwxr-xr-x. 1 adinn adinn 82787808 Jun  8 09:10 ../quarkus-runner.debug

i.e. roughly 80Mb of image and likewise of debug info.

The output log shows full local var info being added e.g.

    . . .
    R[0..1295], span 1296, method Encoder.<init>(WritableByteChannel, Encoder$Parameters, int) at Encoder.java:38 locals: li(this=stack:32|QWORD[.]), li(destination=stack:40|QWORD[.]), li(params=stack:48|QWORD[.]), li(inputBufferSize=stack:12|DWORD)
        -[42..42], span 1, bci 0, method Encoder.<init>(WritableByteChannel, Encoder$Parameters, int) at Encoder.java:38 locals: li(this=rdi|QWORD[.]), li(destination=rsi|QWORD[.]), li(params=rdx|QWORD[.]), li(inputBufferSize=rcx|DWORD)
        -[43..43], span 1, bci 0, method Encoder.<init>(WritableByteChannel, Encoder$Parameters, int) at Encoder.java:38 locals: li(this=rdi|QWORD[.]), li(destination=rsi|QWORD[.]), li(params=rdx|QWORD[.]), li(inputBufferSize=rcx|DWORD)
        -[44..44], span 1, bci 4, method Encoder.<init>(WritableByteChannel, Encoder$Parameters, int) at Encoder.java:39 locals: li(this=rdi|QWORD[.]), li(destination=rsi|QWORD[.]), li(params=rdx|QWORD[.]), li(inputBufferSize=rcx|DWORD)
        -[45..53], span 9, bci 5, method Encoder.<init>(WritableByteChannel, Encoder$Parameters, int) at Encoder.java:39 locals: 
        -[54..54], span 1, bci 18, method Encoder.<init>(WritableByteChannel, Encoder$Parameters, int) at Encoder.java:43 locals: li(this=rdi|QWORD[.]), li(destination=rsi|QWORD[.]), li(params=rdx|QWORD[.]), li(inputBufferSize=rcx|DWORD)
        -[55..55], span 1, bci 32, method Encoder.<init>(WritableByteChannel, Encoder$Parameters, int) at Encoder.java:46 locals: li(this=rdi|QWORD[.]), li(destination=rsi|QWORD[.]), li(params=rdx|QWORD[.]), li(inputBufferSize=rcx|DWORD)
    . . .

It also shows a very large number of types, methods and fields being registered as present in the debug info.

[adinn@sputstik quarkus-native-image-source-jar]$ egrep 'Register .* type .*' out.log | wc -l
18746
[adinn@sputstik quarkus-native-image-source-jar]$ egrep 'typename .* adding .* method .*' out.log | wc -l
143573
[adinn@sputstik quarkus-native-image-source-jar]$ egrep 'typename .* adding .* field .*' out.log | wc -l
38821
[adinn@sputstik quarkus-native-image-source-jar]$ egrep 'typename .* adding .* field .*' out.log | grep -v 'at offset 0xfffffff6' | wc -l
26646

n.b. the last count omits fields that are declared but not defined

I ran this up in gdb as per Serverin's example above and got these timings:

[adinn@sputstik quarkus-native-image-source-jar]$ gdb ../quarkus-runner
GNU gdb (GDB) Fedora Linux 13.1-3.fc37
Copyright (C) 2023 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ../quarkus-runner...
Reading symbols from /home/adinn/redhat/openjdk/graal/mandrel-integration-tests/apps/quarkus-full-microprofile/target/quarkus-runner.debug...
(gdb) set debuginfod enabled on
(gdb) !date
Thu  8 Jun 12:52:28 BST 2023
(gdb) b ConfigTestController.java:33
Downloading source file /home/adinn/redhat/openjdk/graal/mandrel-integration-tests/apps/quarkus-full-microprofile/target/quarkus-native-image-source-jar/sources/com/example/quarkus/config/ConfigTestController.java
Breakpoint 1 at 0x40eb68: file com/example/quarkus/config/ConfigTestController.java, line 33.
(gdb) !date
Thu  8 Jun 12:52:37 BST 2023
(gdb) q

Allowing for the time I took to type the commands the actual startup was a matter of a few (2-3?) seconds.

n.b. This is fedora 37 on bare metal:

[adinn@sputstik quarkus-native-image-source-jar]$ uname -a
Linux sputstik 6.3.5-100.fc37.x86_64 #1 SMP PREEMPT_DYNAMIC Tue May 30 15:43:51 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

So, I am confused. Anyone have any idea what is going on?

n.b. When I step through the program in gdb under emacs source files are being located. However, the sources directory appears to be missing a fair number of source files that correspond to jars downloaded by Maven. I am wondering if that might be an issue. Is there a simple way to ensure all source jars for the project (including dependency sources) are downloaded alongside the class jars?

@zakkak
Copy link
Collaborator

zakkak commented Jun 8, 2023

Note the debug info-specific arguments I provided

@adinn I think you are missing -H:+DebugCodeInfoUseSourceMappings see #160 (comment)

@adinn
Copy link

adinn commented Jun 9, 2023

Yes, but that's not the real problem (a slowdown was observed even without that flag being set). The real problem was that I was using a substrate build (of native-image) that was built from an older branch to generate the quarkus app (doh!).

I rebuilt substrate from head and the resulting quarkus app image reproduced the problem. I'm debugging gdb now to see where all the time is going.

@adinn
Copy link

adinn commented Jun 9, 2023

Ok, so the generated image and debug info with all flag set (including -H:+DebugCodeInfoUseSourceMappings) is

[adinn@sputstik quarkus-full-microprofile]$ ls -l target/quarkus-runner*
-rwxr-xr-x. 1 adinn adinn  81073784 Jun  9 09:57 target/quarkus-runner
-rwxr-xr-x. 1 adinn adinn 308679552 Jun  9 09:57 target/quarkus-runner.debug

When I ran up gdb the VMEM went up to 6.9GB and RSS to 5.8GB. The cpu was at about 12% so it looks like it is spending an enormous amount of time and space reading and modelling the 300Mb of DWARF info.

@adinn
Copy link

adinn commented Jun 9, 2023

I have identified that the delay that has been introduced by the move to one CU occurs both when reading file+line info and when reading type/code info. So, that means it is not all to do with the inclusion of local vars.

I also believe I have pinned down the major cause of the cost difference as far line info is concerned. I'm not sure it accounts for all the extra time taken but it is certainly a large part of it.

My current native image for the quarkus app includes 18626 classes 1131 source file dir paths and and 10122 source file names. That's a lot more than the 4382 in a simple hello world app.

In the old model there was a dir and line number table for each CU i.e. for each class. So the file index for an owning class C included an entry for C itself plus an entry for every classes C' which has a method C'.m' that gets inlined into a top level compiled method C.m. For many classes the file count was zero or 1 (a lot of classes have no top level compiled methods). For a lot of classes it was small (1 - 10). For a few classes it might rise up to maybe 100. So, gdb creates a name table array for each class with up to maybe 100 entries and populates a list of of file info records which grows up to at most around 100 entries.

In the new model all dir paths and files are listed in the one dir path and file index belonging to the the line number table for the one shared CU. So, gdb populates its name table with an array with 10122 strings and then builds a list of file info records which grows up to 10122 entries.

The problem happens when gdb begins processing the line number table data. This encodes line info for all code addresses in compiled code in a state machine representation. The important thing to bear in mind is that the amount of state machine data is unchanged between old and new model.

The encoding works through code addresses from the start of the first top-level compiled method to the end of the last top-level compiled method. Previously it was split into multiple sections by class/CU each with its own small dir path and file index and gdb could read each section one at a time. It now occurs in one big lump with one of each type of index and gdb has to read it all in one go.

The slow down is not simply (or even primarily) having to read the data in one big slurp. The biggest problem is the algorithm that reads an individual line table. Every time the address info progresses into or out of a new top level compiled method or inlined method the current file index is updated and gdb has to locate the line info for record the associated file. It translates the index to the correspoding string name and then linearly searches its current list of line info records for one whose name string compares equal. So, this algorithm is o(N^2) in the new model where N = the total file count. In the old model it is essentially o(N). i.e. even if you have to read all the line info in the old model you execute instructions roughly proportional to 10000 * K for K probably somewhere well below 10, whereas with the new model you get something proportional to 10000 * 10000.

I'll have a think about what to do about this but it is not obvious to me that there is a solution that does not involve re-introducing multiple CUs because there is a 1-1 map between CUs and line number tables.

Meantime I will also see if I can work out why the reading the code/type info is also taking longer.

@jerboaa
Copy link
Collaborator Author

jerboaa commented Jun 12, 2023

@adinn Thanks for the analysis. Please use this gdb snippet for the timing stats though (over manual date runs). That should be more accurate:

$ gdb -ex 'set breakpoint pending on' -ex 'set debuginfod enabled off' -ex "set \$starttime=$(date +%s.%N)" -ex 'b ConfigTestController.java:33' -ex 'shell echo set \$endtime=$(date +%s.%N) > ~/gdbtmp.txt' -ex 'source ~/gdbtmp.txt' -ex 'print $endtime-$starttime'  quarkus-runner

It should print the time in seconds (including fractionals) it takes to set the breakpoint.

@adinn
Copy link

adinn commented Jun 12, 2023

Resulting output is:

Breakpoint 1 at 0x40eb68: file com/example/quarkus/config/ConfigTestController.java, line 33.
$1 = 213.95615172386169

@jerboaa
Copy link
Collaborator Author

jerboaa commented Jun 12, 2023

Thanks. That roughly matches my measurements in #160 (comment)

@adinn
Copy link

adinn commented Jun 13, 2023

I looked into this further to get some feel for how costly the whole process is, not just processing the line data but also processing info section entries, and wher ethe balance lies between these and other operations.

After parsing the Java CU for the info section and then processing the line number table for that CU located in the line section, gdb goes on to process the full tree of records (DIEs) that the CU contains. This includes recursively indexing, for every compiled method DIE, the tree of inlined subroutine sub-records. That latter operation appears to be by far the largest part of the remaining processing work. That's because this tree lists for every code address in the method at which call info is present the inlined or top-level method the code originates from plus an associated file and line number. Because of the depth of Java inlining these inline subrecord trees are often very large.

The initial parsing seems to be very quick. The subsequent two steps, line info processing then DIE processing, are where almost all the time is spent processing debug info (something like 99%). I measured the times when running with my (debug) build of gdb and it was roughly 240 secs processing line info and 100 secs processing DIEs. The absolute times and the ratio might well turn out to be different when debugging with a non-debug build of gdb but the ratio does at least give a rough indication of the proportion of time spent on each task.

Also, during processing, the scratch data used by gdb to track the info it was building up from these line and info sections increased the VMEM to 6GB and the resident memory (RSS) to about 2.5GB. Most of this memory was freed at the end of DIE processing dropping back to about 2.0 GB and 0.9 GB.

The difficulty here is that the extra line processing time grows in part because of having to check each file against the list of all files, which is determined by the number of fires actually referenced in the line section. However, it also grows because there are more file checks when there are more inlined method traces. The DIE processing ought to grow linearly with the number of inlined method traces, although the depth of tree/node count for the tree will imply more work.

In an attempt to isolate these two effects I tweaked the full debug options to always include the following settings -H:GenerateDebugInfo=1, -H:+DebugCodeInfoUseSourceMappings, -H:+SourceLevelDebug, -H:-DeleteLocalSymbols but vary the following flags:

Config Options
1 -H:+DebugCodeInfoUseSourceMappings, -H:+TrackNodeSourcePosition
2 -H:+DebugCodeInfoUseSourceMappings, -H:-TrackNodeSourcePosition
3 -H:-DebugCodeInfoUseSourceMappings, -H:+TrackNodeSourcePosition
4 -H:-DebugCodeInfoUseSourceMappings, -H:-TrackNodeSourcePosition

The idea was to successively reduce the inline trace count. However, it turns out that configs 2, 3 and 4 performed identically (modulo a tiny amount of noise).

The following tables show some key stats for the first 2 configs:

Config image size (MB) debug info size (MB)
1 78 295
2 78 100

If both UseSourceMappings and TrackNodeSourcePosition are enabled the debug info size is nearly tripled.

Config #Classes #Files #File switches #File refs
1 18626 10122 2085064 6677
2 18626 10122 519909 6661

Although there are 18,000 classes with 10,000 associated source files only 1/3rd of the files are actually referenced from the Java CU's line number table. However, enabling both options means that the number of file switch operations during line info processing increases 4-fold.

Config #Classes #Inline Nodes #Inline Trees #Avg Tree Depth
1 18626 3652938 678075 5.3
2 18626 392533 218423 1.8

The number of inlined subroutine trees in the info section grows by just over 3 when both options are enabled while the number of nodes in those trees grows by roughly 9. So, the average tree depth goes up by a factor of 3.

Config process lines (secs) process dies (secs)
1 240 100
2 57 9

Overall the line processing time increases by just over 4 and the die processing by around 11.

Config Peak VMem (GB) Peak RSS (GB)
1 6.0 2.5
2 1.9 0.9

The most alarming issue is the amount of excess scratch memory needed to process the line info when both options are enabled. This memory is released after DIE processing taking the image VMem and RSS back to around 1.8 and 0.9. With either option omitted the difference between peak VMem/RSS and the values after processing is almost non-existent.

I think this is probably because the full config involves many repeated visits to files each of which creates scratch data that later needs to be freed freed while the lower configs involve few repeat visits. In particular the code I identified that traverses the file lists appears to be creating a lot of copies of strings formed by concatenating a dir path and a file name. It also looks like it is reallocating a lot of temporary array structures that are used to track visited locations in each file and only freeing these arrays at the end of processing. So, it seems the growth here is distinctly non-linear.

I think the only way to bring these numbers down is to find some way to return to using multiple CUs that also allows out of (class) order generation of compiled method code. That may well allow the inline tree info to be loaded incrementally rather than all at once during startup. It should also allow processing of each line table to be much quicker as per table file ref counts will be substantially lower, ensuring that each individual file switch only involves traversal of a comparably reduced file record list. I will investigate this after reporting the problem to the upstream GraalVM team.

@jerboaa
Copy link
Collaborator Author

jerboaa commented Jun 19, 2023

It looks like 5 minutes are not enough for github actions, we will need to further increase it:

Error:  debugSymbolsQuarkus{TestInfo}  Time elapsed: 474.501 s  <<< FAILURE!

It doesn't look like the 10 minute timeout solves it reliably either:

Error:  Tests run: 2, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 626.09 s <<< FAILURE! - in org.graalvm.tests.integration.DebugSymbolsTest
Error:  debugSymbolsQuarkus{TestInfo}  Time elapsed: 518.154 s  <<< FAILURE!

https://github.com/graalvm/mandrel/actions/runs/5295670650/jobs/9586454998#step:11:8896

@adinn
Copy link

adinn commented Jun 28, 2023

I prepared a fix for this that factors out compiled method details for each class into a single compiled method CU (separate from the types CU) when its methods occupy a single range, or a group of compiled method CUs when they are factored into subsets with discontiguous subranges thanks to interleaving of code form other class's methods. That gives each compiled method CU its own line table which means that for full debug info the time to set a breakpoint comes down from > 200 seconds to 3 seconds. Unfortunately, it also means that setting a method breakpoint b Foo:bar will fail unless you have previously executed some command like ptype Foo that forces gdb to cross-index the info in the main CU about type Foo with the info in the compiled methods CU that describes the compiled code for Foo:bar. By default gdb reads the main type info CU, finds a method declaration but no method definition in that CU and assumes Foo:bar was never compiled into the program.

That is clearly an unacceptable problem when it comes to usability (as an example, it breaks the debug info test when it tries to break a method used to test foreign types). I have not been able to come up with a way of making sure gdb automatically cross-indexes the declaration in the types CU with the definition in the separate compiled method CU. I believe I am going to have to revert to using a CU per class in order to ensure that the declarations and definitions stay together while still keeping CU file counts down to a manageable size.

I checked the DWARF spec and found out that there is a way to deal with the CUs having discontinuous subranges (the problem that led to merging all the info). Instead of a lo and hi watermark an info CU can point to a table or address ranges listed in a separate .debug_ranges section. So, if I add that I believe we can still support code interleaving for methods from independent classes while addressing the scaling problem that led to this performance issue.

@adinn
Copy link

adinn commented Jul 5, 2023

I have patched the generator to split the info into a compile unit per class. This brings down the line processing overhead enormously -- to around a second. Most CUs have only a small number of files. There are occasionally CUs with a few hundred files but even that is not enough to create a large processing load.

Unfortunately, the patch does not remove all the overhead in processing inline method hierarchies. The overall time to set the breakpoint is now down to 60 seconds. So, this patch has reduced the inline method processing cost (was ~100 seconds) but it is still large.

I ran gdb under perf to identify where the time is spent and I don't think I can improve on this result. Processing an inline tree involves a recursive traversal of the relevant inline nodes. perf shows most of the time spent in a hierarchy of recursive calls from process_die to read_func_scope. The recorded depth of recursion is 11 (i.e. 22 frames) with the percentage of time in that chain is as follows:

     - 83.22% process_die                                                                                                                                                                            
       - 81.98% read_func_scope
         - 81.53% process_die
           - 81.26% read_func_scope
             - 73.29% process_die
               - 68.19% read_func_scope
                 - 60.04% process_die
                   - 53.57% read_func_scope
                     - 46.44% process_die
                       - 39.80% read_func_scope
                         - 32.44% process_die
                           - 26.55% read_func_scope
                             - 20.59% process_die
                               - 16.11% read_func_scope
                                 - 11.89% process_die
                                   - 8.55% read_func_scope
                                     - 6.21% process_die
                                       - 3.80% read_func_scope
                                         - 2.74% process_die
                                           - 1.72% read_func_scope
                                             - 1.19% process_die
                                               - 0.73% read_func_scope

The first call to read_func_scope is for the compiled method info (i.e. inline depth 0). The recursive entries represent processing of an inline call at the equivalent depth in the inline tree. So, the figure of 0.73% for the 11th call represents time spent processing nodes at depth 10 in an inline tree (there may be trees with greater depths but perf has not spotted them).

There is no way to make processing this info run faster other dropping inline nodes from the generated info, for example by limiting the inline tree depth. Even then you need a very low limit for this to cut the time significantly e.g. to halve the time you would need to stop at inline depth 3.

@jerboaa
Copy link
Collaborator Author

jerboaa commented Jul 20, 2023

Fix is in the upstream merge queue: oracle/graal#6984 We shall see...

@zakkak
Copy link
Collaborator

zakkak commented Jul 21, 2023

Upstream fix is now merged. Hopefully the CI will succeed tomorrow :)

@jerboaa
Copy link
Collaborator Author

jerboaa commented Jul 21, 2023

Upstream fix is now merged. Hopefully the CI will succeed tomorrow :)

Fingers crossed.

zakkak added a commit to zakkak/mandrel-integration-tests that referenced this issue Jul 24, 2023
oracle/graal#6984 reduced the time spent reading
debyg symbols but it's still noticeable. This PR re-adjusts the timeouts
to a timeout of 60 seconds.

Relates to Karm#160
zakkak added a commit that referenced this issue Jul 24, 2023
oracle/graal#6984 reduced the time spent reading
debyg symbols but it's still noticeable. This PR re-adjusts the timeouts
to a timeout of 60 seconds.

Relates to #160
@zakkak
Copy link
Collaborator

zakkak commented Jul 24, 2023

Closed by oracle/graal#6984 and #170

@zakkak zakkak closed this as completed Jul 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants