docker on EC2 aarch64 dynamic images restarting after ~30 minutes (during build) #1630

andrew-m-leonard · 2020-10-19T08:21:21Z

https://ci.adoptopenjdk.net/view/Failed%20Builds/job/build-scripts/job/jobs/job/jdk11u/job/jdk11u-linux-aarch64-openj9-linuxXL/397/console

09:07:25  [ 59%] Building CXX object runtime/gc_modron_standard/CMakeFiles/j9modronstandard.dir/StandardAccessBarrier.cpp.o
09:15:43  wrapper script does not seem to be touching the log file in /home/ubuntu/workspace/build-scripts/jobs/jdk11u/jdk11u-linux-aarch64-openj9-linuxXL@tmp/durable-63690d87
09:15:43  (JENKINS-48300: if on an extremely laggy filesystem, consider -Dorg.jenkinsci.plugins.durabletask.BourneShellScript.HEARTBEAT_CHECK_INTERVAL=86400)

Node:
08:42:28 All nodes of label ‘build&&linux&&aarch64&&dockerBuild’ are offline
08:43:32 Running on EC2 (adopt_aws) - Dynamic Linux aarch64 VM provisioned from AWS (i-0fbd7d9f07a7b18f3) in /home/ubuntu/workspace/build-scripts/jobs/jdk11u/jdk11u-linux-aarch64-openj9-linuxXL

The text was updated successfully, but these errors were encountered:

andrew-m-leonard · 2020-10-19T08:37:54Z

It looks like this failure started from 17th Oct...

andrew-m-leonard · 2020-10-19T10:29:08Z

The common theme seems to be the job fails after running exactly 30mins ... ?

sxa · 2020-10-19T17:31:54Z

At some point in thie middle of this afternoon this seems to be ok. All affected builds during the time period appeared to be aborting after 30 minutes - no obvious reason why and we have not knowingly taken any remidial action on it. I will close for now and reopen if it recurs.

Sample error:

12:13:00  [ 94%] Building C object runtime/verbose/CMakeFiles/j9vrb.dir/__/codert_vm/jswalk.c.o
12:13:01  [ 94%] Building CXX object runtime/compiler/CMakeFiles/j9jit.dir/home/ubuntu/workspace/build-scripts/jobs/jdk11u/jdk11u-linux-aarch64-openj9/workspace/build/src/omr/compiler/optimizer/LoopReplicator.cpp.o
12:13:02  [ 94%] Building CXX object runtime/verbose/CMakeFiles/j9vrb.dir/__/jit_vm/ctsupport.cpp.o
12:20:50  wrapper script does not seem to be touching the log file in /home/ubuntu/workspace/build-scripts/jobs/jdk11u/jdk11u-linux-aarch64-openj9@tmp/durable-11ec4905
12:20:50  (JENKINS-48300: if on an extremely laggy filesystem, consider -Dorg.jenkinsci.plugins.durabletask.BourneShellScript.HEARTBEAT_CHECK_INTERVAL=86400)
[Pipeline] }
[Pipeline] // withEnv
[Pipeline] }
[Pipeline] // stage
[Pipeline] }
12:20:50  $ docker stop --time=1 097306ba93429796813a823ded1742ee23e1560424baf50d8fa5fcfe1ff40f31
12:20:51  $ docker rm -f 097306ba93429796813a823ded1742ee23e1560424baf50d8fa5fcfe1ff40f31
[Pipeline] // withDockerContainer
[Pipeline] }
[Pipeline] // node
[Pipeline] }
[Pipeline] // stage
[Pipeline] echo
12:20:52  Execution error: hudson.AbortException: script returned exit code -1
[Pipeline] echo
12:20:52  hudson.AbortException: script returned exit code -1
12:20:52  	at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.handleExit(DurableTaskStep.java:659)
12:20:52  	at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.check(DurableTaskStep.java:605)
12:20:52  	at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.run(DurableTaskStep.java:549)
12:20:52  	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
12:20:52  	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
12:20:52  	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
12:20:52  	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
12:20:52  	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
12:20:52  	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
12:20:52  	at java.lang.Thread.run(Thread.java:748)
12:20:52

andrew-m-leonard · 2020-10-20T06:55:58Z

Problem still occuring: https://ci.adoptopenjdk.net/view/Failed%20Builds/job/build-scripts/job/jobs/job/jdk11u/job/jdk11u-linux-aarch64-openj9/

sxa · 2020-10-20T11:54:47Z

Above PR made no difference - https://ci.adoptopenjdk.net/job/build-scripts-pr-tester/job/build-test/job/jobs/job/jdk/job/jdk-linux-aarch64-openj9/317/consoleFull showed the same failure.

I have added the appropriate labels to https://ci.adoptopenjdk.net/computer/docker-packet-ubuntu1604-armv8-1/ and my proposal would be that unless we can determine the cause of the failures we disable the dynamically provisioned EC2 aarch64 systems for the GA and single thread all builds through that packet machine. I am running a build at https://ci.adoptopenjdk.net/job/build-scripts/job/jobs/job/jdk/job/jdk-linux-aarch64-hotspot/285/console to verify whether it is able to run the dockerBuilds successfully, so if that works the only potential with the proposal is if we see any recurrance of adoptium/temurin-build#1804

sxa · 2020-10-20T12:42:42Z

(NOTE: @gdams had increased the capacity on the systems yesterday but that did not resolve the issue)

sxa · 2020-10-20T15:16:20Z

After seeing that in the time when the logs stopped the docker images was no longer running I wondered if we were hitting an issue with the docker subsystem on the dynamically provisioned hosts being updated and restarted while the builds were taking place. Looking at the jenkins job log, and the package logs on the host system, they seemed almost exactly an hour out at a time when the docker.io package was updated (likely just a time zone difference in how they're being prevented)


15:37:28  [ 88%] Building CXX object runtime/compiler/CMakeFiles/j9jit.dir/home/ubuntu/workspace/andrew-aarch-test/workspace/build/src/omr/compiler/optimizer/ReachingDefinitions.cpp.o
15:37:28  [ 88%] Building CXX object runtime/compiler/CMakeFiles/j9jit.dir/home/ubuntu/workspace/andrew-aarch-test/workspace/build/src/omr/compiler/optimizer/OMRRecognizedCallTransformer.cpp.o
15:37:29  [ 88%] Building CXX object runtime/gc_modron_startup/CMakeFiles/j9modronstartup.dir/mminit.cpp.o
15:37:29  [ 88%] Building CXX object runtime/compiler/CMakeFiles/j9jit.dir/home/ubuntu/workspace/andrew-aarch-test/workspace/build/src/omr/compiler/optimizer/RedundantAsyncCheckRemoval.cpp.o
15:37:29  [ 88%] Building CXX object runtime/compiler/CMakeFiles/j9jit.dir/home/ubuntu/workspace/andrew-aarch-test/workspace/build/src/omr/compiler/optimizer/RegisterCandidate.cpp.o
15:45:02  wrapper script does not seem to be touching the log file in /home/ubuntu/workspace/andrew-aarch-test@tmp/durable-90579abf
15:45:02  (JENKINS-48300: if on an extremely laggy filesystem, consider -Dorg.jenkinsci.plugins.durabletask.BourneShellScript.HEARTBEAT_CHECK_INTERVAL=86400)

vs the update logs:

2020-10-20 14:37:19 upgrade docker.io:arm64 19.03.6-0ubuntu1~18.04.1 19.03.6-0ubuntu1~18.04.2
2020-10-20 14:37:19 status half-configured docker.io:arm64 19.03.6-0ubuntu1~18.04.1
2020-10-20 14:37:31 status unpacked docker.io:arm64 19.03.6-0ubuntu1~18.04.1
2020-10-20 14:37:31 status half-installed docker.io:arm64 19.03.6-0ubuntu1~18.04.1
2020-10-20 14:37:35 status half-installed docker.io:arm64 19.03.6-0ubuntu1~18.04.1
2020-10-20 14:37:35 status unpacked docker.io:arm64 19.03.6-0ubuntu1~18.04.2
2020-10-20 14:37:35 status unpacked docker.io:arm64 19.03.6-0ubuntu1~18.04.2
2020-10-20 14:37:35 configure docker.io:arm64 19.03.6-0ubuntu1~18.04.2 <none>
2020-10-20 14:37:35 status unpacked docker.io:arm64 19.03.6-0ubuntu1~18.04.2
2020-10-20 14:37:35 status half-configured docker.io:arm64 19.03.6-0ubuntu1~18.04.2
2020-10-20 14:37:37 status installed docker.io:arm64 19.03.6-0ubuntu1~18.04.2

I therefore believe that these issues have been caused by an automatic update on the machine trying to update docker from a one-off template image used for creating the dynamic instances. If we rebuild the template with a more up to date docker.io package I believe it will resolve the problem

sxa · 2020-10-20T17:02:30Z

Looks ok after rebuilding the image (and increasing the space again as the first rebuild was too small to run a build on). Closing as I'm now reasonably confident that the issue has been identified and resolved.

karianna added this to TODO in infrastructure via automation Oct 19, 2020

karianna added the bug label Oct 19, 2020

sxa assigned andrew-m-leonard and sxa Oct 19, 2020

sxa added this to the October 2020 milestone Oct 19, 2020

sxa added the systemdown label Oct 19, 2020

sxa closed this as completed Oct 19, 2020

infrastructure automation moved this from TODO to Done Oct 19, 2020

sxa mentioned this issue Oct 19, 2020

Intermittent build failures list adoptium/temurin-build#1450

Open

gdams mentioned this issue Oct 20, 2020

revert https://github.com/AdoptOpenJDK/openjdk-build/pull/2130 adoptium/temurin-build#2162

Closed

sxa reopened this Oct 20, 2020

infrastructure automation moved this from Done to In Progress Oct 20, 2020

sxa changed the title ~~aarch64 builds fail: extremely laggy filesystem~~ docker on EC2 aarch64 dynamic images restarting after ~30 minutes (during build) Oct 20, 2020

sxa added the provider:aws label Oct 20, 2020

sxa closed this as completed Oct 20, 2020

infrastructure automation moved this from In Progress to Done Oct 20, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docker on EC2 aarch64 dynamic images restarting after ~30 minutes (during build) #1630

docker on EC2 aarch64 dynamic images restarting after ~30 minutes (during build) #1630

andrew-m-leonard commented Oct 19, 2020

andrew-m-leonard commented Oct 19, 2020

andrew-m-leonard commented Oct 19, 2020

sxa commented Oct 19, 2020 •

edited

andrew-m-leonard commented Oct 20, 2020

sxa commented Oct 20, 2020

sxa commented Oct 20, 2020

sxa commented Oct 20, 2020

sxa commented Oct 20, 2020

docker on EC2 aarch64 dynamic images restarting after ~30 minutes (during build) #1630

docker on EC2 aarch64 dynamic images restarting after ~30 minutes (during build) #1630

Comments

andrew-m-leonard commented Oct 19, 2020

andrew-m-leonard commented Oct 19, 2020

andrew-m-leonard commented Oct 19, 2020

sxa commented Oct 19, 2020 • edited

andrew-m-leonard commented Oct 20, 2020

sxa commented Oct 20, 2020

sxa commented Oct 20, 2020

sxa commented Oct 20, 2020

sxa commented Oct 20, 2020

sxa commented Oct 19, 2020 •

edited