Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docker on EC2 aarch64 dynamic images restarting after ~30 minutes (during build) #1630

Closed
andrew-m-leonard opened this issue Oct 19, 2020 · 8 comments

Comments

@andrew-m-leonard
Copy link
Contributor

https://ci.adoptopenjdk.net/view/Failed%20Builds/job/build-scripts/job/jobs/job/jdk11u/job/jdk11u-linux-aarch64-openj9-linuxXL/397/console

09:07:25  [ 59%] Building CXX object runtime/gc_modron_standard/CMakeFiles/j9modronstandard.dir/StandardAccessBarrier.cpp.o
09:15:43  wrapper script does not seem to be touching the log file in /home/ubuntu/workspace/build-scripts/jobs/jdk11u/jdk11u-linux-aarch64-openj9-linuxXL@tmp/durable-63690d87
09:15:43  (JENKINS-48300: if on an extremely laggy filesystem, consider -Dorg.jenkinsci.plugins.durabletask.BourneShellScript.HEARTBEAT_CHECK_INTERVAL=86400)

Node:
08:42:28 All nodes of label ‘build&&linux&&aarch64&&dockerBuild’ are offline
08:43:32 Running on EC2 (adopt_aws) - Dynamic Linux aarch64 VM provisioned from AWS (i-0fbd7d9f07a7b18f3) in /home/ubuntu/workspace/build-scripts/jobs/jdk11u/jdk11u-linux-aarch64-openj9-linuxXL

@andrew-m-leonard
Copy link
Contributor Author

It looks like this failure started from 17th Oct...

@karianna karianna added this to TODO in infrastructure via automation Oct 19, 2020
@karianna karianna added the bug label Oct 19, 2020
@andrew-m-leonard
Copy link
Contributor Author

The common theme seems to be the job fails after running exactly 30mins ... ?

@sxa sxa added this to the October 2020 milestone Oct 19, 2020
@sxa sxa added the systemdown label Oct 19, 2020
@sxa
Copy link
Member

sxa commented Oct 19, 2020

At some point in thie middle of this afternoon this seems to be ok. All affected builds during the time period appeared to be aborting after 30 minutes - no obvious reason why and we have not knowingly taken any remidial action on it. I will close for now and reopen if it recurs.

Sample error:

12:13:00  [ 94%] Building C object runtime/verbose/CMakeFiles/j9vrb.dir/__/codert_vm/jswalk.c.o
12:13:01  [ 94%] Building CXX object runtime/compiler/CMakeFiles/j9jit.dir/home/ubuntu/workspace/build-scripts/jobs/jdk11u/jdk11u-linux-aarch64-openj9/workspace/build/src/omr/compiler/optimizer/LoopReplicator.cpp.o
12:13:02  [ 94%] Building CXX object runtime/verbose/CMakeFiles/j9vrb.dir/__/jit_vm/ctsupport.cpp.o
12:20:50  wrapper script does not seem to be touching the log file in /home/ubuntu/workspace/build-scripts/jobs/jdk11u/jdk11u-linux-aarch64-openj9@tmp/durable-11ec4905
12:20:50  (JENKINS-48300: if on an extremely laggy filesystem, consider -Dorg.jenkinsci.plugins.durabletask.BourneShellScript.HEARTBEAT_CHECK_INTERVAL=86400)
[Pipeline] }
[Pipeline] // withEnv
[Pipeline] }
[Pipeline] // stage
[Pipeline] }
12:20:50  $ docker stop --time=1 097306ba93429796813a823ded1742ee23e1560424baf50d8fa5fcfe1ff40f31
12:20:51  $ docker rm -f 097306ba93429796813a823ded1742ee23e1560424baf50d8fa5fcfe1ff40f31
[Pipeline] // withDockerContainer
[Pipeline] }
[Pipeline] // node
[Pipeline] }
[Pipeline] // stage
[Pipeline] echo
12:20:52  Execution error: hudson.AbortException: script returned exit code -1
[Pipeline] echo
12:20:52  hudson.AbortException: script returned exit code -1
12:20:52  	at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.handleExit(DurableTaskStep.java:659)
12:20:52  	at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.check(DurableTaskStep.java:605)
12:20:52  	at org.jenkinsci.plugins.workflow.steps.durable_task.DurableTaskStep$Execution.run(DurableTaskStep.java:549)
12:20:52  	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
12:20:52  	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
12:20:52  	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
12:20:52  	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
12:20:52  	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
12:20:52  	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
12:20:52  	at java.lang.Thread.run(Thread.java:748)
12:20:52  

@sxa sxa closed this as completed Oct 19, 2020
infrastructure automation moved this from TODO to Done Oct 19, 2020
@andrew-m-leonard
Copy link
Contributor Author

@sxa
Copy link
Member

sxa commented Oct 20, 2020

Above PR made no difference - https://ci.adoptopenjdk.net/job/build-scripts-pr-tester/job/build-test/job/jobs/job/jdk/job/jdk-linux-aarch64-openj9/317/consoleFull showed the same failure.

I have added the appropriate labels to https://ci.adoptopenjdk.net/computer/docker-packet-ubuntu1604-armv8-1/ and my proposal would be that unless we can determine the cause of the failures we disable the dynamically provisioned EC2 aarch64 systems for the GA and single thread all builds through that packet machine. I am running a build at https://ci.adoptopenjdk.net/job/build-scripts/job/jobs/job/jdk/job/jdk-linux-aarch64-hotspot/285/console to verify whether it is able to run the dockerBuilds successfully, so if that works the only potential with the proposal is if we see any recurrance of adoptium/temurin-build#1804

@sxa sxa reopened this Oct 20, 2020
infrastructure automation moved this from Done to In Progress Oct 20, 2020
@sxa
Copy link
Member

sxa commented Oct 20, 2020

(NOTE: @gdams had increased the capacity on the systems yesterday but that did not resolve the issue)

@sxa
Copy link
Member

sxa commented Oct 20, 2020

After seeing that in the time when the logs stopped the docker images was no longer running I wondered if we were hitting an issue with the docker subsystem on the dynamically provisioned hosts being updated and restarted while the builds were taking place. Looking at the jenkins job log, and the package logs on the host system, they seemed almost exactly an hour out at a time when the docker.io package was updated (likely just a time zone difference in how they're being prevented)


15:37:28  [ 88%] Building CXX object runtime/compiler/CMakeFiles/j9jit.dir/home/ubuntu/workspace/andrew-aarch-test/workspace/build/src/omr/compiler/optimizer/ReachingDefinitions.cpp.o
15:37:28  [ 88%] Building CXX object runtime/compiler/CMakeFiles/j9jit.dir/home/ubuntu/workspace/andrew-aarch-test/workspace/build/src/omr/compiler/optimizer/OMRRecognizedCallTransformer.cpp.o
15:37:29  [ 88%] Building CXX object runtime/gc_modron_startup/CMakeFiles/j9modronstartup.dir/mminit.cpp.o
15:37:29  [ 88%] Building CXX object runtime/compiler/CMakeFiles/j9jit.dir/home/ubuntu/workspace/andrew-aarch-test/workspace/build/src/omr/compiler/optimizer/RedundantAsyncCheckRemoval.cpp.o
15:37:29  [ 88%] Building CXX object runtime/compiler/CMakeFiles/j9jit.dir/home/ubuntu/workspace/andrew-aarch-test/workspace/build/src/omr/compiler/optimizer/RegisterCandidate.cpp.o
15:45:02  wrapper script does not seem to be touching the log file in /home/ubuntu/workspace/andrew-aarch-test@tmp/durable-90579abf
15:45:02  (JENKINS-48300: if on an extremely laggy filesystem, consider -Dorg.jenkinsci.plugins.durabletask.BourneShellScript.HEARTBEAT_CHECK_INTERVAL=86400)

vs the update logs:

2020-10-20 14:37:19 upgrade docker.io:arm64 19.03.6-0ubuntu1~18.04.1 19.03.6-0ubuntu1~18.04.2
2020-10-20 14:37:19 status half-configured docker.io:arm64 19.03.6-0ubuntu1~18.04.1
2020-10-20 14:37:31 status unpacked docker.io:arm64 19.03.6-0ubuntu1~18.04.1
2020-10-20 14:37:31 status half-installed docker.io:arm64 19.03.6-0ubuntu1~18.04.1
2020-10-20 14:37:35 status half-installed docker.io:arm64 19.03.6-0ubuntu1~18.04.1
2020-10-20 14:37:35 status unpacked docker.io:arm64 19.03.6-0ubuntu1~18.04.2
2020-10-20 14:37:35 status unpacked docker.io:arm64 19.03.6-0ubuntu1~18.04.2
2020-10-20 14:37:35 configure docker.io:arm64 19.03.6-0ubuntu1~18.04.2 <none>
2020-10-20 14:37:35 status unpacked docker.io:arm64 19.03.6-0ubuntu1~18.04.2
2020-10-20 14:37:35 status half-configured docker.io:arm64 19.03.6-0ubuntu1~18.04.2
2020-10-20 14:37:37 status installed docker.io:arm64 19.03.6-0ubuntu1~18.04.2

I therefore believe that these issues have been caused by an automatic update on the machine trying to update docker from a one-off template image used for creating the dynamic instances. If we rebuild the template with a more up to date docker.io package I believe it will resolve the problem

@sxa sxa changed the title aarch64 builds fail: extremely laggy filesystem docker on EC2 aarch64 dynamic images restarting after ~30 minutes (during build) Oct 20, 2020
@sxa
Copy link
Member

sxa commented Oct 20, 2020

Looks ok after rebuilding the image (and increasing the space again as the first rebuild was too small to run a build on). Closing as I'm now reasonably confident that the issue has been identified and resolved.

@sxa sxa closed this as completed Oct 20, 2020
infrastructure automation moved this from In Progress to Done Oct 20, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Development

Successfully merging a pull request may close this issue.

3 participants