Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

aarch64 build failure, Hash of java.rmi (...) differs to expected hash #1804

Closed
andrew-m-leonard opened this issue Jun 1, 2020 · 48 comments
Closed
Assignees
Labels
aarch Issues that affect or relate to the aarch ARCHITECTURE bug Issues that are problems in the code as reported by the community openj9 Issues that are enhancements or bugs raised against the OpenJ9 group x-linux Issues that affect or relate to the x64/x32 LINUX OS
Milestone

Comments

@andrew-m-leonard
Copy link
Contributor

andrew-m-leonard commented Jun 1, 2020

Platform:
aarch64
https://ci.adoptopenjdk.net/view/Failing%20Builds/job/build-scripts/job/jobs/job/jdk11u/job/jdk11u-linux-aarch64-openj9-linuxXL/223/consoleFull

01:05:10  * For target support_images_jre:
01:05:10  Error: Hash of java.rmi (5fe5f05db30334fea7371af06d42a7f7e5417acae4022df3ddec5c8d8aaf5e02) differs to expected hash (24f0620202e6705b53c79e3827358915385f2c46b4c91b1b65443e954b4df58f) recorded in java.base
01:05:10  java.lang.module.FindException: Hash of java.rmi (5fe5f05db30334fea7371af06d42a7f7e5417acae4022df3ddec5c8d8aaf5e02) differs to expected hash (24f0620202e6705b53c79e3827358915385f2c46b4c91b1b65443e954b4df58f) recorded in java.base
01:05:10  	at java.base/java.lang.module.Resolver.findFail(Resolver.java:877)
01:05:10  	at java.base/java.lang.module.Resolver.checkHashes(Resolver.java:461)
01:05:10  	at java.base/java.lang.module.Resolver.finish(Resolver.java:360)
01:05:10  	at java.base/java.lang.module.Configuration.<init>(Configuration.java:141)
01:05:10  	at java.base/java.lang.module.Configuration.resolve(Configuration.java:424)
01:05:10  	at java.base/java.lang.module.Configuration.resolve(Configuration.java:256)
01:05:10  	at jdk.jlink/jdk.tools.jlink.internal.Jlink$JlinkConfiguration.resolve(Jlink.java:220)
01:05:10  	at jdk.jlink/jdk.tools.jlink.internal.JlinkTask.createImageProvider(JlinkTask.java:486)
01:05:10  	at jdk.jlink/jdk.tools.jlink.internal.JlinkTask.createImage(JlinkTask.java:396)
01:05:10  	at jdk.jlink/jdk.tools.jlink.internal.JlinkTask.run(JlinkTask.java:272)
01:05:10  	at jdk.jlink/jdk.tools.jlink.internal.Main.run(Main.java:54)
01:05:10  	at jdk.jlink/jdk.tools.jlink.internal.Main.main(Main.java:33)

I am suspecting this maybe an openjdk build concurrency issue.
Workaround maybe reducing the aarch64 concurrency, it is currently using 29 jobs:

00:43:14  * Cores to use:   29
@M-Davies
Copy link
Contributor

M-Davies commented Jun 1, 2020

Previously seen as an intermittent #1450 (comment)

@sxa
Copy link
Member

sxa commented Jun 1, 2020

That machine has far more than 29 cores so that number shouldn't be a problem (unless there is a problem in the makefile that this happens to exacerbate). Do we know if this is only showing up on one machine?

@andrew-m-leonard
Copy link
Contributor Author

I suspect this might be showing up a jdk11 makefile issue, really hard to tell and work out, it's just 29 jobs is a large number.
What I am saying with a bit of luck the problem may dissappear with say 10jobs, as I am suspecting java.base got its hashes generated on a non-complete java.rmi

@karianna karianna added the bug Issues that are problems in the code as reported by the community label Jun 1, 2020
@karianna karianna added this to TODO in temurin-build via automation Jun 1, 2020
@karianna karianna added this to the June 2020 milestone Jun 10, 2020
@karianna karianna moved this from TODO to In Progress in temurin-build Jun 10, 2020
@knn-k
Copy link

knn-k commented Jun 29, 2020

I wrote a small program that calculates the hash value of a jmod file for reproducing the problem.
But I haven't been able to reproduce the wrong hash value.

@knn-k
Copy link

knn-k commented Jun 30, 2020

Similar error with Hotspot on amd64: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=944738

@Akira1Saitoh
Copy link

I could reproduce the problem on cent7-aarch64-1. I was also able to reproduce by manually launching jmod command used by build process, but the probability is very low (1/200 times so far).

@andrew-m-leonard
Copy link
Contributor Author

andrew-m-leonard commented Jun 30, 2020

I started to have a think about this one, as it's causing regular build breaks... and I was wondering what reasons could there be for the "recorded" java.base hash, being different to the dependent jmod hash? I can think of the following:

  1. Crypto "hash" logic has some odd timing bug
  2. Concurrency in the GNU make workers causes the java.base recorded hash to be on an incomplete java.rmi jmod? (Discounted this, by running a test build --with-jobs=1)
  3. GNU makefile has a basic dependency error such that recorded hash is done at wrong time, prior to java.rmi being completed?
  4. ???
    Any others anyone?

@karianna karianna modified the milestones: June 2020, July 2020 Jul 1, 2020
@andrew-m-leonard
Copy link
Contributor Author

@Akira1Saitoh
Copy link

  1. Concurrency in the GNU make workers causes the java.base recorded hash to be on an incomplete java.rmi jmod? (Discounted this, by running a test build --with-jobs=1)
  2. GNU makefile has a basic dependency error such that recorded hash is done at wrong time, prior to java.rmi being completed?

I think this is not a problem with GNU make or makefile because I can reproduce the hash mismatch error with simply running jmod command.

@knn-k
Copy link

knn-k commented Jul 2, 2020

@andrew-m-leonard What is the version of OpenSSL on the build server?

@Akira1Saitoh
Copy link

Surprisingly, I saw that sometimes the output of sha256sum command was incorrect on cent7-aarch64-1.

...
fa13eb1483a0fbc03164e113d4d7d23525888d8607b2fc4472c2ff4dd9bce55f build/linux-aarch64-normal-server-release/images/jmods/java.rmi.jmod
hash of java.rmi is fa13eb1483a0fbc03164e113d4d7d23525888d8607b2fc4472c2ff4dd9bce55f
run 47: ok
fa13eb1483a0fbc03164e113d4d7d23525888d8607b2fc4472c2ff4dd9bce55f build/linux-aarch64-normal-server-release/images/jmods/java.rmi.jmod
hash of java.rmi is fa13eb1483a0fbc03164e113d4d7d23525888d8607b2fc4472c2ff4dd9bce55f
run 48: ok
fa13eb1483a0fbc03164e113d4d7d23525888d8607b2fc4472c2ff4dd9bce55f build/linux-aarch64-normal-server-release/images/jmods/java.rmi.jmod
hash of java.rmi is fa13eb1483a0fbc03164e113d4d7d23525888d8607b2fc4472c2ff4dd9bce55f
run 49: ok
fa13eb1483a0fbc03164e113d4d7d23525888d8607b2fc4472c2ff4dd9bce55f build/linux-aarch64-normal-server-release/images/jmods/java.rmi.jmod
hash of java.rmi is fa13eb1483a0fbc03164e113d4d7d23525888d8607b2fc4472c2ff4dd9bce55f
run 50: ok
264ac9684a28d03af37aaedf683a52d02405b239c8911b74ec780c9a02bd0288 build/linux-aarch64-normal-server-release/images/jmods/java.rmi.jmod
hash of java.rmi is 264ac9684a28d03af37aaedf683a52d02405b239c8911b74ec780c9a02bd0288
hash does not match
[jenkins@cent7-arm8-1 openj9-openjdk-jdk11]$ sha256sum build/linux-aarch64-normal-server-release/images/jmods/java.rmi.jmod
fa13eb1483a0fbc03164e113d4d7d23525888d8607b2fc4472c2ff4dd9bce55f build/linux-aarch64-normal-server-release/images/jmods/java.rmi.jmod

sha256sum command uses libcrypto.so which is part of OpenSSL.

[jenkins@cent7-arm8-1 build]$ ldd /usr/bin/sha256sum
linux-vdso.so.1 => (0x0000ffffb08a0000)
libcrypto.so.10 => /lib64/libcrypto.so.10 (0x0000ffffb0670000)
libc.so.6 => /lib64/libc.so.6 (0x0000ffffb04e0000)
/lib/ld-linux-aarch64.so.1 (0x0000ffffb08b0000)
libdl.so.2 => /lib64/libdl.so.2 (0x0000ffffb04b0000)
libz.so.1 => /lib64/libz.so.1 (0x0000ffffb0470000)
[jenkins@cent7-arm8-1 build]$ ls -l /lib64/libcrypto.so.10
lrwxrwxrwx 1 root root 19 Sep 18 2019 /lib64/libcrypto.so.10 -> libcrypto.so.1.0.2k

@andrew-m-leonard
Copy link
Contributor Author

That is interesting i've just created a job doing a sha256sum loop, and in one instance out of 10000 it got a different hash:
HASH=cfdab6087d46a53a4de3a7f2e42791d7e44cfa1e553206127c487541c0e207f3
ERROR hash=7b9d06d1412b71840e2576b2dcccfce7f57c5e5427a576507551ac7b5a9e3db1

however, having run the job numerous times since it hasn't failed again... weird!

@andrew-m-leonard
Copy link
Contributor Author

build-packet-centos74-armv8-1, openssl version :
OpenSSL 1.0.2k-fips 26 Jan 2017

@sxa
Copy link
Member

sxa commented Jul 2, 2020

@andrew-m-leonard What is the version of OpenSSL on the build server?

1.0.2k as supplied by RedHat

@andrew-m-leonard
Copy link
Contributor Author

@sxa can we upgrade openssl, as it's looking like a bug in openssl ?

@andrew-m-leonard
Copy link
Contributor Author

or a file system issue, returning different file data...?

@sxa
Copy link
Member

sxa commented Jul 2, 2020

Well we could build our own one but we SHOULD be building with the one supplied with the OS so arguably RedHat should fix it. I'm tempted to suggest it might be file system related. I guess https://ci.adoptopenjdk.net/job/andrew-aarch-hash-debug/10/console is the run where you had one failure out of ... a lot?

@andrew-m-leonard
Copy link
Contributor Author

Well we could build our own one but we SHOULD be building with the one supplied with the OS so arguably RedHat should fix it. I'm tempted to suggest it might be file system related. I guess https://ci.adoptopenjdk.net/job/andrew-aarch-hash-debug/10/console is the run where you had one failure out of ... a lot?

Correct job 10

@knn-k
Copy link

knn-k commented Jul 2, 2020

or a file system issue, returning different file data...?

Yes, that is another possibility.
I suspect OpenSSL because there are some other known intermittent issues for AArch64 OpenJ9 that may be related to OpsnSSL.

@Akira1Saitoh
Copy link

If this is a file system issue, the error can happen with other checksum command which does not use OpenSSL, such as cksum command. I think doing a cksum loop instead of sha256sum might help narrow down the problem.

@sxa
Copy link
Member

sxa commented Jul 15, 2020

This is not a file system issue - I've been able to replicate it when testing against a file on a ramdisk on the machine (/dev/shm)

@tmancill
Copy link

This is not a file system issue - I've been able to replicate it when testing against a file on a ramdisk on the machine (/dev/shm)

Is this possibly related to #1214? That issue occurs on x86-64 and could be the same as Debian #944738, where it cropped up ~8 months ago. I haven't looked at it in any depth, it seems like it could be an issue with a common toolchain component. The reporters of the Debian issue note that it doesn't occur with Oracle's builds.

@andrew-m-leonard
Copy link
Contributor Author

andrew-m-leonard commented Jul 15, 2020

@tmancill @sxa #1214 maybe related, but fyi I have been able to replicate using a basic loop of sha256sum alone, and it fails very occaisionally on this particular machine, but works always on another machine...
https://ci.adoptopenjdk.net/view/work%20in%20progress/job/andrew-aarch-hash-debug/

@tmancill
Copy link

@andrew-m-leonard Ah, in that case these sound like distinct issues. Thanks!

@sxa
Copy link
Member

sxa commented Jan 4, 2021

Similar issues are happening happening within docker images on test-packet-ubuntu1604-armv8-1 host. Attempting to upgrade to a later ubuntu may cause issues due to:

Leaving 'diversion of /etc/init/ureadahead.conf to /etc/init/ureadahead.conf.disabled by cloud-init'
Processing triggers for libc-bin (2.23-0ubuntu11.2) ...
Processing triggers for initramfs-tools (0.122ubuntu8.17) ...
update-initramfs: Generating /boot/initrd.img-4.10.0-26-generic
W: Possible missing firmware /lib/firmware/ast_dp501_fw.bin for module ast
W: mdadm: /etc/mdadm/mdadm.conf defines no arrays.
Unsupported platform on EFI system, doing nothing.

on an apt-get upgrade .. Hopefully this isn't a problem ...

Current plan (since machine is unusable for most practical purposes just now) is to attempt to do a release upgrade on it.

@sxa
Copy link
Member

sxa commented Jan 5, 2021

Sample errors seen on some of the test jobs on the same systems:

Receiving objects:  19% (2186/11502), 13.37 MiB | 13.37 MiB/s
Receiving objects:  20% (2301/11502), 13.37 MiB | 13.37 MiB/s
error: RPC failed; curl 56 GnuTLS recv error (-24): Decryption has failed.
fatal: the remote end hung up unexpectedly
fatal: early EOF
fatal: index-pack failed

or

Receiving objects:  12% (1381/11502)
error: RPC failed; curl 56 GnuTLS recv error (-12): A TLS fatal alert has been received.
fatal: the remote end hung up unexpectedly
fatal: early EOF
fatal: index-pack failed

@M-Davies M-Davies added aarch Issues that affect or relate to the aarch ARCHITECTURE openj9 Issues that are enhancements or bugs raised against the OpenJ9 group x-linux Issues that affect or relate to the x64/x32 LINUX OS labels Jan 5, 2021
@sxa
Copy link
Member

sxa commented Jan 5, 2021

OK ...

For now I have enabled the dockerBuild tag on docker-packet-ubuntu1604-armv8-1 (Not a ThunderX system) with three executors to see how it works - initial testing has shown that it's safe on those but we'll see ...

Separately I have enabled multiple docker containers under build-packet-ubuntu1804-armv8l-1 (Five Ubuntu, four Fedora, all limited to use 8 of the 64-cores each) which are to be used for testing - this is also not a ThunderX system and has so far not exhibited the crypto issues. I'll enable further distributions if this looks stable

@sxa
Copy link
Member

sxa commented Jan 5, 2021

Ref OpenJ9 issue: eclipse-openj9/openj9#9046

@sxa
Copy link
Member

sxa commented Feb 5, 2021

Closing and will persue any subsequent remediation work we can identify under adoptium/infrastructure#1897

@sxa sxa closed this as completed Feb 5, 2021
temurin-build automation moved this from In Progress to Done Feb 5, 2021
@tresf

This comment was marked as off-topic.

@tresf

This comment was marked as off-topic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
aarch Issues that affect or relate to the aarch ARCHITECTURE bug Issues that are problems in the code as reported by the community openj9 Issues that are enhancements or bugs raised against the OpenJ9 group x-linux Issues that affect or relate to the x64/x32 LINUX OS
Projects
No open projects
temurin-build
  
Done
Development

No branches or pull requests

10 participants