Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BusyBox+hostid fix-up #540

Merged
merged 1 commit into from
Apr 16, 2021
Merged

Conversation

benjaminmordaunt
Copy link
Contributor

The version of busybox currently shipped with devlib for x86_64 produces a segfault in the hostid applet for certain IDs.
Looking through the commit history, they seem to have made a bit of a mess with format specifiers.

I've bumped the version of BB to v1.31.0 which allows LocalLinuxTarget to progress past the hostid invocation.
The behaviour of the applet is still incorrect, however, returning all zeroes for my system's hostid. Better an incorrect ID than a crash, however...

@marcbonnici
Copy link
Collaborator

That's interesting and we recently had a report of a similar issue for aarch64 so perhaps it might be time to update all our arm binaries as well, although I can't see anything in the release notes about this.

In your PR you've updated to v1.31.0 however when looking at https://busybox.net/ it appears v1.31.0 is marked as unstable and they're now up to v1.32.1 (stable).
Does the latest version work for you as well or is there some later behavior that you've seen that also causes issues?

@benjaminmordaunt
Copy link
Contributor Author

benjaminmordaunt commented Apr 13, 2021

Ah, interesting - didn't expect it to affect aarch64 too.

Yes, it seems the issue is most common when your hostid starts with 0. I believe they're still using the wrong underlying integer type for the ID, which is causing the issue.

It's likely that a new compiler check in the toolchain they're using to build is avoiding the segfault for them (it's a format spec. bug), which likely explains why you're not seeing any explicit mention of a fix (it's not fixed).

I'll look into pushing a fix upstream.

Honestly, I just went with the most recent prebuilt version on their binaries page - it doesn't look like 1.32.1 is available there yet.

I'll update this PR with at least a bump to the Arm BB version too.

@benjaminmordaunt
Copy link
Contributor Author

Retrieved binaries from the Arch distribution for arm64 and x86_64 at 1.32.1 (stable).
Note that it still returns all zeroes for my hostid on x86_64.

@marcbonnici
Copy link
Collaborator

Thanks for updating these, interestingly I've just tried the new version on my desktop and my hostid also started reporting all zeros..

I've just tried building busybox from source and this seems to work correctly, even though they both report the same version number.

./busybox_github
BusyBox v1.32.1 () multi-call binary.
./busybox_github hostid
00000000

vs

./busybox_src
BusyBox v1.32.1 (2021-04-13 12:10:59 BST) multi-call binary.
./busybox_src hostid
007f0101

Would you be able to try running this version on your system to see if that allows it to display the hostid correctly?
busybox.tar.gz

@benjaminmordaunt
Copy link
Contributor Author

That build yields a segfault again on my system.
If it's easy to do on your end, can you build this again using musl libc? If not, I'll setup a build this evening.

@benjaminmordaunt
Copy link
Contributor Author

benjaminmordaunt commented Apr 13, 2021

Just built against static musl and the issue of 00000000 hostid returned. sigh

I've found that both static and shared musl builds produce the all-zero hostid (* see below - this is expected).
My guess is that the Linux kernel header version has a role to play in this. In the build @marcbonnici provided, file reports:

busybox: ELF 64-bit LSB executable, x86-64, version 1 (GNU/Linux), statically linked, BuildID[sha1]=f6bd8f3578cd3abd88b06000853908a2810bae57, for GNU/Linux 3.2.0, stripped

Note the disparity between GNU/Linux 3.2.0 and my kernel version: 5.11.10. In this case, a segfault seems to occur.

The Arch package linux-headers-musl which I used to build BusyBox with musl, is only updated to 4.19.88. In this case, the all-zero hostid bug occurs.

Compiling natively on my machine with glibc and updated headers produces the correct hostid, as you observed on your own machine.

The issue here is that the "all-zero hostid" bug may very well transpose to a segfault on later versions of the Linux kernel. I'll need to do more digging with this to find a robust solution.

EDIT:
So the behaviour of musl isn't actually surprising. As /etc/hostid doesn't exist on my system, glibc does the following:

In the glibc implementation, if gethostid() cannot open the file
       containing the host ID, then it obtains the hostname using
       gethostname(2), passes that hostname to gethostbyname_r(3) in
       order to obtain the host's IPv4 address, and returns a value
       obtained by bit-twiddling the IPv4 address.  (This value may not
       be unique.)

However, hopefully the following revelation will produce some giggles: https://github.com/occlum/musl/blob/master/src/misc/gethostid.c

That's musl's implementation... 😆

My recommendation is that devlib moves away from using the hostid altogether. glibc uses arbitrary bit-fiddling to invent an ID it thinks should be unique. Other implementations just return 0...

@benjaminmordaunt
Copy link
Contributor Author

@marcbonnici For the sake of completion, could you recompile your busybox build as-is but with the -g flag. I'll debug it in GDB on my end and see what's causing the segmentation fault.

@benjaminmordaunt benjaminmordaunt changed the title Update x86_64 busybox BusyBox+hostid fix-up Apr 13, 2021
@marcbonnici
Copy link
Collaborator

That's musl's implementation...

Right... thanks for looking into this, that would certainly explain that one then :D

So it looks like we either go with musl and don't crash but always get zeros or we stick without and either segfaults or manage to obtain a usable value.
It seems like without is the better option and this becomes a best effort value where we no longer always expect the call to succeed which returns 0 ourselves for devices that cannot provide this value rather than crashing?

For the sake of completion, could you recompile your busybox build as-is but with the -g flag.

Sure busybox_unstripped.tar.gz

@benjaminmordaunt
Copy link
Contributor Author

Okay, here's the segfault:

Program received signal SIGSEGV, Segmentation fault.
0x00007ffff7ef5414 in __nss_readline () from /usr/lib/libc.so.6

The NSS facility requires dynamic library loading and matching glibc versions at runtime, so it's no surprise this is failing, even if compiled as static. There is an --enable-static-nss option (or something similar) when compiling glibc.

That being said, I need libcrypt.a and I don't really fancy waiting 1+ hours to build the static glibc libs on Arch (frustrating they decided not to bundle them by default...).
Also, compiling a static glibc-based BusyBox appears frowned upon for many good reasons (this is why BusyBox uses musl by default)

Here is my proposal for the time being - use musl for max peace of mind (i.e. the binaries that are already in this PR) and emulate glibc gethostid behaviour in devlib. I'll begin work on this and attach a commit here.

@benjaminmordaunt
Copy link
Contributor Author

glibc emulation added. This will need testing.

@marcbonnici
Copy link
Collaborator

Thanks for taking a stab at this however unfortunately I don't think this approach is going to work very well. For example if we take an android device connected via usb it is unlikely that we would be able to resolve the hostname, there is no guarantee that they are connected to the same network, especially due to the fact that a lot of the devices I've tested with have their hostname set as localhost (or other non unique values) which would be more confusing than not supplying a hostid.

Here is my proposal for the time being - use musl for max peace of mind

Please correct me if I've misunderstood, but by using musl unless /etc/hostid exists on a particular device, we are guaranteed to get 0 as the returned value and this is more likely to occur rather than facing an error caused by a mis-match of libc which would otherwise would return a useable value. Although there doesn't seem to be a reliable workaround for either issue?

Also out of curiosity how many / what sort of devices have you observed this issue on?

@benjaminmordaunt
Copy link
Contributor Author

benjaminmordaunt commented Apr 13, 2021

The value returned by hostid is almost useless as a means of uniquely identifying a device and it has had a skeleton implementation in most libcs since their inception,

For example, you and I receive the same value on our test machines: 007f0101, which reverse engineered using the glibc algorithm, represents an IPv4 of 127.0.1.1. (btw newlib uses the same algo)

I don't know the origins of why hostid is regarded as useful in the case of devlib, but the changes I've proposed here are intended to maximise compatibility by reducing the risk of glibc issues (which when linked statically are frequent) while maintaining the behaviour we expect from glibc when possible.

You're correct in saying that localhost poses a problem, as the following would occur:

  1. Target reports hostname of localhost
  2. IP resolution is performed on master, yielding the host's IP.
  3. hostid ends up being that of the host, not the target.

This edge-case would require a check, and would likely end up in returning 0 again.

You're also correct in saying that, if target and host don't share the same network, IP resolution will fail, and 0 will once again be returned.

The only con to this solution is missing out on a mostly-useless value in the case that the host and target are not networked. Arguably, as the hostid is derived from the IP, it has even less meaning when the devices aren't networked together.
The reason resolution isn't performed on the target is because I'm not aware of a BusyBox facility to do that.

I suppose this comes down to preference... but statically linking glibc should not be an option.

Also out of curiosity how many / what sort of devices have you observed this issue on?

Mainly modern x86_64 Linux desktops. There's no reason the problems this PR intends to solve shouldn't occur on most other targets too, however. The issue you linked highlights this is an issue on arm too.

@marcbonnici
Copy link
Collaborator

I don't know the origins of why hostid is regarded as useful in the case of devlib,

It was actually only very recently added to devlib itself as a result of a workaround for this issue, this value is just one of many items collected from WA as we try to collect as much information about a target as possible to store as the target info, so as far as I'm aware, this is not being widely used.

The only con to this solution is missing out on a mostly-useless value in the case that the host and target are not networked. Arguably, as the hostid is derived from the IP, it has even less meaning when the devices aren't networked together.
The reason resolution isn't performed on the target is because I'm not aware of a BusyBox facility to do that.
I suppose this comes down to preference... but statically linking glibc should not be an option.

As you say, this does not seem like a reliable metric so I'm wary of adding this relativity complicated workaround for something that is unlikely to be widely used, which also changes the previously reported value, even if the new version is arguably more useful as it attempts to use the devices actual IP rather than apparently picking up the loopback device.

Do you see a downside to leaving the existing implementation and simply ignoring the error if unable to execute the command? E.g. something like:
marcbonnici@d89c454

Mainly modern x86_64 Linux desktops. There's no reason the problems this PR intends to solve shouldn't occur on most other targets too, however.

Ok thanks, I haven't been able to recreate this on my devices so was just looking for some more data points.

@benjaminmordaunt
Copy link
Contributor Author

Do you see a downside to leaving the existing implementation and simply ignoring the error if unable to execute the command? E.g. something like:
marcbonnici/devlib@d89c454

Yes, this works too. The only qualm I have with this is that it leaves a broken BusyBox in place, which is likely to present issues in the future, as glibc makes extensive use of dlopen internally.

@marcbonnici
Copy link
Collaborator

I agree having an reliable option would definitely be the best outcome here.

On that topic I noticed on the busybox site that there are 3 options to statically link against glibc, musl and uclibc, we've explored the initial 2 but I hadn't looked at uclibc yet which apparently does have an implementation for gethostid [1]

Testing the pre provided uclibc binary [2] worked (and provided consistent output) on all of my devices therefore I wondering if you could/had tested this on yours and if so which behavior you saw?

[1] https://git.uclibc.org/uClibc/tree/libc/inet/hostid.c
[2] https://busybox.net/downloads/binaries/1.31.0-i686-uclibc/

@benjaminmordaunt
Copy link
Contributor Author

Testing the pre provided uclibc binary [2] worked (and provided consistent output) on all of my devices therefore I wondering if you could/had tested this on yours and if so which behavior you saw?

I think I spotted that there was no x86_64 prebuilt version and sidestepped uclibc-ng, only to totally forget of its existence. I'll setup an x86_64 build with uclibc-ng and post it here for you to test.

@benjaminmordaunt
Copy link
Contributor Author

benjaminmordaunt commented Apr 14, 2021

@marcbonnici

Took twice as long as expected - forgot IPv6 support in libc! 🙄

Here: busybox-ulibc.zip
hostid appears to work correctly, as you indicated.

@marcbonnici
Copy link
Collaborator

That's great thank you, and yes that seems to do the trick on my devices as well.
Does it look like we have a winner?

@benjaminmordaunt
Copy link
Contributor Author

Looks good - I imagine the aarch64 side will need the same treatment.

@marcbonnici
Copy link
Collaborator

Cool thanks, I'll create another PR to update the remaining binaries.

Would you prefer for me to include the x86_64 binary from your PR when I update the others directly or would you like to clean up this PR and we can merge?

@benjaminmordaunt
Copy link
Contributor Author

benjaminmordaunt commented Apr 15, 2021

I'll roll-back the other patches here and am currently doing an aarch64 build. When complete, I'll patch and then this can be squashed and merged.

Would be nice to have a GitHub Actions-based CI/CD for these builds. I'm using crosstool-ng to generate these toolchains, will have a play and see what makes sense when I have time.

@benjaminmordaunt
Copy link
Contributor Author

Ready to merge. Requires test on an aarch64 device and another PR to bring these archs in line:

  • ppc64le
  • armeabi

@marcbonnici
Copy link
Collaborator

Great thanks, will give them a test.

Would you also be able to rebase your branch so that we don't have the unnecessary commits to try and keep the history clean and add a bit more detail into the commit messages about the updated binaries?

The original BusyBox binaries were statically linked
against glibc, which caused segmentation faults to occur
on __nss_* calls. This switches the libc implementation
to uClibc in both cases.

busybox ver bump to 1.32.1 for arm64 and x86_64

update x86_64-linux-uclibc busybox

update aarch64-linux-uclibc busybox
@benjaminmordaunt
Copy link
Contributor Author

Rewrote history on this PR for you, it's now a single commit with a message. Thanks.

@marcbonnici marcbonnici merged commit 5bfeae0 into ARM-software:master Apr 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants