New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

.data with relocations not linked in correctly #73

Closed
pqwy opened this Issue Aug 14, 2016 · 12 comments

Comments

Projects
None yet
3 participants
@pqwy

pqwy commented Aug 14, 2016

This is probably equally a bug in mirage, but raising here as it concerns only solo5.

If I compile a C source that contains, at the top level, something like

char *p = "xxx";

with the flags given by pkg-config --static --cflags ocaml-freestanding[1], and then link it in using mirage -t virtio, which passes pkg-config --variable=ldflags solo5-kernel-virtio[2] and pkg-config --static --libs mirage-solo5 ocaml-freestanding[3] to ld, the generated unikernel ends up with p == 0. I can see "xxx" in the binary, but p's value is never set to its address.

I suspect the problem is in the linker script, but so far cannot figure out how to solve it.

This becomes more interesting when the pointer is supposed to point to functions, as it breaks OCaml's custom primitive data, like zarith bignums, which in turn breaks cryptography.

[1] -nostdinc -isystem /usr/lib/gcc/x86_64-pc-linux-gnu/6.1.1/include -ffreestanding -mno-red-zone -mno-3dnow -I/home/self/.opam/system+solo5/lib/pkgconfig/../../include/ocaml-freestanding/include -I/home/self/.opam/system+solo5/lib/pkgconfig/../../include/solo5-kernel-virtio/include

[2] -nostdlib -z max-page-size=0x1000 -static -T /home/self/.opam/system+solo5/lib/pkgconfig/../../lib/solo5-kernel-virtio/solo5.lds /home/self/.opam/system+solo5/lib/pkgconfig/../../lib/solo5-kernel-virtio/solo5.o

[3] /home/self/.opam/system+solo5/lib/pkgconfig/../../lib/mirage-solo5/libsolo5camlbindings.a /home/self/.opam/system+solo5/lib/pkgconfig/../../lib/ocaml-freestanding/libasmrun.a /home/self/.opam/system+solo5/lib/pkgconfig/../../lib/ocaml-freestanding/libotherlibs.a /home/self/.opam/system+solo5/lib/pkgconfig/../../lib/ocaml-freestanding/libnolibc.a /home/self/.opam/system+solo5/lib/pkgconfig/../../lib/ocaml-freestanding/libopenlibm.a

@djwillia

This comment has been minimized.

Member

djwillia commented Aug 15, 2016

I'm wondering if this has something to do with position independent code... maybe -fno-PIC should be somewhere as a linker option?

@mato

This comment has been minimized.

Member

mato commented Aug 16, 2016

I can't reproduce this with a plain C test case using ocaml-freestanding.

The following code:

#include <stdio.h>

char *p = "This is initialised";

int solo5_app_main(int argc, char *argv[])
{
    printf ("Hello, world!\n");
    printf ("p: %p = \"%s\"\n", p, p);
    return 0;
}

Built with this script:

#!/bin/sh

if [ $# -ne 1 ]; then
    echo usage: build.sh ukvm\|virtio
    exit 1
fi

PLATFORM=$1

PREFIX=$(opam config var prefix)
export PKG_CONFIG_PATH=${PREFIX}/lib/pkgconfig
set -xe
gcc $(pkg-config --static --cflags ocaml-freestanding) -c hello.c
ld $(pkg-config --variable=ldflags solo5-kernel-${PLATFORM}) -o hello.${PLATFORM} hello.o ${PREFIX}/lib/ocaml-freestanding/libnolibc.a ${PREFIX}/lib/ocaml-freestanding/libopenlibm.a

always gives me a *p which is initialised, for both virtio and ukvm, tested against GCC 4.9.2 (Debian 4.9.2-10) and 6.1.1 20160802 (Debian 6.1.1-11). Example:

$ qemu-system-x86_64 -vga none -nographic -kernel ./hello.virtio 
            |      ___|
  __|  _ \  |  _ \ __ \
\__ \ (   | | (   |  ) |
____/\___/ _|\___/____/
multiboot: Using memory: 0x100000 - 0x7fe0000
TSC frequency estimate is 3093620440 Hz
Hello, world!
p: 0x112408 = "This is initialised"
solo5_app_main() returned with 0
Kernel done.
Goodbye!
QEMU: Terminated

@pqwy Do you have a reproducible test case that hits the problem?

@mato

This comment has been minimized.

Member

mato commented Aug 16, 2016

Regarding -fPIC: Unfortunately adding -fno-PIC on the linker command line does not guarantee that no PIC code will get linked into the unikernel. Looking at the output of nm and readelf -S on mir-console.virtio, the presence of a .got.plt section and _GLOBAL_OFFSET_TABLE_ suggests that some PIC code is indeed getting linked in. These are not present in unikernels with no OCaml runtime linked in (such as the above example, or the standalone tests incliuded with Solo5).

I remember investigating this some time ago and deciding to ignore the problem since I could see the same happening for mirage-xen, so I assumed the code was getting linked in but never used.

@mato

This comment has been minimized.

Member

mato commented Aug 16, 2016

See my comment in #74 -- I have working crypto (static_website_tls) on the ukvm target, so I think -fPIC is not relevant here (not that we shouldn't investigate where PIC code is coming from). Will go and hunt in the virtio code to see if there's something broken in the low-level memory layout / setup.

@djwillia

This comment has been minimized.

Member

djwillia commented Aug 16, 2016

Back when we had the "loader" in the virtio boot, the 64-bit ELF wasn't getting loaded properly and there was a weird hack to clear the BSS very late in the boot. I don't think it still exists after you cleaned up the virtio boot, but I thought I'd mention it if it gives any ideas.

@mato

This comment has been minimized.

Member

mato commented Aug 16, 2016

@djwillia The multiboot specification explicitly states that the boot loader will zero BSS if those fields are filled in in the multiboot header (which they are). Out of paranoia, I added in some code to do it ourselves, but nothing changed (crypto still fails on virtio). So the problem is elsewhere...

@pqwy

This comment has been minimized.

pqwy commented Aug 16, 2016

Sorry, a few more details:

  • It happens only when the code is compiled as PIC. But ocamlopt compiles everything as PIC by default, and adds -fPIC to the invocations of the C compiler.
  • Interestingly, it happens only on virtio, but not ukvm.

Here's a fleshed-out repro: https://github.com/pqwy/abstract-type. A couple of ways to see the problem:

  • Comparing ts throws.
  • inspect_ops lets you peek into the associated structure; it's all-zeros for t and zarith's Z.t, but not for stdlib's int32, int64 and nativeint. libasmrun these come from seems not to be compiled as PIC in ocaml-freestanding.
  • canary should give you the address of a global char *.

This is the thing that's confusing me to no end: the same object file, compiled PIC, works on ukvm and doesn't work on virtio. But the link scripts are almost the same (modulo bootstrap code), dynamic linker does not get invoked in either case, and I can't find any linking code in ukvm-bin. Where does the difference come from? Why is ukvm working?

And yes, this is what's breaking nocrypto on virtio.

@pqwy

This comment has been minimized.

pqwy commented Aug 16, 2016

Test-drive:

config.ml:

open Mirage
let main =
  foreign "Unikernel.Main"
    ~libraries:["abstract-type"; "zarith"]
    (console @-> job)
let () = register "uk" [main $ default_console]

unikernel.ml:

module A = Abstract_type
module Main (C: V1_LWT.CONSOLE) = struct
  let start c =
    A.inspect_ops 1l;
    A.inspect_ops 1L;
    A.inspect_ops Z.(~$1 lsl 63);
    A.inspect_ops A.(create 20);
    C.log_s c (Format.sprintf "ptr: %d\n%!" (A.canary ()))
end

-t virtio:

$ qemu-system-x86_64 -enable-kvm -cpu host -nographic -vga none -kernel ./mir-uk.virtio
            |      ___|
  __|  _ \  |  _ \ __ \
\__ \ (   | | (   |  ) |
____/\___/ _|\___/____/
multiboot: Using memory: 0x100000 - 0x7fe0000
Initializing the KVM Paravirtualized clock.
Solo5: new bindings
STUB: getenv() called
+ ops: ops: 3070848, id: 2432524, compare: 2242944, hash: 2242976
+ ops: ops: 3070784, id: 2432521, compare: 2243056, hash: 2242992
+ ops: ops: 3071488, id: 0, compare: 0, hash: 0
+ ops: ops: 3071424, id: 0, compare: 0, hash: 0
ptr: 0

solo5_app_main() returned with 0
Kernel done.
Goodbye!

-t unix:

+ ops: ops: 8095360, id: 5212504, compare: 5141970, hash: 5141993
+ ops: ops: 8095296, id: 5212501, compare: 5141998, hash: 5142024
+ ops: ops: 8094976, id: 5209491, compare: 5070608, hash: 5071024
+ ops: ops: 8091968, id: 5205676, compare: 5011600, hash: 5011616
ptr: 5205670

-t ukvm:

Setting the CPUID.
            |      ___|
  __|  _ \  |  _ \ __ \
\__ \ (   | | (   |  ) |
____/\___/ _|\___/____/
mem_size=20000000, kernel_end=2f6000
Initializing the KVM Paravirtualized clock.
Solo5: new bindings
STUB: getenv() called
+ ops: ops: 3037952, id: 2402977, compare: 2212096, hash: 2212128
+ ops: ops: 3037888, id: 2402974, compare: 2212208, hash: 2212144
+ ops: ops: 3038592, id: 2402415, compare: 1763712, hash: 1764144
+ ops: ops: 3038528, id: 2402303, compare: 1762704, hash: 1762720
ptr: 2402297

solo5_app_main() returned with 0
Kernel done.
Goodbye!
KVM_EXIT_HLT
@pqwy

This comment has been minimized.

pqwy commented Aug 16, 2016

... and lastly, adding -fPIC to gcc in the build script yields:

p: 0 = "(null)"

... but only on virtio.

@mato

This comment has been minimized.

Member

mato commented Aug 16, 2016

@pqwy Thanks a lot for the test cases. The non-working behaviour with -fPIC code is what I'd expect. Why it works with ukvm (and presumably also xen) is a mystery. I'll keep digging.

One pointer I found which may be useful is that it turns out that on OCaml 4.03.0+ it's possible to globally override and disable all use of -fPIC by ocamlopt using OCAMLPARAM: http://caml.inria.fr/mantis/print_bug_page.php?bug_id=6167. So, we're not the first to run into this problem.

@mato

This comment has been minimized.

Member

mato commented Aug 17, 2016

I've managed to figure out why things are not working on virtio. Using the C-only test as an example, from the readelf -S output on a PIC-compiled copy of the test:

  [ 9] .data             PROGBITS         0000000000115000  00016000
       00000000000000b8  0000000000000000  WA       0     0     4096
  [10] .got.plt          PROGBITS         00000000001150b8  000160b8
       0000000000000018  0000000000000008  WA       0     0     8
  [11] .data.rel.local   PROGBITS         00000000001150d0  000160d0
       0000000000000008  0000000000000000  WA       0     0     8
  [12] .bss              NOBITS           0000000000116000  000160d8
       0000000000083874  0000000000000000  WA       0     0     4096

Note the offset of the .got.plt and .data.rel.local sections -- these are placed after .data starting at 0x1150b8. These sections are defined as follows:

    /* Read-write data (initialized) */
    .data BLOCK(4K) : ALIGN(4K)
    {
        *(.data)
    }
    _edata = .;

    /* Read-write data (uninitialized) */
    .bss BLOCK(4K) : ALIGN(4K)
    {
        *(COMMON)
        *(.bss)
    }
    _ebss = .;

Given that we don't explicitly ask the linker to place .got.plt or .data.rel.local it implicitly chooses to put these in the data segment after the explicitly defined .data section and _edata symbol.

What happens next is: Based on our multiboot header, the boot loader will (correctly) zero everything between 0x1150b8 (_edata) and 0x199874 (_ebss). This includes the content of the .got.plt [10] and .data_rel.local [11] sections, hence the value of *p in the test is NULL.

The reason ukvm is not affected is that it does not use any embedded symbols and instead zeroes BSS based on the difference between p_memsz and p_filesz in the ELF program header (see https://github.com/Solo5/solo5/blob/master/ukvm/ukvm-core.c#L289).

The fix is to explicitly add (.data.*) to the .data output section and adding an extra output section for the GOT before .data makes the C test work for me. I'll do some more testing a submit a PR to clean up / consolidate the linker scripts as part of this once I've verified that everything works as expected.

As for the question of "why does -fPIC-compiled code work in a unikernel in the first place", I think that the answer is "the linker knows it is creating an executable (as opposed to a shared object), and performs all the relocations at link time". Combined with the fact that RIP-relative addressing can be used on amd64 and arm (for the xen backend) this results in working code.

mato added a commit to mato/solo5 that referenced this issue Aug 18, 2016

New linker scripts for ukvm and virtio
- Ensure all input sections are placed in the correct output section.
  This addresses Solo5#73 by placing .data.*, .got and .got.plt in the .data
  output section.
- Unify symbols used as section delimiters to _stext, _etext, _erodata,
  _edata, _ebss and _end.
- Rename ukvm kernel_main() to _start(). The ukvm and virtio linker
  scripts are now identical with the exception of .bootstrap for virtio.

@mato mato referenced this issue Aug 18, 2016

Merged

Memory layout fixes #76

@mato

This comment has been minimized.

Member

mato commented Aug 18, 2016

Fixed via #76.

@mato mato closed this Aug 18, 2016

This was referenced Nov 21, 2016

mato added a commit to mato/solo5 that referenced this issue Nov 23, 2016

Add simple test for Solo5#73
Not much of a test, but also verifies that the memory mappings are
(somewhat) correct.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment