Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Native/bytecode executables segfault on Linux when running on a Wayland compositor #74

Closed
zploskey opened this issue Mar 21, 2018 · 43 comments

Comments

@zploskey
Copy link
Contributor

zploskey commented Mar 21, 2018

This happens every time. Here's the log from a clean build of the example repo:

zach@znbk:~/src/reprocessing-example$ npm run build

> reprocessing-example@ build /home/zach/src/reprocessing-example
> bsb -make-world

[2/2] Building fake_src/sdl_index.mlast.d
[2/2] Building lib.cma
[4/4] Building run_build_script
[3/3] Building lib.cma
[4/4] Building run_build_script
[3/3] Building lib.cma
[18/18] Building run_build_script
[10/10] Building lib.cma
[42/42] Building src/Reprocessing_Internal.mlast.d
[22/22] Building lib.cma
ninja: Entering directory `lib/bs/bytecode'
[4/4] Building src/IndexHot.mlast.d
[3/3] Building indexhot.byte
zach@znbk:~/src/reprocessing-example$ npm run start

> reprocessing-example@ start /home/zach/src/reprocessing-example
> ./lib/bs/bytecode/indexhot.byte

Rebuilding hotloaded module
Succesfully changed functions
Segmentation fault (core dumped)

Note this happens almost immediately, no chance to edit anything.

@Schmavery
Copy link
Owner

If I understand correctly, this is just happening for hotreloading and not for a "regular" build?
I'll try to see if I can repro on any of my machines.

@zploskey
Copy link
Contributor Author

zploskey commented Mar 21, 2018

It also segfaults on native builds.

$ npm run start:native

> reprocessing-example@ start:native /home/zach/src/reprocessing-example
> ./lib/bs/native/index.native

Segmentation fault (core dumped)
$ uname -a
Linux znbk.ploskey.com 4.15.9-300.fc27.x86_64 #1 SMP Mon Mar 12 17:07:55 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

@Schmavery
Copy link
Owner

😭

@zploskey
Copy link
Contributor Author

Evidently I'm generating coredumps when this happens. This is the backtrace:

{   "signal": 11
,   "executable": "/home/zach/src/reprocessing-example/lib/bs/native/index.native"
,   "stacktrace":
      [ {   "crash_thread": true
        ,   "frames":
              [ {   "address": 0
                ,   "build_id_offset": 0
                }
              , {   "address": 4554802
                ,   "build_id": "6a31abd6bbc250214c3c4942e0e9b110545afd88"
                ,   "build_id_offset": 360498
                ,   "file_name": "/home/zach/src/reprocessing-example/lib/bs/native/index.native"
                } ]
        }
      , {   "frames":
              [ {   "address": 140430986743142
                ,   "build_id": "b097b427ace57ac70bb636f7a41af7f10a69a851"
                ,   "build_id_offset": 961894
                ,   "function_name": "ppoll"
                ,   "file_name": "/lib64/libc.so.6"
                }
              , {   "address": 140430756520241
                ,   "build_id": "d8eac16837bacf679ba9f307bcf71db7bd931b33"
                ,   "build_id_offset": 151857
                ,   "function_name": "pa_mainloop_poll"
                ,   "file_name": "/lib64/libpulse.so.0"
                }
              , {   "address": 140430756521792
                ,   "build_id": "d8eac16837bacf679ba9f307bcf71db7bd931b33"
                ,   "build_id_offset": 153408
                ,   "function_name": "pa_mainloop_iterate"
                ,   "file_name": "/lib64/libpulse.so.0"
                }
              , {   "address": 140430756521936
                ,   "build_id": "d8eac16837bacf679ba9f307bcf71db7bd931b33"
                ,   "build_id_offset": 153552
                ,   "function_name": "pa_mainloop_run"
                ,   "file_name": "/lib64/libpulse.so.0"
                }
              , {   "address": 5712525
                ,   "build_id": "6a31abd6bbc250214c3c4942e0e9b110545afd88"
                ,   "build_id_offset": 1518221
                ,   "file_name": "/home/zach/src/reprocessing-example/lib/bs/native/index.native"
                } ]
        }
      , {   "frames":
              [ {   "address": 140430986743142
                ,   "build_id": "b097b427ace57ac70bb636f7a41af7f10a69a851"
                ,   "build_id_offset": 961894
                ,   "function_name": "ppoll"
                ,   "file_name": "/lib64/libc.so.6"
                }
              , {   "address": 140430756520241
                ,   "build_id": "d8eac16837bacf679ba9f307bcf71db7bd931b33"
                ,   "build_id_offset": 151857
                ,   "function_name": "pa_mainloop_poll"
                ,   "file_name": "/lib64/libpulse.so.0"
                }
              , {   "address": 140430756521792
                ,   "build_id": "d8eac16837bacf679ba9f307bcf71db7bd931b33"
                ,   "build_id_offset": 153408
                ,   "function_name": "pa_mainloop_iterate"
                ,   "file_name": "/lib64/libpulse.so.0"
                }
              , {   "address": 5712287
                ,   "build_id": "6a31abd6bbc250214c3c4942e0e9b110545afd88"
                ,   "build_id_offset": 1517983
                ,   "file_name": "/home/zach/src/reprocessing-example/lib/bs/native/index.native"
                } ]
        } ]
}

@Schmavery
Copy link
Owner

😕 Looks like something dying in pulse audio... Wonder if it's a version issue of kind. Currently at a bit of a loss but I'll give it some thought.
Maybe if we run through with a debugger of some kind and try to figure out where the segfault is happening.

One thing that seems reassuring is that it's not the actual build that's failing, it's the executable (ie our ocaml compiler binaries are probably fine)

@Schmavery
Copy link
Owner

@bsansouci do you know if there's any way that we should be building this that will give us more symbols in the ocaml code?

@zploskey zploskey changed the title Native/bytecode builds segfault on Linux Native/bytecode executables segfault on Linux Mar 21, 2018
@zploskey
Copy link
Contributor Author

Looking at bsc -help, I see these flags that might be relevant:

  -bs-D  Define conditional variable e.g, -D DEBUG=true
  -g  Save debugging information

The -g flag comes from the OCaml compiler. It describes it as

Add debugging information while compiling and linking. This option is required in order to be able to debug the program with ocamldebug (see chapter 17), and to produce stack backtraces when the program terminates on an uncaught exception (see section 11.2).

@Schmavery
Copy link
Owner

I believe that bsb-native builds with -g by default (looking at lib/bs/bytecode/build.ninja)
@bsansouci can correct me if I'm missing something.

@zploskey
Copy link
Contributor Author

Ah you're right. Attaching the debugger I see:

zach@znbk:~/src/reprocessing-example$ ocamldebug lib/bs/bytecode/indexhot.byte 
	OCaml Debugger version 4.02.3+BS

(ocd) run
Loading program... done.
Rebuilding hotloaded module
Succesfully changed functions
Lost connection with process 10955 (active process)
between time 50000 and time 60000
Restart from time 50000 and try to get closer of the problem ? (y or n) y
Lost connection with process 10965 (active process)
between time 53000 and time 54000
Lost connection with process 11043 (active process)
between time 53000 and time 53100
Lost connection with process 11046 (active process)
between time 53000 and time 53010
Lost connection with process 11058 (active process)
between time 53007 and time 53008
Time: 53007 - pc: 525040 - module Reasongl_native
486   let viewport = (~context as _, ~x, ~y, ~width, ~height) => <|b|>Gl.viewport(~x, ~y, ~width, ~height);
(ocd) print x
x: int = -1
(ocd) print y
y: int = -1
(ocd) print width
width: int = 200
(ocd) print height
height: int = 200
(ocd) bt
Backtrace:
#0 Reasongl_native /home/zach/src/reprocessing-example/node_modules/Reasongl/src/native/reasongl_native.re:486:62
#1 Reprocessing_Internal /home/zach/src/reprocessing-example/node_modules/Reprocessing/src/Reprocessing_Internal.re:53:59
#2 Reprocessing /home/zach/src/reprocessing-example/node_modules/Reprocessing/src/Reprocessing.re:124:7
(Encountered a function with no debugging information)

Are x and y ever supposed to be negative? Other than that not sure what's wrong. Let me know if there's any other info I can extract from this.

@Schmavery
Copy link
Owner

I find this a little confusing as it looked like your code dump was very pulse-audio-related :/
Negative width and height passed to viewport will result in an error but afaik the same isn't true of x and y. Will give it some thought.

@zploskey
Copy link
Contributor Author

The presence of pulse in the stack traces I'm getting from the core dumps might just be a coincidence (because it runs in its own thread? Not sure.). When I run it in the debugger now I don't see any mention of pulse in the stack trace from the core dump.

{   "signal": 11
,   "executable": "/home/zach/src/reprocessing-example/lib/bs/bytecode/indexhot.byte"
,   "stacktrace":
      [ {   "crash_thread": true
        ,   "frames":
              [ {   "address": 0
                ,   "build_id_offset": 0
                }
              , {   "address": 5368396
                ,   "build_id": "a9363eba5fed588881aae7ee01a5fc3da026e016"
                ,   "build_id_offset": 1174092
                ,   "file_name": "/home/zach/src/reprocessing-example/lib/bs/bytecode/indexhot.byte"
                } ]
        } ]
}

It's definitely failing as soon as I try to step into this function call to Gl.viewport. Changing the default values of x and y to 0 makes so discernible difference.

let viewport = (~context as _, ~x, ~y, ~width, ~height) => Gl.viewport(~x, ~y, ~width, ~height);

@Schmavery
Copy link
Owner

Schmavery commented Mar 21, 2018

Ah, I see, I was looking at the wrong part of the stacktrace in the core dump. I think you're right that the audio stuff runs in its own thread.

I wonder if some GL function is failing quietly and whether we need to think about adding checks to glGetError in more places...

There's also a slight chance that glad (our gl loader) isn't kicking in properly and so the call to viewport itself is the problem... Kind of spitballing here.

@Schmavery
Copy link
Owner

@zploskey this means reprocessing is totally dead on linux except for when building to web right? Seems pretty bad. Any idea of when this started happening for you? I'll try to look into it more tonight.

@zploskey
Copy link
Contributor Author

If anyone knows when (or if) native builds were ever working on Linux please let us know.

I don't know when the problem might have been introduced since I only started trying to use this in the last few weeks (when I started filing issues). It may be a bit difficult to bisect due to the build being broken for other reasons, but I can make an attempt. Without a known working version I'll have to pick an arbitrary commit on Reasongl, I guess. Open to suggestions on where that should be. Do you have any intuition about what commits might have been a problem?

I have my eye on these commits in particular: https://github.com/bsansouci/reasongl/commits/master/src/native

It honestly might be easier to just follow up where things are going wrong when calling in to outside libs.

@bsansouci
Copy link
Collaborator

Oh if there's a segfault on Gl.viewport that means GL isn't being loaded correctly.

The way GL is loaded is through glad which is a cross platform little pile of C that dynamically loads OpenGL and all of the functions you ask it to load. If you get a segfault right at glViewport that's most likely because the function itself is null. That's a symptom of misconfigured GL on load.

This can be debugged by printing in here which is the entry point to dynamically loading GL. Also maybe making sure SDL loads correctly by doing printf("fuck: %s", SDL_error()); here.

@Schmavery
Copy link
Owner

Ah sorry @zploskey I misunderstood and didn't realize that this never worked for you (I had assumed that it was running back when you were fixing the audio stuff).
Thanks for writing up those pointers, Ben.

@zploskey
Copy link
Contributor Author

zploskey commented Mar 29, 2018

This is probably unrelated, but trying to npm install in my tgls clone gives me this:

Building a local version of the OCaml compiler failed, check the output above for more information. A possible problem is that you don't have a compiler installed.
/home/zach/src/tgls/node_modules/bs-platform/scripts/install.js:112
            throw e;
            ^

Error: Command failed: /home/zach/src/tgls/node_modules/bs-platform/scripts/buildocaml.sh
File "_none_", line 1:
Error: I/O error: compilerlibs/ocamlbytecomp.cma: No such file or directory
make[4]: *** [Makefile:408: compilerlibs/ocamlbytecomp.cma] Error 2
make[4]: *** Waiting for unfinished jobs....
make[3]: *** [Makefile:223: coreall] Error 2
make[2]: *** [Makefile:219: core] Error 2
make[1]: *** [Makefile:287: opt.opt] Error 2
make: *** [Makefile:160: world.opt] Error 2

    at checkExecSyncError (child_process.js:601:13)
    at Object.execFileSync (child_process.js:621:13)
    at tryToProvideOCamlCompiler (/home/zach/src/tgls/node_modules/bs-platform/scripts/install.js:106:27)
    at non_windows_npm_release (/home/zach/src/tgls/node_modules/bs-platform/scripts/install.js:157:9)
    at Object.<anonymous> (/home/zach/src/tgls/node_modules/bs-platform/scripts/install.js:180:5)
    at Module._compile (module.js:652:30)
    at Object.Module._extensions..js (module.js:663:10)
    at Module.load (module.js:565:32)
    at tryModuleLoad (module.js:505:12)
    at Function.Module._load (module.js:497:3)

I've tried with a compiler available through OPAM and without and still get this. May be due to a recent change in bsb-native since I'm pretty sure I could build this previously.

@Schmavery
Copy link
Owner

Schmavery commented Mar 29, 2018

@zploskey hmm, not sure why you're getting this, I just tried on my mac and couldn't repro. We'll try to see if there's anything obviously wrong there.

You don't need an opam compiler installed to install bsb-native.

In the meantime, it looks like tgls uses bsb-native master, so as a workaround you can rely on bsansouci/bsb-native#2.1.1 instead for now to get the prebuilt version.

@zploskey
Copy link
Contributor Author

Looks to be a regression in bsb-native. If I specify 2.1.1 it builds ok.

@zploskey
Copy link
Contributor Author

On bsb-native 2.1.1, having changed all the deps to use my local clones, I get this build failure on the example project on npm run build and build:native:

/home/zach/src/reprocessing-example/node_modules/Reprocessing/node_modules/Reasongl/node_modules/Tgls/lib/SOIL.o: In function `query_DXT_capability':
SOIL.c:(.text+0x2d92): undefined reference to `gladGetProcAddressPtr'
collect2: error: ld returned 1 exit status

Also apparently it needed to be printf("fuck: %s", SDL_GetError()); lol.

@zploskey
Copy link
Contributor Author

zploskey commented Mar 29, 2018

These two lines were removed from glad.h in bsansouci/tgls@8b75c25.

typedef void* (APIENTRYP PFNWGLGETPROCADDRESSPROC_PRIVATE)(const char*);
static PFNWGLGETPROCADDRESSPROC_PRIVATE gladGetProcAddressPtr;

but are needed here, at least on Linux. Forward declaring them in the IFDEF for for Linux stuff in SOIL.c fixes the build. Should these lines be in the glad header or not?

@Schmavery
Copy link
Owner

Ah darn. I think what happened here is that @bsansouci regenerated the glad files but we had made some edits that were then lost. I'm fairly sure I added those back when we started using glad and that there's no issue with them being in their previous location. Thanks for tracking that one down

@zploskey
Copy link
Contributor Author

Any interest in setting up CI to catch things like this and have an automated test for Linux stuff? Preparing a PR for that now.

@Schmavery
Copy link
Owner

Schmavery commented Mar 29, 2018

100% interest but haven't had a chance to do it :)
Edit: Created #77 to track the suggestion, thanks

@bsansouci
Copy link
Collaborator

bsansouci commented Mar 29, 2018

re SDL_GetError my bad! Thanks for trying :)

@zploskey
Copy link
Contributor Author

So it's either segfaulting before either of the suggested points where I tried to print things or the printing is somehow not making it to stdout...

@bsansouci
Copy link
Collaborator

Mmmh did you call fflush(stdout) after the print? It might do some buffering...

@zploskey
Copy link
Contributor Author

Good call, it prints things now. Investigating.

@zploskey
Copy link
Contributor Author

In gladLoadGl() and TglViewport(), SDL_GetError() returns "Invalid window".

@bsansouci
Copy link
Collaborator

Mmmmh this error seems to be because the window object is null, as if there was an issue creating it.
Can you print window here.
Also could you try passing 0 instead of Int_val(flags) and see if SDL can load something? Also make sure that the demo you're building is super simple, like just calls Draw.background or something.

@Schmavery
Copy link
Owner

Great job narrowing it down guys, you give me hope for the world <3

@zploskey
Copy link
Contributor Author

zploskey commented Apr 19, 2018

TSDL_CreateWindow_native() doesn't seem to be getting called. Never mind, I that project was not getting rebuilt properly with the changes.

@zploskey
Copy link
Contributor Author

Ok, so as suspected window is null. After attempting to create the window in TSDL_CreateWindow_native, SDL_GetError returns "Couldn't find matching GLX visual".

@zploskey
Copy link
Contributor Author

Good news, it work in X11. This problem only crops up when running the Wayland display server. We need to support Wayland, though, since X11 is on its way out.

@zploskey zploskey changed the title Native/bytecode executables segfault on Linux Native/bytecode executables segfault on Linux when running on a Wayland compositor Apr 20, 2018
@Schmavery
Copy link
Owner

Ahh, that explains why it was working on my one linux machine... Good stuff

@bsansouci
Copy link
Collaborator

Could you try running the built executable with SDL_VIDEODRIVER=wayland in front?
I remember someone figuring out that this was the only way to run an sdl2 app on Wayland :/

@Schmavery
Copy link
Owner

Schmavery commented Apr 20, 2018

I'm assuming you have libwayland-dev installed?

This linux README for sdl seems to mention
libwayland-dev libxkbcommon-dev wayland-protocols as being necessary (ubuntu) packages for wayland support.
https://github.com/emscripten-ports/SDL2/blob/master/docs/README-linux.md

@zploskey
Copy link
Contributor Author

Hey alright! On Fedora 27 I was able to get this working by doing this:

# Install build dependencies of SDL2-devel package (currently 2.0.7 on F27)
# List is in the rpm spec: https://src.fedoraproject.org/rpms/SDL2/blob/f27/f/SDL2.spec
sudo dnf builddep SDL2-devel
git clone https://github.com/bsansouci/reprocessing-example.git
cd reprocessing-example
npm install
npm run build:native
SDL_VIDEODRIVER=wayland ./lib/bs/native/index.native

I see at least a couple of outcomes here:

  • This should be documented, particularly setting SDL_VIDEODRIVER=wayland.
  • We should never segfault, and so should check if window == NULL after trying to create it. I already wrote this code while debugging this, so I'll open a PR for this.

@Schmavery
Copy link
Owner

Having the requirement of setting an env var to be able to run the program makes me sad 😢
Thanks for the PR/debugging

@Schmavery
Copy link
Owner

Schmavery commented Apr 20, 2018

My only only guess would be to look into what path the configure script is running on a wayland machine... https://github.com/bsansouci/SDL-mirror/blob/master/configure
It has several mentions of wayland and maybe something is getting confused or isn't specified correctly. I'll do some googling too.

We might want to try just explicitly enabling wayland support in the configure script???

@Schmavery
Copy link
Owner

@bsansouci @zploskey ...can we detect wayland-iness in c/ocaml? We could do the niiiice and hacky solution of setting the env var manually before starting up the sdl code...

@zploskey
Copy link
Contributor Author

zploskey commented Apr 20, 2018

@Schmavery
Copy link
Owner

This should be fixed now by bsansouci/reasongl#9!
If anyone still has problems with wayland, let us know and we can try to refine the detection heuristic :)
Thanks so much to @zploskey for all the help debugging, never would have been able to solve this one otherwise <3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants