Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash in a scheduler exhaustion example #2441

Closed
d-led opened this issue Dec 22, 2017 · 16 comments
Closed

Crash in a scheduler exhaustion example #2441

d-led opened this issue Dec 22, 2017 · 16 comments
Labels
bug Something isn't working help wanted Extra attention is needed needs investigation This needs to be looked into before its "ready for work"

Comments

@d-led
Copy link

d-led commented Dec 22, 2017

Hi,

I was writing an example of how monopolizing the scheduler threads with tight loops could lead to standstill. Expected behavior: some output, then full CPU load, but no output. However, the program seems to crash pretty soon (in release mode). The same program does not crash in debug mode:

use "time"
use "collections"

class Tick is TimerNotify
  let _env : Env
  var _counter: U64 = 0

  new iso create(env: Env) =>
    _env = env

  fun ref _next(): String =>
    _counter = _counter + 1
    _counter.string()

  fun ref apply(timer: Timer, count: U64): Bool =>
    _env.out.print("Tick "+_next())
    true

actor Loader
  var _counter: U64 = 0
  let _env : Env

  new create(env: Env) =>
    _env = env

  be spin() =>
    _spin()

  fun ref _spin() =>
    while _counter >=  0 do
      _counter = _counter + 1
    end
    _env.out.print(_counter.string())


class Load is TimerNotify
  let _env : Env
  let _loaders: List[Loader] = List[Loader]
  var _counter: U64 = 0

  new iso create(env: Env) =>
    _env = env

  fun ref _next() =>
    _counter = _counter + 1
    _env.out.print("Started task "+_counter.string())
    let l = Loader(_env)
    _loaders.push(l)
    l.spin()

  fun ref apply(timer: Timer, count: U64): Bool =>
    _next()
    true


actor Main
  new create(env: Env) =>
    let timers = Timers
    let timer = Timer(Tick(env), 0, 1_000_000_000 /*1s*/)
    timers(consume timer)
    env.out.print("Started the ticker")
    let load_timer = Timer(Load(env), 0, 2_000_000_000 /*2s*/)
    timers(consume load_timer)

on Windows: ponyc && test.exe → crash, but ponyc --debug && test.exe → no crash

0.21.0-acd811b [release]
compiled with: llvm 3.9.1 -- msvc-15-x64

same on OSX:

0.21.0 [release]
compiled with: llvm 3.9.1 -- Apple LLVM version 9.0.0 (clang-900.0.39.2)

For a cross-check, 0.20 crashes too:

0.20.0-0b2a2d2 [release]
compiled with: llvm 3.9.1 -- msvc-14-x64
@d-led
Copy link
Author

d-led commented Dec 22, 2017

P.S. no rush, by the way. Reporting it, as Pony's stability might be endangered

@d-led
Copy link
Author

d-led commented Dec 22, 2017

curiously, adding some calculation in the tight loop does makes the program run without a crash

      if (_counter % 1000000000) == 0 then
        _env.out.print("ping")
      end

https://gist.github.com/d-led/a92c1e4aa51b87f587976995d397dbcb

I'd expect that the existing tight loops would continue to create output, though

@SeanTAllen
Copy link
Member

SeanTAllen commented Dec 23, 2017

with debug runtime but release build. looks like a gc bug.

(lldb) run
Process 8042 launched: './issue-2441' (x86_64)
Started the ticker
Tick 1
Started task 1
Process 8042 stopped

  • thread Nested tuple access can't parse #2: tid = 0x60e437, 0x0000000100000e7b issue-2441Env_Trace + 11, stop reason = EXC_BAD_ACCESS (code=1, address=0x10) frame #0: 0x0000000100000e7b issue-2441Env_Trace + 11
    issue-2441`Env_Trace:
    -> 0x100000e7b <+11>: movq 0x10(%rbx), %rsi
    0x100000e7f <+15>: leaq 0x283ea(%rip), %rdx
    0x100000e86 <+22>: movl $0x2, %ecx
    0x100000e8b <+27>: callq 0x100015c10 ; pony_traceknown at trace.c:109

(lldb) bt
warning: could not load any Objective-C class information. This will significantly reduce the quality of type information available.

  • thread Nested tuple access can't parse #2: tid = 0x60e437, 0x0000000100000e7b issue-2441`Env_Trace + 11, stop reason = EXC_BAD_ACCESS (code=1, address=0x10)
    • frame #0: 0x0000000100000e7b issue-2441Env_Trace + 11 frame #1: 0x0000000100013a2e issue-2441ponyint_gc_handlestack(ctx=0x00000001097ff7c8) + 94 at gc.c:619
      frame Nested tuple access can't parse #2: 0x0000000100015999 issue-2441pony_recv_done(ctx=0x00000001097ff7c8) + 89 at trace.c:65 frame #3: 0x0000000100001433 issue-2441Loader_Dispatch + 51
      frame Calling to and from C #4: 0x00000001000094fb issue-2441handle_message(ctx=0x00000001097ff7c8, actor=0x00000001298f4400, msg=0x00000001298f4200) + 635 at actor.c:164 frame #5: 0x0000000100008fbe issue-2441ponyint_actor_run(ctx=0x00000001097ff7c8, actor=0x00000001298f4400, batch=100) + 174 at actor.c:226
      frame Package dependency loops #6: 0x000000010001d4c5 issue-2441run(sched=0x00000001097ff780) + 261 at scheduler.c:527 frame #7: 0x000000010001cc19 issue-2441run_thread(arg=0x00000001097ff780) + 57 at scheduler.c:568
      frame Conditional code #8: 0x00007fff9db1899d libsystem_pthread.dylib_pthread_body + 131 frame #9: 0x00007fff9db1891a libsystem_pthread.dylib_pthread_start + 168
      frame Primitive boxing with contravariant traits #10: 0x00007fff9db16351 libsystem_pthread.dylib`thread_start + 13

@SeanTAllen
Copy link
Member

The bug is triggered by _spin() method.

Not sure why though.

@SeanTAllen
Copy link
Member

I don't have time to investigate further but... this happens in release but not debug. I'm pretty sure this is the result of an LLVM optimization of:

    while _counter >=  0 do
      _counter = _counter + 1
    end

given that will always be true, i'm interested to see what LLVM is producing in release mode as an optimization.

This for example, has no segfaulting issues:

  fun ref _spin() =>
    _counter = 0
    while _counter < U64.max_value() do
      _counter = _counter + 1
    end
    _env.out.print(_counter.string())

@SeanTAllen
Copy link
Member

@d-led you should be able to keep working and get your desired effect with this:

  fun ref _spin() =>
    while _counter < U64.max_value() do
      if _counter == (U64.max_value() - 1) then
        _counter = 0
      end
      _counter = _counter + 1
    end
    _env.out.print(_counter.string())

@d-led
Copy link
Author

d-led commented Dec 23, 2017

@SeanTAllen thanks for your investigation! Will continue poking around + use your proposal

@SeanTAllen
Copy link
Member

Sylvan reports this does not segfault for him on Windows with LLVM 3.9.1

@mfelsche
Copy link
Contributor

i was able to reproduce this on ubuntu linux with pony:

0.21.2-28b67dd22 [debug]
compiled with: llvm 3.9.1 -- cc (Ubuntu 5.4.0-6ubuntu1~16.04.5) 5.4.0 20160609

@sylvanc
Copy link
Contributor

sylvanc commented Jan 31, 2018

I changed spin to start with _counter at 1 and test for _counter > 0, and there was no segfault. This is on master, using Windows + LLVM 3.9.1.

It also works if work is done before the loop that can't be optimised away (for example, printing something to env.out).

It appears that Loader_Dispatch optimises away the possibility of receiving a spin message otherwise. The _spin function becomes a jmp to its own label, and at some point we get a ud2 during LLVM optimisation, and then LLVM is free to assume the spin message is never received (because it would result in reaching unoptimised code), so the spin message is handled as if it was create, which is why it crashes in gc (tries to deal with the env argument, which is not present).

As to why LLVM is turning that "jmp to self" into ud2, I'm not sure.

@Praetonus
Copy link
Member

As to why LLVM is turning that "jmp to self" into ud2, I'm not sure.

Many parts of the LLVM optimiser treat infinite loops with no side-effects as causing undefined behaviour, based on the C++ standard. The LLVM IR itself doesn't define infinite loops very well. This is a recurrent topic on the LLVM mailing list. See for example here and here.

@SeanTAllen SeanTAllen added needs investigation This needs to be looked into before its "ready for work" bug Something isn't working and removed bug: 1 - needs investigation labels May 12, 2020
@SeanTAllen
Copy link
Member

FYI, as of 0.37.0 this is still a problem with optimized builds.

However, the error is a little different.

~/pony-scratch/2441 ➜ lldb -- ./2441
(lldb) target create "./2441"
Current executable set to '/home/sean/pony-scratch/2441/2441' (x86_64).
(lldb) run
Process 6653 launched: '/home/sean/pony-scratch/2441/2441' (x86_64)
Started the ticker
Tick 1
Started task 1
Process 6653 stopped
* thread #3, name = '2441', stop reason = signal SIGILL: illegal instruction operand
    frame #0: 0x0000000000403191 2441`Loader_Dispatch + 81
2441`Loader_Dispatch:
->  0x403191 <+81>: ud2
    0x403193:       nopw   %cs:(%rax,%rax)
    0x40319d:       nopl   (%rax)
2441`Loader_Trace:
    0x4031a0 <+0>:  movq   0x108(%rsi), %rsi
(lldb) bt
* thread #3, name = '2441', stop reason = signal SIGILL: illegal instruction operand
  * frame #0: 0x0000000000403191 2441`Loader_Dispatch + 81
    frame #1: 0x0000000000409ad6 2441`ponyint_actor_run + 214
    frame #2: 0x000000000041255f 2441`run_thread + 2287
    frame #3: 0x00007ffff7f9b609 libpthread.so.0`start_thread + 217
    frame #4: 0x00007ffff7d46103 libc.so.6`__clone + 67

@d-led
Copy link
Author

d-led commented Sep 14, 2020

a side remark, the Go run-time has managed to implement tight loop pre-emption

@SeanTAllen
Copy link
Member

I am no longer able to reproduce this as of ponyc 0.46.0.

@ergl
Copy link
Member

ergl commented Jan 27, 2022

This probably got fixed when we upgraded to LLVM 12, which allowed to have infinite loops at the IR level.

@d-led
Copy link
Author

d-led commented Jan 27, 2022

can confirm. Nice. Although, the scheduler exhaustion is not that nice - it's a trade-off :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Extra attention is needed needs investigation This needs to be looked into before its "ready for work"
Projects
None yet
Development

No branches or pull requests

6 participants