Crash in a scheduler exhaustion example #2441

d-led · 2017-12-22T19:18:57Z

Hi,

I was writing an example of how monopolizing the scheduler threads with tight loops could lead to standstill. Expected behavior: some output, then full CPU load, but no output. However, the program seems to crash pretty soon (in release mode). The same program does not crash in debug mode:

use "time"
use "collections"

class Tick is TimerNotify
  let _env : Env
  var _counter: U64 = 0

  new iso create(env: Env) =>
    _env = env

  fun ref _next(): String =>
    _counter = _counter + 1
    _counter.string()

  fun ref apply(timer: Timer, count: U64): Bool =>
    _env.out.print("Tick "+_next())
    true

actor Loader
  var _counter: U64 = 0
  let _env : Env

  new create(env: Env) =>
    _env = env

  be spin() =>
    _spin()

  fun ref _spin() =>
    while _counter >=  0 do
      _counter = _counter + 1
    end
    _env.out.print(_counter.string())


class Load is TimerNotify
  let _env : Env
  let _loaders: List[Loader] = List[Loader]
  var _counter: U64 = 0

  new iso create(env: Env) =>
    _env = env

  fun ref _next() =>
    _counter = _counter + 1
    _env.out.print("Started task "+_counter.string())
    let l = Loader(_env)
    _loaders.push(l)
    l.spin()

  fun ref apply(timer: Timer, count: U64): Bool =>
    _next()
    true


actor Main
  new create(env: Env) =>
    let timers = Timers
    let timer = Timer(Tick(env), 0, 1_000_000_000 /*1s*/)
    timers(consume timer)
    env.out.print("Started the ticker")
    let load_timer = Timer(Load(env), 0, 2_000_000_000 /*2s*/)
    timers(consume load_timer)

on Windows: ponyc && test.exe → crash, but ponyc --debug && test.exe → no crash

0.21.0-acd811b [release]
compiled with: llvm 3.9.1 -- msvc-15-x64

same on OSX:

0.21.0 [release]
compiled with: llvm 3.9.1 -- Apple LLVM version 9.0.0 (clang-900.0.39.2)

For a cross-check, 0.20 crashes too:

0.20.0-0b2a2d2 [release]
compiled with: llvm 3.9.1 -- msvc-14-x64

The text was updated successfully, but these errors were encountered:

d-led · 2017-12-22T19:22:09Z

P.S. no rush, by the way. Reporting it, as Pony's stability might be endangered

d-led · 2017-12-22T19:25:26Z

curiously, adding some calculation in the tight loop does makes the program run without a crash

      if (_counter % 1000000000) == 0 then
        _env.out.print("ping")
      end

https://gist.github.com/d-led/a92c1e4aa51b87f587976995d397dbcb

I'd expect that the existing tight loops would continue to create output, though

SeanTAllen · 2017-12-23T00:01:02Z

with debug runtime but release build. looks like a gc bug.

(lldb) run
Process 8042 launched: './issue-2441' (x86_64)
Started the ticker
Tick 1
Started task 1
Process 8042 stopped

thread Nested tuple access can't parse #2: tid = 0x60e437, 0x0000000100000e7b issue-2441Env_Trace + 11, stop reason = EXC_BAD_ACCESS (code=1, address=0x10) frame #0: 0x0000000100000e7b issue-2441Env_Trace + 11
issue-2441`Env_Trace:
-> 0x100000e7b <+11>: movq 0x10(%rbx), %rsi
0x100000e7f <+15>: leaq 0x283ea(%rip), %rdx
0x100000e86 <+22>: movl $0x2, %ecx
0x100000e8b <+27>: callq 0x100015c10 ; pony_traceknown at trace.c:109

(lldb) bt
warning: could not load any Objective-C class information. This will significantly reduce the quality of type information available.

thread Nested tuple access can't parse #2: tid = 0x60e437, 0x0000000100000e7b issue-2441`Env_Trace + 11, stop reason = EXC_BAD_ACCESS (code=1, address=0x10)
- frame #0: 0x0000000100000e7b issue-2441Env_Trace + 11 frame #1: 0x0000000100013a2e issue-2441ponyint_gc_handlestack(ctx=0x00000001097ff7c8) + 94 at gc.c:619
  frame Nested tuple access can't parse #2: 0x0000000100015999 issue-2441pony_recv_done(ctx=0x00000001097ff7c8) + 89 at trace.c:65 frame #3: 0x0000000100001433 issue-2441Loader_Dispatch + 51
  frame Calling to and from C #4: 0x00000001000094fb issue-2441handle_message(ctx=0x00000001097ff7c8, actor=0x00000001298f4400, msg=0x00000001298f4200) + 635 at actor.c:164 frame #5: 0x0000000100008fbe issue-2441ponyint_actor_run(ctx=0x00000001097ff7c8, actor=0x00000001298f4400, batch=100) + 174 at actor.c:226
  frame Package dependency loops #6: 0x000000010001d4c5 issue-2441run(sched=0x00000001097ff780) + 261 at scheduler.c:527 frame #7: 0x000000010001cc19 issue-2441run_thread(arg=0x00000001097ff780) + 57 at scheduler.c:568
  frame Conditional code #8: 0x00007fff9db1899d libsystem_pthread.dylib_pthread_body + 131 frame #9: 0x00007fff9db1891a libsystem_pthread.dylib_pthread_start + 168
  frame Primitive boxing with contravariant traits #10: 0x00007fff9db16351 libsystem_pthread.dylib`thread_start + 13

SeanTAllen · 2017-12-23T00:09:13Z

The bug is triggered by _spin() method.

Not sure why though.

SeanTAllen · 2017-12-23T00:12:52Z

I don't have time to investigate further but... this happens in release but not debug. I'm pretty sure this is the result of an LLVM optimization of:

    while _counter >=  0 do
      _counter = _counter + 1
    end

given that will always be true, i'm interested to see what LLVM is producing in release mode as an optimization.

This for example, has no segfaulting issues:

  fun ref _spin() =>
    _counter = 0
    while _counter < U64.max_value() do
      _counter = _counter + 1
    end
    _env.out.print(_counter.string())

SeanTAllen · 2017-12-23T00:15:39Z

@d-led you should be able to keep working and get your desired effect with this:

  fun ref _spin() =>
    while _counter < U64.max_value() do
      if _counter == (U64.max_value() - 1) then
        _counter = 0
      end
      _counter = _counter + 1
    end
    _env.out.print(_counter.string())

d-led · 2017-12-23T07:46:27Z

@SeanTAllen thanks for your investigation! Will continue poking around + use your proposal

SeanTAllen · 2018-01-31T21:22:34Z

Sylvan reports this does not segfault for him on Windows with LLVM 3.9.1

mfelsche · 2018-01-31T21:28:10Z

i was able to reproduce this on ubuntu linux with pony:

0.21.2-28b67dd22 [debug]
compiled with: llvm 3.9.1 -- cc (Ubuntu 5.4.0-6ubuntu1~16.04.5) 5.4.0 20160609

sylvanc · 2018-01-31T21:46:23Z

I changed spin to start with _counter at 1 and test for _counter > 0, and there was no segfault. This is on master, using Windows + LLVM 3.9.1.

It also works if work is done before the loop that can't be optimised away (for example, printing something to env.out).

It appears that Loader_Dispatch optimises away the possibility of receiving a spin message otherwise. The _spin function becomes a jmp to its own label, and at some point we get a ud2 during LLVM optimisation, and then LLVM is free to assume the spin message is never received (because it would result in reaching unoptimised code), so the spin message is handled as if it was create, which is why it crashes in gc (tries to deal with the env argument, which is not present).

As to why LLVM is turning that "jmp to self" into ud2, I'm not sure.

Praetonus · 2018-01-31T22:23:34Z

As to why LLVM is turning that "jmp to self" into ud2, I'm not sure.

Many parts of the LLVM optimiser treat infinite loops with no side-effects as causing undefined behaviour, based on the C++ standard. The LLVM IR itself doesn't define infinite loops very well. This is a recurrent topic on the LLVM mailing list. See for example here and here.

SeanTAllen · 2020-09-05T12:57:08Z

FYI, as of 0.37.0 this is still a problem with optimized builds.

However, the error is a little different.

~/pony-scratch/2441 ➜ lldb -- ./2441
(lldb) target create "./2441"
Current executable set to '/home/sean/pony-scratch/2441/2441' (x86_64).
(lldb) run
Process 6653 launched: '/home/sean/pony-scratch/2441/2441' (x86_64)
Started the ticker
Tick 1
Started task 1
Process 6653 stopped
* thread #3, name = '2441', stop reason = signal SIGILL: illegal instruction operand
    frame #0: 0x0000000000403191 2441`Loader_Dispatch + 81
2441`Loader_Dispatch:
->  0x403191 <+81>: ud2
    0x403193:       nopw   %cs:(%rax,%rax)
    0x40319d:       nopl   (%rax)
2441`Loader_Trace:
    0x4031a0 <+0>:  movq   0x108(%rsi), %rsi
(lldb) bt
* thread #3, name = '2441', stop reason = signal SIGILL: illegal instruction operand
  * frame #0: 0x0000000000403191 2441`Loader_Dispatch + 81
    frame #1: 0x0000000000409ad6 2441`ponyint_actor_run + 214
    frame #2: 0x000000000041255f 2441`run_thread + 2287
    frame #3: 0x00007ffff7f9b609 libpthread.so.0`start_thread + 217
    frame #4: 0x00007ffff7d46103 libc.so.6`__clone + 67

d-led · 2020-09-14T13:53:20Z

a side remark, the Go run-time has managed to implement tight loop pre-emption

SeanTAllen · 2022-01-27T04:38:14Z

I am no longer able to reproduce this as of ponyc 0.46.0.

ergl · 2022-01-27T08:12:17Z

This probably got fixed when we upgraded to LLVM 12, which allowed to have infinite loops at the IR level.

d-led · 2022-01-27T08:23:37Z

can confirm. Nice. Although, the scheduler exhaustion is not that nice - it's a trade-off :)

SeanTAllen added the needs discussion during sync label Dec 23, 2017

SeanTAllen added the bug: 1 - needs investigation label Dec 23, 2017

SeanTAllen added complexity: requires context help wanted Extra attention is needed and removed needs discussion during sync labels Jan 31, 2018

SeanTAllen added needs investigation This needs to be looked into before its "ready for work" bug Something isn't working and removed bug: 1 - needs investigation labels May 12, 2020

SeanTAllen closed this as completed Jan 27, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crash in a scheduler exhaustion example #2441

Crash in a scheduler exhaustion example #2441

d-led commented Dec 22, 2017 •

edited

d-led commented Dec 22, 2017

d-led commented Dec 22, 2017 •

edited

SeanTAllen commented Dec 23, 2017 •

edited

SeanTAllen commented Dec 23, 2017

SeanTAllen commented Dec 23, 2017

SeanTAllen commented Dec 23, 2017

d-led commented Dec 23, 2017

SeanTAllen commented Jan 31, 2018

mfelsche commented Jan 31, 2018

sylvanc commented Jan 31, 2018

Praetonus commented Jan 31, 2018

SeanTAllen commented Sep 5, 2020

d-led commented Sep 14, 2020

SeanTAllen commented Jan 27, 2022

ergl commented Jan 27, 2022

d-led commented Jan 27, 2022

Crash in a scheduler exhaustion example #2441

Crash in a scheduler exhaustion example #2441

Comments

d-led commented Dec 22, 2017 • edited

d-led commented Dec 22, 2017

d-led commented Dec 22, 2017 • edited

SeanTAllen commented Dec 23, 2017 • edited

SeanTAllen commented Dec 23, 2017

SeanTAllen commented Dec 23, 2017

SeanTAllen commented Dec 23, 2017

d-led commented Dec 23, 2017

SeanTAllen commented Jan 31, 2018

mfelsche commented Jan 31, 2018

sylvanc commented Jan 31, 2018

Praetonus commented Jan 31, 2018

SeanTAllen commented Sep 5, 2020

d-led commented Sep 14, 2020

SeanTAllen commented Jan 27, 2022

ergl commented Jan 27, 2022

d-led commented Jan 27, 2022

d-led commented Dec 22, 2017 •

edited

d-led commented Dec 22, 2017 •

edited

SeanTAllen commented Dec 23, 2017 •

edited