Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Already on GitHub? Sign in to your account

Linux - deadlock between alsa and xcb #93

Closed
OpenDUNE opened this Issue Dec 3, 2012 · 24 comments

Comments

Projects
None yet
7 participants
@ghost

ghost commented Dec 3, 2012

http://bugs.opendune.org/view.php?id=93

svuorela wrote:
...according to #opendune it is related to alsa and xcb with one doing free and the other malloc at the same time.

The backtrace is here:

gdb) bt
#0 0xb7796424 in __kernel_vsyscall ()

0000001 0xb753adb1 in __lll_lock_wait_private () at ../nptl/sysdeps/unix/sysv/linux/i386/i686/../i486/lowlevellock.S:95
0000002 0xb74d1ba7 in _L_lock_9661 () from /lib/i386-linux-gnu/i686/cmov/libc.so.6
0000003 0xb74d03d6 in *__GI___libc_free (mem=0xb201efd8) at malloc.c:3736
0000004 0xb72e4959 in dequeue_pending_request (dpy=, req=) at ../../src/xcb_io.c:195
0000005 0xb72e5ccf in _XReply (dpy=0x8890a40, rep=0xbfaf7d80, extra=0, discard=1) at ../../src/xcb_io.c:638
0000006 0xb72e1236 in XSync (dpy=0x8890a40, discard=0) at ../../src/Sync.c:44
0000007 0xb76f1410 in ?? () from /usr/lib/libSDL-1.2.so.0
0000008 0xb76e3a6e in SDL_UpdateRects () from /usr/lib/libSDL-1.2.so.0
0000009 0xb76e3bd1 in SDL_UpdateRect () from /usr/lib/libSDL-1.2.so.0
0000010 0x0809adfc in Video_Tick () at /home/pusling/games/opendune/opendune-0.7-source/src/video/video_sdl.c:524
0000011 0x080944b6 in Timer_InterruptRun () at /home/pusling/games/opendune/opendune-0.7-source/src/timer.c:103
0000012
0000013 _int_malloc (av=, bytes=) at malloc.c:4348
0000014 0xb74cfc1b in __libc_calloc (n=1, elem_size=36) at malloc.c:4065
0000015 0xb741a1ee in pa_xmalloc0 () from /usr/lib/libpulse.so.0
0000016 0xb7401681 in ?? () from /usr/lib/libpulse.so.0
0000017 0xb70dec79 in pa_pdispatch_register_reply () from /usr/lib/libpulsecommon-0.9.23.so
0000018 0xb740ccdf in pa_stream_update_timing_info () from /usr/lib/libpulse.so.0
0000019 0xb740cf4d in ?? () from /usr/lib/libpulse.so.0
0000020 0xb7410586 in pa_stream_cork () from /usr/lib/libpulse.so.0
0000021 0xb7764480 in ?? () from /usr/lib/i386-linux-gnu/alsa-lib/libasound_module_pcm_pulse.so
0000022 0xb7650b69 in ?? () from /usr/lib/i386-linux-gnu/libasound.so.2
0000023 0xb76045df in snd_pcm_start () from /usr/lib/i386-linux-gnu/libasound.so.2
0000024 0xb760c777 in ?? () from /usr/lib/i386-linux-gnu/libasound.so.2
0000025 0xb76503bf in ?? () from /usr/lib/i386-linux-gnu/libasound.so.2
0000026 0xb76052a4 in snd_pcm_writei () from /usr/lib/i386-linux-gnu/libasound.so.2
0000027 0x080562aa in DSP_Play (data=0x88cad72 "\001("") at /home/pusling/games/opendune/opendune-0.7-source/src/audio/dsp_alsa.c:134
0000028 0x08055994 in Driver_Voice_Play (data=0x88cad58 "Creative Voice File\032\032", arg0A=11) at /home/pusling/games/opendune/opendune-0.7-source/src/audio/driver.c:272
0000029 0x08059006 in Voice_PlayAtTile (voiceID=42, position=...) at /home/pusling/games/opendune/opendune-0.7-source/src/audio/sound.c:128
0000030 0x08098e13 in Unit_CreateBullet (position=..., type=UNIT_MISSILE_TURRET, houseID=1 '\001', damage=30, target=16420) at /home/pusling/games/opendune/opendune-0.7-source/src/unit.c:1936
0000031 0x08089cf0 in Script_Structure_Fire (script=0x80c9074) at /home/pusling/games/opendune/opendune-0.7-source/src/script/structure.c:534
0000032 0x08088512 in Script_Run (script=0x80c9074) at /home/pusling/games/opendune/opendune-0.7-source/src/script/script.c:445
0000033 0x0808fe5f in GameLoop_Structure () at /home/pusling/games/opendune/opendune-0.7-source/src/structure.c:333
0000034 0x08082c8c in GameLoop_Main () at /home/pusling/games/opendune/opendune-0.7-source/src/opendune.c:2281
0000035 0x08082eeb in main (argc=1, argv=0xbfaf8ea4) at /home/pusling/games/opendune/opendune-0.7-source/src/opendune.c:2396

@ghost ghost closed this Dec 3, 2012

Owner

TrueBrain commented Dec 14, 2012

Problem is rare, but does happen. A perfect solution would to remove the signal, and use an event loop. The in-between solution would be to temporary disable signals when ALSA is called, as it rarely calls (c)alloc.

@TrueBrain TrueBrain reopened this Dec 14, 2012

Owner

TrueBrain commented Dec 14, 2012

johan wrote:
I'm not so sure about rare. I just found out about opendune today and the game has hung twice, both within 0000001:0000005 minutes of starting level 1/Atreides.

I'm using Xubuntu 12.10 (amd64) on an Atom processor. Maybe the slow processor makes it more likely to occur?

In any event, I'll try to confirm that this is the issue via gdb.

Owner

TrueBrain commented Dec 14, 2012

The problem comes from the fact that malloc/free are not reentrant. X11 does a free, signal handler gets called, ALSA calls malloc, deadlock.

Owner

TrueBrain commented Feb 8, 2013

Three branches of solution exists in the minds of the devs at the moment, with the latest fixes for MPU:

  1. put the DSP in its own little thread. This is not trivial, as the code is not written with that in mind. But the MPU is a nice template how it can be done, so it might be easy enough.

  2. move the video-tick to the main thread, by checking all the loops that need to call it. This is one of the more poor solutions, as there is absolutely no control over how often the screen is redrawn etc, which can redraws to be missed, or to be done too often. Either way, we would lose the control (where we have it now).

  3. put all the while(true) {} loops of the whole game in a single function, like any modern game, and apply 2) after that. This is the cleanest solution of them all, as it would make the game like any other event-based game. It would also allow us to limit fps (means less CPU wasted etc), handle keyboard and mouse how it should be done (not by putting them ready, and letting the signal-handler pick up on them and handle them after that), and many more code-wise ugly parts. Sadly, this solution would take the longest to implement, as there are roughly 60+ while (true) {} loops in the code. On a positive side, initial work on this has started.

This to keep everyone up-to-date on the progress. It is an annoying issue :(

Contributor

wangds commented Feb 8, 2013

In the original game, carryalls' shadows would flicker a little bit. I'm not exactly sure how it does that, but I think it is probably just a quirk due to how the drawing was implemented. Will we lose that if 2 or 3 are implemented?

Contributor

rofl0r commented Feb 9, 2013

  1. will be very hard - you'd need to protect all usages of global variables/objects using locks (proper locks, that means mutexes and not global variables like it's currently done - those are subject to race conditions as well)
Contributor

miniupnp commented Mar 21, 2013

Is someone actively working on 3) ?

Owner

TrueBrain commented Mar 21, 2013

glx22 did make a start; but no, I would guess not. It also turned out to be not THAT big of a job, just something you need to sit for and do.

Contributor

rofl0r commented Mar 31, 2013

my pull request more or less does 2) and it works pretty well
it's basically like Application.ProcessMessages() in Delphi.
there's no design problem with that at all, you just need to do the call whenever the screen contents are updated.
there seems to be one spot that my pull request misses currently, that is when a sandworm moves there will be relics on the position it left (unless you scroll around or another event that redraws the entire screen happens after it moves). however this is not a big issue and seems to be quite easy to fix.

Owner

Xaroth commented Mar 31, 2013

I recall several discussions regarding your 'idea' on how to fix it.. to summarize it (again).. just because it works, doesn't mean it's the right sollution.. I really don't want code to be included in the codebase that gets close to VB's "On Error Resume Next" ...

Contributor

rofl0r commented Mar 31, 2013

where's the relation with VB, and with exception handling ?

anyway: my point here is to clarify that my pull request implements point 2) as specified above by @TrueBrain and it works well (contrary to the current code in this repo, which is unplayable on linux).

since @wangds and @miniupnp are now in charge, i'd be interested to hear their opinion on the matter.

Owner

TrueBrain commented Mar 31, 2013

Well, for once you are polite, I give you credit for that. What I don't give you credit for, is that you keep trying over and over and over and over again to make us commit code that is bad, no matter how many problems it fixes. You do not make half-baked solutions and hope for the best. That is not progress, that is short-sight-solution-solving. If you go that road, you end up with something like Windows ME. You remember that OS? It solved a lot of problems, but was TERRIBLE.

So once again: sure, your solution works, but it is not a solution: it is a hack, a workaround, a quick-fix, a look-I-made-the-code-more-ugly-but-now-at-least-it-works.

There the VB reference comes in: "On Error Resume Next" was the go-to solution of Microsoft for problems. It crashed many many many MANY IISes for months at end. It was a hack, and it became mainstream because nobody made a proper solution.

So if instead of wasting your own time, and ours, in trying over and over again to get a solution accepted which we declined over and over for the same reason, and you put that time to good use and work on a real proper solution, we might get somewhere.

So please, stop this badgering, and either make a proper solution (which means: not a hack), or use your own fork.

Contributor

rofl0r commented Mar 31, 2013

so you're stating that

  1. the code is bad
  2. the code is half-baked
  3. the code is short-sighted
  4. the code is ugly
  5. the code is a hack
    please provide arguments for your claims.
    did you even read the code ?
    i have the feeling that you have a personal problem with me and thus whatever i do or say is automatically bad in your eyes.
    from my point of view, my patch is elegant, it even reduces the code needed, and solves the problem.
    since 1) is completely infeasible due to all the globals, so there is only 2) and 3) left, of which theoretically 3 would be nicer, but it would require insane amounts of refactoring, so 2) is the only viable way.
Owner

TrueBrain commented Mar 31, 2013

If I would have a personal problem with you, would we be having this conversation? You are getting close to me having a person problem with you btw ... as you cannot take no for an answer. All you read is: my patch is awesome, and I don't want to hear anything different. Even if we present why it is bad, you bluntly ignore it, weeks later come back, telling the patch is still awesome. For the same reasons, it is still not. But we go round and round and round ... getting rather tired of it.

Contributor

rofl0r commented Mar 31, 2013

you're still not explaining why the code is ugly, bad, a hack, etc.
in the pull request #157 you only nagged about introducing braces for one liners (resulting from debug code where i needed to insert another line) and doxygen stuff.
this can of course be fixed easily to meet your expectations.
i'm interested in a technical review, not a code-style review.

if you actually look at the code: https://github.com/OpenDUNE/OpenDUNE/pull/174/files
"Showing 9 changed files with 67 additions and 147 deletions"
the pull request more or less only removes code and changes at most 20 lines (if we subtract the whitespace changes)
so there's not really much there that could be "hackish", unless you think it's hackish to remove broken code.

Owner

TrueBrain commented Mar 31, 2013

It is sad that you fail to see the argumentation presented and given to you through-out all the tickets, PRs, and IRC talk. You seem either ignorant or blind for why that is not a real solution, but what is called a hack, in programming terms. But for the sanity of this ticket, as this is getting out of control already as it is, let me sum it up once more for you. I do promise you it is the last time I waste any of my time on this, as it useless to debate with you; you are right, we are wrong, we are the bad guys, and I don't know what more already came our way. Of course, from your point of view that is most likely true, as we don't like something you wrote and you consider "perfect". That is fine, we all have our own opinion. Don't care. My (and Xaroth's) project, you can happily fork away what ever you want.

So, here it comes (again), worded differently a few times:

Your patch is a hack, as you hack yourself into a layer you should not be hacking into. Doing any processing like that in a while (true) {} loop, in the hope you got them all (and you already indicated, you missed one, which just proofs the point more), is bad. You are just doing the bare minimum to get something to work, instead of making a real solution, which doesn't hide the code away like this.

To explain it another way: as already indicated in 2), even 2) is a poor solution. Your solution, which does more than 2) (it pulls out the whole interrupt, including all the other timers, which are basically fine to be there), is just an extension of that already poor solution. It was merely mentioned in the list for documentation purposes; not as a direction we would like to go. (in fact, the idea was meant differently, but I agree that the wording is poor, so I can understand it can be seen as a direction to go; it shouldn't be.)

Lastly, the proper solution should be in terms of reducing while (true) {} loops, as that is not a modern approach to the problem they tried to solve. The current code would, in which ever solution, never allow any real refactoring or creating any base for modding in terms of maintainability and error-free code; the control flow is basically fucked to hell as it is. Your solution is just an extension of it: I rather avoid building on top of an already flawed design, just "so it works".

To put this last one differently, I have no interest (at all) in applying a fix just because it fixes something. It should be clean. And as stated earlier, 3) is not as much work as expected, and should be done. So applying your hack over a proper solution, which merely takes time, would be silly at best. Your only argumentation why 3) should not happen, is because it "require insane amounts of refactoring" and because of that only 2) is viable. This of course is a non-sense argument; OpenDUNE would not have existed if anything "insane" would not be done. It is basically code for: I am to lazy. You are trying to postpone something that should happen, no matter what. The only real and clean solution flows from 3), not 2). How ever much it would solve an issue. And that we call a hack; if something solves a symptoms, not the problem.

I like how someone on the interwebz explained it:

"(A hack is) applying a temporary band aid to a large gaping wound. Its fixed for now, buts going to cause even more problems later." -- JD Isaacks

Bring real solution; not hacks.

So, to come back to your list:

  1. the code is bad
    Hacks are bad. Period.

  2. the code is half-baked
    Hacks are half-baked. Period.

  3. the code is short-sighted
    Hacks are short-sighted. Period.

  4. the code is ugly
    Hacks are ugly. Period.

  5. the code is a hack
    Hacks are .. hacks.

To sum it up: you see a lot of tickets (and PRs) being closed unaccepted, because presented was a hack. It is the default goto of programmers: fix the symptom you experience, and the world is well again. Any decent project cannot run on that. Soon your code is not maintainable, and spins out of control. We will reject now and in the future any patch that is a hack, as long as a proper solution is viable.

To close it up: feel free to disagree, we live in a free world (thank god for that). But we are not going to accept hacks if we can ("easily") avoid it, because hacks are, by definition I guess, all the above mentioned things.

EDIT: hmm, I should nuance that statement: there are a few cases where hacks in fact are viable and the only proper solution. They remain hacks, of course, and those are rare cases. For sure this is not one of these cases, even more as another viable solution is presented. This edit is just to make sure I don't go into history as the dude that claimed all hacks are bad :D

Contributor

miniupnp commented Mar 31, 2013

I hope my contribution will only calm things down :)
The fact is that current code crashes because it calls forbidden functions into a signal handler.
That makes the games almost unplayable under linux or OS X, because it is difficult to play more than 5 minutes without a frustrating crash. Under linux, the problem is almost gone if sound is disabled.
POSIX defined a list of C functions which are explicitly allowed in signal handlers, but malloc/free are not part of them :)

"Replacing" the signal handler with a thread is a solution to avoid the crash, but just looking 1 minute at the code, one sees that it is NOT thread safe. I think that is why the rofl0r's PR are hacks...

Personnally I really see the solution 3) (have one even loop) as the cleanest one. I would like to contribute to it, if glx22 shows what he started it would help me.

Member

glx22 commented Mar 31, 2013

I did nothing on it since 7a5ba8d

Contributor

rofl0r commented Mar 31, 2013

@miniupnp you've looked at #174 ? if you look sharply you will see that I removed the signal handler/thread code. so which part there is not threadsafe when there are no more threads running ?

Contributor

rofl0r commented Mar 31, 2013

@TrueBrain
"(A hack is) applying a temporary band aid to a large gaping wound. Its fixed for now, buts going to cause even more problems later." -- JD Isaacks
nice explanation. this explains your signal handler code quite accurately.

Owner

TrueBrain commented Mar 31, 2013

I think I have been more than patient enough. Lets see if Github blocks work as advertised.

Contributor

miniupnp commented Apr 1, 2013

@rofl0r maybe I'm mixing your two PR... I need to spend time on this to understand everything. I think we all need time to find the best solution.

miniupnp added a commit to miniupnp/OpenDUNE that referenced this issue Oct 21, 2015

-Fix(#93) : Don't use signal() to do the work
Doesn't change anything to WIN32 code which works perfectly !

It is unsafe to call most function in a signal handler : So
we only write to a volatile sig_atomic_t variable, and then
use SleepIdle() to do the real work.
Warnings are issued when some "Interrupts" are lost.

pause() replaces usleep() but waits for the signal to be
triggered. That should ensure that Timer code is executed
as soon as the signal handler returns.
When this is not the case, Warnings are issued.

Code changes are minimal, I hope that makes them easily understandable.
Contributor

miniupnp commented Oct 21, 2015

More than 2 years after, I made my mind and found a solution using pause() to wait for the SIGALRM in the "main thread" to execute Timers with the right timing.
Only very rarely there are some "interrupts" lost (ie sleepIdle() is not called before the timer is triggered). It happens mostly during startup, not during game.
I need to play some more on my old and slow Mac to see how it behaves

Contributor

miniupnp commented Jan 2, 2016

see #261

@miniupnp miniupnp closed this Jan 2, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment