Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tests (and Lint?) freeze on Win64 #7942

Closed
tkelman opened this issue Aug 10, 2014 · 46 comments
Closed

Tests (and Lint?) freeze on Win64 #7942

tkelman opened this issue Aug 10, 2014 · 46 comments
Labels
system:windows Affects only Windows test This change adds or pertains to unit tests

Comments

@tkelman
Copy link
Contributor

tkelman commented Aug 10, 2014

I've gotten this to happen on 2 different Win64 computers, one Sandy Bridge, one Haswell. Happens when running runtests.jl all in parallel, seemingly more often the more cores I use for the tests. One of the workers gets stuck on its first test - so usually linalg, but I just got it to happen even on the strings test - while the rest of the workers happily finish everything else, waiting right before running parallel at the very end like they're supposed to.

This isn't just the usual linalg slowness, I've left these going on multiple computers for half an hour or longer. The offending processes are stuck at 100% of a single core, but the memory consumption isn't changing at all.

This doesn't happen on Win32, or when JULIA_CPU_CORES=1. Any ideas how to narrow this down? OpenBlas interaction? Win64 codegen problem? Something to do with libuv and task spawning? Ignore it and hope it doesn't show up in normal code?

@JeffBezanson JeffBezanson added this to the 0.3.1 milestone Aug 10, 2014
@vtjnash
Copy link
Sponsor Member

vtjnash commented Aug 11, 2014

Can you force a backtrace (either from gdb or WinDbg )?

@tkelman
Copy link
Contributor Author

tkelman commented Aug 11, 2014

I can try, but will have to get better at working gdb and/or windbg to figure out how.

@vtjnash
Copy link
Sponsor Member

vtjnash commented Aug 11, 2014

gdb -p <pid> to attach gdb (in WinDbg, there is a gui option in the file menu)
gdb> interrupt (pauses execution, in WinDbg this is a gui option)
gdb> bt "backtrace" (in WinDbg, there's a gui option)

@tkelman
Copy link
Contributor Author

tkelman commented Aug 11, 2014

Will give that a try. I'm clearly not in the habit of running the tests with julia-debug.exe, it fails the backtrace test and I couldn't tell you (until I start bisecting) how long that's been the case for.

Edit: only in win64, and still happens after deleting sys.dll which I keep forgetting to do.

@tkelman
Copy link
Contributor Author

tkelman commented Aug 12, 2014

All I see with julia.exe in either debugger is ntdll!DbgBreakPoint and its 3 expected parents, nothing Julia-related.

@vtjnash
Copy link
Sponsor Member

vtjnash commented Aug 12, 2014

oh, right. usually it does backtrace on the wrong thread. need to first switch back to thread 1 (in WinDbg, this is ~0 or ~1)

@tkelman
Copy link
Contributor Author

tkelman commented Aug 12, 2014

This at all meaningful to you? https://gist.github.com/9985064c916005f0cb04

In this case there were 2 processes stuck each using 100% of a core.

@tkelman
Copy link
Contributor Author

tkelman commented Aug 13, 2014

@staticfloat staticfloat modified the milestones: 0.3.2, 0.3.1 Sep 22, 2014
@pao pao changed the title Tests freeze on Win64 Tests (and Lint?) freeze on Win64 Oct 13, 2014
@pao
Copy link
Member

pao commented Oct 13, 2014

(Copied from JuliaStrings/utf8proc#18)

I have managed to have a single process get stuck here, now. Dominant call stack:

call_stack

The second-place finisher has VirtualQuery at the top of the call stack.

Full Very Sleepy profile: https://dl.dropboxusercontent.com/u/16873321/capture.sleepy

@yuyichao
Copy link
Contributor

So after a number of crashes (of virtualbox) I finally managed to compile a julia on a 64bit windows 7 in virtual box. @tkelman How do you usually reproduce this? (commit? which test?)

@tkelman
Copy link
Contributor Author

tkelman commented Jul 19, 2015

master, just run all the tests in a loop until something freezes. Watch in task manager to see whether a julia process looks stuck without memory consumption changing at all.

@yuyichao
Copy link
Contributor

And gdb -p should work on Windows?

@tkelman
Copy link
Contributor Author

tkelman commented Jul 19, 2015

Depends where you got gdb from, but that's what I was using above (usually with cygwin's gdb IIRC).

@yuyichao
Copy link
Contributor

OK. I'll see if I can get it work... BTW thanks for the windows compilation manual, it's super clear.

@tkelman
Copy link
Contributor Author

tkelman commented Jul 19, 2015

Before you start anything too time-intensive, did you build with override LLVM_DEBUG = 1 ? I caught a freeze in gdb last week but wasn't able to get much info out of it (ref https://github.com/JuliaLang/julia/pull/12109/files#r34421928) without debug symbols in llvm.

@yuyichao
Copy link
Contributor

@tkelman Have you seen the msys terminal freezing if I resize the windows while the julia tests are running? Is it related to #11017

@yuyichao
Copy link
Contributor

did you build with override LLVM_DEBUG = 1 ?

I realized that I'm not after LLVM compilation is done... If you have got it freeze with a debug build of llvm I guess I'll just abort the current test and build the LLVM again with debug on, especially since the terminal is frozen for the test right now................

@tkelman
Copy link
Contributor Author

tkelman commented Jul 19, 2015

I haven't yet caught a freeze in gdb with a debug llvm, but I'm trying that now, and presumably if you want to watch llvm local variables and step through llvm code you'll need the debug info anyway.

@yuyichao
Copy link
Contributor

I see.

Anyway, I'm building the llvm with debug on now and will see if it helps.

Given how long it took me to get the well documented compilation working, it might take even longer to get the debugging working = =....

@yuyichao
Copy link
Contributor

@tkelman Interestingly, it seems to hang the whole system (the host!!) when the freeze happens.

@tkelman
Copy link
Contributor Author

tkelman commented Jul 19, 2015

Oh my. I wonder whether that teaches us anything other than LLVM and Virtualbox are having a bad interaction? I had some similar wackiness when trying to run the cross-compiled build under wine under docker. Apparently we're venturing a bit beyond the design parameters, or something.

@yuyichao
Copy link
Contributor

Actually to be precise, I'm not sure if it is the freeze we expect. I just see the system become very slow for a minute while the vbox display was stuck at the spawn test followed by the vbox guest abnormally terminated...

@tkelman
Copy link
Contributor Author

tkelman commented Jul 19, 2015

That could possibly be related to the broken pipe warning we've been seeing since #12144? You could try checking out the immediate predecessor on master right before that was merged.

@yuyichao
Copy link
Contributor

A broken pipe warning crashes vbox? Hmm...

Running the full test is a little too resources consuming when I want to work on something else at the same time. I saw one instance of freezing in versioninfo() and is running that in a loop right now to see if I can get lucky.

@yuyichao
Copy link
Contributor

Related question. Is the llvm version used on AppVeyor a debugging build or can it be a debug version?

@tkelman
Copy link
Contributor Author

tkelman commented Jul 19, 2015

It's a release+assertions build IIRC, but I could upload a debug build if you want to test things out. Would need to tweak a couple lines in contrib/windows/msys_build.sh.

@yuyichao
Copy link
Contributor

No need for now. Maybe if I couldn't get anything out of a local test (and if I still want to spend more time on it at that point).

@tkelman
Copy link
Contributor Author

tkelman commented Jul 19, 2015

Thanks for trying, it's much appreciated to have anyone else looking at these long-standing serious bugs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
system:windows Affects only Windows test This change adds or pertains to unit tests
Projects
None yet
Development

No branches or pull requests

8 participants