-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sporadic julia error #12
Comments
I'm worried this bug is buried deep inside Julia, we're evaluating the same code snippet with the same inputs dozens of times before? |
Sadly, I haven't been able to make this code deterministic no matter what I do (at least with BayesianRidge, which appears to be non-deterministic because of the SVD), so the fact that it isn't reproducible isn't surprising. [edited - Cas pointed out where the attachment is] |
I saw a bug like this before when I accidentally redefined one of the functions as a variable that contained a vector (namely @JPDarby how often is this happening? |
It happened on two separate HAL runs, both around 40-50 HAL iterations in... Restarting from the same database and selected basis params actually did not raise the bug annoyingly. |
Yeah exactly as Cas said |
@cortner Do you have any thoughts on this issue? We're evaluating this exact code snippet 40-50 times and the next iteration leads to this error..? |
2 out of how many? If I run it 5 times, can I expect it to happen once? 10 times? 100 times? |
I did 2 runs, both 40-50ish iterations and they both ended with this error. I've restarted one of them and will see if it happens a 3rd time... |
The only thing I can think of what happened here is what Noam said above: that some function that we are trying to call has be overwritten by a variable that is an array. |
Unfortunately the LOG.txt doesn't give the Julia stack trace so I don't have a way of tracking down where the exception was thrown. Is it possible to reproduce this in pure Julia? I'm afraid I don't have the time and energy to start digging into how it is called from Python and how that might affect the results ... |
I have a hard time imagining how we can reproduce this in pure julia, since it's deep into a long run. @cortner do you know where that log message is generated? julyp? |
definitely not julip. First time I've seen such a message. |
I guess we can tell from the python stack trace that it's just python's julia module. I'll try to see if I can find a way to add more details. I may follow up here with questions about julia's exception objects, but probably I'll be able to find the docs. |
I have a simpler idea for debugging, at least for now. @JPDarby if it's at all reproducible, I'll send you the patch so you can test it and we can get more info about what's happening. |
Yes - I figured out how to extract the julia line number where the error happens by catching the exception inside the julia code block. It just requires a patch to |
See also JuliaPy/pyjulia#525 |
Basically, you just need to add
The julia code line will be reported as part of the python exception message, although keep in mind that the line numbers will be relative to the source code with the "try" line, and depending on where you start the julia relative to the python |
Thank you, this is the entire stacktrace (had to include to add some
|
It's the (very) long line above including @cortner could this be related to the warnings?
|
Formatting the relevant line here again
|
Can you paste your |
Here's my
I think line 10 is the |
Interesting. I've never run more than 20 iterations in a single run, so maybe that's why I haven't run into this. I think we have to take this up with the ACE/julia experts. |
Is it happening right after some particular choice of basis parameters (e.g. new ones chosen by the optimizer)? |
Ok I understand the cause and can fix it. Well not really the cause but I have a rough idea what might have happened. How did you transfer the model / basis to different processes? |
I don't think we're doing anything active (about multiple processes) in python. Just calling
|
I'm setting |
I'm doing this too, might this be the problem?
No they're different.
As @bernstei described above I don't think we're not using multiple proces calls to Julia. It seems to break after exactly 536 total calls of that function in serial seemingly |
no - threads shouldn't cause an issue. I've seen it before when we tried to copy an ACE basis to a new process. Then the anonymous function that defines the Agnesi(p, q) transform gets lost along the way. What I will do now is implement a raw Agnesi(p, q) struct without anonymous functions. I bet this will solve your problem for now at least. But I still don't understand why the problem occured in the first place. |
Thank you very much, regarding the |
New python process, or new julia process? Just trying to understand how this could be happening, given that we're not (as far as I know) doing anything with multiple processes. |
New julia process. We've seen a (possibly) related issue for distributed assembly, where the core problem is serializing the basis and reconstructing it elsewhere, so maybe the Python interface does something similar. Doesn't yet explain the intermittency though. |
i've seen it when copying to a new Julia process. |
Yes, I can remove them. It will just fail by throwing an error,. |
That'd be great, thanks |
can you please try ACE1.jl v0.11.4 - see also this PR |
Thank you, running a job now. Warnings have dissapeared and I'll get back once I get to 36 HAL iterations, should be a few hours. Which is also pretty much exactly how long it takes to generate a stable ACE potential for a small molecule starting from 1 config :). Including running the DFT. |
This seems resolved now, thank you! |
log.txt
I have seen this error twice now and have attached the full log. It doesn't seem specific to the basis chosen and when I restarted HAL from the same configurations I couldn't reproduce it.
The text was updated successfully, but these errors were encountered: