Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Infinite core INTERRUPTED loop #1157

Closed
jchodera opened this issue Dec 20, 2015 · 21 comments
Closed

Infinite core INTERRUPTED loop #1157

jchodera opened this issue Dec 20, 2015 · 21 comments

Comments

@jchodera
Copy link
Member

One slack donor (hayesk) had his core keep crashing in an infinite loop when reading the checkpoint file:

11:52:21:WU02:FS02:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/web.stanford.edu/~pande/Linux/AMD64/NVIDIA/Fermi/beta/Core_21.fah/FahCore_21 -dir 02 -suffix 01 -version 704 -lifeline 1087 -checkpoint 15 -gpu 1 -gpu-vendor nvidia -gpu-vendor=nvidia -noclean
11:52:21:WU02:FS02:Started FahCore on PID 4700
11:52:21:WU02:FS02:Core PID:4704
11:52:21:WU02:FS02:FahCore 0x21 started
11:52:21:WU02:FS02:0x21:*********************** Log Started 2015-12-08T11:52:21Z ***********************
11:52:21:WU02:FS02:0x21:Project: 10496 (Run 83, Clone 1, Gen 7)
11:52:21:WU02:FS02:0x21:Unit: 0x000000098ca304f555d12140ccbb628f
11:52:21:WU02:FS02:0x21:CPU: 0x00000000000000000000000000000000
11:52:21:WU02:FS02:0x21:Machine: 2
11:52:21:WU02:FS02:0x21:Digital signatures verified
11:52:21:WU02:FS02:0x21:Folding@home GPU Core21 Folding@home Core
11:52:21:WU02:FS02:0x21:Version 0.0.14
11:52:21:WU02:FS02:0x21:  Found a checkpoint file
11:52:28:WU02:FS02:FahCore returned: INTERRUPTED (102 = 0x66)
11:53:21:WU02:FS02:Starting
11:53:21:WU02:FS02:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/web.stanford.edu/~pande/Linux/AMD64/NVIDIA/Fermi/beta/Core_21.fah/FahCore_21 -dir 02 -suffix 01 -version 704 -lifeline 1087 -checkpoint 15 -gpu 1 -gpu-vendor nvidia -gpu-vendor=nvidia -noclean
11:53:21:WU02:FS02:Started FahCore on PID 4710
11:53:21:WU02:FS02:Core PID:4714
11:53:21:WU02:FS02:FahCore 0x21 started
11:53:22:WU02:FS02:0x21:*********************** Log Started 2015-12-08T11:53:21Z ***********************
11:53:22:WU02:FS02:0x21:Project: 10496 (Run 83, Clone 1, Gen 7)
11:53:22:WU02:FS02:0x21:Unit: 0x000000098ca304f555d12140ccbb628f
11:53:22:WU02:FS02:0x21:CPU: 0x00000000000000000000000000000000
11:53:22:WU02:FS02:0x21:Machine: 2
11:53:22:WU02:FS02:0x21:Digital signatures verified
11:53:22:WU02:FS02:0x21:Folding@home GPU Core21 Folding@home Core
11:53:22:WU02:FS02:0x21:Version 0.0.14
11:53:22:WU02:FS02:0x21:  Found a checkpoint file
11:53:29:WU02:FS02:FahCore returned: INTERRUPTED (102 = 0x66)
11:54:21:WU02:FS02:Starting
11:54:21:WU02:FS02:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/web.stanford.edu/~pande/Linux/AMD64/NVIDIA/Fermi/beta/Core_21.fah/FahCore_21 -dir 02 -suffix 01 -version 704 -lifeline 1087 -checkpoint 15 -gpu 1 -gpu-vendor nvidia -gpu-vendor=nvidia -noclean
11:54:21:WU02:FS02:Started FahCore on PID 4719
11:54:21:WU02:FS02:Core PID:4723
11:54:21:WU02:FS02:FahCore 0x21 started
11:54:22:WU02:FS02:0x21:*********************** Log Started 2015-12-08T11:54:21Z ***********************
11:54:22:WU02:FS02:0x21:Project: 10496 (Run 83, Clone 1, Gen 7)
11:54:22:WU02:FS02:0x21:Unit: 0x000000098ca304f555d12140ccbb628f
11:54:22:WU02:FS02:0x21:CPU: 0x00000000000000000000000000000000
11:54:22:WU02:FS02:0x21:Machine: 2
11:54:22:WU02:FS02:0x21:Digital signatures verified
11:54:22:WU02:FS02:0x21:Folding@home GPU Core21 Folding@home Core
11:54:22:WU02:FS02:0x21:Version 0.0.14
11:54:22:WU02:FS02:0x21:  Found a checkpoint file
11:54:24:WU00:FS01:0x18:Completed 4550000 out of 5000000 steps (91%)
11:54:29:WU02:FS02:FahCore returned: INTERRUPTED (102 = 0x66)
11:55:21:WU02:FS02:Starting
11:55:21:WU02:FS02:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/web.stanford.edu/~pande/Linux/AMD64/NVIDIA/Fermi/beta/Core_21.fah/FahCore_21 -dir 02 -suffix 01 -version 704 -lifeline 1087 -checkpoint 15 -gpu 1 -gpu-vendor nvidia -gpu-vendor=nvidia -noclean
11:55:21:WU02:FS02:Started FahCore on PID 4728
11:55:21:WU02:FS02:Core PID:4732
11:55:21:WU02:FS02:FahCore 0x21 started
11:55:22:WU02:FS02:0x21:*********************** Log Started 2015-12-08T11:55:21Z ***********************
11:55:22:WU02:FS02:0x21:Project: 10496 (Run 83, Clone 1, Gen 7)
11:55:22:WU02:FS02:0x21:Unit: 0x000000098ca304f555d12140ccbb628f
11:55:22:WU02:FS02:0x21:CPU: 0x00000000000000000000000000000000
11:55:22:WU02:FS02:0x21:Machine: 2
11:55:22:WU02:FS02:0x21:Digital signatures verified
11:55:22:WU02:FS02:0x21:Folding@home GPU Core21 Folding@home Core
11:55:22:WU02:FS02:0x21:Version 0.0.14
11:55:22:WU02:FS02:0x21:  Found a checkpoint file
11:55:29:WU02:FS02:FahCore returned: INTERRUPTED (102 = 0x66)
11:56:21:WU02:FS02:Starting
11:56:21:WU02:FS02:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/web.stanford.edu/~pande/Linux/AMD64/NVIDIA/Fermi/beta/Core_21.fah/FahCore_21 -dir 02 -suffix 01 -version 704 -lifeline 1087 -checkpoint 15 -gpu 1 -gpu-vendor nvidia -gpu-vendor=nvidia -noclean
11:56:21:WU02:FS02:Started FahCore on PID 4737
11:56:21:WU02:FS02:Core PID:4741
11:56:21:WU02:FS02:FahCore 0x21 started
11:56:22:WU02:FS02:0x21:*********************** Log Started 2015-12-08T11:56:21Z ***********************
11:56:22:WU02:FS02:0x21:Project: 10496 (Run 83, Clone 1, Gen 7)
11:56:22:WU02:FS02:0x21:Unit: 0x000000098ca304f555d12140ccbb628f
11:56:22:WU02:FS02:0x21:CPU: 0x00000000000000000000000000000000
11:56:22:WU02:FS02:0x21:Machine: 2
11:56:22:WU02:FS02:0x21:Digital signatures verified
11:56:22:WU02:FS02:0x21:Folding@home GPU Core21 Folding@home Core
11:56:22:WU02:FS02:0x21:Version 0.0.14
11:56:22:WU02:FS02:0x21:  Found a checkpoint file
11:56:29:WU02:FS02:FahCore returned: INTERRUPTED (102 = 0x66)
11:56:43:WU00:FS01:0x18:Completed 4600000 out of 5000000 steps (92%)
11:57:21:WU02:FS02:Starting
11:57:21:WU02:FS02:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/web.stanford.edu/~pande/Linux/AMD64/NVIDIA/Fermi/beta/Core_21.fah/FahCore_21 -dir 02 -suffix 01 -version 704 -lifeline 1087 -checkpoint 15 -gpu 1 -gpu-vendor nvidia -gpu-vendor=nvidia -noclean
11:57:21:WU02:FS02:Started FahCore on PID 4746
11:57:21:WU02:FS02:Core PID:4750
11:57:21:WU02:FS02:FahCore 0x21 started
11:57:22:WU02:FS02:0x21:*********************** Log Started 2015-12-08T11:57:21Z ***********************
11:57:22:WU02:FS02:0x21:Project: 10496 (Run 83, Clone 1, Gen 7)
11:57:22:WU02:FS02:0x21:Unit: 0x000000098ca304f555d12140ccbb628f
11:57:22:WU02:FS02:0x21:CPU: 0x00000000000000000000000000000000
11:57:22:WU02:FS02:0x21:Machine: 2
11:57:22:WU02:FS02:0x21:Digital signatures verified
11:57:22:WU02:FS02:0x21:Folding@home GPU Core21 Folding@home Core
11:57:22:WU02:FS02:0x21:Version 0.0.14
11:57:22:WU02:FS02:0x21:  Found a checkpoint file
11:57:29:WU02:FS02:FahCore returned: INTERRUPTED (102 = 0x66)
11:58:21:WU02:FS02:Starting
11:58:21:WU02:FS02:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/web.stanford.edu/~pande/Linux/AMD64/NVIDIA/Fermi/beta/Core_21.fah/FahCore_21 -dir 02 -suffix 01 -version 704 -lifeline 1087 -checkpoint 15 -gpu 1 -gpu-vendor nvidia -gpu-vendor=nvidia -noclean
11:58:21:WU02:FS02:Started FahCore on PID 4755
11:58:21:WU02:FS02:Core PID:4759
11:58:21:WU02:FS02:FahCore 0x21 started
11:58:22:WU02:FS02:0x21:*********************** Log Started 2015-12-08T11:58:21Z ***********************
11:58:22:WU02:FS02:0x21:Project: 10496 (Run 83, Clone 1, Gen 7)
11:58:22:WU02:FS02:0x21:Unit: 0x000000098ca304f555d12140ccbb628f
11:58:22:WU02:FS02:0x21:CPU: 0x00000000000000000000000000000000
11:58:22:WU02:FS02:0x21:Machine: 2
11:58:22:WU02:FS02:0x21:Digital signatures verified
11:58:22:WU02:FS02:0x21:Folding@home GPU Core21 Folding@home Core
11:58:22:WU02:FS02:0x21:Version 0.0.14
11:58:22:WU02:FS02:0x21:  Found a checkpoint file
11:58:29:WU02:FS02:FahCore returned: INTERRUPTED (102 = 0x66)
11:59:04:WU00:FS01:0x18:Completed 4650000 out of 5000000 steps (93%)
11:59:21:WU02:FS02:Starting
11:59:21:WU02:FS02:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/web.stanford.edu/~pande/Linux/AMD64/NVIDIA/Fermi/beta/Core_21.fah/FahCore_21 -dir 02 -suffix 01 -version 704 -lifeline 1087 -checkpoint 15 -gpu 1 -gpu-vendor nvidia -gpu-vendor=nvidia -noclean
11:59:21:WU02:FS02:Started FahCore on PID 4764
11:59:21:WU02:FS02:Core PID:4768
11:59:21:WU02:FS02:FahCore 0x21 started
11:59:22:WU02:FS02:0x21:*********************** Log Started 2015-12-08T11:59:21Z ***********************
11:59:22:WU02:FS02:0x21:Project: 10496 (Run 83, Clone 1, Gen 7)
11:59:22:WU02:FS02:0x21:Unit: 0x000000098ca304f555d12140ccbb628f
11:59:22:WU02:FS02:0x21:CPU: 0x00000000000000000000000000000000
11:59:22:WU02:FS02:0x21:Machine: 2
11:59:22:WU02:FS02:0x21:Digital signatures verified
11:59:22:WU02:FS02:0x21:Folding@home GPU Core21 Folding@home Core
11:59:22:WU02:FS02:0x21:Version 0.0.14
11:59:22:WU02:FS02:0x21:  Found a checkpoint file
11:59:29:WU02:FS02:FahCore returned: INTERRUPTED (102 = 0x66)
12:00:21:WU02:FS02:Starting
12:00:21:WU02:FS02:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/web.stanford.edu/~pande/Linux/AMD64/NVIDIA/Fermi/beta/Core_21.fah/FahCore_21 -dir 02 -suffix 01 -version 704 -lifeline 1087 -checkpoint 15 -gpu 1 -gpu-vendor nvidia -gpu-vendor=nvidia -noclean
12:00:21:WU02:FS02:Started FahCore on PID 4773
12:00:21:WU02:FS02:Core PID:4777
12:00:21:WU02:FS02:FahCore 0x21 started
12:00:22:WU02:FS02:0x21:*********************** Log Started 2015-12-08T12:00:21Z ***********************
12:00:22:WU02:FS02:0x21:Project: 10496 (Run 83, Clone 1, Gen 7)
12:00:22:WU02:FS02:0x21:Unit: 0x000000098ca304f555d12140ccbb628f
12:00:22:WU02:FS02:0x21:CPU: 0x00000000000000000000000000000000
12:00:22:WU02:FS02:0x21:Machine: 2
12:00:22:WU02:FS02:0x21:Digital signatures verified
12:00:22:WU02:FS02:0x21:Folding@home GPU Core21 Folding@home Core
12:00:22:WU02:FS02:0x21:Version 0.0.14
12:00:22:WU02:FS02:0x21:  Found a checkpoint file
12:00:29:WU02:FS02:FahCore returned: INTERRUPTED (102 = 0x66)
12:01:21:WU02:FS02:Starting
12:01:21:WU02:FS02:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/web.stanford.edu/~pande/Linux/AMD64/NVIDIA/Fermi/beta/Core_21.fah/FahCore_21 -dir 02 -suffix 01 -version 704 -lifeline 1087 -checkpoint 15 -gpu 1 -gpu-vendor nvidia -gpu-vendor=nvidia -noclean
12:01:21:WU02:FS02:Started FahCore on PID 4782
12:01:21:WU02:FS02:Core PID:4786
12:01:21:WU02:FS02:FahCore 0x21 started
12:01:22:WU02:FS02:0x21:*********************** Log Started 2015-12-08T12:01:21Z ***********************
12:01:22:WU02:FS02:0x21:Project: 10496 (Run 83, Clone 1, Gen 7)
12:01:22:WU02:FS02:0x21:Unit: 0x000000098ca304f555d12140ccbb628f
12:01:22:WU02:FS02:0x21:CPU: 0x00000000000000000000000000000000
12:01:22:WU02:FS02:0x21:Machine: 2
12:01:22:WU02:FS02:0x21:Digital signatures verified
12:01:22:WU02:FS02:0x21:Folding@home GPU Core21 Folding@home Core
12:01:22:WU02:FS02:0x21:Version 0.0.14
12:01:22:WU02:FS02:0x21:  Found a checkpoint file
12:01:23:WU00:FS01:0x18:Completed 4700000 out of 5000000 steps (94%)
12:01:29:WU02:FS02:FahCore returned: INTERRUPTED (102 = 0x66)
12:02:21:WU02:FS02:Starting
12:02:21:WU02:FS02:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/web.stanford.edu/~pande/Linux/AMD64/NVIDIA/Fermi/beta/Core_21.fah/FahCore_21 -dir 02 -suffix 01 -version 704 -lifeline 1087 -checkpoint 15 -gpu 1 -gpu-vendor nvidia -gpu-vendor=nvidia -noclean
12:02:21:WU02:FS02:Started FahCore on PID 4791
12:02:21:WU02:FS02:Core PID:4795
12:02:21:WU02:FS02:FahCore 0x21 started
12:02:22:WU02:FS02:0x21:*********************** Log Started 2015-12-08T12:02:21Z ***********************
12:02:22:WU02:FS02:0x21:Project: 10496 (Run 83, Clone 1, Gen 7)
12:02:22:WU02:FS02:0x21:Unit: 0x000000098ca304f555d12140ccbb628f
12:02:22:WU02:FS02:0x21:CPU: 0x00000000000000000000000000000000
12:02:22:WU02:FS02:0x21:Machine: 2
12:02:22:WU02:FS02:0x21:Digital signatures verified
12:02:22:WU02:FS02:0x21:Folding@home GPU Core21 Folding@home Core
12:02:22:WU02:FS02:0x21:Version 0.0.14
12:02:22:WU02:FS02:0x21:  Found a checkpoint file
12:02:29:WU02:FS02:FahCore returned: INTERRUPTED (102 = 0x66)
12:03:21:WU02:FS02:Starting
12:03:21:WU02:FS02:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/web.stanford.edu/~pande/Linux/AMD64/NVIDIA/Fermi/beta/Core_21.fah/FahCore_21 -dir 02 -suffix 01 -version 704 -lifeline 1087 -checkpoint 15 -gpu 1 -gpu-vendor nvidia -gpu-vendor=nvidia -noclean
12:03:21:WU02:FS02:Started FahCore on PID 4802
12:03:21:WU02:FS02:Core PID:4806
12:03:21:WU02:FS02:FahCore 0x21 started
12:03:22:WU02:FS02:0x21:*********************** Log Started 2015-12-08T12:03:21Z ***********************
12:03:22:WU02:FS02:0x21:Project: 10496 (Run 83, Clone 1, Gen 7)
12:03:22:WU02:FS02:0x21:Unit: 0x000000098ca304f555d12140ccbb628f
12:03:22:WU02:FS02:0x21:CPU: 0x00000000000000000000000000000000
12:03:22:WU02:FS02:0x21:Machine: 2
12:03:22:WU02:FS02:0x21:Digital signatures verified
12:03:22:WU02:FS02:0x21:Folding@home GPU Core21 Folding@home Core
12:03:22:WU02:FS02:0x21:Version 0.0.14
12:03:22:WU02:FS02:0x21:  Found a checkpoint file
12:03:29:WU02:FS02:FahCore returned: INTERRUPTED (102 = 0x66)
12:03:42:WU00:FS01:0x18:Completed 4750000 out of 5000000 steps (95%)
12:04:21:WU02:FS02:Starting
12:04:21:WU02:FS02:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/web.stanford.edu/~pande/Linux/AMD64/NVIDIA/Fermi/beta/Core_21.fah/FahCore_21 -dir 02 -suffix 01 -version 704 -lifeline 1087 -checkpoint 15 -gpu 1 -gpu-vendor nvidia -gpu-vendor=nvidia -noclean
12:04:21:WU02:FS02:Started FahCore on PID 4811
12:04:21:WU02:FS02:Core PID:4815
12:04:21:WU02:FS02:FahCore 0x21 started
12:04:22:WU02:FS02:0x21:*********************** Log Started 2015-12-08T12:04:21Z ***********************
12:04:22:WU02:FS02:0x21:Project: 10496 (Run 83, Clone 1, Gen 7)
12:04:22:WU02:FS02:0x21:Unit: 0x000000098ca304f555d12140ccbb628f
12:04:22:WU02:FS02:0x21:CPU: 0x00000000000000000000000000000000
12:04:22:WU02:FS02:0x21:Machine: 2
12:04:22:WU02:FS02:0x21:Digital signatures verified
12:04:22:WU02:FS02:0x21:Folding@home GPU Core21 Folding@home Core
12:04:22:WU02:FS02:0x21:Version 0.0.14
12:04:22:WU02:FS02:0x21:  Found a checkpoint file
12:04:29:WU02:FS02:FahCore returned: INTERRUPTED (102 = 0x66)
12:05:21:WU02:FS02:Starting
12:05:21:WU02:FS02:Running FahCore: /usr/bin/FAHCoreWrapper /var/lib/fahclient/cores/web.stanford.edu/~pande/Linux/AMD64/NVIDIA/Fermi/beta/Core_21.fah/FahCore_21 -dir 02 -suffix 01 -version 704 -lifeline 1087 -checkpoint 15 -gpu 1 -gpu-vendor nvidia -gpu-vendor=nvidia -noclean
12:05:21:WU02:FS02:Started FahCore on PID 4821
12:05:21:WU02:FS02:Core PID:4825
12:05:21:WU02:FS02:FahCore 0x21 started

We may want to have the client detect repeated crash cycles and abort the WU if some number is exceeded.

(@jcoffland: You may want to restore the internal fah-client issue tracker so we can report issues like this internally.)

@jcoffland
Copy link
Member

We have run into this before. When the core returns INTERRUPTED the client assumes that it was interrupted, i.e. by a CTRL-C or other signal from the client. I believe this was fixed in the latest client but I haven't had the time to push it through testing and get it released.

@jchodera
Copy link
Member Author

Is this due to an erroneous exit code return by the core then? If so, we can fix that.

When you say you think this is fixed, do you mean the Client has some protection from falling into an infinite loop?

@jcoffland
Copy link
Member

Is this due to an erroneous exit code return by the core then? If so, we can fix that.
Yes. That would be a good idea anyway.

When you say you think this is fixed, do you mean the Client has some protection from falling into an infinite loop?
Yes, that is what I meant but I just checked the client code and it does not count this as an error. However, it will eventually detect the core as stalled, since it will fail to make progress, so it's not an infinite loop.

I think the best course of action is to fix the core return code. It should not return this code in this situation.

@nhstanley
Copy link

I've run into this on vsp-fah. Is the current solution to upgrade the client?

@jcoffland
Copy link
Member

It's a problem if the core is still returning INTERRUPTED. I thought this was fixed.

@nhstanley
Copy link

So is there currently any solution? We already have the latest FAHClient on that machine (not from source, but the 7.4.4 release).

@jcoffland
Copy link
Member

This is a bug in the OpenMM core not the client.

@bb30994
Copy link

bb30994 commented Apr 20, 2016

John: The client was advanced from 7.4.4 but those versions were never released. Fixing the bug in the FahCore is the best approach if somebody can do that..

Joseph: Are you thinking that from the perspective of the client, the problem was fixed after 7.4.4 but prior to or in 7.4.9?

@jcoffland
Copy link
Member

The 7.4.9 client has a workaround for this problem but it has never been fully beta tested and releasing this client is not currently on the radar. The client only stops the loop eventually anyway. The best solution is to fix the core.

@jchodera
Copy link
Member Author

jchodera commented Apr 21, 2016 via email

@jcoffland
Copy link
Member

Please do take a stab at it. INTERRUPTED is the exit code. I'm not sure what code paths cause this. That's the problem that needs to be solved. It's an interaction between libfah and OpenMM.

@jchodera
Copy link
Member Author

jchodera commented Apr 21, 2016

Sorry---what's the numerical exit code corresponding to INTERRUPTED? I can never find the header buried deep inside libfah that defines these...

@jcoffland
Copy link
Member

All of the FAH related stuff is in libfah. https://github.com/FoldingAtHome/libfah/blob/master/src/fah/core/ExitCode.h

@nhstanley
Copy link

nhstanley commented Apr 27, 2016

FYI, we (@cxhernandez) got around this issue on VSP-FAH by upgrading the nvidia driver. So that might give you a hint as to why this is happening. Current driver version is: 340.65. I don't know what it was before. Carlos, please correct me if I've got any of this wrong.

@bb30994
Copy link

bb30994 commented Apr 27, 2016

Wouldn't there be a number of things (like SEGFAULTs) that would cause the Core to be INTERRUPTED that might be indistinguishable from CTRL-C without additional code to differentiate between those interruptions?

Updating the NVidia driver might have helped, but it would also have temporarily cleared, say, a slow memory leak.

@cxhernandez
Copy link

Wouldn't there be a number of things (like SEGFAULTs) that would cause the Core to be INTERRUPTED that might be indistinguishable from CTRL-C without additional code to differentiate between those interruptions?

Yeah, this definitely isn't a general fix for the INTERRUPTED error. But potentially a case where we can create a more descriptive error code.

Updating the NVidia driver might have helped, but it would also have temporarily cleared, say, a slow memory leak.

How could we test this?

@jchodera
Copy link
Member Author

Any chance you can also look at the core logs, rather than the client logs? That will probably contain the exact error (e.g. that there is an NVIDIA driver error).

@jcoffland
Copy link
Member

There are two questions that need answers here:

  1. Why the core is failing?
  2. Which code path causes the core to return INTERRUPTED when it should return some other code?

@jchodera
Copy link
Member Author

jchodera commented Apr 28, 2016

I can address (2) by taking a pass through to clean all of the return codes this weekend. This has been long planned as issue https://github.com/FoldingAtHome/openmm-core/issues/57

Addressing (1) requires project maintainers for projects experiencing this issue look into the returned error result packets, unzip them, and look at the core logs. These result packets are in directories that begin with 0x and look like 0x347e2b5a5654cb20-20160329-131301. I believe the core log file is logfile_01.txt, and should contain some information if a CUDA driver error is occurring.

@bb30994
Copy link

bb30994 commented Apr 30, 2016

How could we test this?

IMHO, we can't. (since we don't know which of the various possible causes of INTERRUPTED produced the loop.) All we can do is to fix the THROW macro calls to provide more information next time it happens..

If somebody wants to unzip the results for Project: 10496 (Run 83, Clone 1, Gen 7) they might find something useful, but I'm guessing you'll only find the Core log for somebody else completing the same WU -- not the log corresponding to the client-log posted above.

@jcoffland
Copy link
Member

I believe this was solved in the core.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants