-
Notifications
You must be signed in to change notification settings - Fork 120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Opening partly written HDF5 files with happi after simulation crash #655
Comments
Admittedly the 'flush_every' function helps reduce the chance of the HDF5 file being corrupted during a write while it's running, so part of what I said is not quite right, but the problem still exists that some bits of data at the end of the HDF5 files is miswritten sometimes during a crash. |
Thank you for suggesting this. I actually had the same comment a few days ago from a colleague. |
Sorry, bad code in previous thing, this recreates the probes hdf5 file (at least I think it does) and seems to work all of the time as far as I have tested: `
` |
Just to note when I use the above script it doesn't always work. The individual attributes must also be tested before the key is written to the new H5 file as sometimes a key can be correctly created but not filled with attributes correctly. |
Do you have an idea of to reproduce this? I cannot get a corrupted file |
I just ran a simulation that got cut off in the middle. if time it right so
that the walltime is midway through a large output it can happen. The most
reliable way I found it to happen is if you have very frequent dumps, I had
a probe diagnostic that was recording every 4 timesteps and then dumping
every 250 timesteps. If the simulation gets cutoff during a write I had a
50% chance the file had a couple of crappy outputs that hadn't been written
properly.
e.g.:
DiagProbe(
every=4,
number=[10],
origin=[0.0 + 5.0 * dx],
corners=[[Lsim - 5.0 * dx]],
fields=["Ex","Ey","Ez","Bx","By","Bz","Rho_ion","Rho_eon","Jx_eon","Jy_eon","Jz_eon","Jx_ion","Jy_ion","Jz_ion","Jx","Jy","Jz"
],
flush_every=outputtime / 10,
)
…On Tue, 28 Nov 2023 at 12:11, mccoys ***@***.***> wrote:
Do you have an idea of to reproduce this? I cannot get a corrupted file
—
Reply to this email directly, view it on GitHub
<#655 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ALNT42FRGZQ62HZA6PO4RBDYGXBEZAVCNFSM6AAAAAA5FW2MKCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMRZGYYDKNZZGY>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
It does not happen on my system for some reason. Would you be able to produce a small example and send it with dropbox or equivalent? |
Sure, I'll try finding one tomorrow
…On Tue, 28 Nov 2023, 18:12 mccoys, ***@***.***> wrote:
It does not happen on my system for some reason. Would you be able to
produce a small example and send it with dropbox or equivalent?
—
Reply to this email directly, view it on GitHub
<#655 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ALNT42GZQIUCYE45HQMM3DDYGYLQZAVCNFSM6AAAAAA5FW2MKCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMZQGMZDGNBUGM>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
I've sent you a link in element chat |
I have not received it. My name on element is fredpz |
Dammit
…On Thu, 30 Nov 2023, 15:24 mccoys, ***@***.***> wrote:
I have not received it. My name on element is fredpz
—
Reply to this email directly, view it on GitHub
<#655 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ALNT42CKR6ZSXZA354G22HTYHCJIJAVCNFSM6AAAAAA5FW2MKCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMZTHA3TMMRSGE>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
I'll send again tomorrow, I can't get access right now
…On Thu, 30 Nov 2023, 15:27 David Blackman, ***@***.***> wrote:
Dammit
On Thu, 30 Nov 2023, 15:24 mccoys, ***@***.***> wrote:
> I have not received it. My name on element is fredpz
>
> —
> Reply to this email directly, view it on GitHub
> <#655 (comment)>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/ALNT42CKR6ZSXZA354G22HTYHCJIJAVCNFSM6AAAAAA5FW2MKCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMZTHA3TMMRSGE>
> .
> You are receiving this because you authored the thread.Message ID:
> ***@***.***>
>
|
ok, sent again, hopefully right person this time :) |
I made a change for happi in the develop branch. Could you test it? |
Yes, that seems to allow me to access files which I couldn't before, thanks! That's eliminated a step which was quite annoying and will save me some time too, really appreciated! |
Just to be clear, with the old version I tried to access a broken probes0.h5 file and got this error: Traceback (most recent call last): Now I get no error and successfully build up my probe signals so I can process them properly.! |
Unfortunately some of the clusters I am using occasionally crash. This can caused errors in writing to HDF5 files. Using happi I sometimes can't open HDF5 files with incompletely written lines of data. I.E. in a probe data file there are dumptimes that are half written and some bit of data is missing.
For example I recently ran a simulation with a probe diagnostic which writes a HDF5 data set every 4 timesteps. It is quite likely that if the simulation hits the wall time before a write occurs, or if the simulation crashes during a write, then there will be half written data in the file. But the previously written data could still be useful and the simulation may not need to be rerun.
When you use happi to open these files the following error occurs:
When I look at the file in something like HDFCompass I can see that there are two empty data lines, but the rest of the data is intact:
I think it would be relatively simple to use a try-except statement somewhere to obtain this data in this scenario. It would likely be relatively easy to simulate this problem using a correctly written HDF5 file and adding a couple of unexpected data sets at the end of the file.
The text was updated successfully, but these errors were encountered: