Opening partly written HDF5 files with happi after simulation crash #655

DoubleAgentDave · 2023-09-25T08:37:00Z

Unfortunately some of the clusters I am using occasionally crash. This can caused errors in writing to HDF5 files. Using happi I sometimes can't open HDF5 files with incompletely written lines of data. I.E. in a probe data file there are dumptimes that are half written and some bit of data is missing.

For example I recently ran a simulation with a probe diagnostic which writes a HDF5 data set every 4 timesteps. It is quite likely that if the simulation hits the wall time before a write occurs, or if the simulation crashes during a write, then there will be half written data in the file. But the previously written data could still be useful and the simulation may not need to be rerun.

When you use happi to open these files the following error occurs:

signal = Ez_probe.getData()
  File "~/Smilei/happi/_Diagnostics/Diagnostic.py", line 163, in getData
    data.append( self._dataAtTime(t) )
  File "~/Smilei/happi/_Diagnostics/Diagnostic.py", line 862, in _dataLinAtTime
    A = self._getDataAtTime(t)
  File "~/Smilei/happi/_Diagnostics/Probe.py", line 372, in _getDataAtTime
    data = self._dataForTime[t][n,first:last]
TypeError: 'NoneType' object is not subscriptable

When I look at the file in something like HDFCompass I can see that there are two empty data lines, but the rest of the data is intact:

I think it would be relatively simple to use a try-except statement somewhere to obtain this data in this scenario. It would likely be relatively easy to simulate this problem using a correctly written HDF5 file and adding a couple of unexpected data sets at the end of the file.

The text was updated successfully, but these errors were encountered:

DoubleAgentDave · 2023-09-25T14:49:27Z

Admittedly the 'flush_every' function helps reduce the chance of the HDF5 file being corrupted during a write while it's running, so part of what I said is not quite right, but the problem still exists that some bits of data at the end of the HDF5 files is miswritten sometimes during a crash.

mccoys · 2023-09-25T19:00:25Z

Thank you for suggesting this. I actually had the same comment a few days ago from a colleague.

DoubleAgentDave · 2023-09-26T09:09:31Z

Sorry, bad code in previous thing, this recreates the probes hdf5 file (at least I think it does) and seems to work all of the time as far as I have tested:

`

import h5py
f_dest = h5py.File("Probes0_fixed.h5", "w")
f_src = h5py.File("Probes0.h5", "r")
for key in f_src:
    try:
        f_dest.create_dataset_like(str(key), f_src[key])
        f_dest[key][()] = f_src[key][()]
        for attrib in f_src[key].attrs.keys():
            f_dest[key].attrs.create(attrib, f_src[key].attrs[attrib])

    except KeyError:
        print("faulty key = " + str(key))

for attrib in f_src.attrs.keys():
    f_dest.attrs.create(attrib, f_src.attrs[attrib])

`

DoubleAgentDave · 2023-11-03T09:19:30Z

Just to note when I use the above script it doesn't always work. The individual attributes must also be tested before the key is written to the new H5 file as sometimes a key can be correctly created but not filled with attributes correctly.

mccoys · 2023-11-28T11:10:58Z

Do you have an idea of to reproduce this? I cannot get a corrupted file

DoubleAgentDave · 2023-11-28T16:04:39Z

I just ran a simulation that got cut off in the middle. if time it right so that the walltime is midway through a large output it can happen. The most reliable way I found it to happen is if you have very frequent dumps, I had a probe diagnostic that was recording every 4 timesteps and then dumping every 250 timesteps. If the simulation gets cutoff during a write I had a 50% chance the file had a couple of crappy outputs that hadn't been written properly. e.g.: DiagProbe( every=4, number=[10], origin=[0.0 + 5.0 * dx], corners=[[Lsim - 5.0 * dx]], fields=["Ex","Ey","Ez","Bx","By","Bz","Rho_ion","Rho_eon","Jx_eon","Jy_eon","Jz_eon","Jx_ion","Jy_ion","Jz_ion","Jx","Jy","Jz" ], flush_every=outputtime / 10, )

…

On Tue, 28 Nov 2023 at 12:11, mccoys ***@***.***> wrote: Do you have an idea of to reproduce this? I cannot get a corrupted file — Reply to this email directly, view it on GitHub <#655 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ALNT42FRGZQ62HZA6PO4RBDYGXBEZAVCNFSM6AAAAAA5FW2MKCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMRZGYYDKNZZGY> . You are receiving this because you authored the thread.Message ID: ***@***.***>

mccoys · 2023-11-28T17:12:33Z

It does not happen on my system for some reason. Would you be able to produce a small example and send it with dropbox or equivalent?

DoubleAgentDave · 2023-11-28T18:26:14Z

Sure, I'll try finding one tomorrow

…

On Tue, 28 Nov 2023, 18:12 mccoys, ***@***.***> wrote: It does not happen on my system for some reason. Would you be able to produce a small example and send it with dropbox or equivalent? — Reply to this email directly, view it on GitHub <#655 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ALNT42GZQIUCYE45HQMM3DDYGYLQZAVCNFSM6AAAAAA5FW2MKCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMZQGMZDGNBUGM> . You are receiving this because you authored the thread.Message ID: ***@***.***>

DoubleAgentDave · 2023-11-30T14:16:15Z

It does not happen on my system for some reason. Would you be able to produce a small example and send it with dropbox or equivalent?

I've sent you a link in element chat

mccoys · 2023-11-30T14:23:54Z

I have not received it. My name on element is fredpz

DoubleAgentDave · 2023-11-30T14:27:26Z

Dammit

…

On Thu, 30 Nov 2023, 15:24 mccoys, ***@***.***> wrote: I have not received it. My name on element is fredpz — Reply to this email directly, view it on GitHub <#655 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ALNT42CKR6ZSXZA354G22HTYHCJIJAVCNFSM6AAAAAA5FW2MKCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMZTHA3TMMRSGE> . You are receiving this because you authored the thread.Message ID: ***@***.***>

DoubleAgentDave · 2023-11-30T14:27:53Z

I'll send again tomorrow, I can't get access right now

…

On Thu, 30 Nov 2023, 15:27 David Blackman, ***@***.***> wrote: Dammit On Thu, 30 Nov 2023, 15:24 mccoys, ***@***.***> wrote: > I have not received it. My name on element is fredpz > > — > Reply to this email directly, view it on GitHub > <#655 (comment)>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/ALNT42CKR6ZSXZA354G22HTYHCJIJAVCNFSM6AAAAAA5FW2MKCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMZTHA3TMMRSGE> > . > You are receiving this because you authored the thread.Message ID: > ***@***.***> >

DoubleAgentDave · 2023-12-01T08:29:57Z

ok, sent again, hopefully right person this time :)

mccoys · 2023-12-06T13:39:33Z

I made a change for happi in the develop branch. Could you test it?

DoubleAgentDave · 2023-12-18T12:20:58Z

Yes, that seems to allow me to access files which I couldn't before, thanks! That's eliminated a step which was quite annoying and will save me some time too, really appreciated!

DoubleAgentDave · 2023-12-18T12:22:37Z

Just to be clear, with the old version I tried to access a broken probes0.h5 file and got this error:

Traceback (most recent call last):
File "/usr/lib/python3.10/multiprocessing/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
File "/usr/lib/python3.10/multiprocessing/pool.py", line 48, in mapstar
return list(map(*args))
File "/media/david_blackman/left_external/broadband/simulation_results/bandwidth/redone_diags/fixed_ions/narrow/../../../../py1D/make_signal_files.py", line 193, in start
signal = Ez_probe.getData()
File "/home/david_blackman/codes/smilei_new/Smilei/happi/_Diagnostics/Diagnostic.py", line 163, ingetData
data.append( self._dataAtTime(t) )
File "/home/david_blackman/codes/smilei_new/Smilei/happi/_Diagnostics/Diagnostic.py", line 862, in_dataLinAtTime
A = self._getDataAtTime(t)
File "/home/david_blackman/codes/smilei_new/Smilei/happi/_Diagnostics/Probe.py", line 372, in _getDataAtTime
data = self._dataForTime[t][n,first:last]
TypeError: 'NoneType' object is not subscriptable

Now I get no error and successfully build up my probe signals so I can process them properly.!

DoubleAgentDave added the feature-request something that could be added to the code label Sep 25, 2023

mccoys added the needs-user-input the issue cannot be resolved without additional information label Dec 11, 2023

DoubleAgentDave closed this as completed Dec 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Opening partly written HDF5 files with happi after simulation crash #655

Opening partly written HDF5 files with happi after simulation crash #655

DoubleAgentDave commented Sep 25, 2023 •

edited

Loading

DoubleAgentDave commented Sep 25, 2023

mccoys commented Sep 25, 2023

DoubleAgentDave commented Sep 26, 2023 •

edited

Loading

DoubleAgentDave commented Nov 3, 2023

mccoys commented Nov 28, 2023

DoubleAgentDave commented Nov 28, 2023 via email

mccoys commented Nov 28, 2023

DoubleAgentDave commented Nov 28, 2023 via email

DoubleAgentDave commented Nov 30, 2023

mccoys commented Nov 30, 2023

DoubleAgentDave commented Nov 30, 2023 via email

DoubleAgentDave commented Nov 30, 2023 via email

DoubleAgentDave commented Dec 1, 2023

mccoys commented Dec 6, 2023

DoubleAgentDave commented Dec 18, 2023

DoubleAgentDave commented Dec 18, 2023

Opening partly written HDF5 files with happi after simulation crash #655

Opening partly written HDF5 files with happi after simulation crash #655

Comments

DoubleAgentDave commented Sep 25, 2023 • edited Loading

DoubleAgentDave commented Sep 25, 2023

mccoys commented Sep 25, 2023

DoubleAgentDave commented Sep 26, 2023 • edited Loading

DoubleAgentDave commented Nov 3, 2023

mccoys commented Nov 28, 2023

DoubleAgentDave commented Nov 28, 2023 via email

mccoys commented Nov 28, 2023

DoubleAgentDave commented Nov 28, 2023 via email

DoubleAgentDave commented Nov 30, 2023

mccoys commented Nov 30, 2023

DoubleAgentDave commented Nov 30, 2023 via email

DoubleAgentDave commented Nov 30, 2023 via email

DoubleAgentDave commented Dec 1, 2023

mccoys commented Dec 6, 2023

DoubleAgentDave commented Dec 18, 2023

DoubleAgentDave commented Dec 18, 2023

DoubleAgentDave commented Sep 25, 2023 •

edited

Loading

DoubleAgentDave commented Sep 26, 2023 •

edited

Loading