Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling corrupt or bad-header MRC files #544

Open
chris-langfield opened this issue Jan 18, 2022 · 0 comments
Open

Handling corrupt or bad-header MRC files #544

chris-langfield opened this issue Jan 18, 2022 · 0 comments
Labels
cleanup invalid This doesn't seem right

Comments

@chris-langfield
Copy link
Collaborator

Some datasets contain .mrc and .mrcs files that the Python library mrcfile will not open for various reasons. Currently, we allow mrcfile to throw whatever error it decides when trying to load such files.

Not all of these errors mean that the mrc is unusable though. We may want to report in more detail to the user what the problem is, and/or provide some kind of utility. Alternatively, we could open the file in "permissive mode" and work from there, only failing if the problem is critical.

See ComputationalCryoEM/ASPIRE-Data#2

Some common messages:

Unrecognised machine stamp: 0x00 0x00 0x00 0x00

This means the header does not contain information about endianness.

Map ID string not found - not an MRC file, or file is corrupt

This is a simple issue of a header field not being set to the standard value (which is a constant).

The mrcfile library provides a permissive read mode which will attempt to open the file anyway. Once this is done, sometimes the header can be fixed and the file re-saved with the update. For example, the following code attempts to fix the two example issues above:

import mrcfile
import sys

fp = sys.argv[1]

with mrcfile.open(fp, "r+", permissive=True) as mrc:
    if not mrc.header.map == mrcfile.constants.MAP_ID:
        mrc.header.map = mrcfile.constants.MAP_ID
    if not mrc.data is None:
        mrc.update_header_from_data()
    else:
	print(f"ERROR with {fp}: data is None!")
try:
    with mrcfile.open(fp, "r") as mrc:
       	pass
except ValueError as e:
    print(f"ERROR with {fp}: {e}")

In EMPIAR 10005, the files can be fixed with the above script. The code can and should be fleshed out into a fix-it script that suits our purposes.

However, can be more complex cases. There are two situations in which an MrcFile object will be returned, but its data field, typically a Numpy array containing the image data, is None:

The mode number is not recognised. Currently accepted modes are 0, 1, 2, 4 and 6.

or

The data block is not large enough for the specified data type and dimensions.

(see: https://mrcfile.readthedocs.io/en/latest/usage_guide.html#permissive-read-mode)

We should figure out how to fix these last two categories and/or at what point to decide that an .mrc file we have received is truly "corrupted" and unusable. On resolution we should store the above information as well as any new methods that we discover.

Note that the mrcfile.validate() method and the mrcfile-validate CLI tools will return False even for usable MRC's. The warnings given are not critical in general and mrcfile will open the file without complaint:

These tools are also slow, taking 3-5 seconds per mrc file.

e.g.

python
>>> mrcfile.validate("patch/10028/data/Particles/MRC_0601/037_particles_shiny_nb50_new.mrcs")
Error in header labels: nlabl is 10 but 0 labels contain text
File does not declare MRC format version 20140: nversion = 0
Error in data statistics: RMS deviation is 0.9954950213432312 but the value in the header is 0.9954612255096436
False
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cleanup invalid This doesn't seem right
Projects
None yet
Development

No branches or pull requests

1 participant