Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.Sign up
inconsistent handling of PDB Insertion Codes and resid #2308
I using a data set of ~16k PDB files and many of them contain insertion codes (see PDB File Format). I found that some codes just ignore (or forget) them and therefore I'm being extra careful whit my analysis. I noticed that MDAnalysis also shows some strange behavior.
Consider the following PDB file, where there are two different residues with the same residue number but with and without insertion code:
If I load this PDB file
import MDAnalysis as mda u = mda.Universe("test.pdb") print(u.residues) for res in u.residues: print(res.resid, res.resnum, res.resname, res.icode)
all the three residues are correctly identified:
We see that
u.select_atoms("resid 30") <AtomGroup with 19 atoms>
u.select_atoms("resnum 30") <AtomGroup with 33 atoms>
u.select_atoms("resid 30A") <AtomGroup with 14 atoms>
which is a nice and short alternative to
u.select_atoms("resnum 30 and icode A") <AtomGroup with 14 atoms>
The alternative is particularly useful with f-strings:
The questions I have the following:
Personally, I find odd that
Additionally, in the MDAnalysis Documentation - Selection Commands there is no mention of the
Again, I would be happy to provide an improvement.
Credits: Useful discussions with @IAlibay.
Thanks for the useful summary.
We need to better define which quantities are "tags" that we just read from the file (e.g., icode is used in this way, resnum ought to behave in this way but I am not sure if it always does) and which ones have a universal meaning in MDAnalysis that is independent from the original file format. I think we were moving towards making the internal "ix" indices (such as
Summary of my ramblings: Make a suggestion how you think it should work, ideally with examples how it will work for data coming from different file formats (PDB, GRO, PSF, TPR, ...).
I completely understand that different formats have different needs (and agree that the documentation could be clearer about universal quantities and tags). The only thing I found quite strange is that I can select a single residue with a particular
As you can see, the residue 30A has been selected correctly, but then the
But this is even more confusing, since looking at
I guess my point is that the fact that the
My suggestion would be to store full residue information in
Hey @RMeli I'm the one responsible for most of this behaviour, but I'm not really an expert on all these corner cases, so I'm happy to be corrected! Btw if you do something like
WRT resnum - I don't actually know what this is, it's something that existed before my time on MDA, and I kept around for backwards compat reasons... Maybe we should remove it (ie alias it to resid).
Currently ICodes are treated as a completely separate attribute, so as independents as charge. Reading this has made me think maybe we should instead lump it together with Resids...
So currently resids are created by the file parsing creating one of these
And passing it to the Universe.
But maybe PDB files (and anything else that provides both Resid and ICode) should instead create a single
So things that should get changed
If you want to work on any of those I'm happy to help you with it.
Hi @richardjgowers. I just tested
and it works as expected (even if the first residue has no insertion code), which is great! I'm more and more convinced that
I didn't know
The downside of this approach is that
I would be happy to work on this improvement with some guidance, but at the moment I'm quite busy with the last GSoC leg with another organisation and therefore it will have to wait until September.
As a weird hack you could change the
FWIW resids are far from unique, we often see files where these have looped around.
I think what you're describing for
Yes, I think such a change would break a lot of code and could be introduced only in a major revision. However, to fix the broken codes it would be sufficient to switch from
I don't need to add the change in an hacky way at the moment since I'm only using
If I understand correctly,
The latter is not actually true: you also need a chain identifier because you can have the same resid in multiple chains. Furthermore, everything goes to hell in a hand basket when we deal with PDB files from simulations where resID wraps around after 9999, as @richardjgowers alluded to in #2308 (comment).
MDAnalysis has to deal with the problem that formats like PDB are used in different contexts and everbody would like their files to behave the way that they expect. What's difficult for us is figuring out what these expectations are (especially when they differ from the published standard). Therefore the points that you're raising are very valuable!
Yes, there can be multiple chains of course. And now that you point this out, I find the wording on the PDB Format Specification somewhat misleading...
I completely understand that there are many different contexts for PDB files alone, this is why I opened this discussion as Question and not as Issue. ; )
I wanted to raise the awareness that the following code is somewhat problematic:
and find out if this behaviour is expected/wanted or not.
Maybe it's just a matter of updating the documentation to make clear that