-
Notifications
You must be signed in to change notification settings - Fork 648
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
inconsistent handling of PDB Insertion Codes and resid #2308
Comments
Thanks for the useful summary. We need to better define which quantities are "tags" that we just read from the file (e.g., icode is used in this way, resnum ought to behave in this way but I am not sure if it always does) and which ones have a universal meaning in MDAnalysis that is independent from the original file format. I think we were moving towards making the internal "ix" indices (such as Summary of my ramblings: Make a suggestion how you think it should work, ideally with examples how it will work for data coming from different file formats (PDB, GRO, PSF, TPR, ...). |
I completely understand that different formats have different needs (and agree that the documentation could be clearer about universal quantities and tags). The only thing I found quite strange is that I can select a single residue with a particular
As you can see, the residue 30A has been selected correctly, but then the
But this is even more confusing, since looking at
I guess my point is that the fact that the My suggestion would be to store full residue information in |
Hey @RMeli I'm the one responsible for most of this behaviour, but I'm not really an expert on all these corner cases, so I'm happy to be corrected! Btw if you do something like WRT resnum - I don't actually know what this is, it's something that existed before my time on MDA, and I kept around for backwards compat reasons... Maybe we should remove it (ie alias it to resid). Currently ICodes are treated as a completely separate attribute, so as independents as charge. Reading this has made me think maybe we should instead lump it together with Resids... So currently resids are created by the file parsing creating one of these https://github.com/MDAnalysis/mdanalysis/blob/develop/package/MDAnalysis/core/topologyattrs.py#L1322 And passing it to the Universe. But maybe PDB files (and anything else that provides both Resid and ICode) should instead create a single So things that should get changed
If you want to work on any of those I'm happy to help you with it. |
Hi @richardjgowers. I just tested
and it works as expected (even if the first residue has no insertion code), which is great! I'm more and more convinced that I didn't know The downside of this approach is that I would be happy to work on this improvement with some guidance, but at the moment I'm quite busy with the last GSoC leg with another organisation and therefore it will have to wait until September. |
As a weird hack you could change the FWIW resids are far from unique, we often see files where these have looped around. I think what you're describing for |
Yes, I think such a change would break a lot of code and could be introduced only in a major revision. However, to fix the broken codes it would be sufficient to switch from I don't need to add the change in an hacky way at the moment since I'm only using If I understand correctly, |
How does VMD handle resid vs resnum? EDIT: And PyMOL, Chimera, ... ? |
The latter is not actually true: you also need a chain identifier because you can have the same resid in multiple chains. Furthermore, everything goes to hell in a hand basket when we deal with PDB files from simulations where resID wraps around after 9999, as @richardjgowers alluded to in #2308 (comment). MDAnalysis has to deal with the problem that formats like PDB are used in different contexts and everbody would like their files to behave the way that they expect. What's difficult for us is figuring out what these expectations are (especially when they differ from the published standard). Therefore the points that you're raising are very valuable! |
Yes, there can be multiple chains of course. And now that you point this out, I find the wording on the PDB Format Specification somewhat misleading... I completely understand that there are many different contexts for PDB files alone, this is why I opened this discussion as Question and not as Issue. ; ) I wanted to raise the awareness that the following code is somewhat problematic:
and find out if this behaviour is expected/wanted or not. Maybe it's just a matter of updating the documentation to make clear that |
@RMeli thanks – not much I could do since then but I am adding this issue to a growing list of interconnected problems. Your comments are much appreciated, sorry that we haven't been doing anything since. |
No pronlem @orbeckst, I've been following a bit the related issue and completely understand that there is much to be done and little time. I too have been quite busy lately, but I'll be happy to help where I can! |
Replying to @RMeli #2672 (comment) here because it seems more relevant: I think suddenly making If a new topology attribute is added, it's probably best that the special resid-icode selection get moved to that token to avoid the mismatch here between the selection string and the actual attribute. Would it ever be an issue if the new ResLabels didn't match str(Resids)+(Icodes)? Currently all other topology attributes are independent of each other. For example someone tweaks their icodes attribute and expects that to show up in the reslabels and selection results. Miscellaneous notes:
|
I agree changing I think issues #2669 and #2672 can be solved by using PS: I usually compare my selections with PyMol, which seems to act as I expect. |
Ah, we may have been talking past each other a little. From my perspective then the |
Yes, I think I might be interpreting I interpret it as it is used in They way you interpret |
More general question: do we have a comparison of how different commonly used (bio) comp chem software does selections? What’s common, whats’s different? IIRC there was also a list of topology attributes on the wiki (and probably now in the User Guide?) that explained what they were supposed to mean. Eg resnum vs resid. It would be important to get the semantics defined for once and all. |
@orbeckst Below I compare MDAnalysis, PyMol and ProDy. I will try to add other comparisons in the coming days, but I already outlined the problems I see with MDAnalysis. I edited earlier today the description of this issue to note that @lilyminium already added information about using insertion codes with the In the User Guide - Canonical Attributes it is clearly stated that Test File
MDAnalysisStable
PR #2673
Problem?
In In [3]: s1 = u.select_atoms("same resnum as (around 2 resname LIG)")
In [4]: s2 = u.select_atoms("same resid as (around 2 resname LIG)")
In [5]: s1 == s2
Out[5]: True
In [6]: s3 = u.select_atoms("resid 1")
In [7]: s4 = u.select_atoms("resnum 1")
In [8]: s3 == s4
Out[8]: False PyMol
ProDyProDy has an interesting take:
Question: is there a way to select an empty insertion code using the |
@RMeli no, I don't think there's a way to select an empty string. Thanks for the thorough summary of the issue! Edit: I didn't see residue 1B in the test file there. The potential residues for selection in every point below are [Gly1, Lys1A, Lig1]. To add (all but mdtraj are from skimming google, so feel free to correct):
Bonus
Summing up
|
It sounds like we need a special dtype for the resid array that then allows
comparison and ordering of both integers (26) and strings (“26A”). All
working on resids get coerced to the special dtype so it works as expected
and all resids include icodes by default. I’ll see if I can make the struct
in numpy...
…On Sun, Apr 26, 2020 at 08:32, Rocco Meli ***@***.***> wrote:
Reopened #2308 <#2308>.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2308 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACGSGB4QEDWHQIZ2KDL6DJLROPPPXANCNFSM4IIWJPPQ>
.
|
I using a data set of ~16k PDB files and many of them contain insertion codes (see PDB File Format). I found that some codes just ignore (or forget) them and therefore I'm being extra careful whit my analysis. I noticed that MDAnalysis also shows some strange behavior.
Consider the following PDB file, where there are two different residues with the same residue number but with and without insertion code:
If I load this PDB file
all the three residues are correctly identified:
We see that
resid
andresnum
have the same value. However, a selection usingresid
andresnum
gives different results:The
resid
keyword only selects the residue 30 without insertion code, while theresnum
keyword select both residues with residue number 30. This is because,resid
appears to be sensitive to insertion codes:which is a nice and short alternative to
The alternative is particularly useful with f-strings:
resid {res.num}{res.icode}
woks even ificode == ""
.The questions I have the following:
res.resid
return the full residue name (residue number and insertion code)?Personally, I find odd that
u.select_atoms("resid {res.resid}")
might or might not return the correct residue when if insertion codes are involved. I would be happy to provide a fix if this is the case.Additionally, in the MDAnalysis Documentation - Selection Commands there is no mention of theicode
keyword and theresid
is described as a single residue number or a range of numbers (without mention to the insertion code functionality).Should the documentation be clearer about the difference betweenresid
andresnum
, and explicitly mention theicode
selection keyword as well?Again, I would be happy to provide an improvement.EDIT 25/04/2020: This is still the case in the MDAnalysis Documentation - Selection Commands, but the new MDAnalysis User Guide now mentions insertion codes explicitly. Thanks @lilyminium !
Credits: Useful discussions with @IAlibay.
The text was updated successfully, but these errors were encountered: