Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

inconsistent handling of PDB Insertion Codes and resid #2308

Open
RMeli opened this issue Aug 1, 2019 · 19 comments
Open

inconsistent handling of PDB Insertion Codes and resid #2308

RMeli opened this issue Aug 1, 2019 · 19 comments

Comments

@RMeli
Copy link
Member

RMeli commented Aug 1, 2019

I using a data set of ~16k PDB files and many of them contain insertion codes (see PDB File Format). I found that some codes just ignore (or forget) them and therefore I'm being extra careful whit my analysis. I noticed that MDAnalysis also shows some strange behavior.

Consider the following PDB file, where there are two different residues with the same residue number but with and without insertion code:

ATOM    392  N   LEU A  29      26.918 -15.013  -3.000  1.00 18.00           N
ATOM    393  H   LEU A  29      26.291 -15.692  -3.477  1.00  0.00           H
ATOM    394  CA  LEU A  29      26.474 -13.636  -2.851  1.00 17.76           C
ATOM    395  HA  LEU A  29      27.360 -13.024  -2.682  1.00  0.00           H
ATOM    396  C   LEU A  29      25.563 -13.475  -1.650  1.00 17.80           C
ATOM    397  O   LEU A  29      25.026 -12.400  -1.444  1.00 17.77           O
ATOM    398  CB  LEU A  29      25.771 -13.127  -4.152  1.00 16.97           C
ATOM    399  HB1 LEU A  29      24.803 -13.622  -4.236  1.00  0.00           H
ATOM    400  HB2 LEU A  29      25.621 -12.051  -4.064  1.00  0.00           H
ATOM    401  CG  LEU A  29      26.567 -13.401  -5.431  1.00 17.75           C
ATOM    402  HG  LEU A  29      26.698 -14.483  -5.455  1.00  0.00           H
ATOM    403  CD1 LEU A  29      25.796 -12.941  -6.688  1.00 16.58           C
ATOM    404 1HD1 LEU A  29      24.849 -13.478  -6.749  1.00  0.00           H
ATOM    405 2HD1 LEU A  29      25.604 -11.870  -6.623  1.00  0.00           H
ATOM    406 3HD1 LEU A  29      26.392 -13.152  -7.576  1.00  0.00           H
ATOM    407  CD2 LEU A  29      27.971 -12.768  -5.434  1.00 17.24           C
ATOM    408 1HD2 LEU A  29      27.881 -11.686  -5.333  1.00  0.00           H
ATOM    409 2HD2 LEU A  29      28.549 -13.165  -4.600  1.00  0.00           H
ATOM    410 3HD2 LEU A  29      28.473 -13.006  -6.372  1.00  0.00           H
ATOM    411  N   LEU A  30      25.385 -14.539  -0.852  1.00 19.19           N
ATOM    412  H   LEU A  30      25.849 -15.442  -1.076  1.00  0.00           H
ATOM    413  CA  LEU A  30      24.520 -14.423   0.357  1.00 19.64           C
ATOM    414  HA  LEU A  30      23.626 -13.869   0.070  1.00  0.00           H
ATOM    415  C   LEU A  30      25.272 -13.661   1.477  1.00 20.27           C
ATOM    416  O   LEU A  30      26.235 -14.195   2.067  1.00 21.08           O
ATOM    417  CB  LEU A  30      24.083 -15.795   0.874  1.00 19.89           C
ATOM    418  HB1 LEU A  30      23.672 -16.353   0.033  1.00  0.00           H
ATOM    419  HB2 LEU A  30      24.968 -16.307   1.251  1.00  0.00           H
ATOM    420  CG  LEU A  30      23.021 -15.790   2.004  1.00 20.22           C
ATOM    421  HG  LEU A  30      23.443 -15.166   2.792  1.00  0.00           H
ATOM    422  CD1 LEU A  30      21.673 -15.190   1.518  1.00 23.01           C
ATOM    423 1HD1 LEU A  30      21.832 -14.163   1.189  1.00  0.00           H
ATOM    424 2HD1 LEU A  30      21.291 -15.784   0.688  1.00  0.00           H
ATOM    425 3HD1 LEU A  30      20.954 -15.204   2.337  1.00  0.00           H
ATOM    426  CD2 LEU A  30      22.754 -17.189   2.582  1.00 22.25           C
ATOM    427 1HD2 LEU A  30      22.393 -17.845   1.790  1.00  0.00           H
ATOM    428 2HD2 LEU A  30      23.678 -17.592   2.996  1.00  0.00           H
ATOM    429 3HD2 LEU A  30      22.002 -17.119   3.368  1.00  0.00           H
ATOM    430  N   ASN A  30A     24.832 -12.436   1.742  1.00 21.44           N
ATOM    431  H   ASN A  30A     24.010 -12.091   1.207  1.00  0.00           H
ATOM    432  CA  ASN A  30A     25.424 -11.537   2.735  1.00 21.84           C
ATOM    433  HA  ASN A  30A     26.512 -11.534   2.668  1.00  0.00           H
ATOM    434  C   ASN A  30A     24.963 -12.022   4.124  1.00 22.73           C
ATOM    435  O   ASN A  30A     23.765 -12.131   4.358  1.00 23.27           O
ATOM    436  CB  ASN A  30A     24.896 -10.119   2.493  1.00 21.37           C
ATOM    437  HB1 ASN A  30A     25.152  -9.818   1.477  1.00  0.00           H
ATOM    438  HB2 ASN A  30A     23.812 -10.124   2.607  1.00  0.00           H
ATOM    439  CG  ASN A  30A     25.477  -9.123   3.452  1.00 23.63           C
ATOM    440  OD1 ASN A  30A     24.901  -8.853   4.524  1.00 25.25           O
ATOM    441  ND2 ASN A  30A     26.651  -8.605   3.113  1.00 19.86           N
ATOM    442 1HD2 ASN A  30A     27.131  -7.942   3.754  1.00  0.00           H
ATOM    443 2HD2 ASN A  30A     27.091  -8.862   2.206  1.00  0.00           H

If I load this PDB file

import MDAnalysis as mda
u = mda.Universe("test.pdb")
print(u.residues)
for res in u.residues:
    print(res.resid, res.resnum, res.resname, res.icode)

all the three residues are correctly identified:

<ResidueGroup [<Residue LEU, 29>, <Residue LEU, 30>, <Residue ASN, 30>]>
29 29 LEU
30 30 LEU
30 30 ASN A

We see that resid and resnum have the same value. However, a selection using resid and resnum gives different results:

u.select_atoms("resid 30")
<AtomGroup with 19 atoms>
u.select_atoms("resnum 30")
<AtomGroup with 33 atoms>

The resid keyword only selects the residue 30 without insertion code, while the resnum keyword select both residues with residue number 30. This is because, resid appears to be sensitive to insertion codes:

u.select_atoms("resid 30A")
<AtomGroup with 14 atoms>

which is a nice and short alternative to

u.select_atoms("resnum 30 and icode A")
<AtomGroup with 14 atoms>

The alternative is particularly useful with f-strings: resid {res.num}{res.icode} woks even if icode == "".

The questions I have the following:

  • Shouldn't res.resid return the full residue name (residue number and insertion code)?

Personally, I find odd that u.select_atoms("resid {res.resid}") might or might not return the correct residue when if insertion codes are involved. I would be happy to provide a fix if this is the case.

Additionally, in the MDAnalysis Documentation - Selection Commands there is no mention of the icode keyword and the resid is described as a single residue number or a range of numbers (without mention to the insertion code functionality).

  • Should the documentation be clearer about the difference between resid and resnum, and explicitly mention the icode selection keyword as well?

Again, I would be happy to provide an improvement.

EDIT 25/04/2020: This is still the case in the MDAnalysis Documentation - Selection Commands, but the new MDAnalysis User Guide now mentions insertion codes explicitly. Thanks @lilyminium !


Credits: Useful discussions with @IAlibay.

@orbeckst
Copy link
Member

orbeckst commented Aug 1, 2019

Thanks for the useful summary.

We need to better define which quantities are "tags" that we just read from the file (e.g., icode is used in this way, resnum ought to behave in this way but I am not sure if it always does) and which ones have a universal meaning in MDAnalysis that is independent from the original file format. I think we were moving towards making the internal "ix" indices (such as atoms.ix and residues.ix) also properly available for the user, but at the moment, they are not accessible via the selection language. That leaves the selections with quantities that can have subtly different meanings in different file formats.

Summary of my ramblings: Make a suggestion how you think it should work, ideally with examples how it will work for data coming from different file formats (PDB, GRO, PSF, TPR, ...).

@RMeli
Copy link
Member Author

RMeli commented Aug 2, 2019

I completely understand that different formats have different needs (and agree that the documentation could be clearer about universal quantities and tags). The only thing I found quite strange is that I can select a single residue with a particular resid but then the resid attribute of that selection is different:

import MDAnalysis as mda

u = mda.Universe("test.pdb")
ag = u.select_atoms("resid 30A")

print(ag)
print(ag.resids)
<AtomGroup [<Atom 39: N of type N of resname ASN, resid 30 and segid A and altLoc >, <Atom 40: H of type H of resname ASN, resid 30 and segid A and altLoc >, <Atom 41: CA of type C of resname ASN, resid 30 and segid A and altLoc >, ..., <Atom 50: ND2 of type N of resname ASN, resid 30 and segid A and altLoc >, <Atom 51: 1HD2 of type H of resname ASN, resid 30 and segid A and altLoc >, <Atom 52: 2HD2 of type H of resname ASN, resid 30 and segid A and altLoc >]>

[30 30 30 30 30 30 30 30 30 30 30 30 30 30]

As you can see, the residue 30A has been selected correctly, but then the resids of the atom group are different. Interestingly enough (since I didn't dig into the source code), the information is not completely lost:

ag.select_atoms("resid 30A")
<AtomGroup with 14 atoms>

But this is even more confusing, since looking at print(ag) one could be tempted to do the following:

ag.select_atoms("resid 30")
<AtomGroup with 0 atoms>

I guess my point is that the fact that the resid used for selection and the one printed out do not match is confusing and potentially error prone.

My suggestion would be to store full residue information in resid. Such information should allow to uniquely identify a single residue. This will of course depend on the format (I'm not knowledgeable in many of them...); for some formats this will simply be the equal to resnum but for other formats it could be slightly different. For PDB files, for example, it would be the residue number plus the insertion code (from the PDB File Format: The combination of residue numbering and insertion code defines the unique residue.)

@richardjgowers
Copy link
Member

Hey @RMeli I'm the one responsible for most of this behaviour, but I'm not really an expert on all these corner cases, so I'm happy to be corrected! Btw if you do something like select_atoms('resid 30A-30C') it should select the range too.

WRT resnum - I don't actually know what this is, it's something that existed before my time on MDA, and I kept around for backwards compat reasons... Maybe we should remove it (ie alias it to resid).

Currently ICodes are treated as a completely separate attribute, so as independents as charge. Reading this has made me think maybe we should instead lump it together with Resids...

So currently resids are created by the file parsing creating one of these TopologyAttr objects:

https://github.com/MDAnalysis/mdanalysis/blob/develop/package/MDAnalysis/core/topologyattrs.py#L1322

And passing it to the Universe. Universe then monkeypatches AtomGroups to let them know that they have resids. (This is why if you load an XYZ file, the AtomGroups won't even have a .charge attribute). A Group can then query the TopologyAttr object via the get_x method.

But maybe PDB files (and anything else that provides both Resid and ICode) should instead create a single ResidAndICode attribute, which then could return 30A as .resid attribute. So the TopologyAttr would instead hold 2 numpy arrays.

So things that should get changed

  • remove resnum
  • document selecting w/ icodes
  • rework Resid attribute to also handle icodes since they're linked.

If you want to work on any of those I'm happy to help you with it.

@RMeli
Copy link
Member Author

RMeli commented Aug 2, 2019

Hi @richardjgowers. I just tested

u.select_atoms("resid 30-30B")

and it works as expected (even if the first residue has no insertion code), which is great! I'm more and more convinced that resids behave as expected, but the residue.resid attribute is simply not a faithful representation.

I didn't know resnum was there only for backward compatibility. I thought the idea was that resid is an unique residue identifier while resnum is the residue number (which is some formats, such as PDB, does not uniquely identify a residue). What do you think of having them both, but with a clear distinction? resid could be defined as an unique identifier of a single residue (or a group of mutually exclusive residues, if there are alternative locations...) and resnum as the residue number (equal to resid in some formats). resnum could be useful to select all the residues with same residue number but with different (or no) insertion code.

The downside of this approach is that resid could be a string or a number, depending on the format; how do you feel about that?

I would be happy to work on this improvement with some guidance, but at the moment I'm quite busy with the last GSoC leg with another organisation and therefore it will have to wait until September.

@richardjgowers
Copy link
Member

As a weird hack you could change the get method of resids to check if there's icodes and return a version with icodes appended. It gets messy because then the return type has to be strings, whereas I think a lot of people might expect an integer array. This is going to break a lot of code where people are doing arithmetic with their resids.

FWIW resids are far from unique, we often see files where these have looped around.

I think what you're describing for resnum is what we currently have for resindex. Ie a number generated by MDAnalysis which enumerates the residues we think the topology represents.

@RMeli
Copy link
Member Author

RMeli commented Aug 2, 2019

Yes, I think such a change would break a lot of code and could be introduced only in a major revision. However, to fix the broken codes it would be sufficient to switch from resid to resnum when storing such attributes (during selection they work as expected).

I don't need to add the change in an hacky way at the moment since I'm only using select_atoms which appears to work correctly from my tests (reported here). However, problems will arise if one extracts resids for later use and therefore I wanted to raise the problem.

If I understand correctly, resindex is an internal representation and therefore is different from resnum in the sense that resnum should still correspond to the numbering in the original file. But my understanding or assumptions could be completely wrong!

@orbeckst
Copy link
Member

orbeckst commented Aug 2, 2019

How does VMD handle resid vs resnum?

EDIT: And PyMOL, Chimera, ... ?

@orbeckst
Copy link
Member

orbeckst commented Aug 2, 2019

#2308 (comment)

For PDB files, for example, it would be the residue number plus the insertion code (from the PDB File Format: The combination of residue numbering and insertion code defines the unique residue.)

The latter is not actually true: you also need a chain identifier because you can have the same resid in multiple chains. Furthermore, everything goes to hell in a hand basket when we deal with PDB files from simulations where resID wraps around after 9999, as @richardjgowers alluded to in #2308 (comment).

MDAnalysis has to deal with the problem that formats like PDB are used in different contexts and everbody would like their files to behave the way that they expect. What's difficult for us is figuring out what these expectations are (especially when they differ from the published standard). Therefore the points that you're raising are very valuable!

@orbeckst orbeckst changed the title PDB Insertion Codes inconsistent handling of PDB Insertion Codes and resid Aug 2, 2019
@RMeli
Copy link
Member Author

RMeli commented Aug 2, 2019

Yes, there can be multiple chains of course. And now that you point this out, I find the wording on the PDB Format Specification somewhat misleading...

I completely understand that there are many different contexts for PDB files alone, this is why I opened this discussion as Question and not as Issue. ; )

I wanted to raise the awareness that the following code is somewhat problematic:

ATOM    428 2HD2 LEU A  30      23.678 -17.592   2.996  1.00  0.00           H
ATOM    429 3HD2 LEU A  30      22.002 -17.119   3.368  1.00  0.00           H
ATOM    430  N   ASN A  30A     24.832 -12.436   1.742  1.00 21.44           N
ATOM    431  H   ASN A  30A     24.010 -12.091   1.207  1.00  0.00           H
ATOM    432  CA  ASN A  30B     25.424 -11.537   2.735  1.00 21.84           C
ATOM    433  HA  ASN A  30B     26.512 -11.534   2.668  1.00  0.00           H
import MDAnalysis as mda

u = mda.Universe("test.pdb")

ag = u.select_atoms("resid 30A")
print(ag)

resids = set(ag.resids)

for resid in resids:
    agw = u.select_atoms(f"resid {resid}")
    print(agw)
<AtomGroup [<Atom 3: N of type N of resname ASN, resid 30 and segid A and altLoc >, <Atom 4: H of type H of resname ASN, resid 30 and segid A and altLoc >]>
<AtomGroup [<Atom 1: 2HD2 of type H of resname LEU, resid 30 and segid A and altLoc >, <Atom 2: 3HD2 of type H of resname LEU, resid 30 and segid A and altLoc >]>

and find out if this behaviour is expected/wanted or not.

Maybe it's just a matter of updating the documentation to make clear that resid can be used to select residues with insertion codes but the resid attribute does not contain that information and it has to be accessed explicitly with icode?

@orbeckst
Copy link
Member

@RMeli thanks – not much I could do since then but I am adding this issue to a growing list of interconnected problems. Your comments are much appreciated, sorry that we haven't been doing anything since.

@RMeli
Copy link
Member Author

RMeli commented Oct 19, 2019

No pronlem @orbeckst, I've been following a bit the related issue and completely understand that there is much to be done and little time. I too have been quite busy lately, but I'll be happy to help where I can!

@lilyminium
Copy link
Member

lilyminium commented Apr 25, 2020

Replying to @RMeli #2672 (comment) here because it seems more relevant:

I think suddenly making resid string-type would break a lot of code out there, including mine. 🙃

If a new topology attribute is added, it's probably best that the special resid-icode selection get moved to that token to avoid the mismatch here between the selection string and the actual attribute.

Would it ever be an issue if the new ResLabels didn't match str(Resids)+(Icodes)? Currently all other topology attributes are independent of each other. For example someone tweaks their icodes attribute and expects that to show up in the reslabels and selection results.

Miscellaneous notes:

  • Currently resnum is an alias for resid in (iirc) all topology readers.
  • I don't think VMD has a concept of insertion codes. "resid 14" and "resid 14A" selects both above residues.
  • Pymol and Chimera both treat residue ids as number+insertion code. "resi 14" only selects Asp14, "resi 14-14A" selects both in pymol.
  • the only time .resnum or .resnums shows up in the MDAnalysis codebase is once in a docstring in analysis.rms, and then in some tests. Every other reference to numeric residue labels uses .resids. Changing the type and content of this attribute should really be considered very carefully.

@RMeli
Copy link
Member Author

RMeli commented Apr 25, 2020

I agree changing resid to a string is not a viable option (and will have consequences on performance). The problem I was rising here is that resid works correctly on selection, but the attribute stored does not match the selection (resid 1A will correctly select residue 1A, but the resid attribute will only contain 1).

I think issues #2669 and #2672 can be solved by using resindices internally, but the problem of resid not being bijective remains.

PS: I usually compare my selections with PyMol, which seems to act as I expect.

@lilyminium
Copy link
Member

lilyminium commented Apr 25, 2020

Ah, we may have been talking past each other a little. From my perspective then the resid attribute values are fine as is. Instead, it's ResidSelection that has unexpected behaviour, and selecting "resid 1A" should give an error; you should be required to select "resid 1 and icode A" for the 1A residue. Of course this will annoy anyone who uses the resid shorthand...

@RMeli
Copy link
Member Author

RMeli commented Apr 25, 2020

Yes, I think I might be interpreting resid in a different way... =)

I interpret it as it is used in select_atoms where if you have residues 1, 1A and 1B the selection resid 1 will only select 1 (in contrast, resnum 1 will select all three). That is, for me resid should be an unique identifier.

They way you interpret resid is the way I interpret resnum; selecting resnum 1A will indeed throw a SelectionError: Selection failed: 'Failed to parse number: 1A' while resnum 1 and icode A works as expected.

@orbeckst
Copy link
Member

More general question: do we have a comparison of how different commonly used (bio) comp chem software does selections? What’s common, whats’s different?

IIRC there was also a list of topology attributes on the wiki (and probably now in the User Guide?) that explained what they were supposed to mean. Eg resnum vs resid. It would be important to get the semantics defined for once and all.

@RMeli
Copy link
Member Author

RMeli commented Apr 25, 2020

@orbeckst Below I compare MDAnalysis, PyMol and ProDy. I will try to add other comparisons in the coming days, but I already outlined the problems I see with MDAnalysis.

I edited earlier today the description of this issue to note that @lilyminium already added information about using insertion codes with the resid selection in the User Guide - Simple Selections, which partially address the problem.

In the User Guide - Canonical Attributes it is clearly stated that resid is the residue number and resnum is an alias for resid.


Test File

ATOM   1     N   GLY A   1      14.785 -12.720  25.303  1.00 21.06           N
ATOM   2     H   GLY A   1      15.188 -13.267  24.516  1.00  0.00           H
ATOM   3     CA  GLY A   1      13.751 -13.323  26.187  1.00 21.45           C
ATOM   4     HA1 GLY A   1      13.969 -13.130  27.237  1.00  0.00           H
ATOM   5     HA2 GLY A   1      12.758 -12.944  25.947  1.00  0.00           H
ATOM   6     C   GLY A   1      13.869 -14.830  25.859  1.00 21.97           C
ATOM   7     O   GLY A   1      14.414 -15.179  24.801  1.00 22.20           O
ATOM   8     H   LYS A   1A     25.735 -14.238  12.187  1.00 28.77           N
ATOM   9     A   LYS A   1A     25.698 -14.467  13.201  1.00  0.00           H
ATOM   10    CA  LYS A   1A     25.215 -12.920  11.783  1.00 28.80           C
ATOM   11    HA  LYS A   1A     25.025 -12.972  10.711  1.00  0.00           H
ATOM   12    C   LYS A   1A     26.223 -11.798  12.006  1.00 29.74           C
ATOM   13    O   LYS A   1A     26.730 -11.602  13.135  1.00 28.72           O
ATOM   14    CB  LYS A   1A     23.937 -12.618  12.497  1.00 30.64           C
ATOM   15    HB1 LYS A   1A     23.285 -13.448  12.225  1.00  0.00           H
ATOM   16    HB2 LYS A   1A     24.203 -12.697  13.551  1.00  0.00           H
ATOM   17    CG  LYS A   1A     23.015 -11.397  12.444  1.00 31.59           C
ATOM   18    HG1 LYS A   1A     23.519 -10.518  12.845  1.00  0.00           H
ATOM   19    HG2 LYS A   1A     22.700 -11.198  11.420  1.00  0.00           H
ATOM   20    CD  LYS A   1A     21.796 -11.779  13.326  1.00 31.47           C
ATOM   21    HD1 LYS A   1A     21.317 -12.644  12.868  1.00  0.00           H
ATOM   22    HD2 LYS A   1A     22.176 -12.055  14.310  1.00  0.00           H
ATOM   23    CE  LYS A   1A     20.754 -10.743  13.523  1.00 29.40           C
ATOM   24    HE1 LYS A   1A     20.275 -10.535  12.566  1.00  0.00           H
ATOM   25    HE2 LYS A   1A     21.224  -9.834  13.898  1.00  0.00           H
ATOM   26    NZ  LYS A   1A     19.715 -11.194  14.500  1.00 30.69           N
ATOM   27    HZ1 LYS A   1A     19.259 -12.059  14.145  1.00  0.00           H
ATOM   28    HZ2 LYS A   1A     20.165 -11.389  15.417  1.00  0.00           H
ATOM   29    HZ3 LYS A   1A     19.001 -10.446  14.615  1.00  0.00           H
ATOM   1     N   GLN A   1B     30.861  12.781  18.536  1.00 44.73           N
ATOM   2     H   GLN A   1B     30.423  12.110  19.199  1.00  0.00           H
ATOM   3     CA  GLN A   1B     32.085  12.372  17.843  1.00 45.43           C
ATOM   4     HA  GLN A   1B     32.589  13.273  17.493  1.00  0.00           H
ATOM   5     C   GLN A   1B     31.893  11.457  16.657  1.00 46.22           C
ATOM   6     O   GLN A   1B     31.622  11.923  15.525  1.00 46.88           O
TER
HETATM    1  C   LIG     1      17.735 -17.178  22.612  1.00  0.00           C  
HETATM    2  C   LIG     1      18.543 -17.350  21.462  1.00  0.00           C  
HETATM    3  C   LIG     1      19.877 -17.046  21.558  1.00  0.00           C  
HETATM    4  C   LIG     1      20.401 -16.566  22.766  1.00  0.00           C  
HETATM    5  C   LIG     1      19.613 -16.358  23.893  1.00  0.00           C  
HETATM    6  C   LIG     1      20.084 -15.791  25.046  1.00  0.00           C  
HETATM    7  C   LIG     1      19.241 -15.568  26.114  1.00  0.00           C  
HETATM    8  C   LIG     1      17.893 -15.971  26.028  1.00  0.00           C  
HETATM    9  C   LIG     1      17.370 -16.510  24.879  1.00  0.00           C  
HETATM   10  C   LIG     1      18.205 -16.689  23.809  1.00  0.00           C  
HETATM   11  N   LIG     1      21.383 -15.285  25.104  1.00  0.00           N  
HETATM   12  C   LIG     1      21.898 -14.217  24.214  1.00  0.00           C  
HETATM   13  C   LIG     1      22.635 -16.213  25.496  1.00  0.00           C  
HETATM   14  S   LIG     1      15.909 -17.581  22.417  1.00  0.00           S  
HETATM   15  O   LIG     1      15.988 -18.013  21.010  1.00  0.00           O  
HETATM   16  O   LIG     1      14.998 -18.191  23.402  1.00  0.00           O  
HETATM   17  N   LIG     1      14.893 -15.970  22.485  1.00  0.00           N  
HETATM   18  C   LIG     1      15.193 -15.114  21.296  1.00  0.00           C  
HETATM   19  C   LIG     1      16.307 -14.167  21.752  1.00  0.00           C  
HETATM   20  O   LIG     1      16.212 -13.781  22.914  1.00  0.00           O  
HETATM   21  C   LIG     1      13.942 -14.354  20.879  1.00  0.00           C  
HETATM   22  C   LIG     1      13.232 -13.686  22.058  1.00  0.00           C  
HETATM   23  C   LIG     1      11.892 -13.133  21.595  1.00  0.00           C  
HETATM   24  N   LIG     1      11.218 -12.647  22.780  1.00  0.00           N  
HETATM   25  C   LIG     1      11.457 -11.554  23.483  1.00  0.00           C  
HETATM   26  N   LIG     1      12.359 -10.674  23.045  1.00  0.00           N  
HETATM   27  N   LIG     1      10.723 -11.336  24.566  1.00  0.00           N1+
HETATM   28  N   LIG     1      17.280 -13.692  20.830  1.00  0.00           N  
HETATM   29  C   LIG     1      17.043 -14.033  19.327  1.00  0.00           C  
HETATM   30  C   LIG     1      18.421 -14.472  18.809  1.00  0.00           C  
HETATM   31  C   LIG     1      19.610 -13.417  19.194  1.00  0.00           C  
HETATM   32  C   LIG     1      19.696 -13.083  20.765  1.00  0.00           C  
HETATM   33  C   LIG     1      18.351 -12.680  21.310  1.00  0.00           C  
HETATM   34  C   LIG     1      16.593 -13.014  18.550  1.00  0.00           C  
HETATM   35  C   LIG     1      15.941 -13.425  17.179  1.00  0.00           C  
HETATM   36  S   LIG     1      15.783 -14.904  14.458  1.00  0.00           S  
HETATM   37  O   LIG     1      17.488 -11.922  16.287  1.00  0.00           O  
HETATM   38  C   LIG     1      17.059 -15.583  13.508  1.00  0.00           C  
HETATM   39  C   LIG     1      16.864 -13.556  14.739  1.00  0.00           C  
HETATM   40  C   LIG     1      16.825 -12.903  16.107  1.00  0.00           C  
HETATM   41  C   LIG     1      18.146 -14.734  13.451  1.00  0.00           C  
HETATM   42  N   LIG     1      18.049 -13.554  14.106  1.00  0.00           N  
END

MDAnalysis

Stable

  • u.select_atoms("resid 1") selects A/GLY/1 and /LIG/1
  • u.select_atoms("resid 1A") selects only A/LYS/1A
  • u.select_atoms("resnum 1") selects everything
  • u.select_atoms("resnum 1A") fails with SelectionError
  • u.select_atoms("byres (around 2 resname LIG)") selects everything
    • The selection is expanded to atoms not in residues selected by (around 2 resname LIG)
    • This is fixed in Fix #2669 and #2672 #2673
  • u.select_atoms("same residue as (around 2 resname LIG)") selects everything
    • The selection is expanded to atoms not in residues selected by (around 2 resname LIG)
    • This is fixed in Fix #2669 and #2672 #2673
  • u.select_atoms("same resid as (around 2 resname LIG)") selects everything
    • The selection is expanded to atoms in residues not selected by (around 2 resname LIG)
    • This is inconsistent with the fact that resid 1 only selects A/GLY/1 and /LIG/1
    • This is somewhat unexpected since the intent of (around 2 resname LIG) is to select atoms near /LIG/1
  • u.select_atoms("same resnum as (around 2 resname LIG)") selects everything

PR #2673

  • u.select_atoms("resid 1") selects A/GLY/1 and /LIG/1
  • u.select_atoms("resid 1A") selects only A/LYS/1A
  • u.select_atoms("resnum 1") selects everything
  • u.select_atoms("resnum 1A") fails with SelectionError
  • u.select_atoms("byres (around 2 resname LIG)") selects A/GLY/1 and A/LYS/1A
    • The selection is expanded only to atoms in residues selected by (around 2 resname LIG)
  • u.select_atoms("same residue as (around 2 resname LIG)") selects A/GLY/1 and A/LYS/1A
    • The selection is expanded only to atoms in residues selected by (around 2 resname LIG)
  • u.select_atoms("same resid as (around 2 resname LIG)") selects everything
    • The selection is expanded to atoms in residues not selected by (around 2 resname LIG)
    • This is inconsistent with the fact that resid 1 only selects A/GLY/1 and /LIG/1
    • This is somewhat unexpected since the intent of (around 2 resname LIG) is to select atoms near /LIG/1
  • u.select_atoms("same resnum as (around 2 resname LIG)") selects everything

Problem?

resid behaves differently in u.select_atoms("resid 1") (where resid takes the insertion codes into account) and u.select_atoms("same resid as (around 2 resname LIG)") (where insertion codes are ignored).

In u.select_atoms("resid 1"), resid is not interchangeable with resnum; in u.select_atoms("same resid as (around 2 resname LIG)"), resid is interchangeable with resnum.

In [3]: s1 = u.select_atoms("same resnum as (around 2 resname LIG)")

In [4]: s2 = u.select_atoms("same resid as (around 2 resname LIG)")

In [5]: s1 == s2
Out[5]: True

In [6]: s3 = u.select_atoms("resid 1")

In [7]: s4 = u.select_atoms("resnum 1")

In [8]: s3 == s4
Out[8]: False

PyMol

  • select residue 1 selects both A/GLY/1 and /LIG/1
    • A/GLY/1 and /LIG/1 have the same residue number an no insertion code
    • A/LYS/1A and A/GLN/1B are not selected
  • select residue 1A selects only A/LYS/1A
  • select byres resname LIG around 2.0 selects A/GLY/1 and A/LYS/1A
    • The selection is expanded only to atoms in residues near /LIG/1

ProDy

ProDy has an interesting take:

  • p.select("resnum 1") selects everything
    • This is equivalent to resnum in MDanalysis
  • p.select("resnum 1A") selects only A/LYS/1A
  • p.select("resnum 1_") selects A/GLY/1 and /LIG/1
    • _ represent explicitly the absence of insertion code
  • p.select('same residue as exwithin 2.0 of resname LIG') selects A/GLY/1 and A/LYS/1A
  • p.select('same residue as within 2.0 of resname LIG') selects A/GLY/1, A/LYS/1A and /LIG/1

Question: is there a way to select an empty insertion code using the icode selection keyword in MDAnalysis?

@lilyminium
Copy link
Member

lilyminium commented Apr 26, 2020

@RMeli no, I don't think there's a way to select an empty string. Thanks for the thorough summary of the issue!

Edit: I didn't see residue 1B in the test file there. The potential residues for selection in every point below are [Gly1, Lys1A, Lig1].

To add (all but mdtraj are from skimming google, so feel free to correct):

  • mdtraj does not keep insertion codes, but it does differentiate residues based on them. Its "residue" is equivalent to MDA's resnum, and its "resid" is equivalent to MDA's resindex.
    • "select residue 1" selects everything
    • "select resid 2" selects the ligand
  • Phenix has two different properties "resid" (=> Pymol's resid) and "resseq" (=> MDA's resnum)
    • "resid 1" and "resseq 1 and icode ' '" (note space) selects all but 1A
    • "resseq 1" selects all
    • "resid 1A" selects 1A
  • CCP4 maintains insertion codes separately and allows you to select by them, but the default is '*'.
    • "1" selects all the residues
    • "1." selects all but 1A (I think)
    • "1.A" selects 1A
  • Bio3D separates residue numbers (called "resno") and insertion codes. The default value of NULL selects all possible icode values.
    • atom.select(resno=[1]) selects all
    • atom.select(resno=[1], insert=['A']) selects 1A
    • atom.select(resno=[1], insert=['']) selects all but 1A
  • PMV combines insertion codes with numbers into the one selector field. I'm not clear a selection w/o a code would select all, or select all but 1A.
  • coot keeps insertion codes separate from residue numbers in different selector fields.
  • Chime cannot distinguish insertion codes in atom selection (or altLocs for that matter)
  • biopython treats the residue label as a tuple of three elements: (hetero_flag: str, res_seq: int, ins_code: str). A glucose residue could be ('H_GLC', 100, 'A'). Empty flags/insertion codes are ' ' (spaces), not ''.

Bonus
The keyword resid variously refers to:

  • residue number (VMD)
  • residue index (MDTraj)
  • residue label, i.e. number+icode (Phenix)
  • residue name (Bio3D)

Summing up

  • Packages that don't support icode selection: VMD, MDTraj, Chime
  • Packages that require you to specify resnum and icode separately in residue selection: CCP4, Bio3D, coot, biopython
  • Packages that require you to consider icodes in residue selection and you cannot separate: Chimera, PyMol, Prody, PMV
  • Packages that can handle both ✨: Phenix

@RMeli RMeli closed this as completed Apr 26, 2020
@RMeli RMeli reopened this Apr 26, 2020
@richardjgowers
Copy link
Member

richardjgowers commented Apr 26, 2020 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants