Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

inconsistent handling of PDB Insertion Codes and resid #2308

Open
RMeli opened this issue Aug 1, 2019 · 9 comments

Comments

@RMeli
Copy link
Member

commented Aug 1, 2019

I using a data set of ~16k PDB files and many of them contain insertion codes (see PDB File Format). I found that some codes just ignore (or forget) them and therefore I'm being extra careful whit my analysis. I noticed that MDAnalysis also shows some strange behavior.

Consider the following PDB file, where there are two different residues with the same residue number but with and without insertion code:

ATOM    392  N   LEU A  29      26.918 -15.013  -3.000  1.00 18.00           N
ATOM    393  H   LEU A  29      26.291 -15.692  -3.477  1.00  0.00           H
ATOM    394  CA  LEU A  29      26.474 -13.636  -2.851  1.00 17.76           C
ATOM    395  HA  LEU A  29      27.360 -13.024  -2.682  1.00  0.00           H
ATOM    396  C   LEU A  29      25.563 -13.475  -1.650  1.00 17.80           C
ATOM    397  O   LEU A  29      25.026 -12.400  -1.444  1.00 17.77           O
ATOM    398  CB  LEU A  29      25.771 -13.127  -4.152  1.00 16.97           C
ATOM    399  HB1 LEU A  29      24.803 -13.622  -4.236  1.00  0.00           H
ATOM    400  HB2 LEU A  29      25.621 -12.051  -4.064  1.00  0.00           H
ATOM    401  CG  LEU A  29      26.567 -13.401  -5.431  1.00 17.75           C
ATOM    402  HG  LEU A  29      26.698 -14.483  -5.455  1.00  0.00           H
ATOM    403  CD1 LEU A  29      25.796 -12.941  -6.688  1.00 16.58           C
ATOM    404 1HD1 LEU A  29      24.849 -13.478  -6.749  1.00  0.00           H
ATOM    405 2HD1 LEU A  29      25.604 -11.870  -6.623  1.00  0.00           H
ATOM    406 3HD1 LEU A  29      26.392 -13.152  -7.576  1.00  0.00           H
ATOM    407  CD2 LEU A  29      27.971 -12.768  -5.434  1.00 17.24           C
ATOM    408 1HD2 LEU A  29      27.881 -11.686  -5.333  1.00  0.00           H
ATOM    409 2HD2 LEU A  29      28.549 -13.165  -4.600  1.00  0.00           H
ATOM    410 3HD2 LEU A  29      28.473 -13.006  -6.372  1.00  0.00           H
ATOM    411  N   LEU A  30      25.385 -14.539  -0.852  1.00 19.19           N
ATOM    412  H   LEU A  30      25.849 -15.442  -1.076  1.00  0.00           H
ATOM    413  CA  LEU A  30      24.520 -14.423   0.357  1.00 19.64           C
ATOM    414  HA  LEU A  30      23.626 -13.869   0.070  1.00  0.00           H
ATOM    415  C   LEU A  30      25.272 -13.661   1.477  1.00 20.27           C
ATOM    416  O   LEU A  30      26.235 -14.195   2.067  1.00 21.08           O
ATOM    417  CB  LEU A  30      24.083 -15.795   0.874  1.00 19.89           C
ATOM    418  HB1 LEU A  30      23.672 -16.353   0.033  1.00  0.00           H
ATOM    419  HB2 LEU A  30      24.968 -16.307   1.251  1.00  0.00           H
ATOM    420  CG  LEU A  30      23.021 -15.790   2.004  1.00 20.22           C
ATOM    421  HG  LEU A  30      23.443 -15.166   2.792  1.00  0.00           H
ATOM    422  CD1 LEU A  30      21.673 -15.190   1.518  1.00 23.01           C
ATOM    423 1HD1 LEU A  30      21.832 -14.163   1.189  1.00  0.00           H
ATOM    424 2HD1 LEU A  30      21.291 -15.784   0.688  1.00  0.00           H
ATOM    425 3HD1 LEU A  30      20.954 -15.204   2.337  1.00  0.00           H
ATOM    426  CD2 LEU A  30      22.754 -17.189   2.582  1.00 22.25           C
ATOM    427 1HD2 LEU A  30      22.393 -17.845   1.790  1.00  0.00           H
ATOM    428 2HD2 LEU A  30      23.678 -17.592   2.996  1.00  0.00           H
ATOM    429 3HD2 LEU A  30      22.002 -17.119   3.368  1.00  0.00           H
ATOM    430  N   ASN A  30A     24.832 -12.436   1.742  1.00 21.44           N
ATOM    431  H   ASN A  30A     24.010 -12.091   1.207  1.00  0.00           H
ATOM    432  CA  ASN A  30A     25.424 -11.537   2.735  1.00 21.84           C
ATOM    433  HA  ASN A  30A     26.512 -11.534   2.668  1.00  0.00           H
ATOM    434  C   ASN A  30A     24.963 -12.022   4.124  1.00 22.73           C
ATOM    435  O   ASN A  30A     23.765 -12.131   4.358  1.00 23.27           O
ATOM    436  CB  ASN A  30A     24.896 -10.119   2.493  1.00 21.37           C
ATOM    437  HB1 ASN A  30A     25.152  -9.818   1.477  1.00  0.00           H
ATOM    438  HB2 ASN A  30A     23.812 -10.124   2.607  1.00  0.00           H
ATOM    439  CG  ASN A  30A     25.477  -9.123   3.452  1.00 23.63           C
ATOM    440  OD1 ASN A  30A     24.901  -8.853   4.524  1.00 25.25           O
ATOM    441  ND2 ASN A  30A     26.651  -8.605   3.113  1.00 19.86           N
ATOM    442 1HD2 ASN A  30A     27.131  -7.942   3.754  1.00  0.00           H
ATOM    443 2HD2 ASN A  30A     27.091  -8.862   2.206  1.00  0.00           H

If I load this PDB file

import MDAnalysis as mda
u = mda.Universe("test.pdb")
print(u.residues)
for res in u.residues:
    print(res.resid, res.resnum, res.resname, res.icode)

all the three residues are correctly identified:

<ResidueGroup [<Residue LEU, 29>, <Residue LEU, 30>, <Residue ASN, 30>]>
29 29 LEU
30 30 LEU
30 30 ASN A

We see that resid and resnum have the same value. However, a selection using resid and resnum gives different results:

u.select_atoms("resid 30")
<AtomGroup with 19 atoms>
u.select_atoms("resnum 30")
<AtomGroup with 33 atoms>

The resid keyword only selects the residue 30 without insertion code, while the resnum keyword select both residues with residue number 30. This is because, resid appears to be sensitive to insertion codes:

u.select_atoms("resid 30A")
<AtomGroup with 14 atoms>

which is a nice and short alternative to

u.select_atoms("resnum 30 and icode A")
<AtomGroup with 14 atoms>

The alternative is particularly useful with f-strings: resid {res.num}{res.icode} woks even if icode == "".

The questions I have the following:

  • Shouldn't res.resid return the full residue name (residue number and insertion code)?

Personally, I find odd that u.select_atoms("resid {res.resid}") might or might not return the correct residue when if insertion codes are involved. I would be happy to provide a fix if this is the case.

Additionally, in the MDAnalysis Documentation - Selection Commands there is no mention of the icode keyword and the resid is described as a single residue number or a range of numbers (without mention to the insertion code functionality).

  • Should the documentation be clearer about the difference between resid and resnum, and explicitly mention the icode selection keyword as well?

Again, I would be happy to provide an improvement.

Credits: Useful discussions with @IAlibay.

@orbeckst

This comment has been minimized.

Copy link
Member

commented Aug 1, 2019

Thanks for the useful summary.

We need to better define which quantities are "tags" that we just read from the file (e.g., icode is used in this way, resnum ought to behave in this way but I am not sure if it always does) and which ones have a universal meaning in MDAnalysis that is independent from the original file format. I think we were moving towards making the internal "ix" indices (such as atoms.ix and residues.ix) also properly available for the user, but at the moment, they are not accessible via the selection language. That leaves the selections with quantities that can have subtly different meanings in different file formats.

Summary of my ramblings: Make a suggestion how you think it should work, ideally with examples how it will work for data coming from different file formats (PDB, GRO, PSF, TPR, ...).

@orbeckst orbeckst added the proposal label Aug 1, 2019

@RMeli

This comment has been minimized.

Copy link
Member Author

commented Aug 2, 2019

I completely understand that different formats have different needs (and agree that the documentation could be clearer about universal quantities and tags). The only thing I found quite strange is that I can select a single residue with a particular resid but then the resid attribute of that selection is different:

import MDAnalysis as mda

u = mda.Universe("test.pdb")
ag = u.select_atoms("resid 30A")

print(ag)
print(ag.resids)
<AtomGroup [<Atom 39: N of type N of resname ASN, resid 30 and segid A and altLoc >, <Atom 40: H of type H of resname ASN, resid 30 and segid A and altLoc >, <Atom 41: CA of type C of resname ASN, resid 30 and segid A and altLoc >, ..., <Atom 50: ND2 of type N of resname ASN, resid 30 and segid A and altLoc >, <Atom 51: 1HD2 of type H of resname ASN, resid 30 and segid A and altLoc >, <Atom 52: 2HD2 of type H of resname ASN, resid 30 and segid A and altLoc >]>

[30 30 30 30 30 30 30 30 30 30 30 30 30 30]

As you can see, the residue 30A has been selected correctly, but then the resids of the atom group are different. Interestingly enough (since I didn't dig into the source code), the information is not completely lost:

ag.select_atoms("resid 30A")
<AtomGroup with 14 atoms>

But this is even more confusing, since looking at print(ag) one could be tempted to do the following:

ag.select_atoms("resid 30")
<AtomGroup with 0 atoms>

I guess my point is that the fact that the resid used for selection and the one printed out do not match is confusing and potentially error prone.

My suggestion would be to store full residue information in resid. Such information should allow to uniquely identify a single residue. This will of course depend on the format (I'm not knowledgeable in many of them...); for some formats this will simply be the equal to resnum but for other formats it could be slightly different. For PDB files, for example, it would be the residue number plus the insertion code (from the PDB File Format: The combination of residue numbering and insertion code defines the unique residue.)

@richardjgowers

This comment has been minimized.

Copy link
Member

commented Aug 2, 2019

Hey @RMeli I'm the one responsible for most of this behaviour, but I'm not really an expert on all these corner cases, so I'm happy to be corrected! Btw if you do something like select_atoms('resid 30A-30C') it should select the range too.

WRT resnum - I don't actually know what this is, it's something that existed before my time on MDA, and I kept around for backwards compat reasons... Maybe we should remove it (ie alias it to resid).

Currently ICodes are treated as a completely separate attribute, so as independents as charge. Reading this has made me think maybe we should instead lump it together with Resids...

So currently resids are created by the file parsing creating one of these TopologyAttr objects:

https://github.com/MDAnalysis/mdanalysis/blob/develop/package/MDAnalysis/core/topologyattrs.py#L1322

And passing it to the Universe. Universe then monkeypatches AtomGroups to let them know that they have resids. (This is why if you load an XYZ file, the AtomGroups won't even have a .charge attribute). A Group can then query the TopologyAttr object via the get_x method.

But maybe PDB files (and anything else that provides both Resid and ICode) should instead create a single ResidAndICode attribute, which then could return 30A as .resid attribute. So the TopologyAttr would instead hold 2 numpy arrays.

So things that should get changed

  • remove resnum
  • document selecting w/ icodes
  • rework Resid attribute to also handle icodes since they're linked.

If you want to work on any of those I'm happy to help you with it.

@RMeli

This comment has been minimized.

Copy link
Member Author

commented Aug 2, 2019

Hi @richardjgowers. I just tested

u.select_atoms("resid 30-30B")

and it works as expected (even if the first residue has no insertion code), which is great! I'm more and more convinced that resids behave as expected, but the residue.resid attribute is simply not a faithful representation.

I didn't know resnum was there only for backward compatibility. I thought the idea was that resid is an unique residue identifier while resnum is the residue number (which is some formats, such as PDB, does not uniquely identify a residue). What do you think of having them both, but with a clear distinction? resid could be defined as an unique identifier of a single residue (or a group of mutually exclusive residues, if there are alternative locations...) and resnum as the residue number (equal to resid in some formats). resnum could be useful to select all the residues with same residue number but with different (or no) insertion code.

The downside of this approach is that resid could be a string or a number, depending on the format; how do you feel about that?

I would be happy to work on this improvement with some guidance, but at the moment I'm quite busy with the last GSoC leg with another organisation and therefore it will have to wait until September.

@richardjgowers

This comment has been minimized.

Copy link
Member

commented Aug 2, 2019

As a weird hack you could change the get method of resids to check if there's icodes and return a version with icodes appended. It gets messy because then the return type has to be strings, whereas I think a lot of people might expect an integer array. This is going to break a lot of code where people are doing arithmetic with their resids.

FWIW resids are far from unique, we often see files where these have looped around.

I think what you're describing for resnum is what we currently have for resindex. Ie a number generated by MDAnalysis which enumerates the residues we think the topology represents.

@RMeli

This comment has been minimized.

Copy link
Member Author

commented Aug 2, 2019

Yes, I think such a change would break a lot of code and could be introduced only in a major revision. However, to fix the broken codes it would be sufficient to switch from resid to resnum when storing such attributes (during selection they work as expected).

I don't need to add the change in an hacky way at the moment since I'm only using select_atoms which appears to work correctly from my tests (reported here). However, problems will arise if one extracts resids for later use and therefore I wanted to raise the problem.

If I understand correctly, resindex is an internal representation and therefore is different from resnum in the sense that resnum should still correspond to the numbering in the original file. But my understanding or assumptions could be completely wrong!

@orbeckst

This comment has been minimized.

Copy link
Member

commented Aug 2, 2019

How does VMD handle resid vs resnum?

EDIT: And PyMOL, Chimera, ... ?

@orbeckst

This comment has been minimized.

Copy link
Member

commented Aug 2, 2019

#2308 (comment)

For PDB files, for example, it would be the residue number plus the insertion code (from the PDB File Format: The combination of residue numbering and insertion code defines the unique residue.)

The latter is not actually true: you also need a chain identifier because you can have the same resid in multiple chains. Furthermore, everything goes to hell in a hand basket when we deal with PDB files from simulations where resID wraps around after 9999, as @richardjgowers alluded to in #2308 (comment).

MDAnalysis has to deal with the problem that formats like PDB are used in different contexts and everbody would like their files to behave the way that they expect. What's difficult for us is figuring out what these expectations are (especially when they differ from the published standard). Therefore the points that you're raising are very valuable!

@orbeckst orbeckst changed the title PDB Insertion Codes inconsistent handling of PDB Insertion Codes and resid Aug 2, 2019

@RMeli

This comment has been minimized.

Copy link
Member Author

commented Aug 2, 2019

Yes, there can be multiple chains of course. And now that you point this out, I find the wording on the PDB Format Specification somewhat misleading...

I completely understand that there are many different contexts for PDB files alone, this is why I opened this discussion as Question and not as Issue. ; )

I wanted to raise the awareness that the following code is somewhat problematic:

ATOM    428 2HD2 LEU A  30      23.678 -17.592   2.996  1.00  0.00           H
ATOM    429 3HD2 LEU A  30      22.002 -17.119   3.368  1.00  0.00           H
ATOM    430  N   ASN A  30A     24.832 -12.436   1.742  1.00 21.44           N
ATOM    431  H   ASN A  30A     24.010 -12.091   1.207  1.00  0.00           H
ATOM    432  CA  ASN A  30B     25.424 -11.537   2.735  1.00 21.84           C
ATOM    433  HA  ASN A  30B     26.512 -11.534   2.668  1.00  0.00           H
import MDAnalysis as mda

u = mda.Universe("test.pdb")

ag = u.select_atoms("resid 30A")
print(ag)

resids = set(ag.resids)

for resid in resids:
    agw = u.select_atoms(f"resid {resid}")
    print(agw)
<AtomGroup [<Atom 3: N of type N of resname ASN, resid 30 and segid A and altLoc >, <Atom 4: H of type H of resname ASN, resid 30 and segid A and altLoc >]>
<AtomGroup [<Atom 1: 2HD2 of type H of resname LEU, resid 30 and segid A and altLoc >, <Atom 2: 3HD2 of type H of resname LEU, resid 30 and segid A and altLoc >]>

and find out if this behaviour is expected/wanted or not.

Maybe it's just a matter of updating the documentation to make clear that resid can be used to select residues with insertion codes but the resid attribute does not contain that information and it has to be accessed explicitly with icode?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.