Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

residue numbering jumps unexpectedly: PDB file 1A02 #2251

Open
fabiotrovato opened this issue Apr 16, 2019 · 10 comments

Comments

@fabiotrovato
Copy link

commented Apr 16, 2019

Hi,

it seems that I am not able to select atoms as I want, using the pdbfile attached. My script is below, if you would like to give it a try.

The output that I see is the following:

['N' 'A' 'B' 'F' 'J']
<AtomGroup []>

which is clearly not what I want, since in the posted pdb file there is a chain named F and there is a residue numbered 140 in that chain (I also verified in VMD that things worked as expected). Am I mistaken with the selection syntax or is there a bug?

import MDAnalysis as mda

u = mda.Universe('1a02_1.pdb')
sele_str = "protein and segid F and resnum 140"
print(u.segments.segids)
print(u.select_atoms(sele_str))

1a02_1.pdb.zip

@fabiotrovato

This comment has been minimized.

Copy link
Author

commented Apr 16, 2019

I just found out that if I print the residue numbers of chain F with:

sel = u.select_atoms(sele_str)
print(sel.residues.resnums)

I get:

[10140 10141 10142 10143 10144 10145 10146 10147 10148 10149 10150 10151
 10152 10153 10154 10155 10156 10157 10158 10159 10160 10161 10162 10163
 10164 10165 10166 10167 10168 10169 10170 10171 10172 10173 10174 10175
 10176 10177 10178 10179 10180 10181 10182 10183 10184 10185 10186 10187
 10188 10189 10190 10191 10192]

These numbers correspond to the pdb residue numbers from 140 to 192 + 10000. Same for other chains.

What is this exactly???

@ezavod

This comment has been minimized.

Copy link

commented Apr 16, 2019

I think you're running into

# Wrapping
while resid - resid_prev < -5000:
resid += 10000

because of the last atom of chain B has a resid of 6088 and the first atom of chain F has the resid 140. Therefore this triggers the wrapping.

@fabiotrovato

This comment has been minimized.

Copy link
Author

commented Apr 16, 2019

@orbeckst

This comment has been minimized.

Copy link
Member

commented Apr 17, 2019

There are different (competing) views how to deal with PDB files that are used for different purposes. Storing/reading coordinates is fine but trying to deduce topology information is difficult.

For a start, PDB files are a terrible format for big systems because of the limitations on sizes of numbers. Many simulation codes nevertheless use PDB files to dump systems into them, with the understanding that atom numbers and residue numbers are kind of meaningless or at least with the understanding that they are sequentially numbered and eventually wrap around. chainIDs (or segIDs) are not necessarily used (and would not be useful as "molecule" identifiers because of size limitations). For real topology information, all MD codes have their own specialized formats. Because of MDAnalysis' roots in "analyzing MD trajectories", this is pretty much the approach taken here.

With protein databank PDBs, the view is more that resnums are labels that, together with chainID, are sufficient to locate a specific residue in a structure. This works for small systems but not very well for big systems. And it often does not contain enough information to build a topology, so some guesswork is involved.

Dealing with "the" PDB format is really difficult, see all the open issues on Format-PDB.

One approach for your and related issues would be to just store the "labels" from the file and don't care about them except for making them available for selections. But this requires a very careful view across all topology parsers in MDAnalysis and what we currently consider as standard attributes of groups (such as resids).

If you have an opinion of what should happen, please let us know.

@orbeckst

This comment has been minimized.

Copy link
Member

commented Apr 17, 2019

@fabiotrovato thanks for finding PDB files that break our PDBParser. The feedback is very welcome, it's just not that easy to solve all problems right away. If you have suggestions (or want to work on improving the PDBparser) then please jump in!

@orbeckst orbeckst changed the title Atom selection not working as expected residue numbering jumps unexpectedly: PDB file 1A02 Apr 17, 2019

@fabiotrovato

This comment has been minimized.

Copy link
Author

commented Apr 17, 2019

@orbeckst you're very welcome. I try to report as many cases as I can so that the package might improve in the future. The reason why I have so many examples to show is because I am analyzing a large dataset of pdb structures and I started doing it with MDAnalysis.
I am happy to give you my opinion and suggestions and, why not, even to work on improving the PDBparser. I am quite familiar with the messy (but extremely popular) pdb format and I use it often.

Because of MDAnalysis' roots in "analyzing MD trajectories", this is pretty much the approach taken here.

This is an understandable philosophy. However, in many cases a simulation starts exactly from a pdb structure. So, better to deal with this format despite it's many disadvantages.

One approach for your and related issues would be to just store the "labels" from the file and don't care about them except for making them available for selections.

My take on this is, yes it would be an important step forward in terms of usability and flexibility. Not to mention that the manipulation of a pdb file is often necessary also as a preliminary step in simulating biomolecules. Another example, directly from my personal experience, is about biomolecule modeling, specifically coarse-grained model creation. Flexibility in this case would mean make it easier to build such simplified models from a pdb structure.

Another point is how to make the improvements on the PDBparser compatible with the core philosophy of MDAnalysis. I have just started using it, so my view is a bit limited, but I am willing to give my support and contribution.

@richardjgowers

This comment has been minimized.

Copy link
Member

commented Apr 18, 2019

@jbarnoud

This comment has been minimized.

Copy link
Contributor

commented May 31, 2019

Maybe we should have different topology attributes: one that just count from 0, one that tries to be smart, and one that is just what the file says. For many formats the two last one will be the same, for PDB and a few others they will differ.

@orbeckst

This comment has been minimized.

Copy link
Member

commented May 31, 2019

I like @jbarnoud 's suggestion. One downside is that users that load from PDB files will have to choose how to analyze and then there might be a PDB_resid and resid, PDB_serial and index or something like that.

@orbeckst

This comment has been minimized.

Copy link
Member

commented May 31, 2019

The broader discussion is then to

  1. list our canonical attributes
  2. define what these attributes should mean

Then we should also describe format-specific attributes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants
You can’t perform that action at this time.