Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GROParser cannot maintain original resids #2212

Open
xiki-tempula opened this issue Feb 28, 2019 · 11 comments

Comments

@xiki-tempula
Copy link
Contributor

commented Feb 28, 2019

I have a gro file which has 36 atoms has the residue id 1 and name BB. However, if I do

import MDAnalysis as mda
u = mda.Universe('test.gro')
start = u.select_atoms('resid 1 and name BB')

only 27 atoms are selected.

Please find the gro file in the link.
https://drive.google.com/file/d/1GMwUx17IxSi-metPxNDQPvXR5c1243_J/view?usp=sharing

Currently version of MDAnalysis

  • Which version are you using? 0.19.3-dev
  • Which version of Python (python -V)? py36
  • Which operating system? osx 10.13.6
@zemanj

This comment has been minimized.

Copy link
Member

commented Feb 28, 2019

Wow, I have never seen such a scrambled gro file... and the first line says "Written by MDAnalysis" 😱
There are a couple (2?) of possibilities leading to the problem:

  1. Residue- and atom numbering is not consecutive, that causes problems for systems exceeding 100,000 atoms
  2. Atom numbers (4th column) contain a lot of duplicates, not only in the case of numbers which would normally exceed 99,999 but also below.
    Example:
Line      3:     1MET     BB    1   0.000   0.000   0.000 -0.1697 -0.2925  0.0722
Line 100003: 62944ION     NA    1   0.000   0.000   0.000 -0.0759  0.4400  0.4969
Line 200003: 25680WN      WN    1   0.000   0.000   0.000  0.1612 -0.3843  0.1464

I'm not an expert concerning the gro reader, but I could imagine that duplicate atom indices are either simply overwritten or duplicates discarded.

Furthermore, why are all positions zero?

@zemanj

This comment has been minimized.

Copy link
Member

commented Feb 28, 2019

I also think that the concept of residue numbers is completely abused here since I'm pretty sure that each of your residues should have a unique number. It should be the residue name that is used to identify similar residues, not the number. But maybe I'm mistaken here...

@zemanj

This comment has been minimized.

Copy link
Member

commented Feb 28, 2019

By the way... VMD also seems to have trouble with that file:

Info)    Atoms: 212040
Info)    Bonds: 0
Info)    Angles: 0  Dihedrals: 0  Impropers: 0  Cross-terms: 0
Info)    Bondtypes: 0  Angletypes: 0  Dihedraltypes: 0  Impropertypes: 0
Info)    Residues: 212040
Info)    Waters: 0
Info)    Segments: 1
Info)    Fragments: 212040   Protein: 0   Nucleic: 0

As you can see, it detects identical numbers of atoms/residues/fragments...

@zemanj

This comment has been minimized.

Copy link
Member

commented Feb 28, 2019

Could you provide information about how that gro file was exactly generated?

@zemanj

This comment has been minimized.

Copy link
Member

commented Feb 28, 2019

I identified the problem:
Residue numbers coming after a residue with number 99999 are treated as 100000 + x, where x is the residue number in the gro file. That makes perfect sense if one assumes that each residue has a unique number and the file is not scrambled.

Therefore, the first BB atom that is not detected using your selection string is this one:

u.atoms[159030]                                                                                                                                                                                                                                                        
<Atom 159031: BB of type B of resname MET, resid 100001 and segid SYSTEM>
@xiki-tempula

This comment has been minimized.

Copy link
Contributor Author

commented Mar 2, 2019

Thank you for your help. Since I'm interested in protein aggregation, There are 36 copies of the same protein in the box. Thus, it might be desirable to know what is the first residue of all 36 copies of the same proteins doing and, thus, the residues are not numbered consecutively.
I'm not quite sure what is the right way to pursue as the gro file format is notorious for not able to number anything beyond 99999 correctly and there is a demand to preserve the residue id.
Positions are set to zero because I don't want my data to be scoped despite the fact that no one is interested in my data.

@xiki-tempula

This comment has been minimized.

Copy link
Contributor Author

commented Mar 2, 2019

Since there is an option in the gro output options to turn on and turn off the reindexing of the gro file #1781. Should we also add an option to the gro input to preserve the residue id? Interface might be

import MDAnalysis as mda
u = mda.Universe('test.gro, preserve_resid=True')
@zemanj

This comment has been minimized.

Copy link
Member

commented Mar 3, 2019

Such an option sounds useful to me. In the meantime, your particular problem could also be solved by reordering atoms such that the 36 proteins come first, followed by all solvent molecules and then all ions. But it's probably quicker and easier to just implement the preserve_resid option.

@jbarnoud

This comment has been minimized.

Copy link
Contributor

commented Mar 19, 2019

Can you try with the resnum selection keyword? If I recall correctly, "resid" is a continuous serial number that we set, and "resnum" is the number directly read from the file.

@orbeckst

This comment has been minimized.

Copy link
Member

commented Apr 3, 2019

If I recall correctly, "resid" is a continuous serial number that we set, and "resnum" is the number directly read from the file.

That's what I thought, too, but I don't think that this is the case, at least not for GRO as topology:

Resids(new_resids),
Resnums(new_resids.copy()),
because we just use the resid from the file and copy it into resnum.

The same happens for PDBParser

attrs.append(Resids(resids))
attrs.append(Resnums(resids.copy()))

I don't think that we are currently discriminating between resid and resnum. This might have been a corollary of #363 (maybe @richardjgowers or @dotsdl can comment) when we started using our own residx (Residue.ix (or maybe rix somewhere?)).

@orbeckst

This comment has been minimized.

Copy link
Member

commented Apr 3, 2019

I think preserve_resid=True #2212 (comment) should be something that all parsers should have.

We try to do the right thing and create a sensible representation of the data but sometimes we have to infer such as "oh, these residue numbers wrap around". For power users this might be the wrong assumption so we should give a means to those users to switch off any guessing.

@orbeckst orbeckst changed the title select_atoms is not able to select all the atoms GROParser cannot maintain original resids Apr 3, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.