PDB files with >=100k atoms #226

GiovanniBussi · 2017-04-12T07:28:19Z

PDB format does not allow atom numbers >=100k.

Currently we are bound to PDB format in two places:

MOLINFO
Reference structures for RMSD and similar variables

For MOLINFO we need to access residue numbers, chains, and atom types, so we are probably forces to remain with PDB format. A possible solution would be to allow multiple PDBs (see also #134).

For reference structures instead it could be complicated to split a system on multiple files. We could perhaps think about some different format?

We could look for some already existing modification of the PDB formats, for instance the hybrid format mentioned here.

GiovanniBussi · 2017-04-12T14:13:07Z

On the openmm github they discussed about this issue:
openmm/openmm#659
openmm/openmm#664

It looks like they chose the VMD format, which switches to hexadecimal past 100k. This is what I obtained from a xyz file read and write from VMD:

 ....
 ATOM  99997  Ar      X   1      45.349  38.631  15.116  0.00  0.00
 ATOM  99998  Ar      X   1      46.189  38.631  15.956  0.00  0.00
 ATOM  99999  Ar      X   1      46.189  39.471  15.116  0.00  0.00
 ATOM  186a0  Ar      X   1      45.349  39.471  15.956  0.00  0.00
 ATOM  186a1  Ar      X   1      45.349  38.631  16.796  0.00  0.00
 ATOM  186a2  Ar      X   1      46.189  38.631  17.636  0.00  0.00
 ....

This however would not be compatible with our needs since we need the serial to be unique when we store non-consecutive atoms for RMSD calculation. In the VMD representation indeed, past 100k you can have some hex number containing only digits that are the same as decimal number. E.g., 18700 in hex is equivalent to 100096 decimal. Since VMD does not mark in any special way hex numbers, there is no way to distinguish them. For instance, there would not be anyway to say if this line

 ATOM  18700  Ar      X   1      45.349  41.150   5.879  0.00  0.00

corresponds to atom 18700 or to atom 100096.

The format discussed hybrid 36 format seems more consistent since there is a 1-to-1 map between numbers and strings. The sequence of strings would be something like

etc. Plus it uses 26-base numbers, resulting in a maximum atom number equal to ~ 87M for 5 digits strings (for VMD syntax it is ~ 1M, so much less). However, I am not sure how easy it would be to produce such files. They provide small C and python tool to manipulate these numbers.

Actually, including in PLUMED either the VMD format or this hybrid 36 would be super easy and backward compatible (with the exception of the problem in the VMD format mentioned above).

Addresses #226

GiovanniBussi self-assigned this Apr 20, 2017

GiovanniBussi added this to the Version 2.4 milestone Apr 20, 2017

GiovanniBussi added a commit that referenced this issue Apr 25, 2017

Added hybrid36 to pdb

0b42847

Addresses #226

GiovanniBussi added a commit that referenced this issue Apr 25, 2017

Regtest

e7c9e02

Addresses #226

GiovanniBussi closed this as completed Apr 27, 2017

GiovanniBussi mentioned this issue Jan 31, 2018

Selecting atoms in systems with > 100000 atoms #335

Closed

orbeckst mentioned this issue May 15, 2018

PDB hybrid36 format (Packmol PDB files with hexadecimal numbers) MDAnalysis/mdanalysis#1897

Closed

GiovanniBussi mentioned this issue Jul 2, 2018

Tool to renumber pdb files #371

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PDB files with >=100k atoms #226

PDB files with >=100k atoms #226

GiovanniBussi commented Apr 12, 2017

GiovanniBussi commented Apr 12, 2017

PDB files with >=100k atoms #226

PDB files with >=100k atoms #226

Comments

GiovanniBussi commented Apr 12, 2017

GiovanniBussi commented Apr 12, 2017