New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow read in of VERY large pdb files #1978

Merged
merged 14 commits into from Aug 10, 2018

Conversation

Projects
None yet
5 participants
@arm61
Copy link
Contributor

arm61 commented Jul 9, 2018

Fixes #1897

Changes made in this Pull Request:

PR Checklist

  • Tests?
  • Docs? (N/A)
  • CHANGELOG updated?
  • Issue raised/referenced?
@arm61

This comment has been minimized.

Copy link
Contributor Author

arm61 commented Jul 9, 2018

Not clear why the travis failed. Will investigate later.

@richardjgowers

This comment has been minimized.

Copy link
Member

richardjgowers commented Jul 9, 2018

Hello @arm61 , welcome to MDAnalysis!

I think this is failing because not all indices are base 36, just the first 99,999 or so. I think you can add another middle layer to the try/except block that exists...

try:
    idx = int(thing)
except:
    try:
        idx = int(thing, 36)
    except:
        # wrapped serials case  

Fixes

* Introduced compatibility for packmol (and hopefully generally) for pbd files with

This comment has been minimized.

@richardjgowers

richardjgowers Jul 9, 2018

Member

insert this inside the existing chunk below, also add yourself to the AUTHORS file as it's your first contribution

@orbeckst

This comment has been minimized.

Copy link
Member

orbeckst commented Jul 10, 2018

FYI, PROPKA contains an implementation hybrid36.py and we can use the code because it is published under LGPL.

@richardjgowers

This comment has been minimized.

Copy link
Member

richardjgowers commented Jul 10, 2018

Looks a little slow for calling on every single line of a PDB file

@richardjgowers richardjgowers self-assigned this Jul 12, 2018

@arm61 arm61 closed this Jul 23, 2018

@arm61 arm61 reopened this Jul 23, 2018

arm61 added some commits Jul 23, 2018

@codecov

This comment has been minimized.

Copy link

codecov bot commented Jul 23, 2018

Codecov Report

Merging #1978 into develop will increase coverage by 0.01%.
The diff coverage is 96.55%.

Impacted file tree graph

@@            Coverage Diff             @@
##           develop   #1978      +/-   ##
==========================================
+ Coverage    88.59%   88.6%   +0.01%     
==========================================
  Files          143     143              
  Lines        17361   17386      +25     
  Branches      2658    2665       +7     
==========================================
+ Hits         15381   15405      +24     
  Misses        1379    1379              
- Partials       601     602       +1
Impacted Files Coverage Δ
package/MDAnalysis/topology/PDBParser.py 99.41% <96.55%> (-0.59%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 10fb62d...6277126. Read the comment docs.


def test_PDB_hex():
u = mda.Universe(StringIO(PDB_hex), format='PDB')
assert len(u.atoms) == 5

This comment has been minimized.

@richardjgowers

richardjgowers Jul 23, 2018

Member

Looks good, can you add a test that checks what the atom.id is to make sure we're correctly doing the base 36 conversion

@richardjgowers

This comment has been minimized.

Copy link
Member

richardjgowers commented Jul 23, 2018

And add yourself to the AUTHORS file

@@ -65,6 +65,8 @@ Fixes
pack_into_box() (Issue #1911)
* Fixed format of MODEL number in PDB file writing (Issue #1950)
* PDBWriter now properly sets start value
* Introduced compatibility for packmol (and hopefully generally) for pbd files with
greater than 100 000 atoms (Issue #1897)

This comment has been minimized.

@kain88-de

kain88-de Jul 23, 2018

Member

We do read those files already now. You added the specific hybrid36 format that wasn't supported. It would be nice if you can name it to be precise in the changelog.

Test atom[i].id is now present. Design this test showed that the int(…
…n, 36) method was not sufficient instead the more involved process used in PHENIX (http://cci.lbl.gov/hybrid_36/) (http://cci.lbl.gov/cctbx_sources/iotbx/pdb/hybrid_36.py) has been used
@arm61

This comment has been minimized.

Copy link
Contributor Author

arm61 commented Jul 23, 2018

It turns out that the int(n, 36) does not decode the hybrid36 format correctly. I have used the implementation found in PHENIX instead (see links in the commit message)

@arm61 arm61 closed this Jul 24, 2018

@arm61 arm61 reopened this Jul 24, 2018

@arm61

This comment has been minimized.

Copy link
Contributor Author

arm61 commented Jul 27, 2018

Not super clear why only one of the ci instances is failing, any input?

@kain88-de

This comment has been minimized.

Copy link
Member

kain88-de commented Jul 27, 2018

You can have a look at the log on travis.

package/MDAnalysis/topology/PDBParser.py:92: [W1638(range-builtin-not-iterating), ] range built-in referenced when not iterating
package/MDAnalysis/topology/PDBParser.py:93: [W1638(range-builtin-not-iterating), ] range built-in referenced when not iterating

That means you are using a normal range. On Python2 this will return a list and a generator on python3. The solution is to add from six.moves import range below the future import.
The linter helps us to keep the codebase python2/3 compatible.

@richardjgowers

This comment has been minimized.

Copy link
Member

richardjgowers commented Jul 27, 2018

TBH the linter is wrong there, the range call is inside a zip, so it is iterating it. But yeah if you change to use six.moves.range it will stop complaining

@@ -87,6 +87,44 @@ def float_or_default(val, default):
except ValueError:
return default

digits_upper = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ"
digits_lower = digits_upper.lower()
digits_upper_values = dict([pair for pair in zip(digits_upper, range(36))])

This comment has been minimized.

@richardjgowers

richardjgowers Jul 27, 2018

Member

rather than having two dicts, and two code paths below, could you not create a dict with both upper and lower case in it? Ie e and E both map to whatever value?

This comment has been minimized.

@arm61

arm61 Aug 3, 2018

Author Contributor

I am not sure this would necessarily work the same as the upper case are treated differently from the lower case (line 118 vs 124). Unless I am not seeing something you are.

arm61 added some commits Aug 3, 2018

@arm61

This comment has been minimized.

Copy link
Contributor Author

arm61 commented Aug 3, 2018

Sorry about the delay. other things got in the way.

arm61 added some commits Aug 7, 2018

@richardjgowers

This comment has been minimized.

Copy link
Member

richardjgowers commented Aug 8, 2018

WRT upper/lower case, if this is base 36 surely it doesn’t matter which case and we can mangle the input into either?

@arm61

This comment has been minimized.

Copy link
Contributor Author

arm61 commented Aug 9, 2018

From a bit of reading, I don't think this is real base36, it is referred to as hybrid-36.

It is traditional base 36 (using upper case) until that is exhausted, then it uses the lower case to basically make more numbers available. It is a weird monstrosity (as with all pdb formatting) that is a pseudo-base62 almost.

Going to put some tests for the decode_pure() and hy36decode() functions today.

@richardjgowers

This comment has been minimized.

Copy link
Member

richardjgowers commented Aug 9, 2018

@arm61 ewww ok. But yeah, if you can add some tests for values that hit all the different possibilities. You can use a pytest.mark.parametrize which lets you write a list of values and it turns into lots of individual tests, eg here it loops over different residue name parsing tests: https://github.com/MDAnalysis/mdanalysis/blob/develop/testsuite/MDAnalysisTests/lib/test_util.py#L88

@arm61

This comment has been minimized.

Copy link
Contributor Author

arm61 commented Aug 10, 2018

I think those tests are pretty comprehensive. Also I agree, the pdb format is gross.

digits_upper = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ"
digits_lower = digits_upper.lower()
digits_upper_values = dict([pair for pair in zip(digits_upper, range(36))])
digits_lower_values = dict([pair for pair in zip(digits_lower, range(36))])

This comment has been minimized.

@jbarnoud

jbarnoud Aug 10, 2018

Contributor

These are constants in the global name space, they should be CAPITALIZED_WITH_UNDERSCORES.

This comment has been minimized.

@arm61

arm61 Aug 10, 2018

Author Contributor

Resolved in 02bf546

@richardjgowers

This comment has been minimized.

Copy link
Member

richardjgowers commented Aug 10, 2018

Looking through the coverage diff it looks like we can't reach the exceptions in the decode function (probably because we're handling them before the function). I'd just remove them

@richardjgowers richardjgowers merged commit dbad72c into MDAnalysis:develop Aug 10, 2018

4 checks passed

codecov/patch 96.55% of diff hit (target 88.59%)
Details
codecov/project 88.6% (+0.01%) compared to 10fb62d
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@richardjgowers

This comment has been minimized.

Copy link
Member

richardjgowers commented Aug 10, 2018

Awesome, thanks @arm61 !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment