Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign upAllow read in of VERY large pdb files #1978
Conversation
arm61
added some commits
Jul 9, 2018
This comment has been minimized.
This comment has been minimized.
Not clear why the travis failed. Will investigate later. |
This comment has been minimized.
This comment has been minimized.
Hello @arm61 , welcome to MDAnalysis! I think this is failing because not all indices are base 36, just the first 99,999 or so. I think you can add another middle layer to the try/except block that exists... try:
idx = int(thing)
except:
try:
idx = int(thing, 36)
except:
# wrapped serials case |
richardjgowers
reviewed
Jul 9, 2018
|
||
Fixes | ||
|
||
* Introduced compatibility for packmol (and hopefully generally) for pbd files with |
This comment has been minimized.
This comment has been minimized.
richardjgowers
Jul 9, 2018
Member
insert this inside the existing chunk below, also add yourself to the AUTHORS file as it's your first contribution
This comment has been minimized.
This comment has been minimized.
FYI, PROPKA contains an implementation hybrid36.py and we can use the code because it is published under LGPL. |
This comment has been minimized.
This comment has been minimized.
Looks a little slow for calling on every single line of a PDB file |
richardjgowers
self-assigned this
Jul 12, 2018
arm61
closed this
Jul 23, 2018
arm61
reopened this
Jul 23, 2018
arm61
added some commits
Jul 23, 2018
This comment has been minimized.
This comment has been minimized.
codecov
bot
commented
Jul 23, 2018
•
Codecov Report
@@ Coverage Diff @@
## develop #1978 +/- ##
==========================================
+ Coverage 88.59% 88.6% +0.01%
==========================================
Files 143 143
Lines 17361 17386 +25
Branches 2658 2665 +7
==========================================
+ Hits 15381 15405 +24
Misses 1379 1379
- Partials 601 602 +1
Continue to review full report at Codecov.
|
richardjgowers
requested changes
Jul 23, 2018
|
||
def test_PDB_hex(): | ||
u = mda.Universe(StringIO(PDB_hex), format='PDB') | ||
assert len(u.atoms) == 5 |
This comment has been minimized.
This comment has been minimized.
richardjgowers
Jul 23, 2018
Member
Looks good, can you add a test that checks what the atom.id
is to make sure we're correctly doing the base 36 conversion
This comment has been minimized.
This comment has been minimized.
And add yourself to the AUTHORS file |
kain88-de
reviewed
Jul 23, 2018
@@ -65,6 +65,8 @@ Fixes | |||
pack_into_box() (Issue #1911) | |||
* Fixed format of MODEL number in PDB file writing (Issue #1950) | |||
* PDBWriter now properly sets start value | |||
* Introduced compatibility for packmol (and hopefully generally) for pbd files with | |||
greater than 100 000 atoms (Issue #1897) |
This comment has been minimized.
This comment has been minimized.
kain88-de
Jul 23, 2018
Member
We do read those files already now. You added the specific hybrid36 format that wasn't supported. It would be nice if you can name it to be precise in the changelog.
This comment has been minimized.
This comment has been minimized.
It turns out that the int(n, 36) does not decode the hybrid36 format correctly. I have used the implementation found in PHENIX instead (see links in the commit message) |
arm61
closed this
Jul 24, 2018
arm61
reopened this
Jul 24, 2018
This comment has been minimized.
This comment has been minimized.
Not super clear why only one of the ci instances is failing, any input? |
This comment has been minimized.
This comment has been minimized.
You can have a look at the log on travis.
That means you are using a normal |
This comment has been minimized.
This comment has been minimized.
TBH the linter is wrong there, the range call is inside a zip, so it is iterating it. But yeah if you change to use six.moves.range it will stop complaining |
richardjgowers
reviewed
Jul 27, 2018
@@ -87,6 +87,44 @@ def float_or_default(val, default): | |||
except ValueError: | |||
return default | |||
|
|||
digits_upper = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ" | |||
digits_lower = digits_upper.lower() | |||
digits_upper_values = dict([pair for pair in zip(digits_upper, range(36))]) |
This comment has been minimized.
This comment has been minimized.
richardjgowers
Jul 27, 2018
Member
rather than having two dicts, and two code paths below, could you not create a dict with both upper and lower case in it? Ie e
and E
both map to whatever value?
This comment has been minimized.
This comment has been minimized.
arm61
Aug 3, 2018
Author
Contributor
I am not sure this would necessarily work the same as the upper case are treated differently from the lower case (line 118 vs 124). Unless I am not seeing something you are.
arm61
added some commits
Aug 3, 2018
This comment has been minimized.
This comment has been minimized.
Sorry about the delay. other things got in the way. |
arm61
added some commits
Aug 7, 2018
This comment has been minimized.
This comment has been minimized.
WRT upper/lower case, if this is base 36 surely it doesn’t matter which case and we can mangle the input into either? |
This comment has been minimized.
This comment has been minimized.
From a bit of reading, I don't think this is real base36, it is referred to as hybrid-36. It is traditional base 36 (using upper case) until that is exhausted, then it uses the lower case to basically make more numbers available. It is a weird monstrosity (as with all pdb formatting) that is a pseudo-base62 almost. Going to put some tests for the |
This comment has been minimized.
This comment has been minimized.
@arm61 ewww ok. But yeah, if you can add some tests for values that hit all the different possibilities. You can use a |
arm61
added some commits
Aug 10, 2018
This comment has been minimized.
This comment has been minimized.
I think those tests are pretty comprehensive. Also I agree, the pdb format is gross. |
jbarnoud
reviewed
Aug 10, 2018
digits_upper = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ" | ||
digits_lower = digits_upper.lower() | ||
digits_upper_values = dict([pair for pair in zip(digits_upper, range(36))]) | ||
digits_lower_values = dict([pair for pair in zip(digits_lower, range(36))]) |
This comment has been minimized.
This comment has been minimized.
jbarnoud
Aug 10, 2018
Contributor
These are constants in the global name space, they should be CAPITALIZED_WITH_UNDERSCORES
.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
Looking through the coverage diff it looks like we can't reach the exceptions in the decode function (probably because we're handling them before the function). I'd just remove them |
richardjgowers
approved these changes
Aug 10, 2018
richardjgowers
merged commit dbad72c
into
MDAnalysis:develop
Aug 10, 2018
This comment has been minimized.
This comment has been minimized.
Awesome, thanks @arm61 ! |
arm61 commentedJul 9, 2018
•
edited
Fixes #1897
Changes made in this Pull Request:
PR Checklist