Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expansion of dihedrals analysis module #2033

Merged
merged 21 commits into from
Aug 13, 2018

Conversation

hfmull
Copy link
Contributor

@hfmull hfmull commented Aug 8, 2018

Changes made in this Pull Request:

  • Added Dihedral and Janin classes to dihedral.py
  • Ramachandran class changed to account for selection errors
  • Added tests to test_dihedral.py to account for changes
  • Added reference plots for Ramachandran and Janin classes

PR Checklist

  • Tests?
  • Docs?
  • CHANGELOG updated?
  • [n/a] Issue raised/referenced?

@@ -147,6 +235,16 @@ def __init__(self, atomgroup, **kwargs):

phi_sel = [res.phi_selection() for res in residues]
psi_sel = [res.psi_selection() for res in residues]
if any(sel is None for sel in phi_sel):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was added because phi_selection() and psi_selection() do not account for altloc selections. They will return None in the middle of the protein if that is the case, so this removes the None and the corresponding phi or psi angle. The amount of residues for which this is the case is very small so it seemed easier to just ignore them instead of rewriting phi_selection() and psi_selection(). If a user is interested in those specific residues, they can use the Dihedral class to look at them.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a comment to the code, summarizing what you explained above.


if any(len(self.ag1) != len(ag) for ag in [self.ag2, self.ag3,
self.ag4, self.ag5]):
raise ValueError("Too many or too few atoms selected. Check for "
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to Ramachandran, altloc will make the selections fail, so here the class simply fails. Unlike Ramachandran, the user can then select for only on altloc and the analysis will run. Also I've found some topologies that were incomplete and did not have all the atoms it was supposed to so that's why it says 'too few'.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

altloc (and anything else that specifies as "dual topology") is a problem.

@richardjgowers is there a sensible way how a user can switch between different topologies? The problem is that at the moment, ag.residues will contain all atoms, including altloc ones, even if my selection for ag excluded them. This is problematic for the residue.phi_selection(), residue.psi_selection() and residue.chi1_selection() methods, which then see groups that are not 4 atoms.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, not really. This would have to be something done when the file is parsed. It sounds like maybe Residue.phi_selection() should take an altloc kwarg or something to tell it what to identify?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MMTF switches between topologies... somehow, so we have that.

We could/should use the fact that the residue.X_selections() are methods. Perhaps indeed use a default Residue.phi_selection(selection="altloc ' '") or similar – with the issue that we also have to catch the case when this selection makes no sense.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think mmtf does the topologies as frames, (or rather we read models as frames, like how PDB is abused (or at least one of the various ways..)).

Maybe naïve question, is altloc just a different position? Or can it be different functionalisation? If the former, I guess you could write an altloc aware reader/parser which expands out the different possibilities.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that altloc just refers to different positions, but that those positions correspond to separate atoms that are both present in the topology, so they are treated just like extra atoms that are part of the residue. One of the issues with altloc selection is that there doesn't seem to be a way to select the atoms that have no altloc (at least I haven't figured it out). So to get the atoms with no altloc and those with altloc A, you have to say 'not altloc B', assuming there is only A and B for options. I think there can be more altlocs but I have yet to see any that go into C or higher.

Copy link
Member

@orbeckst orbeckst Aug 9, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have the PDB ID of a file with altloc handy?

If it has altlocs then we assign a default, which is either '' (empty string) or ' ' (one space). I don't know if our selection language can deal with it. But there's always the object-oriented/pandas-style approach

ag[ag.altloc == '']

where you use boolean indexing. (Assuming that we expose an .altloc attribute somewhere... @richardjgowers would know.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been using PDB ID 19hc, but that method returns an AttributeError and says that AtomGroup has no altloc attribute

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@richardjgowers was right

>>> import MDAnalysis as mda
>>> u = mda.fetch_mmtf('19hc')
>>> u.atoms.altLocs
array(['', '', '', ..., '', '', ''], dtype=object)
>>> set(u.atoms.altLocs)
{'', 'A', 'B'}
# select no A or B
>>> u.atoms[u.atoms.altLocs == '']
<AtomGroup with 5944 atoms>
>>> u.atoms
<AtomGroup with 6098 atoms>

The default altLoc indicator is the empty string.

" have been removed from the selection.")
residues = residues.difference(remove)

self.ag1 = residues.atoms.select_atoms("name N")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I decided not to use chi1_selection() for this one because it does not capture all of the atoms it should, specifically the residues with multiple branches or non-carbon side chains.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the standard residue.chi1_selection() be updated?

If so, just open an issue so that it isn't forgotten.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes please!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also add your comment above as a comment in the code.

It will be helpful to anyone else coming after you.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I raised #2044. @hfmull if you can add anything to the issue report (e.g., example PDB ID where the current chi1_angles() gives wrong results) then that would be great!

@codecov
Copy link

codecov bot commented Aug 9, 2018

Codecov Report

Merging #2033 into develop will increase coverage by 0.03%.
The diff coverage is 96.2%.

Impacted file tree graph

@@             Coverage Diff             @@
##           develop    #2033      +/-   ##
===========================================
+ Coverage     88.9%   88.93%   +0.03%     
===========================================
  Files          143      144       +1     
  Lines        17386    17462      +76     
  Branches      2665     2684      +19     
===========================================
+ Hits         15457    15530      +73     
  Misses        1321     1321              
- Partials       608      611       +3
Impacted Files Coverage Δ
package/MDAnalysis/analysis/data/filenames.py 100% <100%> (ø)
package/MDAnalysis/analysis/dihedrals.py 96.52% <95.83%> (-1.31%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 388346d...80533cc. Read the comment docs.

@orbeckst
Copy link
Member

orbeckst commented Aug 9, 2018

@hfmull I merged develop into your branch because Appveyor was complaining about it not being in a conflict-free state. I think just adding this new commit kicked Travis CI into doing something, too – sometimes one just has to add another commit...

@richardjgowers
Copy link
Member

richardjgowers commented Aug 9, 2018 via email

Copy link
Member

@orbeckst orbeckst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice code and cool new features!

See comments – main issue is that datafiles for MDAnalysis cannot reside in MDAnalysisTests.

Ramachandran analysis
~~~~~~~~~~~~~~~~~~~~~

The :class:`~MDAnalysis.analysis.dihedrals.Ramachandran` class allows for the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add references to the papers (Ramachandran and Janin) and add a section References.

R.plot(ax=ax, color='k', marker='s')

Alternatively, if you wanted to plot the data yourself, the angles themselves
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still mention that the data is in :attr:angles – most people will want to use it.

for ts in R.angles:
ax.scatter(ts[:,0], ts[:,1], color='k', marker='s')
Reference plots can be added to the axes for both the Ramachandran and Janin
classes using the kwarg ``ref=True``. These were made using data obtained from
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool!

Can you generate example plots for Ramachandran and Janin so that we can (eventually) add them to the docs? I can't quite remember how best to do it... I did it for the MDA tutorial.

Just create a directory package/doc/sphinx/source/images, add them there, and we'll figure out the rest later.

.. attribute:: angles

Contains the time steps of the angles for each atomgroup in the list as
an n_frames×len(atomgroups) :class:`numpy.ndarray` with content
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ooooh... proper typography!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put the variables in double-back ticks

``n_frames×len(atomgroups)``

so that they get code formatting.

(Still ❤️ ing the ×)


.. attribute:: angles

Contains the time steps of the phi and psi angles for each residue as
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

chi1 and chi2

Btw, you can use proper Greek characters with LaTeX-like formatting

contains :math:`\chi_` and :math:`\chi_2` angles

LYSJaninArray = resource_filename(__name__, 'data/adk_oplsaa_LYS_janin.npy')

Rama_ref = resource_filename(__name__, 'data/rama_ref_data.npy')
Janin_ref = resource_filename(__name__, 'data/janin_ref_data.npy')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs to go into MDAnalysis (see above).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also prepare references for each amino acid separately? I think that this would be more interesting than all sidechains averaged. The peptide backbone is not strongly dependent on the chemical identity of the residue (GLY, PRO are exceptions) but the Janin plots should depend strongly on the nature of the sidechain.

@@ -71,7 +109,61 @@ def test_protein_ends(self, universe):
with pytest.warns(UserWarning):
rama = Ramachandran(universe.select_atoms("protein")).run()

def test_None_removal(self):
with pytest.warns(UserWarning):
u = mda.coordinates.MMTF.fetch_mmtf('19hc')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if fetch_mmtf() cannot download the PDB? You should either XFAIL the test if this happens or include a gzipped 19hc.pdb.gz as a data file.


def test_janin(self, universe):
janin = Janin(universe.select_atoms("protein")).run()
test_janin = np.load(JaninArray)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make the test_janin a class-level fixture so that it only gets loaded once:

class ....:

   @pytest.fixture
   def janin_ref_array(self):
         return np.load(JaninArray)

   def test_janin(self, universe, janin_ref_array):
         janin = Janin(universe.select_atoms("protein")).run()
         assert_almost_equal(janin.angles, janin_ref_array, 4, ...)

@@ -40,7 +78,7 @@ def universe(self):

def test_ramachandran(self, universe):
rama = Ramachandran(universe.select_atoms("protein")).run()
test_rama = np.load(DihedralsArray)
test_rama = np.load(RamaArray)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make it a fixture (see comments below for JaninArray – this is the common way to treat "reference data" in pytest

janin = Janin(universe.select_atoms("protein")).run()
test_janin = np.load(JaninArray)

assert_almost_equal(janin.angles, test_janin, 4,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is decimal=4 sufficient to let the tests pass on Linux and OSX?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a one line comment for the justification for reducing the test precision.

`atomgroup` for each time step in the trajectory. A :class:`~MDAnalysis.ResidueGroup`
is generated from `atomgroup` which is compared to the protein to determine
if it is a legitimate selection.
:math:`\phi` and :math:`\psi` angles will be calculated for each residue \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

backslash not needed at end of line

for each residue as an ``n_frames×n_residues×2`` :class:`numpy.ndarray`
with content ``[[[chi1, chi2], [residue 2], ...], [time step 2], ...]``.

References
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good

Copy link
Member

@orbeckst orbeckst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see comments

@@ -0,0 +1,35 @@
# -*- Mode: python; tab-width: 4; indent-tabs-mode:nil; coding:utf-8 -*-
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The data directory also needs an empty file __init__.py to mark it as an importable module. Just add it with

touch analysis/data/__init__.py
git add analysis/data/__init__.py

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The absence of the file is the reason for the test failures.

@@ -64,7 +63,8 @@ Enhancements
generated with gromacs -noappend (PR #1728)
* MDAnalysis.lib.mdamath now supports triclinic boxes and rewrote in Cython (PR #1965)
* AtomGroup.write can write a trajectory of selected frames (Issue #1037)
* Added analysis.dihedrals with Ramachandran class to analysis module (PR #1997)
* Added dihedrals.py with Dihedral, Ramachandran, and Janin classes to
analysis module (PR #1997, PR #2033)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also add entry under Enhancements

*  added the analysis/data module for reference data used in analysis

# MDAnalysis: A Toolkit for the Analysis of Molecular Dynamics Simulations.
# J. Comput. Chem. 32 (2011), 2319--2327, doi:10.1002/jcc.21787
#

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add brief documentation:

Analysis data files
------------------

.. data:: Rama_ref
   Reference Ramachandran histogram for :class:`MDAnalysis.analysis.dihedrals.Ramachandran`. 
   The data were calculated on a data set of 500 PDB structures taken from [Lovell2003]_.

.. data:: Janin_ref
   ...

You are starting this data module so you should set a good example for documentation in the future.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Insert the docs for analysis.docs.data.filenames into the sphinx docs as you did for dihedrals. Or ping me an I will add it when you have done the docs inside filenames.py and dihedrals.py.

for ts in R.angles:
ax.scatter(ts[:,0], ts[:,1], color='k', marker='s')
Reference plots can be added to the axes for both the Ramachandran and Janin
classes using the kwarg ``ref=True``. These were made using data obtained from
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replace "These were made using" with

The Ramachandran reference data (:data:`~MDAnalysis.analysis.data.filenames.Rama_ref`) and 
Janin reference data (:data:`~MDAnalysis.analysis.data.filenames.Janin_ref`) were made ...

It's good to put in cross links between the docs.

Alternatively, if you wanted to plot the data yourself, the angles themselves
can be accessed using :attr:`Ramachandran.angles`::
The Janin class works in the same way, only needing a list of residues. To plot
the data yourself, the angles can be accessed using :attr:`Ramachandran.angles`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or :attr:Janin.angles

@@ -85,7 +85,7 @@
"XTC_sub_sol",
"XYZ", "XYZ_psf", "XYZ_bz2",
"XYZ_mini", "XYZ_five", # 3 and 5 atoms xyzs for an easy topology
"TXYZ", "ARC", "ARC_PBC", # Tinker files
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure why ARC_PBC was removed. This might have been a merge accident because it is present in current develop. Please add it back! The line should be

 "TXYZ", "ARC", "ARC_PBC",        # Tinker files

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sure that you pull from your branch – maybe you didn't get the merge commit?

@@ -294,7 +297,6 @@
XYZ_five = resource_filename(__name__, 'data/five.xyz')
TXYZ = resource_filename(__name__, 'data/coordinates/test.txyz')
ARC = resource_filename(__name__, 'data/coordinates/test.arc')
ARC_PBC = resource_filename(__name__, 'data/coordinates/new_hexane.arc')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add this line back in

ARC_PBC = resource_filename(__name__, 'data/coordinates/new_hexane.arc')

@orbeckst
Copy link
Member

One test failed but not because of anything you did (but more likely #1988). I restarted the test and hopefully everything will light up "passing".

I am approving the PR and then it can be merged whenever Travis is happy.

Excellent work, @hfmull !

@orbeckst orbeckst self-assigned this Aug 12, 2018
@orbeckst
Copy link
Member

@orbeckst orbeckst merged commit 85ec60a into MDAnalysis:develop Aug 13, 2018
@orbeckst
Copy link
Member

@hfmull FYI: I benchmarked your Dihedral analysis class against the naive approach in Gist https://gist.github.com/orbeckst/26081375f3ea3152f08bbcc90c14c5eb and your approach is between 80 and 100 times faster for a typical protein. Good job!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants