Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial refactor and test added [Fixes #33] #34

Merged
merged 6 commits into from
Jun 25, 2021
Merged

Initial refactor and test added [Fixes #33] #34

merged 6 commits into from
Jun 25, 2021

Conversation

ojeda-e
Copy link
Member

@ojeda-e ojeda-e commented Jun 24, 2021

Helps to fix #33.
In this first version of the refactored code, the calculation of curvature doesn't evaluate per leaflet, and instead, it runs for a surface defined by selected atoms. (More general)

Changes in this PR:

  • Surface derived from direct selection using MDAnalysis.
  • Deleted functions def_all_beads, curvature and core_fast.
  • Function grid_map to map coordinates to grid added.
  • In grid_map arguments as np.array / tuples instead of topologies.
  • Replaced MDtraj by MDanalysis 100%.

As suggested here, tests with toy model using pytest.mark.parametrize added:

  • test_grid_map_small_9grid, using toy model in small grid of 9 lipids in a 3x3 grid, with x values of 0, 1, 2, and y values of 0, 1, 2.
  • test_grid_map_25grid, using toy model in small grid of 25 lipids in a 5x5 grid, with x values of 0, 1, 2,3,4 and y values of 0, 1, 2,3,4.

This PR may also fix

Edit: This PR also fixes #16 .

@codecov
Copy link

codecov bot commented Jun 24, 2021

Codecov Report

Merging #34 (8569cf0) into main (bf6019b) will increase coverage by 9.39%.
The diff coverage is 73.33%.

@ojeda-e ojeda-e requested a review from orbeckst June 24, 2021 22:59
Copy link
Member

@orbeckst orbeckst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good progress! I have some general comments.

My main question is why there are so many changes to the tests in this PR. Once you start doing big refactoring like changing from mdtraj to mda, I would have expected the previous tests to be still in place and passing. (I might not be fully up-to-date on your changes to the tests but I'd still expect any changes to be tests before this PR.)

topology = md.load(grofile).topology

# 6. Populate universe with coordinates and trajectory
# 2. Populate universe with coordinates and trajectory
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MDAnalysis is not restricted to GRO format files. I'd just call it topology and trajectory.

u = mda.Universe(grofile, trjfile)

# 6.1 Set grid: Extract box dimension from MD sim,
# 3 Set grid: Extract box dimension from MD sim,
# set grid max width, set number of unit cells
box_size = u.dimensions[0]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This gives you Lx only. Do you assume square X-Y ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe at least

box_size = max(u.dimensions[0], u.dimensions[1])

If you assume orthorhombic boxes then consider just failing here if you encounter a triclinic box.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is one of the limitations and one of the next issues to solve. Issue #35 added.

Comment on lines 10 to 13
x: float
Value of x coordinate
y: float
Value of y coordinate
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docs (x,y) do not seem to agree with function signature (coords, factor)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are right and thanks for this catch. I'll update it.

Comment on lines +21 to +22
index_grid_l = int(abs(coords[0]) * factor)
index_grid_m = int(abs(coords[1]) * factor)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is coords a single coordinate (essentially a 3-tuple)?

If so then you should keep in mind for later that you can almost certainly gain quite a bit of performance by doing the grid indexing operation on all coordinates at once (there's code related to np.histogramdd that you can probably use... but not in this PR!)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add a test with negative coordinates? It might fail for now but that's ok.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to @orbeckst's comment -- it would be good to start thinking about expected inputs and output. For example, the most common way that coordinates tend to be passed around is in an (N x 3) numpy array. Even without np.histogramdd you could then simplify this function to return np.abs(coords * factor).astype(int).

@@ -160,27 +54,24 @@ def core_fast_leaflet(z_Ref, leaflet, traj, jump, n_cells, lipid_types,

grid_count = np.zeros([n_cells, n_cells])

for frame in range(0, traj.n_frames, jump):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What was the jump argument good for?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

jump was to skip frames. I was using jump = 1 for all the tests so I decided it was not very useful at this point either.

membrane_curvature/lib/mods.py Outdated Show resolved Hide resolved
membrane_curvature/lib/mods.py Outdated Show resolved Hide resolved
membrane_curvature/lib/mods.py Show resolved Hide resolved
membrane_curvature/lib/mods.py Outdated Show resolved Hide resolved
@ojeda-e
Copy link
Member Author

ojeda-e commented Jun 25, 2021

Thanks for your comments @orbeckst
There are two main reasons for all the changes here introduced.

  • The first one and most relevant, is that in the previous version of my code, topologies and trajectories fromMDtraj were parsed in three of the functions: def_all_beads, core_fast_leaflets, and core_fast.
    With the change from MDtraj to MDAnalysis, the arguments of def_all_beads disappeared, core_fast core_fast_leaflets significantly changed.
  • The second reason is that in previous PRs (see PR test core_def_all_beads added. [Issue #27] #31 and test core_fast added [Issue #28] #30) the tests I was asked to provide were with dummy coordinates, to be more readable. Which is completely ok, but makes more sense only after refactoring.
    As I mentioned above, In the previous version of the code, the arguments parsed in 3 of the functions were MDtraj topologies and trajectories that I can't build with numpy arrays. Only after this refactoring, and having MDAnalysis instead, that type of test makes more sense.
  • Additionally, since the previous PRs I was asked to have tests using dummy coordinates and the ones I provided with grofiles were getting too complicated to read, I simplify the tests in this new version of the code.

In the spirit of moving forward, I added changes in this PR that wrap up the previous limitations and allows me to progress. If further tests are needed, now with this refactored version providing dummy coordinates will be possible. I feel I haven't made progress because of the tests that have been asked previously, which were going to change no matter what with this PR.
If it's better, I can put the big systems tests for two of the functions, mean_curvature and gaussian_curvature, which are essentially the only two functions that remain the same after refactoring.

@pep8speaks
Copy link

pep8speaks commented Jun 25, 2021

Hello @ojeda-e! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-06-25 03:37:02 UTC

@ojeda-e ojeda-e requested a review from orbeckst June 25, 2021 17:30
Copy link
Member

@lilyminium lilyminium left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great start to refactoring, @ojeda-e -- the new grid_map tests are much more digestible!

I have a few notes on the refactor. I think it's worth spending some time thinking about how you want to break your MDTraj code and put it back together in an MDAnalysis format. You've swapped MDTraj trajectories for MDAnalysis universes in core_fast_leaflet, which is a good start, but IMO an iterative kind of refactor will constrain potential designs because of trying to be similar to the previous iteration.

Instead, this seems to be a good time to reduce complexity and start separating your functions into more modular ones, where each function does only one main thing. Then it will be easier to put them together in convenient ways as we try to work out the best design for the MembraneCurvature AnalysisBase class. (and easier to test!)

Currently core_fast_leaflet does three things:

  • gets positions for each atom in the atom group for each frame
  • identifies the grid cell for each coordinate (currently grid_map)
  • calculates the average z for the atom group

Could you please break this down into three functions, and write a new function that simply calls them (just for the regression tests?) I would be keen to have these as divorced from either MDTraj or MDAnalysis as possible and work only with numpy arrays, which are generally more versatile. The overall function could grab the positions from the AtomGroup and feed it into the functions instead.

In addition, some unofficial principles of design from personal experience:

Side-effecting functions

Currently core_fast_leaflet is a side-effecting function. This means that it modifies a variable, z_Ref, outside its local environment. In functional programming, where you write functions (as opposed to class/objects in object-oriented programming), it's generally good practice to avoid side-effecting functions. This StackOverflow post summarises it better than I could, but in general side-effect free functions are easier to test, easier to parallelize, and easier to cobble together into larger functions.

To write a side-effect free function, core_fast_leaflet would be given input arguments that it doesn't modify, and return an output value. So the test_core_fast_lealets function could look like:

def test_core_fast_leaflets():
    n_cells, max_width = 3, 3
---    z_calc = np.zeros([n_cells, n_cells])
    u = mda.Universe(GRO_PO4_SMALL, XTC_PO4_SMALL)
    selection = u.select_atoms('index 0:3')
---    core_fast_leaflet(u, z_calc, n_cells, selection, max_width)
+++    z_calc = core_fast_leaflet(u, n_cells, selection, max_width)
    for z, z_test in zip(MEMBRANE_CURVATURE_DATA['grid']['small']['upper'], z_calc):
        print(z, z_test)
        assert_almost_equal(z, z_test)

The principle of least astonishment

The principle of least astonishment basically advises that you use your code to create user expectations, and then try your best to fulfil them. This comes in many forms, such as naming your variables and functions clearly (more on that below). It also means trying to follow the conventions of other packages as best you can. This is a bit nitpicky, but one example is the input arguments to np.zeros. The documentation actually specifies that the shape should be an integer or a tuple of integers. In most contexts, code that looks like np.func([x, y]) means that np.func is operating on a list of values that will be converted to an array. In contrast, shape is commonly specified as a tuple (x, y) so it's much less likely to be misunderstood.

The type of input arguments to np.zeros is of course a very small issue. I just bring it up now because as I said above, refactoring is a good time to start thinking about what kind of inputs and outputs you want your functions to take, as that is a key part of API design.

Even more nitpicky: parameter ordering is also part of designing a function. There isn't really much formal advice on how to order parameters in a function (although I found this Stackoverflow post). This is not important right now, but for example with core_fast_leaflet, as a user, I would be surprised at the ordering of universe, z_Ref, n_cells, selection, max_width. That's because I think of universes and AtomGroups as very similar things (in fact, you don't necessarily need universe as an argument, as you could get the universe = selection.universe). In addition, n_cells and max_width both relate to the same thing: factor. It feels more intuitive to me to call core_fast_leaflet(universe, selection, z_Ref, n_cells, max_width) and have those arguments grouped together.

Naming functions

Part of PLOA is naming things clearly, so future developers (or you in 2 months!) can read your code with minimal effort. PEP8 has a whole section on naming conventions. In general, it's typically a good idea to have a verb in your function name. For example, I am not sure what grid_map does just by looking at its name. If it were more explicit and map were a verb, e.g. map_coordinates_to_grid, that's much more immediately obvious. It's up to you how verbose you would like to make it; other candidates could be map_coordinates_to_array, map_coordinates_to_grid_array, get_coordinate_cell.

The kind of verb can hint at the expected output, such as get_coordinate_cell above. For example, I would expect functions that start with is_ or has_ (e.g. has_atoms, is_negative) to return either True or False.

Comment on lines +21 to +22
index_grid_l = int(abs(coords[0]) * factor)
index_grid_m = int(abs(coords[1]) * factor)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add a test with negative coordinates? It might fail for now but that's ok.

# should map to
lambda xy: (xy[0], xy[1]),
5, 5),
])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This parametrization is just testing one set of arguments, do you plan to add more? You could combine the 9-grid and the 25-grid into the one function with parametrize though.

Comment on lines +21 to +22
index_grid_l = int(abs(coords[0]) * factor)
index_grid_m = int(abs(coords[1]) * factor)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to @orbeckst's comment -- it would be good to start thinking about expected inputs and output. For example, the most common way that coordinates tend to be passed around is in an (N x 3) numpy array. Even without np.histogramdd you could then simplify this function to return np.abs(coords * factor).astype(int).

membrane_curvature/lib/mods.py Show resolved Hide resolved
def test_grid_map_25grid(dummy_coordinates, test_mapper, n_cells, max_width):
factor = np.float32(n_cells / max_width)
for dummy_coord in dummy_coordinates:
assert test_mapper(dummy_coord) == grid_map(dummy_coord, factor)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that test_mapper is not strictly needed here -- I think you could do the below?

Suggested change
assert test_mapper(dummy_coord) == grid_map(dummy_coord, factor)
assert grid_map(dummy_coord, factor) == dummy_coord

In addition, could you please add a test where the output of grid_map is different from the input dummy_coord? Otherwise this will pass even with grid_map = lambda x, *args: x :)

if grid_count[i, j] > 0:
z_Ref[i, j] /= grid_count[i, j]
if grid_count_frames[i, j] > 0:
z_Ref[i, j] /= grid_count_frames[i, j]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that you're first creating an average surface before computing the curvature. On a scientific level, maybe we should consider treating each surface for each frame separately? Otherwise, if you have a membrane that undulates such that a patch has ~50% positive curvature and 50% negative curvature over the frames, the Gaussian curvature is ultimately computed on a flat surface?

@ojeda-e
Copy link
Member Author

ojeda-e commented Jun 25, 2021

Hi @lilyminium Lily, thanks for your review. Regarding this comment (and below)

Currently core_fast_leaflet does three things:

  • gets positions for each atom in the atom group for each frame
  • identifies the grid cell for each coordinate (currently grid_map)
  • calculates the average z for the atom group

Could you please break this down into three functions, and write a new function that simply calls them (just for the regression tests?) I would be keen to have these as divorced from either MDTraj or MDAnalysis as possible and work only with numpy arrays, which are generally more versatile. The overall function could grab the positions from the AtomGroup and feed it into the functions instead.

If it works for you, I'll add these remarks as new issues instead of adding them in this PR. The changes are getting a bit too long. Would that work?
If yes, then it'll be happy to submit a PR with the requested changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Refactor core_fast_leaflet and core_fast and add test using toy model. Replace mdtraj by MDAnalysis
4 participants