-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initial refactor and test added [Fixes #33] #34
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good progress! I have some general comments.
My main question is why there are so many changes to the tests in this PR. Once you start doing big refactoring like changing from mdtraj to mda, I would have expected the previous tests to be still in place and passing. (I might not be fully up-to-date on your changes to the tests but I'd still expect any changes to be tests before this PR.)
membrane_curvature/core.py
Outdated
topology = md.load(grofile).topology | ||
|
||
# 6. Populate universe with coordinates and trajectory | ||
# 2. Populate universe with coordinates and trajectory |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MDAnalysis is not restricted to GRO format files. I'd just call it topology and trajectory.
u = mda.Universe(grofile, trjfile) | ||
|
||
# 6.1 Set grid: Extract box dimension from MD sim, | ||
# 3 Set grid: Extract box dimension from MD sim, | ||
# set grid max width, set number of unit cells | ||
box_size = u.dimensions[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This gives you Lx only. Do you assume square X-Y ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe at least
box_size = max(u.dimensions[0], u.dimensions[1])
If you assume orthorhombic boxes then consider just failing here if you encounter a triclinic box.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is one of the limitations and one of the next issues to solve. Issue #35 added.
membrane_curvature/lib/mods.py
Outdated
x: float | ||
Value of x coordinate | ||
y: float | ||
Value of y coordinate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Docs (x,y) do not seem to agree with function signature (coords, factor)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you are right and thanks for this catch. I'll update it.
index_grid_l = int(abs(coords[0]) * factor) | ||
index_grid_m = int(abs(coords[1]) * factor) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is coords a single coordinate (essentially a 3-tuple)?
If so then you should keep in mind for later that you can almost certainly gain quite a bit of performance by doing the grid indexing operation on all coordinates at once (there's code related to np.histogramdd
that you can probably use... but not in this PR!)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please add a test with negative coordinates? It might fail for now but that's ok.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to @orbeckst's comment -- it would be good to start thinking about expected inputs and output. For example, the most common way that coordinates tend to be passed around is in an (N x 3) numpy array. Even without np.histogramdd
you could then simplify this function to return np.abs(coords * factor).astype(int)
.
@@ -160,27 +54,24 @@ def core_fast_leaflet(z_Ref, leaflet, traj, jump, n_cells, lipid_types, | |||
|
|||
grid_count = np.zeros([n_cells, n_cells]) | |||
|
|||
for frame in range(0, traj.n_frames, jump): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What was the jump
argument good for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
jump
was to skip frames. I was using jump = 1
for all the tests so I decided it was not very useful at this point either.
Thanks for your comments @orbeckst
In the spirit of moving forward, I added changes in this PR that wrap up the previous limitations and allows me to progress. If further tests are needed, now with this refactored version providing dummy coordinates will be possible. I feel I haven't made progress because of the tests that have been asked previously, which were going to change no matter what with this PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great start to refactoring, @ojeda-e -- the new grid_map tests are much more digestible!
I have a few notes on the refactor. I think it's worth spending some time thinking about how you want to break your MDTraj code and put it back together in an MDAnalysis format. You've swapped MDTraj trajectories for MDAnalysis universes in core_fast_leaflet
, which is a good start, but IMO an iterative kind of refactor will constrain potential designs because of trying to be similar to the previous iteration.
Instead, this seems to be a good time to reduce complexity and start separating your functions into more modular ones, where each function does only one main thing. Then it will be easier to put them together in convenient ways as we try to work out the best design for the MembraneCurvature AnalysisBase class. (and easier to test!)
Currently core_fast_leaflet
does three things:
- gets positions for each atom in the atom group for each frame
- identifies the grid cell for each coordinate (currently grid_map)
- calculates the average z for the atom group
Could you please break this down into three functions, and write a new function that simply calls them (just for the regression tests?) I would be keen to have these as divorced from either MDTraj or MDAnalysis as possible and work only with numpy arrays, which are generally more versatile. The overall function could grab the positions from the AtomGroup and feed it into the functions instead.
In addition, some unofficial principles of design from personal experience:
Side-effecting functions
Currently core_fast_leaflet
is a side-effecting function. This means that it modifies a variable, z_Ref
, outside its local environment. In functional programming, where you write functions (as opposed to class/objects in object-oriented programming), it's generally good practice to avoid side-effecting functions. This StackOverflow post summarises it better than I could, but in general side-effect free functions are easier to test, easier to parallelize, and easier to cobble together into larger functions.
To write a side-effect free function, core_fast_leaflet
would be given input arguments that it doesn't modify, and return an output value. So the test_core_fast_lealets
function could look like:
def test_core_fast_leaflets():
n_cells, max_width = 3, 3
--- z_calc = np.zeros([n_cells, n_cells])
u = mda.Universe(GRO_PO4_SMALL, XTC_PO4_SMALL)
selection = u.select_atoms('index 0:3')
--- core_fast_leaflet(u, z_calc, n_cells, selection, max_width)
+++ z_calc = core_fast_leaflet(u, n_cells, selection, max_width)
for z, z_test in zip(MEMBRANE_CURVATURE_DATA['grid']['small']['upper'], z_calc):
print(z, z_test)
assert_almost_equal(z, z_test)
The principle of least astonishment
The principle of least astonishment basically advises that you use your code to create user expectations, and then try your best to fulfil them. This comes in many forms, such as naming your variables and functions clearly (more on that below). It also means trying to follow the conventions of other packages as best you can. This is a bit nitpicky, but one example is the input arguments to np.zeros
. The documentation actually specifies that the shape should be an integer or a tuple of integers. In most contexts, code that looks like np.func([x, y])
means that np.func
is operating on a list of values that will be converted to an array. In contrast, shape is commonly specified as a tuple (x, y)
so it's much less likely to be misunderstood.
The type of input arguments to np.zeros
is of course a very small issue. I just bring it up now because as I said above, refactoring is a good time to start thinking about what kind of inputs and outputs you want your functions to take, as that is a key part of API design.
Even more nitpicky: parameter ordering is also part of designing a function. There isn't really much formal advice on how to order parameters in a function (although I found this Stackoverflow post). This is not important right now, but for example with core_fast_leaflet
, as a user, I would be surprised at the ordering of universe, z_Ref, n_cells, selection, max_width
. That's because I think of universes and AtomGroups as very similar things (in fact, you don't necessarily need universe as an argument, as you could get the universe = selection.universe
). In addition, n_cells
and max_width
both relate to the same thing: factor
. It feels more intuitive to me to call core_fast_leaflet(universe, selection, z_Ref, n_cells, max_width)
and have those arguments grouped together.
Naming functions
Part of PLOA is naming things clearly, so future developers (or you in 2 months!) can read your code with minimal effort. PEP8 has a whole section on naming conventions. In general, it's typically a good idea to have a verb in your function name. For example, I am not sure what grid_map
does just by looking at its name. If it were more explicit and map were a verb, e.g. map_coordinates_to_grid
, that's much more immediately obvious. It's up to you how verbose you would like to make it; other candidates could be map_coordinates_to_array
, map_coordinates_to_grid_array
, get_coordinate_cell
.
The kind of verb can hint at the expected output, such as get_coordinate_cell
above. For example, I would expect functions that start with is_
or has_
(e.g. has_atoms
, is_negative
) to return either True or False.
index_grid_l = int(abs(coords[0]) * factor) | ||
index_grid_m = int(abs(coords[1]) * factor) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please add a test with negative coordinates? It might fail for now but that's ok.
# should map to | ||
lambda xy: (xy[0], xy[1]), | ||
5, 5), | ||
]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This parametrization is just testing one set of arguments, do you plan to add more? You could combine the 9-grid and the 25-grid into the one function with parametrize though.
index_grid_l = int(abs(coords[0]) * factor) | ||
index_grid_m = int(abs(coords[1]) * factor) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to @orbeckst's comment -- it would be good to start thinking about expected inputs and output. For example, the most common way that coordinates tend to be passed around is in an (N x 3) numpy array. Even without np.histogramdd
you could then simplify this function to return np.abs(coords * factor).astype(int)
.
def test_grid_map_25grid(dummy_coordinates, test_mapper, n_cells, max_width): | ||
factor = np.float32(n_cells / max_width) | ||
for dummy_coord in dummy_coordinates: | ||
assert test_mapper(dummy_coord) == grid_map(dummy_coord, factor) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that test_mapper
is not strictly needed here -- I think you could do the below?
assert test_mapper(dummy_coord) == grid_map(dummy_coord, factor) | |
assert grid_map(dummy_coord, factor) == dummy_coord |
In addition, could you please add a test where the output of grid_map
is different from the input dummy_coord
? Otherwise this will pass even with grid_map = lambda x, *args: x
:)
if grid_count[i, j] > 0: | ||
z_Ref[i, j] /= grid_count[i, j] | ||
if grid_count_frames[i, j] > 0: | ||
z_Ref[i, j] /= grid_count_frames[i, j] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see that you're first creating an average surface before computing the curvature. On a scientific level, maybe we should consider treating each surface for each frame separately? Otherwise, if you have a membrane that undulates such that a patch has ~50% positive curvature and 50% negative curvature over the frames, the Gaussian curvature is ultimately computed on a flat surface?
Hi @lilyminium Lily, thanks for your review. Regarding this comment (and below)
If it works for you, I'll add these remarks as new issues instead of adding them in this PR. The changes are getting a bit too long. Would that work? |
Helps to fix #33.
In this first version of the refactored code, the calculation of curvature doesn't evaluate per leaflet, and instead, it runs for a surface defined by selected atoms. (More general)
Changes in this PR:
MDAnalysis
.def_all_beads
,curvature
andcore_fast
.grid_map
to map coordinates to grid added.grid_map
arguments as np.array / tuples instead of topologies.MDtraj
byMDanalysis
100%.As suggested here, tests with toy model using
pytest.mark.parametrize
added:test_grid_map_small_9grid
, using toy model in small grid of 9 lipids in a 3x3 grid, with x values of 0, 1, 2, and y values of 0, 1, 2.test_grid_map_25grid
, using toy model in small grid of 25 lipids in a 5x5 grid, with x values of 0, 1, 2,3,4 and y values of 0, 1, 2,3,4.This PR may also fix
def_all_beads
#27 since functiondef_all_beads
was deleted after replacingMDtraj by
MDAnalysis`.core_fast
inmods.py
#28 since functioncore_fast
was deleted after refactoring.def_all_beads
and add test using toy model. #32 since functiondef_all_beads
was deleted after refactoring and replaced by direct selection using MDAnalysis`.Edit: This PR also fixes #16 .