PMDA with refactored `_single_frame` #128

yuxuanzhuang · 2020-07-08T11:43:22Z

Still on-going, showing some possible simplifications after Universe can be serialized.

PR Checklist

Tests?
Docs?
CHANGELOG updated?
Issue raised/referenced?

pep8speaks · 2020-07-08T11:43:25Z

Hello @yuxuanzhuang! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

In the file pmda/custom.py:

Line 86:80: E501 line too long (83 > 79 characters)
Line 137:25: E126 continuation line over-indented for hanging indent
Line 139:52: E127 continuation line over-indented for visual indent

In the file pmda/density.py:

Line 302:80: E501 line too long (86 > 79 characters)
Line 315:44: W291 trailing whitespace

In the file pmda/leaflet.py:

Line 102:9: E265 block comment should start with '# '
Line 133:51: W504 line break after binary operator
Line 135:52: W504 line break after binary operator
Line 137:54: W504 line break after binary operator
Line 139:54: W504 line break after binary operator
Line 141:54: W504 line break after binary operator
Line 143:52: W504 line break after binary operator

In the file pmda/parallel.py:

Line 236:80: E501 line too long (81 > 79 characters)
Line 352:80: E501 line too long (80 > 79 characters)
Line 353:70: E128 continuation line under-indented for visual indent
Line 353:80: E501 line too long (80 > 79 characters)
Line 381:30: E126 continuation line over-indented for hanging indent
Line 383:34: E123 closing bracket does not match indentation of opening bracket's line
Line 409:14: E131 continuation line unaligned for hanging indent
Line 413:28: E226 missing whitespace around arithmetic operator
Line 428:80: E501 line too long (83 > 79 characters)
Line 448:80: E501 line too long (91 > 79 characters)
Line 451:80: E501 line too long (86 > 79 characters)

In the file pmda/rms/rmsd.py:

Line 147:80: E501 line too long (85 > 79 characters)
Line 149:80: E501 line too long (88 > 79 characters)

In the file pmda/rms/rmsf.py:

Line 187:39: E226 missing whitespace around arithmetic operator
Line 187:80: E501 line too long (90 > 79 characters)
Line 189:80: E501 line too long (83 > 79 characters)
Line 204:34: E127 continuation line over-indented for visual indent

Comment last updated at 2020-07-16 10:59:39 UTC

orbeckst

It would be nice to unify the two _single_frame() methods. In my opinion, returning values as in PMDA is the better design compared to storing state in MDAnalysis 1.x, though. I am almost more tempted to break AnalysisBase in 2.0 but this would be a wider discussion and require broad consensus.

Some initial comments/questions below.

orbeckst · 2020-07-12T02:13:41Z

pmda/parallel.py

+        for i, ts in enumerate(self._universe._trajectory[bslice]):
+            self._frame_index = i
            # record io time per frame
-            with timeit() as b_io:
-                # explicit instead of 'for ts in u.trajectory[bslice]'
-                # so that we can get accurate timing.
-                ts = u.trajectory[i]
-            # record compute time per frame


Where did the per-frame timing information go?

orbeckst · 2020-07-12T02:16:17Z

pmda/parallel.py

-
-    @staticmethod
-    def _reduce(res, result_single_frame):
-        """ 'append' action for a time series"""
-        res.append(result_single_frame)
-        return res


How are you going to deal with aggregations as in DensityAnalysis or complicated reductions as in RMSF?

It's still doable...with some workaround. i.e. we can do aggregations and further calculations inside _conclude, but it also means we have to transfer more data back to the main process. (new examples updated in the gist jupyter notebook)

…to serialize_u

kain88-de · 2020-07-13T19:31:05Z

pmda/parallel.py

        n_frames = len(range(start, stop, step))

        self.start, self.stop, self.step = start, stop, step

        self.n_frames = n_frames

+        #  in case _prepare has not set an array.
+        self._results = np.zeros(n_frames)


Not sure this is always something we want to do. Check the other analysis classes in mda to see if such a default would make sense.

This should normally be overridden by the definition inside _prepare to suit what _single_frame saves.

I do not like this reasoning. This means I always have to know about the default and that I can overwrite it in one place. This just looks weird to me. I would rather have to just set this up in once single place.

kain88-de · 2020-07-13T19:34:42Z

This is pretty cool. I'm do not expect this to work for everything. But it could make a neat little wrapper for some analysis functions. It could definitely be working on 'AnalysisFromFunction'.

orbeckst

I am not sold on how to unify the two _single_frame() methods.

I'd rather have a simple PR that just upgrades PMDA to use serializable universes but leaves everything else intact. Then we can worry about massive API changes. First make it work in a simple manner where you immediately reap the benefits of the serialization.

orbeckst · 2020-07-15T00:18:33Z

pmda/density.py

+#        self._results[self._frame_index][0] = self._ts.frame
+        # the actual trajectory is at self._trajectory
+#        self._results[self._frame_index][1] = self._trajectory.time
+        self._results[self._frame_index] = h


Storing a full histogram for each frame is bad – you can easily run out of memory. I think it is important that the aggregation is done every step and not just in _conclude.

But isn't that what also happens with _reduce? It won't pass the full histogram back to the main process, but only the calculated frames in _dask_helper.

No, the density reduce

pmda/pmda/density.py

Lines 326 to 332 in 13fa3b5

def _reduce(res, result_single_frame):

""" 'accumulate' action for a time series"""

if isinstance(res, list) and len(res) == 0:

res = result_single_frame

else:

res += result_single_frame

return res

does an in-place addition (not a list append) in line 331. In _conclude

pmda/pmda/density.py

Lines 305 to 306 in 13fa3b5

self._grid = self._results[:].sum(axis=0)

self._grid /= float(self.n_frames)

we only sum over the summed histograms of the blocks and then divide by all frames to get the average.

Btw, the PMDA paper has a discussion on that topic.

orbeckst · 2020-07-15T00:19:05Z

pmda/density.py

-    @staticmethod
-    def _reduce(res, result_single_frame):
-        """ 'accumulate' action for a time series"""
-        if isinstance(res, list) and len(res) == 0:
-            res = result_single_frame
-        else:
-            res += result_single_frame
-        return res


I don't like the design that gets rid of reduce.

orbeckst · 2020-07-15T00:20:30Z

pmda/leaflet.py

-        subgraphs = nx.connected_components(graph)
-        comp = [g for g in subgraphs]
-        return comp
+        #raise TypeError(data)


I wouldn't bother with parallel LeafletFinder as it is broken at the moment #76 , #79

Sorry I think I packed too much inside this PR. I was intended to discover the possibility to parallel LeafletFinder both among frames and inside single frame. Because for now, it only starts to work on the next frame after all the jobs are done in the current one.

So I changed this more coarsed instead of passing hundreds of jobs per frame * hundreds of frames to dask graph.

Problem is twofold (after AtomGroup and everything are implemented):

XDR and DCD format files fail to pickle the last frame:

u = mda.Universe(GRO_MEMPROT, XTC_MEMPROT) u.trajectory[4] # last frame pickle.loads(pickle.dumps(u.trajectory)) EOFError: Trying to seek over max number of frames

The major problem is trajectory._xdr.current_frame == 5 (1-based). I might need to add extra fix (and test?) to https://github.com/MDAnalysis/mdanalysis/pull/2723/files, or maybe in an individual PR? since the pickling is handled on their own

The algorithm itself gets different results (for the same ts) with different n_jobs (maybe because I made some mistakes).

The "last frame" thing is a real issue. Oops!

Don't worry about LeafletFinder at the moment, it's not really your job to fix it, and it has lots of issues. (If you need it for your own research and you have an interest in getting it working then that's a bit different but I'd still say, focus on the serialization core problem for now.)

Just pushed a fix for the "last frame" issue.

Not LeafletFinder per se, but maybe a general framework to suit all conditions. e.g.

in each frame deserves parallelism.

run multiple analysis on one single universe.

run one analysis on multiple universe.

complex dependencies between each job

A solution I can think of is to let ParallelAnalysisBase inherents basic methods DaskMethodsMixin as a dask custom collection, so that we can build complex dask graph. But I am not sure how well we can build a legit API that users do not need to care too much about the underlying implementation

That's a good analysis of use cases and it would be useful to write this down somewhere. With PMDA so far (except LeafletFinder) we have been focusing on the simple split-apply-combine because that can be put in a simple "framework". Beyond that, it becomes difficult to do a "one size fits all" and it becomes a serious research project in CS.

I would be happy if we had a library that allows users to easily write their own split/apply/combine type analysis and where we provide a few additional parallelized analysis that might not fit into this scheme (such as LeafletFinder).

An interesting idea that has been coming up repeatedly is to "stack" multiple analysis, i.e., run multiple _single_frame() sequentially so that the expensive data loading into memory time is amortized.

Finally, run one analysis on multiple universes seems to be a standard pleasingly parallel job that can make use of existing workflow management tools – I don't see what we can do directly to support it.

orbeckst · 2020-07-15T00:22:09Z

pmda/leaflet.py

@@ -200,12 +204,13 @@ def _single_frame(self, ts, atomgroups, scheduler_kwargs, n_jobs,
        # Distribute the data over the available cores, apply the map function
        # and execute.
        parAtoms = db.from_sequence(arranged_coord,
-                                    npartitions=len(arranged_coord))
+                                    npartitions=n_jobs)


LeafletFinder is not parallelized over frames... I am not sure that choosing n_jobs is the correct choice here. Need to look at the original paper/algorithm.

orbeckst · 2020-07-15T00:27:30Z

pmda/parallel.py

-        # record time to generate universe and atom groups
-        with timeit() as b_universe:
-            u = mda.Universe(top, traj)
-            agroups = [u.atoms[idx] for idx in indices]
-
-        res = []


Getting rid of this cludge is great.

yuxuanzhuang · 2020-07-15T08:55:57Z

m not sold on how to unify the two _single_frame() methods.

I'd rather have a simple PR that just upgrades PMDA to use serializable universes but leaves everything else intact. Then we can worry about massive API changes. First make it work in a simple manner where you immediately reap the benefits of the serialization.

yeah, makes sense.

orbeckst · 2020-07-15T18:05:00Z

It makes sense to have this PR as a test case and to try out what might be possible. To make immediate progress, I'll continue with your PR #132 for now.

yuxuanzhuang added 2 commits July 8, 2020 12:59

change to new mda

1780ecb

remove timing universe

3fe1394

yuxuanzhuang added 2 commits July 8, 2020 13:43

Merge branch 'master' into serialize_u

10f7194

ts returns nothing now

64cd7d4

yuxuanzhuang mentioned this pull request Jul 11, 2020

Unifying _single_frame() of AnalysisBase and ParallelAnalysisBase #131

Open

yuxuanzhuang added 2 commits July 11, 2020 13:16

remove reduce

19a65b6

Merge branch 'master' into serialize_u

1aa6b29

orbeckst reviewed Jul 12, 2020

View reviewed changes

yuxuanzhuang added 8 commits July 12, 2020 15:53

density

9ef2bb6

density

142df9d

Merge branch 'serialize_u' of https://github.com/yuxuanzhuang/pmda in…

0e2295b

…to serialize_u

add timeit for io and dask prep

6696f9d

density sum

e0d3f7c

rmsf

2dae366

no need to pass u

c1e4cc0

add doc rmsf back

ad24667

kain88-de reviewed Jul 13, 2020

View reviewed changes

yuxuanzhuang added 2 commits July 14, 2020 14:31

set _results to list of result

72181ae

move _results to prep

7db243c

orbeckst requested changes Jul 15, 2020

View reviewed changes

yuxuanzhuang changed the title ~~PMDA with Serialized Universe~~ PMDA with refactored _single_frame Jul 15, 2020

build mdanalysis on serialize_io

00a9b04

yuxuanzhuang marked this pull request as draft August 19, 2020 09:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PMDA with refactored `_single_frame` #128

PMDA with refactored `_single_frame` #128

yuxuanzhuang commented Jul 8, 2020 •

edited

Loading

pep8speaks commented Jul 8, 2020 •

edited

Loading

orbeckst left a comment

orbeckst Jul 12, 2020

orbeckst Jul 12, 2020

yuxuanzhuang Jul 12, 2020

kain88-de Jul 13, 2020

yuxuanzhuang Jul 14, 2020

kain88-de Jul 14, 2020

kain88-de commented Jul 13, 2020

orbeckst left a comment

orbeckst Jul 15, 2020

yuxuanzhuang Jul 15, 2020

orbeckst Jul 15, 2020 •

edited

Loading

orbeckst Jul 15, 2020

orbeckst Jul 15, 2020

orbeckst Jul 15, 2020

yuxuanzhuang Jul 15, 2020

yuxuanzhuang Jul 15, 2020

orbeckst Jul 15, 2020

yuxuanzhuang Jul 16, 2020

orbeckst Jul 16, 2020

orbeckst Jul 15, 2020

orbeckst Jul 15, 2020

yuxuanzhuang commented Jul 15, 2020

orbeckst commented Jul 15, 2020

	def _reduce(res, result_single_frame):
	""" 'accumulate' action for a time series"""
	if isinstance(res, list) and len(res) == 0:
	res = result_single_frame
	else:
	res += result_single_frame
	return res

	self._grid = self._results[:].sum(axis=0)
	self._grid /= float(self.n_frames)

PMDA with refactored _single_frame #128

Are you sure you want to change the base?

PMDA with refactored _single_frame #128

Conversation

yuxuanzhuang commented Jul 8, 2020 • edited Loading

PR Checklist

pep8speaks commented Jul 8, 2020 • edited Loading

Comment last updated at 2020-07-16 10:59:39 UTC

orbeckst left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kain88-de commented Jul 13, 2020

orbeckst left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

orbeckst Jul 15, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yuxuanzhuang commented Jul 15, 2020

orbeckst commented Jul 15, 2020

PMDA with refactored `_single_frame` #128

PMDA with refactored `_single_frame` #128

yuxuanzhuang commented Jul 8, 2020 •

edited

Loading

pep8speaks commented Jul 8, 2020 •

edited

Loading

orbeckst Jul 15, 2020 •

edited

Loading