Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add an AnalysisCollection class #4017

Open
wants to merge 9 commits into
base: develop
Choose a base branch
from

Conversation

PicoCentauri
Copy link
Contributor

@PicoCentauri PicoCentauri commented Jan 30, 2023

Fixes #3569

The AnalysisCollection follows the discussion of #3569 resetting the timestep object for each analysis by default. If requested the user can set the reset_timestep variable to False allowing altering the timestep object.

Within this PR I moved the run method from AnalysisBase to AnalysisCollection to avoid code a lot of duplication. These changes interface of Github might make this a bit difficult to follow. The core changes the run method only that instead of running the _prepare, _single_frame and _conclude methods only once it is looped over all provided analysis instances.

Changes made in this Pull Request:

  • Add an AnalysisCollection class to perform multiple analyses on the same trajectory
  • Use assert_equal in test_base.py

PR Checklist

  • Tests?
  • Docs?
  • CHANGELOG updated?
  • Issue raised/referenced?

Disclaimer: This PR was written with inspiration of OpenAI's ChatGPT.

@codecov
Copy link

codecov bot commented Jan 30, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 93.17%. Comparing base (347a0c0) to head (e762e6a).

Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #4017      +/-   ##
===========================================
- Coverage    93.59%   93.17%   -0.43%     
===========================================
  Files          168       12     -156     
  Lines        21104     1069   -20035     
  Branches      3919        0    -3919     
===========================================
- Hits         19752      996   -18756     
+ Misses         894       73     -821     
+ Partials       458        0     -458     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@PicoCentauri PicoCentauri force-pushed the analysiscollection branch 3 times, most recently from 924d05c to a6d9ef2 Compare January 30, 2023 14:52
@PicoCentauri PicoCentauri changed the title Add a AnalayisCollection class Add an AnalayisCollection class Jan 30, 2023
@orbeckst orbeckst changed the title Add an AnalayisCollection class Add an AnalysisCollection class Jan 30, 2023
Copy link
Member

@richardjgowers richardjgowers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this. Can I be cheeky and inquire what sort of speed up you get running something like 2 separate rdfs in tandem?

analysis_object.times[i] = ts.time
analysis_object._single_frame()

if reset_timestep:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ts is never unassigned, so I'm not sure this is necessary here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But, if an instance is changing the ts object we have to restore it from the stored one.

ts_original = ts.copy()

for analysis_object in self._analysis_objects:
analysis_object._frame_index = i
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is messy but understandable. In hindsight, maybe passing these variables through the single frame method would have been cleaner.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, maybe be, but changing this might break a lot. I can put a comment so that these setters might become clearer.

@PicoCentauri
Copy link
Contributor Author

If I use the code given in the example of class I got a speedup of ~30% compared to running each class individually. Quite good for such a simple change 🙂

@orbeckst
Copy link
Member

@richardjgowers I made you the responsible adult in the room ;-) — please shepherd the PR to completion

@yuxuanzhuang
Copy link
Contributor

Up for discussion, would it be a good idea to inherit abc.Collection class so that one can do e.g.

for analysis in collection:
    print(analysis.results)

Besides, it feels a bit weird to me that AnalysisBase is inherited from AnalysisCollection which is a collection of AnalysisBases...

Copy link
Member

@orbeckst orbeckst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not like turning the core of AnalysisBase inside out and into AnalysisCollection (ie run()).

It's semantically confusing and it also makes it even harder to understand how to write your own analysis.

I think the logic needs to be rethought.

return self


class AnalysisBase(AnalysisCollection):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @yuxuanzhuang : it feels very weird and circular to have AnalysisBase inherited from AnalysisCollection. It just does not make sense semantically.

Can this be changed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only thing why I did this to do not copy code of the run method between AnalysisBase and AnalysisCollection. We can remove the inheritance by explicitly leaving the run method in AnalysisBase untouched. This would be fine for me!

@PicoCentauri
Copy link
Contributor Author

We could wither duplicate the logic, which I don't like. Or, create en external class that both the Base and the Collection use.

@yuxuanzhuang
Copy link
Contributor

We could wither duplicate the logic, which I don't like. Or, create en external class that both the Base and the Collection use.

Would it be a good idea to create an external Runner class for the run task and have e.g. SingleAnalysisRunner, CollectionAnalysisRunner (I'm not totally sure what the best design pattern is here) ? I'm also thinking about if we plan to implement parallel analysis based on it, it can be simply creating a new ParallelAnalysisRunner (or even ParallelCollectionAnalysisRunner) instead of doing complex inheritance.

@PicoCentauri
Copy link
Contributor Author

PicoCentauri commented Feb 27, 2023

Would it be a good idea to create an external Runner class for the run task and have e.g. SingleAnalysisRunner, CollectionAnalysisRunner (I'm not totally sure what the best design pattern is here) ? I'm also thinking about if we plan to implement parallel analysis based on it, it can be simply creating a new ParallelAnalysisRunner (or even ParallelCollectionAnalysisRunner) instead of doing complex inheritance.

I am fine with creating these two classes. To avoid code duplication the SingleAnalysisRunner will inherit from the CollectionAnalysisRunner and just call the run method for a single instance.

Also, should they rather be private classes? I think users will not use them because these runners will require an already prepared analysis class.

@hmacdope
Copy link
Member

hmacdope commented Mar 5, 2023

We could wither duplicate the logic, which I don't like. Or, create en external class that both the Base and the Collection use.

I think this solution is the cleanest and doesn't require even more classes which I don't like the idea of. IMO there is already enough of a barrier to entry without every new analysis having to handle 2x new API points.

Personally I would just have an abc abstact base class that they can both inherent from.

Copy link
Member

@hmacdope hmacdope left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry should have left a review.

@PicoCentauri
Copy link
Contributor Author

Yes, we can do an ABC class if the others also agree.

Copy link
Member

@richardjgowers richardjgowers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking at it, it initially seems strange that Single inherits from Collection, but for the code it does make sense.

@PicoCentauri have you thought about how results could be accessed from the AnalysisCollection object? Or is the intent to always access results from the individual classes again?

Comment on lines 335 to 339
reset_timestep : bool, optional
Reset the timestep object after for each ``analysis_object``.
Setting this to ``False`` can be useful if an ``analysis_object``
is performing a trajectory manipulation which is also useful for the
subsequent ``analysis_instances`` e.g. unwrapping of molecules.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

on the basis that there should be one way to do anything, I don't like the suggestion that you could use an analysis class to transform data for downstream classes (use a transformation instead). Instead make reset_timestep=True the only option and remove this kwarg

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree and I will change this.

Comment on lines 342 to 334
# Ensure compatibility with API of version 0.15.0
if not hasattr(self, "_analysis_instances"):
self._analysis_instances = (self,)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is confusing, instead this is making a single analysis class work as a subclass of a Collection right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this is just a leftover from the code before. We can also remove it I think, if we do not support 0.15.0 anymore.

@@ -316,6 +489,7 @@ def __init__(self, trajectory, verbose=False, **kwargs):
self._trajectory = trajectory
self._verbose = verbose
self.results = Results()
super(AnalysisBase, self).__init__(self)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this ugly super call signature was Python 2.x, I think we can use super().__init__(self) now

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree

@@ -220,7 +224,176 @@ def __setstate__(self, state):
self.data = state


class AnalysisBase(object):
class AnalysisCollection(object):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to inherit from object any more either (it's implicit)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree

@@ -26,6 +25,8 @@ Fixes
(Issue #3336)

Enhancements
* Add an `AnalayisCollection` class to perform multiple analysis on the same
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo on Analysis

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cheers

@PicoCentauri
Copy link
Contributor Author

Thanks to all of you for your comments and suggestions and sorry for my later answer.

I agree that the current design is a bit confusing. I thought about the ABC abstract base class. I think this does not really fit here. The reason is that AnalysisCollection works with and on AnalysisBase instances.

A cleaner way for reading the code would be the idea that @yuxuanzhuang suggested that we provide a (private ?) _Runner class that basically is the code of the AnalysisCollection and the setup_frames. This _Runner will analyze one or more AnalysisBase instances. What do you think?

Regarding some questions and comments:

I think this solution is the cleanest and doesn't require even more classes which I don't like the idea of. IMO there is already enough of a barrier to entry without every new analysis having to handle 2x new API points.

This was never the idea. There will only be one API point for new Analysis classes.

@PicoCentauri have you thought about how results could be accessed from the AnalysisCollection object? Or is the intent to always access results from the individual classes again?

The latter. I think accessing the results from the individual instances is already handy. An additional way would be just confusing and blowing up the code.

@PicoCentauri
Copy link
Contributor Author

@orbeckst @hmacdope I would still like to get this feature into main. Before I rebase again I would like to ask for your opinion again. I think the biggest issue to be discussed is the inheritance of AnalysisBase and AnalysisCollection.

@orbeckst
Copy link
Member

@PicoCentauri sorry for the following short comment... I've run out of "open source time" for the day: There's a longish discussion on parallel analysis in #4158 and it would be good to hear your opinion how the proposed feature could/should work with an AnalysisBase that would try to parallelize with split-apply-combine (see PMDA https://doi.org/10.25080/Majora-7ddc1dd1-013 for the simple idea).

@marinegor
Copy link
Contributor

Hi @PicoCentauri , thanks so much for the PR, it's a neat idea!

Regarding of being compatible with #4158, I feel like you'd need to change the _compute method of your AnalysisCollection class instead of run, and the rest will be more or less the same.

And I'd probably indeed subclass AnalysisBase in this case -- you'd need to re-implement __init__, _single_frame, _prepare and _conclude, and run will be then implemented automatically, with an option to be executed on multiple backends too.

@PicoCentauri
Copy link
Contributor Author

Thanks @orbeckst and @marinegor for the comments.

Regarding of being compatible with #4158, I feel like you'd need to change the _compute method of your AnalysisCollection class instead of run, and the rest will be more or less the same.

Changing to _compute is an easy fix and I could do it..

And I'd probably indeed subclass AnalysisBase in this case -- you'd need to re-implement init, _single_frame, _prepare and _conclude, and run will be then implemented automatically, with an option to be executed on multiple backends too.

I am not sure If I get your comment correctly. You would subclass AnalysisBase ? But this means we don't have reimplement the methods. Maybe you meant to NOT subclass AnalysisBase?

@marinegor
Copy link
Contributor

@PicoCentauri I indeed mean "subclass AnalysisBase" here, and re-implement the methods that need implementation in subclasses -- otherwise, you need to manually add run method to it, and make sure its semantics is maintained the same with AnalysisBase (I assume this was your intention).

But if you subclass AnalysisBase here, you only need to add _single_frame, _prepare and _conclude methods.
_single_frame would run _single_frame of all analysis objects from the collection, _prepare would initialize their _prepare methods, and _conclude would aggregate the computed results back.

@pep8speaks
Copy link

pep8speaks commented Jun 5, 2024

Hello @PicoCentauri! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2024-06-05 16:10:06 UTC

Copy link

github-actions bot commented Jun 5, 2024

Linter Bot Results:

Hi @PicoCentauri! Thanks for making this PR. We linted your code and found the following:

Some issues were found with the formatting of your code.

Code Location Outcome
main package ⚠️ Possible failure
testsuite ⚠️ Possible failure

Please have a look at the darker-main-code and darker-test-code steps here for more details: https://github.com/MDAnalysis/mdanalysis/actions/runs/9387570012/job/25850803205


Please note: The black linter is purely informational, you can safely ignore these outcomes if there are no flake8 failures!

@PicoCentauri
Copy link
Contributor Author

Sorry for the lack of updates for a while. I addressed your concerns and moved the implementation of the run method into a standalone function that is used by the AnalysisBase as well as the AnalysisCollection.

@marinegor
Copy link
Contributor

hey @PicoCentauri , the parallelization PR that we've mentioned earlier has been merged into develop (#4162), and your AnalysisCollection would be extremely helpful here. Reason is, parallel analysis doesn't provide any real speedup unless processing of a single frame is noticeably slower that reading it from the disk -- but if we're running an AnalysisCollection, it's almost guaranteed!

Could you perhaps have a look at your code after you merge develop into it? I can't yet see how your approach would interact with the current parallelization protocol (you can look at the diagram here: https://docs.mdanalysis.org/dev/documentation_pages/analysis/parallelization.html#split-apply-combine), but it might actually work well.

If this doesn't work, perhaps I'd suggest implementing AnalysisCollection as AnalysisBase subclass, and in its _single_frame running each subclass's _single_frame() and manually updating the trajectory -- this way, you will likely get the parallelization with multiprocessing as soon as you figure out how to aggregate the results. But it's more tedious, and I'm happy to discuss it in more details if you want.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Collective analysis of different analysis classes on the same trajectory
7 participants