Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Handle classes without save_results #17

Merged
merged 16 commits into from May 20, 2021
Merged

RFC: Handle classes without save_results #17

merged 16 commits into from May 20, 2021

Conversation

PicoCentauri
Copy link
Collaborator

If an analysis class have no save_results function we simply dump a
pickel file with the name of the class.

What do you think about this @joaomcteixeira?

@joaomcteixeira
Copy link
Member

Honestly, I don't see it this way. I think pickling objects will disrupt the whole purpose of the project in the first place.

In my view, MDACLI aims to bridge MDAnalysis to less programmatic users, or facilitate operations for everyone, or to allow piping sequential workflows.

Hence, once the pickle object is created, what is the user to do with ti? The user will need to go the Python prompt again, import pickle, unpickle, go to MDAnalsysis documentation, look for the needed method, extract the data, and finally use it. A user that can do all this labyrinthic process will actually scripting the MDAnalysis whole way and won't use MDACLI. Pickle does serve piping though.

This has been a recurrent thought in my mind for the last months. I see two possibilities:

  1. The MDA community agrees to have a common method interface for all classes from where we can extract the results. Note, this interface needs not to be public! If the MDA community does not want a common public interface, we, from the MDACLI project, can try to add the needed format-converting and ducktyping functionality through private methods. For example, we would look for the most common way MDA Analysis classes provide results and emulate it in all other classes through private methods -- in an adaptor-like approach. But, this has two problems:
    1. We would need to inspect mannually each and every Analysis class addition to assure format compatibility
    2. consequently, new analysis won't be available on-the-fly on MDACLI, only after revision and PR. Hence we couldn't make use of __all__.
  2. The second possibility is to do the above referred job directly in MDACLI, instead of PRing to the main MDAnalysis. The benefit here is that progress in MDACLI won't need to pass through the PR process of the MDA bigger project, inhevitable slower.

All and all, I truly believe that MDACLI's output should be a CSV file or plots. However, it can indeed output also the pickle analysis. That might be useful for some users. Yet, it shoudn't be a final goal.

So answering your question: For the long term, No. For the time being and as an additional output format, Yes.

So, I believe your PR is very valuable. But we should maintain this discussion active.

I will wait for your thoughts before merging. Let me know you think.

@orbeckst
Copy link
Member

Do you have a specification document of what you would like the MDA AnalysisBase API to look like, from MDACLI's perspective? This would help focus the discussion.

Here are some possible approaches that I can think of:

Reserved attribute names

For example, should .result always be the data structure containing computed data?

Trailing underscore attributes (sklearn-style)

In scikit-learn, computed variables get an underscore affixed (e.g. .results_). We could do that (although I dislike using variable name alterations to indicate types etc. ... and I think it looks ugly — but there's precedent in the Python eco-system and I can be convinced otherwise.)

Flexible annotation

Or should there be a more flexible approach where we might have an attribute that marks up important "Result attributes" e.g. "OUTPUT" for anything that is always computed after run() and possibly other tags such as "OPTIONAL" for something that might be computed by an auxiliary method (e.g., in HydrogenBondAnalysis):

self._annotation = {'results': 'OUTPUT', 'times': 'OUTPUT', 'count_by_type': 'OPTIONAL'}

Then MDACLI can find anything that's 'OUTPUT' and use out = getattr(analysis_instance, key) to process. If out is a numpy array, use np.save() to CSV, if it's a pd.DataFrame, use to_csv(), ... etc. MDACLI's contribution would be to make the decision for the user that the output is always CSV and then find a way to make it work. In MDA we decided (when we cut the save() methods, that the user knows best how to save data).

There's certainly a better way to do this but you get the idea.

@PicoCentauri
Copy link
Collaborator Author

I added a saving routine that uses the results of the new MDAnalysis Results class.

Currently the test will fail and one can not test the code. I will add additional test cases and streamline the code when
#23 is merged.

Please comment if the code is readable and the output structure seems reasonable.

@PicoCentauri PicoCentauri mentioned this pull request May 17, 2021
3 tasks
If an analysis class have no save_results function we simply dump a
pickel file with the name of the class.
PicoCentauri and others added 3 commits May 18, 2021 15:52
Also adds execution command to save CSV.
Only saves JSON if dict has length.
All new code was provided by Philip.

Co-authored-by: Philip Loche <ploche@physik.fu-berlin.de>
@pep8speaks
Copy link

pep8speaks commented May 18, 2021

Hello @PicoCentauri! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-05-20 22:51:30 UTC

@joaomcteixeira joaomcteixeira linked an issue May 18, 2021 that may be closed by this pull request
@PicoCentauri
Copy link
Collaborator Author

I added a test for the cli.
We should also add the header we put to the csv files to the json output file!

@joaomcteixeira
Copy link
Member

@PicoCentauri
Currently, arrays above 3 dims are ignored and non-JSON-serializable results are also ignored. What do you think we .npz higher dimension arrays and try to pickle non-serializable objects?

Improves docstring
General improvements cleaning
Saves ndim > 3 to npz

TODO:
- some refact on save_results
- some testing
TODO:
* lint
* tests
joaomcteixeira and others added 3 commits May 20, 2021 23:23
before attempting mdacli saving engine.

Co-authored-by: Philip Loche <ploche@physik.fu-berlin.de>
correct json saving command
Functional refactoring of save branch
@PicoCentauri PicoCentauri merged commit 6334ac4 into main May 20, 2021
@PicoCentauri PicoCentauri deleted the save branch May 20, 2021 22:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Visit all Analysis run/result API
4 participants