Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using Capped distance in distance selections #2035

Merged
merged 9 commits into from Aug 14, 2018

Conversation

@ayushsuhane
Copy link
Contributor

commented Aug 9, 2018

Fixes #974 , part of #782

Increases the performance of 'around' selection using NSgrid.

PR Checklist

  • Tests?
  • Docs?
  • [X ] CHANGELOG updated?
  • Issue raised/referenced?
@ayushsuhane

This comment has been minimized.

Copy link
Contributor Author

commented Aug 9, 2018

As benchmark for 'around' selection is already present, I compared the three methods (exactly same functions) for that specific case. It seems that the timings have improved by an order of magnitude.
https://github.com/ayushsuhane/Benchmarks_Distance/blob/master/Notebooks/AroundBenchmark.ipynb
However, bear in mind, the improvement is for this case and not for every case.

@orbeckst mentioned adding benchmarks. While this will be reflected if we run the benchmarks, should I add more cases? Maybe PBC/nonPBC as in here (Issue #1721 ). But due to lack of time, I guess I will pick it up after GSOC period is over.

@ayushsuhane

This comment has been minimized.

Copy link
Contributor Author

commented Aug 9, 2018

@richardjgowers @jbarnoud I am not sure how to add another method here mostly because of flags. Any suggestions to get around the problem without breaking the tests (and more specifically code). I thought its better to discuss here, how do we want to proceed. Do we want to keep earlier code and optimize the methods or do we want to replace with one or two using flags/keywords. As we are on the topic, I don't think using capped function for optimization (in its current form) is suitable here, because it does extra computations of pairs/distances which we don't need in this case.

@codecov

This comment has been minimized.

Copy link

commented Aug 9, 2018

Codecov Report

Merging #2035 into develop will decrease coverage by 0.03%.
The diff coverage is 92.3%.

Impacted file tree graph

@@             Coverage Diff             @@
##           develop    #2035      +/-   ##
===========================================
- Coverage    88.93%   88.89%   -0.04%     
===========================================
  Files          144      144              
  Lines        17490    17449      -41     
  Branches      2693     2696       +3     
===========================================
- Hits         15554    15512      -42     
+ Misses        1323     1322       -1     
- Partials       613      615       +2
Impacted Files Coverage Δ
package/MDAnalysis/core/selection.py 99.31% <92.3%> (-0.05%) ⬇️
package/MDAnalysis/lib/pkdtree.py 90.1% <0%> (-1.1%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d2e22ff...ff1605c. Read the comment docs.

@orbeckst

This comment has been minimized.

Copy link
Member

commented Aug 9, 2018

This is looking promising already.

We want to get rid of the flags #782 so there might not remain a way for the user to choose a method. Bearing this in mind, would you recommend to switch over to NSGrid (and take the performance hit in some cases)?

@orbeckst

This comment has been minimized.

Copy link
Member

commented Aug 9, 2018

benchmarks: While this will be reflected if we run the benchmarks, should I add more cases? Maybe PBC/nonPBC as in here (Issue #1721 ).

PBC/non-PBC would be excellent.

@richardjgowers

This comment has been minimized.

Copy link
Member

commented Aug 9, 2018

@ayushsuhane I think ideally we'd reuse capped_distance_array, then we could simplify this greatly. The only reason I can think not to is that it calculates indices and distances, so maybe could be faster. Is this why you didn't use that?

@richardjgowers

This comment has been minimized.

Copy link
Member

commented Aug 9, 2018

Ah just reread, you could add a return_distances kwargs (default True) to capped_distance and then it should be identical to this right? It doesn't make sense to have so much duplicated code.

@ayushsuhane

This comment has been minimized.

Copy link
Contributor Author

commented Aug 9, 2018

@richardjgowers Yes. While this is the case with Periodic KDTree and distance array, it would not be a problem if we are using only capped_distance(.., .., method='nsgrid') as it returns all the distances and pairs, which serves the purpose here.

However, yes, adding return distance to kwargs is a good idea. I will look into it

@ayushsuhane

This comment has been minimized.

Copy link
Contributor Author

commented Aug 9, 2018

@orbeckst For this special case of around selections, it wouldn't hurt to take a hit and replacing with nsgrid. But I think a better approach would be to use capped_distance and optimize there. So that if anyone plan to change/modify the method later, one has to change only the capped_function, and not worry about its dependencies.

That being said, there are certain drawbacks in the capped_distance, which I wrote with different assumptions and needs to be changed. For instance, _bruteforce_capped checks distance with every atoms individually in a loop, which is a bad idea for a numpy array but it can handle arbitrary size of atoms. Now that we wouldn't want to use brute force for arbitrary size and limit it only till its the fastest, it would be better to change it. Similarly for _pkdtree_capped where we have used for loops to get individual distances and pairs, which can/should be avoided.

Copy link
Member

left a comment

@ayushsuhane Ok I think this looks like a case where you need to eat your own dog food. You've merged in some awesome changes in lib.distances and now you're rewriting/rephrasing them in this code. AroundSelection should be blissfully unaware of how distances are calculated, it should just call some function which does it.

Maybe this leads to a few more optimisations in capped later (ie make the distances optionally returned), but this is the thing that all the capped stuff was written for.

I'm pretty sure once we go through selections and start using the capped functions there'll be a lot less lines, but it will be a lot faster too!

@ayushsuhane

This comment has been minimized.

Copy link
Contributor Author

commented Aug 10, 2018

@richardjgowers One more thing. It seems importing any library inside the function, for instance from . import PeriodicKDTree inside _pkdtree_capped takes a large amount of time which is reflected baldly in benchmarks. Any suggestions how can we avoid it. Also, I guess its a bad practice to use local imports, isn't it? I thought we just did it for the time being, but if we are looking for optimizations, it shouldn't really be there

@ayushsuhane

This comment has been minimized.

Copy link
Contributor Author

commented Aug 10, 2018

It seems importing doesn't take any significant time. My bad.

@richardjgowers richardjgowers self-assigned this Aug 10, 2018
@ayushsuhane

This comment has been minimized.

Copy link
Contributor Author

commented Aug 10, 2018

I think after #2041, we can move to capped_distance directly. Here is an example of choosing the best method for around selection based on the rules specified in capped_distance.

image

However, I have one concern in changing the codebase of around. It seems tests are written for different flags and methods for every class i.e. apply_kdtree, apply_distmat. So there are essentially two things, one is removal of flags, and another is using a single method instead of multiple _apply methods. What do you think should be done first. Any suggestions how should I go about it?

@richardjgowers

This comment has been minimized.

Copy link
Member

commented Aug 10, 2018

@ayushsuhane I think you can just rip the flags (& associated tests) out. If someone chose kdtree in the flags and we don't respect that, they'll still get the same distances out, so it's not going to hurt anyone.

@ayushsuhane ayushsuhane changed the title Using NSGrid in around selections Using Capped distance in distance selections Aug 11, 2018
@ayushsuhane ayushsuhane force-pushed the ayushsuhane:aroundbench branch 3 times, most recently from 4b6a0cb to 4904d4d Aug 11, 2018
@richardjgowers

This comment has been minimized.

Copy link
Member

commented Aug 13, 2018

@ayushsuhane this needs a quick rebase right?

@ayushsuhane ayushsuhane force-pushed the ayushsuhane:aroundbench branch from 4904d4d to 7987084 Aug 13, 2018
@richardjgowers

This comment has been minimized.

Copy link
Member

commented Aug 13, 2018

Looks solid, iirc the coverage for selections was complete, so if tests pass this is good to go

@ayushsuhane

This comment has been minimized.

Copy link
Contributor Author

commented Aug 13, 2018

@richardjgowers Not sure what is happening here. Can you take a look? I tried restarting the build but the appveyor still fails due to encore. Also, I dont think I can restart the appveyor build

@zemanj

This comment has been minimized.

Copy link
Member

commented Aug 14, 2018

@ayushsuhane This looks like yet another random number issue to me... I just had the same test failing on my linux box, but it passed in a subsequent run. The failing test uses analysis.encore.bootstrap.get_ensemble_bootstrap_samples() with only 10 samples, so probably that's the culprit. Maybe just give it another try?

@orbeckst

This comment has been minimized.

Copy link
Member

commented Aug 14, 2018

@zemanj

This comment has been minimized.

Copy link
Member

commented Aug 14, 2018

@orbeckst Ok I restarted the Travis build for this PR, but now my noob question: In how far is that related to AppVeyor?

@zemanj

This comment has been minimized.

Copy link
Member

commented Aug 14, 2018

@richardjgowers Ok I gotta remember that shortcut 😉

@richardjgowers

This comment has been minimized.

Copy link
Member

commented Aug 14, 2018

@zemanj it’s unrelated :) I didn’t immediately see the rerun appveyor button on mobile so I turned the PR off then on again....

@zemanj

This comment has been minimized.

Copy link
Member

commented Aug 14, 2018

That's what i thought... I don't see any option to rerun a build on AppVeyor either. Not on mobile, though.

@richardjgowers richardjgowers merged commit e396630 into MDAnalysis:develop Aug 14, 2018
5 checks passed
5 checks passed
Travis CI - Pull Request Build Passed
Details
codecov/patch 92.3% of diff hit (target 88.93%)
Details
codecov/project Absolute coverage decreased by -0.03% but relative coverage increased by +3.37% compared to d2e22ff
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@orbeckst

This comment has been minimized.

Copy link
Member

commented Aug 14, 2018

Ok I restarted the Travis build for this PR, but now my noob question: In how far is that related to AppVeyor?

Sorry, I only saw #2035 (comment) on my phone and did not realize that this referred to AppVeyor.

@ayushsuhane ayushsuhane deleted the ayushsuhane:aroundbench branch Aug 14, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.