Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

simple selection by name is slow due to fnmatch #2751

Closed
orbeckst opened this issue Jun 13, 2020 · 6 comments · Fixed by #2755
Closed

simple selection by name is slow due to fnmatch #2751

orbeckst opened this issue Jun 13, 2020 · 6 comments · Fixed by #2755

Comments

@orbeckst
Copy link
Member

orbeckst commented Jun 13, 2020

Expected behavior

Simple atom selections such as u.select_atoms("protein") or u.select_atoms("name CA") or "resname LYS" are not slowed down by advanced selection capabilities.

Actual behavior

The benchmarks https://www.mdanalysis.org/benchmarks/#selections.SimpleSelectionBench.time_simple_selections?p-selection_string='name%20CA'&six=%5Bnone%5D&ram=16GB clearly show that performance got worse by 53.58x (from 997.988μs to 53.474ms) when PR #2551 , which introduced fnmatch, was merged. Specifically, the commit eb18a338 is the point in the benchmark when performance gets worse.

Note

This slowdown might have contributed to #2671 (tests timing out) and the need for PR #2706.

@orbeckst
Copy link
Member Author

Btw, if you want to see recent regressions, go to https://www.mdanalysis.org/benchmarks/#regressions?sort=3&dir=desc and look for dates in 2020/2019 near the top of the list.

@orbeckst orbeckst added this to the 1.0.x milestone Jun 13, 2020
@richardjgowers
Copy link
Member

I think using something like fnmatch isn't bad per se, string patterns like this should be a solved problem we import. I think instead we can tweak our data structures to go faster (see #2755)

@orbeckst
Copy link
Member Author

Maybe it’s not fnmatch itself but whatever we did in the PR that introduced it killed performance. I didn’t have time to look at the old PR but when I briefly looked at your new PR I saw that there used to be a list comprehension over all atom names to run fnmatch at each single one of them and that alone is probably already problematic when all one really wants to do is atoms[atoms.names == “CA”].

@richardjgowers
Copy link
Member

We used to use something closer to what you've said there, ie letting numpy do the loop rather than a list comprehension. You could rewrite the atom name matching to only use fnmatch when it's a pattern, but this is still fundamentally a slow way to do string matching:

In [1]: import numpy as np                                                                                                              

In [2]: names = np.array(['Ca'] * 100000, dtype=object)                                                                                 

In [3]: %timeit names == 'Ca'                                                                                                           
1.07 ms ± 446 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [4]: ids = np.array([1] * 100000)                                                                                                    

In [5]: %timeit ids == 1                                                                                                                
37.9 µs ± 508 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [7]: ids = np.array([1] * 100000, dtype=np.uint16)                                                                                   

In [8]: ids                                                                                                                             
Out[8]: array([1, 1, 1, ..., 1, 1, 1], dtype=uint16)

In [9]: %timeit ids == 1                                                                                                                
6.8 µs ± 25.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

It's much faster to match integers, if we "limit" ourselves to 16 bits/65,000 names, we can scan the array ~500x faster, you just have to have the str->int conversion with a dict of names.

@orbeckst
Copy link
Member Author

Thanks for the explanation. This starts looking like building a relational database.

16 bits for the int keys should be more than plenty.

@orbeckst
Copy link
Member Author

PR #2755 fixed the regression: see https://www.mdanalysis.org/benchmarks/#selections.SimpleSelectionBench.time_simple_selections?p-selection_string='resname%20LYS'&commits=eb18a338&x-axis=commit&python=3.6 : note the drop down to 207.9 µs for the last commit, which is faster than the ~1.3 ms before the regression. 🍾

image

orbeckst pushed a commit that referenced this issue Aug 26, 2020
* fix #2751
* Python 2.7/3 backport of PR #2755 (use six.str_types, update CHANGELOG and versionchanged as 1.0.1)
* modified AtomNames topologyattr to include lookup table index
* rework atom name selection to use lookup tables
* fixed test supplying integer as atom name
* Update test_topologyattrs.py
* use dict-lookup string attrs EVERYWHERERE
* made protein selection faster, 48ms -> 0.5ms on GRO testfile
* improved nucleic/backbone selections
* Added explicit tests for Resnames topologyattr
  tests now provide str types for resnames/icodes
* use fnmatchcase to be case sensitive (this was a small unreported bug in 1.0.0: the matching
  was done case-insensitive)

Co-authored-by: Irfan Alibay <IAlibay@users.noreply.github.com>
Co-authored-by: Oliver Beckstein <orbeckst@gmail.com>

(cherry picked from commit 45e56e8)
orbeckst pushed a commit that referenced this issue Aug 26, 2020
* fix #2751
* Python 2.7/3 backport of PR #2755 (use six.str_types, update CHANGELOG and versionchanged as 1.0.1)
* modified AtomNames topologyattr to include lookup table index
* rework atom name selection to use lookup tables
* fixed test supplying integer as atom name
* Update test_topologyattrs.py
* use dict-lookup string attrs EVERYWHERERE
* made protein selection faster, 48ms -> 0.5ms on GRO testfile
* improved nucleic/backbone selections
* Added explicit tests for Resnames topologyattr
  tests now provide str types for resnames/icodes
* use fnmatchcase to be case sensitive (this was a small unreported bug in 1.0.0: the matching
  was done case-insensitive)

Co-authored-by: Irfan Alibay <IAlibay@users.noreply.github.com>
Co-authored-by: Oliver Beckstein <orbeckst@gmail.com>

(cherry picked from commit 45e56e8)
orbeckst pushed a commit that referenced this issue Aug 27, 2020
* fix #2751
* Python 2.7/3 backport of PR #2755 (use six.string_types, update CHANGELOG and versionchanged as 1.0.1)
* modified AtomNames topologyattr to include lookup table index
* rework atom name selection to use lookup tables
* fixed test supplying integer as atom name
* Update test_topologyattrs.py
* use dict-lookup string attrs EVERYWHERERE
* made protein selection faster, 48ms -> 0.5ms on GRO testfile
* improved nucleic/backbone selections
* Added explicit tests for Resnames topologyattr
  tests now provide str types for resnames/icodes
* use fnmatchcase to be case sensitive (this was a small unreported bug in 1.0.0: the matching
  was done case-insensitive)

Co-authored-by: Irfan Alibay <IAlibay@users.noreply.github.com>
Co-authored-by: Oliver Beckstein <orbeckst@gmail.com>

(cherry picked from commit 45e56e8)
@orbeckst orbeckst added this to Done in backport Sep 19, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
backport
  
Done
Development

Successfully merging a pull request may close this issue.

2 participants