simple selection by name is slow due to fnmatch #2751

orbeckst · 2020-06-13T01:33:27Z

Expected behavior

Simple atom selections such as u.select_atoms("protein") or u.select_atoms("name CA") or "resname LYS" are not slowed down by advanced selection capabilities.

Actual behavior

The benchmarks https://www.mdanalysis.org/benchmarks/#selections.SimpleSelectionBench.time_simple_selections?p-selection_string='name%20CA'&six=%5Bnone%5D&ram=16GB clearly show that performance got worse by 53.58x (from 997.988μs to 53.474ms) when PR #2551 , which introduced fnmatch, was merged. Specifically, the commit eb18a338 is the point in the benchmark when performance gets worse.

Note

This slowdown might have contributed to #2671 (tests timing out) and the need for PR #2706.

The text was updated successfully, but these errors were encountered:

orbeckst · 2020-06-13T01:38:12Z

Btw, if you want to see recent regressions, go to https://www.mdanalysis.org/benchmarks/#regressions?sort=3&dir=desc and look for dates in 2020/2019 near the top of the list.

richardjgowers · 2020-06-14T13:32:33Z

I think using something like fnmatch isn't bad per se, string patterns like this should be a solved problem we import. I think instead we can tweak our data structures to go faster (see #2755)

orbeckst · 2020-06-14T20:59:32Z

Maybe it’s not fnmatch itself but whatever we did in the PR that introduced it killed performance. I didn’t have time to look at the old PR but when I briefly looked at your new PR I saw that there used to be a list comprehension over all atom names to run fnmatch at each single one of them and that alone is probably already problematic when all one really wants to do is atoms[atoms.names == “CA”].

richardjgowers · 2020-06-16T10:28:21Z

We used to use something closer to what you've said there, ie letting numpy do the loop rather than a list comprehension. You could rewrite the atom name matching to only use fnmatch when it's a pattern, but this is still fundamentally a slow way to do string matching:

In [1]: import numpy as np                                                                                                              

In [2]: names = np.array(['Ca'] * 100000, dtype=object)                                                                                 

In [3]: %timeit names == 'Ca'                                                                                                           
1.07 ms ± 446 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [4]: ids = np.array([1] * 100000)                                                                                                    

In [5]: %timeit ids == 1                                                                                                                
37.9 µs ± 508 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [7]: ids = np.array([1] * 100000, dtype=np.uint16)                                                                                   

In [8]: ids                                                                                                                             
Out[8]: array([1, 1, 1, ..., 1, 1, 1], dtype=uint16)

In [9]: %timeit ids == 1                                                                                                                
6.8 µs ± 25.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

It's much faster to match integers, if we "limit" ourselves to 16 bits/65,000 names, we can scan the array ~500x faster, you just have to have the str->int conversion with a dict of names.

orbeckst · 2020-06-16T16:15:30Z

Thanks for the explanation. This starts looking like building a relational database.

16 bits for the int keys should be more than plenty.

orbeckst · 2020-08-26T22:36:33Z

PR #2755 fixed the regression: see https://www.mdanalysis.org/benchmarks/#selections.SimpleSelectionBench.time_simple_selections?p-selection_string='resname%20LYS'&commits=eb18a338&x-axis=commit&python=3.6 : note the drop down to 207.9 µs for the last commit, which is faster than the ~1.3 ms before the regression. 🍾

* fix #2751 * Python 2.7/3 backport of PR #2755 (use six.str_types, update CHANGELOG and versionchanged as 1.0.1) * modified AtomNames topologyattr to include lookup table index * rework atom name selection to use lookup tables * fixed test supplying integer as atom name * Update test_topologyattrs.py * use dict-lookup string attrs EVERYWHERERE * made protein selection faster, 48ms -> 0.5ms on GRO testfile * improved nucleic/backbone selections * Added explicit tests for Resnames topologyattr tests now provide str types for resnames/icodes * use fnmatchcase to be case sensitive (this was a small unreported bug in 1.0.0: the matching was done case-insensitive) Co-authored-by: Irfan Alibay <IAlibay@users.noreply.github.com> Co-authored-by: Oliver Beckstein <orbeckst@gmail.com> (cherry picked from commit 45e56e8)

* fix #2751 * Python 2.7/3 backport of PR #2755 (use six.string_types, update CHANGELOG and versionchanged as 1.0.1) * modified AtomNames topologyattr to include lookup table index * rework atom name selection to use lookup tables * fixed test supplying integer as atom name * Update test_topologyattrs.py * use dict-lookup string attrs EVERYWHERERE * made protein selection faster, 48ms -> 0.5ms on GRO testfile * improved nucleic/backbone selections * Added explicit tests for Resnames topologyattr tests now provide str types for resnames/icodes * use fnmatchcase to be case sensitive (this was a small unreported bug in 1.0.0: the matching was done case-insensitive) Co-authored-by: Irfan Alibay <IAlibay@users.noreply.github.com> Co-authored-by: Oliver Beckstein <orbeckst@gmail.com> (cherry picked from commit 45e56e8)

orbeckst added performance Component-Selections labels Jun 13, 2020

orbeckst added this to the 1.0.x milestone Jun 13, 2020

richardjgowers mentioned this issue Jun 14, 2020

Faster name selections #2755

Merged

4 tasks

orbeckst assigned richardjgowers Jun 24, 2020

orbeckst mentioned this issue Jun 26, 2020

Backport 1.0.1 #2798

Merged

4 tasks

richardjgowers closed this as completed in #2755 Aug 25, 2020

orbeckst mentioned this issue Sep 1, 2020

Allow selection with elements attribute #2925

Closed

orbeckst added this to Done in backport Sep 19, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

simple selection by name is slow due to fnmatch #2751

simple selection by name is slow due to fnmatch #2751

orbeckst commented Jun 13, 2020 •

edited

orbeckst commented Jun 13, 2020

richardjgowers commented Jun 14, 2020

orbeckst commented Jun 14, 2020

richardjgowers commented Jun 16, 2020

orbeckst commented Jun 16, 2020

orbeckst commented Aug 26, 2020

simple selection by name is slow due to fnmatch #2751

simple selection by name is slow due to fnmatch #2751

Comments

orbeckst commented Jun 13, 2020 • edited

Expected behavior

Actual behavior

Note

orbeckst commented Jun 13, 2020

richardjgowers commented Jun 14, 2020

orbeckst commented Jun 14, 2020

richardjgowers commented Jun 16, 2020

orbeckst commented Jun 16, 2020

orbeckst commented Aug 26, 2020

orbeckst commented Jun 13, 2020 •

edited