Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More flexibility with wildcards in selection #2551

Merged
merged 22 commits into from
Mar 4, 2020

Conversation

Iv-Hristov
Copy link
Contributor

@Iv-Hristov Iv-Hristov commented Feb 25, 2020

Fixes #2436

Changes made in this Pull Request:

  • Selection strings changed to use fnmatch. This now allows for more flexible wildcard usage as well as for using multiple wildcards at once.
  • Added two new tests to match the new functionality.

PR Checklist

  • Tests?
  • Docs?
  • CHANGELOG updated?
  • Issue raised/referenced?

@codecov
Copy link

codecov bot commented Feb 25, 2020

Codecov Report

Merging #2551 into develop will decrease coverage by 0.00%.
The diff coverage is n/a.

Impacted file tree graph

@@             Coverage Diff             @@
##           develop    #2551      +/-   ##
===========================================
- Coverage    90.68%   90.68%   -0.01%     
===========================================
  Files          169      169              
  Lines        22833    22828       -5     
  Branches      2940     2939       -1     
===========================================
- Hits         20707    20702       -5     
  Misses        1540     1540              
  Partials       586      586              

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update bc2f1a5...77443a3. Read the comment docs.

Copy link
Member

@richardjgowers richardjgowers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good, thanks for adding tests. One question, is there a performance difference? Ie if you do select_atoms('name H?') on a large (~100k atoms) system, how does the timing of this compare?

assert ag == ag_wild

def test_wildcard_double_selection(self, universe):
ag = universe.select_atoms('resname ASN or resname ASP or resname HSD')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a shortcut here is resname ASN ASP HSD

Copy link
Member

@richardjgowers richardjgowers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will also need a CHANGELOG entry and adding yourself to AUTHORS

@richardjgowers richardjgowers self-assigned this Feb 26, 2020
@Iv-Hristov
Copy link
Contributor Author

Thank you for the feedback. I implemented the suggested changes and added two more corner case tests that I could think of. As for the performance, I measured it using a bigger system (340K atoms) and the old version which doesn't support different wildcards performed slightly better (0.6s vs 0.9s). I am not aware of the underlying implementation of fnmatch but we could also try using the python re module.

Copy link
Member

@richardjgowers richardjgowers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok cool, I just wanted to check we weren't tanking performance by making this change. Couple tweaks needed then we'll be good to go

@@ -15,7 +15,7 @@ The rules for this file:
------------------------------------------------------------------------------
mm/dd/yy richardjgowers, kain88-de, lilyminium, p-j-smith, bdice, joaomcteixeira,
PicoCentauri, davidercruz, jbarnoud, RMeli, IAlibay, mtiberti, CCook96,
Yuan-Yu, xiki-tempula
Yuan-Yu, xiki-tempula, Iv-Hristov
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So because this is your first ever contribution, you also need to add your name to the AUTHORS file

@@ -56,6 +56,7 @@ Fixes
* Added parmed to setup.py

Enhancements
* Changed selection wildcards to support multiple wildcards
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reference the initial issue #

@@ -499,8 +500,7 @@ def apply(self, group):
class StringSelection(Selection):
"""Selections based on text attributes

Supports the use of one wildcard at the start,
end, and middle of strings
Supports multiple wildcards, based on fnmatch
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a .. versionchanged:: thing

mask |= np.char.startswith(values, val[:wc_pos])
mask &= np.char.endswith(values, val[wc_pos+1:])

values = getattr(group, self.field).astype(np.str_)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so I think I was calling .astype(np.str_) here because we were pushing it into a np.char function. Now we're using fnmatch this likely isn't necessary, try removing this?

@Iv-Hristov
Copy link
Contributor Author

Iv-Hristov commented Feb 27, 2020

@IAlibay @richardjgowers Thank you for the feedback! I have combined all the wildcard tests to eliminate code duplication as Irfan suggested.

Supports the use of one wildcard at the start,
end, and middle of strings
.. versionchanged:: 0.21
Supports multiple wildcards, based on fnmatch
Copy link
Member

@IAlibay IAlibay Feb 28, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Iv-Hristov the text entry in a versionchanged needs to be indented (this is what is causing Travis to fail).

So:

.. versionchanged:: 1.0.0
   Supports...

Copy link
Member

@orbeckst orbeckst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Iv-Hristov this looks like an excellent contribution.

But the documentation is missing. fnmatch extends the simple *-globbing that we had before. It is very important that this is documented. Find the parts in the documentation that explained the globbing and update them (e.g. Selections should now get its own section on pattern matching).

@lilyminium will need to know this, too, for the AtomSelection Language section of the User Guide.

I know that this seems a lot of extra work for only a few lines of code. But that's why we make you do PRs, so that you get a real idea what it means to produce software that are used by many people.

@Iv-Hristov
Copy link
Contributor Author

@orbeckst Thank you for the feedback! I will make sure to do that later today.

@Iv-Hristov
Copy link
Contributor Author

I think I documented all the functionality and added two more tests. However, I am having a problem where pytest.mark.parametrize complains if I try and use square brackets as an input because it expects a string. I tried a few escape characters such as '', '\', '\Q... \E' but nothing seemed to work. Does anyone have any experience with what escape sequence might work so that I can squish all the tests together?

This is what gives the error "tuple object not callable":

@pytest.mark.parametrize('selstring, wildstring', [

    ('resname TYR THR', 'resname T*R'),
    ('resname ASN GLN', 'resname *N'),
    ('resname ASN ASP', 'resname AS*'),
    ('resname TYR THR', 'resname T?R'),
    ('resname ASN ASP HSD', 'resname *S?'),
    ('resname LEU LYS', 'resname L**'),
    ('resname MET', 'resname *M*')
    ('resname GLN GLU', 'resname GL[NY]')

])
def test_wildcard_selection(self, universe, selstring, wildstring):
    ag = universe.select_atoms(selstring)
    ag_wild = universe.select_atoms(wildstring)
    assert ag == ag_wild

`

@orbeckst
Copy link
Member

orbeckst commented Mar 3, 2020

You missed the comma after line ('resname MET', 'resname *M*').

EDIT: Good practice is to have a comma even after the last element so that you can easily add more elements to the list without the highly informative "tuple object not callable" error ;-)

Copy link
Member

@orbeckst orbeckst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks really good, just one typo in the docs.

Pattern matching
----------------

The pattern matching notation described bellow is used to specify
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

below

----------------

The pattern matching notation described bellow is used to specify
patterns for matching strings:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
patterns for matching strings:
patterns for matching strings (based on :mod:`fnmatch`):

@orbeckst orbeckst merged commit eb18a33 into MDAnalysis:develop Mar 4, 2020
@orbeckst
Copy link
Member

orbeckst commented Mar 4, 2020

Congratulations @Iv-Hristov on your first merged PR. Nice contribution!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allow for more flexibility with wildcard in selections
5 participants