Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding grammar for v0.10.0 #66

Merged
merged 49 commits into from Nov 19, 2019
Merged

Adding grammar for v0.10.0 #66

merged 49 commits into from Nov 19, 2019

Conversation

fekad
Copy link
Contributor

@fekad fekad commented Oct 27, 2019

Thank for the previous work of @waychal, this is a PR for the grammar of the v0.10.0 specification. This grammar maps the original specification of the "Filter Language EBNF Grammar" into lark format. Although I applied some simplifications (like ignoring white space, used some definition from the common library...), all of them just makes the filters more robust.

Note: It also contains all the optional features, but they can be ignored during the transformation by using the __default__ method of the Transformer classes.

  • defining the grammar for the filter
  • fixing grammar version (allowing multiple variants of the same version eg.: v0.10.0.g and v0.10.0.elastic.g)
  • changing the extension of the grammar file from .g to .lark for allowing syntax highlights
  • extending the test cases
  • create a reference implementation for the Transformer class
  • creating a new PRs with the suggestions

@codecov
Copy link

codecov bot commented Oct 27, 2019

Codecov Report

Merging #66 into master will increase coverage by 2.07%.
The diff coverage is 89.21%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #66      +/-   ##
==========================================
+ Coverage   82.24%   84.31%   +2.07%     
==========================================
  Files          33       35       +2     
  Lines        1667     2098     +431     
==========================================
+ Hits         1371     1769     +398     
- Misses        296      329      +33
Impacted Files Coverage Δ
optimade/filtertransformers/json.py 0% <ø> (ø) ⬆️
optimade/filtertransformers/debug.py 0% <0%> (ø)
...imade/filtertransformers/tests/test_transformer.py 100% <100%> (ø) ⬆️
optimade/filtertransformers/tests/test_django.py 88.88% <100%> (ø) ⬆️
optimade/server/entry_collections.py 83.6% <100%> (+1.1%) ⬆️
optimade/filterparser/tests/test_filterparser.py 100% <100%> (ø) ⬆️
optimade/filterparser/__init__.py 100% <100%> (ø) ⬆️
...ade/filtertransformers/tests/test_elasticsearch.py 80.64% <100%> (ø) ⬆️
optimade/server/tests/test_server.py 87.93% <80.68%> (-12.07%) ⬇️
optimade/filtertransformers/mongo.py 76.47% <83.72%> (+5.88%) ⬆️
... and 7 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b13e11c...cc6685b. Read the comment docs.

@fekad
Copy link
Contributor Author

fekad commented Oct 28, 2019

Questions about the grammar

  • For the property/IDENTIFIER should we stick to the spec and allowing lowercase only or should we just use the more robust CNAME (which is: ("_" | UCASE_LETTER | LCASE_LETTER) ("_" | UCASE_LETTER | LCASE_LETTER | DIGIT)*) and using the transformer to lower the case? For example, what would be the expected behaviour/mongo filter for the NOTICE=val query string:

    • raised error because it is not valid property (according to the specification this should happen)
    • {'NOTICE': {'$eq': 'val'}}
    • {'ice': {'$not': {'$eq': 'val'}}}
    • {'ICE': {'$not': {'$eq': 'val'}}} this is the result of the current implementation

    Decision: allowing lower case characters only as it is defined in spec. So NOTICE=val raises an error , NOTice=val is equivalent to {'ice': {'$not': {'$eq': 'val'}}}, but "NOTice"=val is equivalent to {'val': {'$eq': 'NOTice'}}

  • Is the space mandatory or optional after a token like NOT or AND? optional

  • The following optional rule's effect is not obvious:

    identifier <operator> identifier
    constant <operator> constant
    

@ml-evs
Copy link
Member

ml-evs commented Oct 28, 2019

This is great stuff @fekad , please request reviews when you're ready so we can get this in!

@fekad
Copy link
Contributor Author

fekad commented Oct 28, 2019

Suggestions/ideas/questions about the specification

We can also suggest the following tiny modification(s) in the spec:

  • Handling the '_' character separately from the LowercaseLetter, by removing it from the definition of LowercaseLetter:

    LowercaseLetter =
        'a' | 'b' | 'c' | 'd' | 'e' | 'f' | 'g' | 'h' | 'i' | 'j' | 'k' | 'l' | 
        'm' | 'n' | 'o' | 'p' | 'q' | 'r' | 's' | 't' | 'u' | 'v' | 'w' | 'x' |
        'y' | 'z' 
    ;
    

    and adding it explicitly where we want to use it:

    Identifier = ( '_' | LowercaseLetter ), { '_' | LowercaseLetter | Digit }, [Spaces] ;
    
    Punctuator =
        '!' | '#' | '$' | '%' | '&' | "'" | '(' | ')' | '*' | '+' | ',' |
        '-' | '.' | '/' | ':' | ';' | '<' | '=' | '>' | '?' | '@' | '[' |
        ']' | '^' | '`' | '{' | '|' | '}' | '~' | '_'
    ;
    
  • There are small discrepancies between the names used in spec and the grammar in the appendix. Eg.:
    property VS identifier at Substring Comparisons

  • Constant vs. Value

  • renaming ConstantFirstComparison to PropertyLastComparison

  • PredicateComparison = LengthComparison looks redundant.

  • using repetition instead of recursion:
    Expression = ExpressionClause, { OR, ExpressionClause } ;
    ExpressionClause = ExpressionPhrase, { AND, ExpressionPhrase } ;

  • Merging: Comparison into ExpressionPhrase:

    ExpressionPhrase = [ NOT ], ( PropertyFirstComparison | PropertyLastComparison | 
    LengthComparison | OpeningBrace, Expression, ClosingBrace );
    
  • LengthComparison = LENGTH, Property, [ Operator ], Number ;: optional Operator, Value -> Number

  • Operator = ( '<', [ '=' ] | '>', [ '=' ] | [ '!' ], '=' ), [Spaces] ;

  • (same as: LIKE operators OPTIMADE#87: adding REGEXP:

    FuzzyStringOpRhs = ( CONTAINS | STARTS, [ WITH ] | ENDS, [ WITH ] | REGEXP ), String ;
    
  • As far as I know the EBNF standard allows to define terminals as string:

    AND = 'AND', [Spaces] ;
    NOT = 'NOT', [Spaces] ;
    etc ...
    
  • Representing all the tokens with all caps in the grammar (like: Property, Operator, ) just to make the separatation of the tokens from the rules more easy for the developers.

  • The implementation of ConstantFirstComparison = Constant, ValueOpRhs ; is quite ugly because the "value" in ValueOpRhs is the property and the Constant is the value so we have to swap it. It works but it is a little bit ugly...

  • "5.2.5 Nested property names" and "5.2.6 Filtering on relationships" could be mergedinto "5.1 Lexical Tokens".

  • extra '`' character in the line: LENGTH list value`: applies the numeric comparison for the number of items in the list property.

  • typo at:

    Examples:

    • :property:_exmpl_formula_sum (a property specific to that database)
    • :property:_exmpl_band_gap
    • :property:_exmpl_supercell
    • :property:_exmpl_trajectory
    • :property:_exmpl_workflow_id
  • the definition of "value" vs "value may equal operator value" is a little bit confusing in the 5.2.4 section

    set_op_rhs: HAS ( [ OPERATOR ] value
                     | ALL value_list
                     | ANY value_list
                     | ONLY value_list )
    value_list: [ OPERATOR ] value ( "," [ OPERATOR ] value )*
    VS 
    set_op_rhs: HAS [ ALL | ANY | ONLY] value_list
    
  • Examples for "5.2.4 Comparisons of list properties":

    list HAS 3:  ...
    list HAS ALL 3: ...    
    OPTIONAL: list HAS < 3: matches all entries for which list contains at least one element that is less than three.
    OPTIONAL: list HAS ALL < 3, > 3: matches only those entries for which list simultaneously contains at least one element less than three and one element greater than three.
    
  • typo: (* OperatorComparison operator tokens: *) -> (* Comparison operator tokens: *)

  • Should we propagate negotion down in the tree? (eg: {"a": {"$not": {"$lt": 3}}} can be simplified to {"a": {"$gte": 3}}) Note: I already combined multiple ORs and ANDs when they are next to each other.

  • (Just an idea) New WIP PR to determine how we should handle/implement the queries for the list properties

    • Correlated positions in multiple arrays: As far as I know in the case of MongoDB there's no way to handle different but "correlated" arrays (I mean there is no way iterate over them simultaneously and using the requested operation separately). The "$elemMatch" can be used in special case when the property is the same (eg: list:list HAS >=2, <=5). (using the new syntax: list HAS ONE >=2, <=5). A similar problem is to filter an array by a different expression for each column.
    • Separate the cases of strings from operator + value. (?)
    • Draft (ANY is optional):
      [ ANY | EVERY ] property HAS [ ANY | ONE | ALL ] oper1 value1, oper2 value2, ...
      .
    ANY ONE / BOTH ALL
    ANY At least one element of the property matches to at least one of the expression. list HAS value list HAS ANY values At least one element of the property matches to all of the expression. list:list HAS val1:val2 Each expression matches to at least one of the element of the property. list HAS ALL values
    EVERY Each element of the property matches to at least one of the expression. list HAS ONLY values Each element of the property matches to all of the expression.

    property HAS oper1 value1, oper2 value2, ...
    property HAS ONE oper1 value1, oper2 value2, ...
    property HAS ALL oper1 value1, oper2 value2, ...

    EVERY property HAS oper1 value1, oper2 value2, ...
    EVERY property HAS BOTH oper1 value1, oper2 value2, ...

Note: Please free to modify/extend this list!

@fekad
Copy link
Contributor Author

fekad commented Oct 28, 2019

This is great stuff @fekad , please request reviews when you're ready so we can get this in!

@ml-evs Please feel free the review/modify/add stuff to this branch anytime (Eventually that was the main reason to use a branch instead of my own repo :)). There is also a new DebugTransformer class which could be useful for testing. I'm going to try to create a transformer for pymongo but I don't have experience either with django nor eleasticsearch.

@CasperWA
Copy link
Member

We can also suggest the following tiny modification(s) in the spec:

* Handling the `'_'` character separately from the `LowercaseLetter`, by removing it from the definition of `LowercaseLetter`:

This seems like a very straight-forward and logical suggestion.
Please make a PR in the spec repo for this! :) - referencing your comment here.

I would probably put @sauliusg, @merkys, and @rartino as reviewers.

fekad and others added 5 commits November 14, 2019 01:15
Co-Authored-By: Casper Welzel Andersen <43357585+CasperWA@users.noreply.github.com>
Co-Authored-By: Casper Welzel Andersen <43357585+CasperWA@users.noreply.github.com>
Co-Authored-By: Casper Welzel Andersen <43357585+CasperWA@users.noreply.github.com>
@CasperWA CasperWA mentioned this pull request Nov 14, 2019
Copy link
Member

@CasperWA CasperWA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of comments more :)

And by the way, for the run.py, I actually do not mind having it in. Indeed, if anyone decides to develop a server in a Windows environment, it may be prudent to not have a bash script to start the server as a standard.
But it should be one or the other. And if it's run.py, we need to update .travis.yml. Thinking more about this, I will make an issue for this and a separate PR, since it's not within the scope of the current one.

from .lark_parser import LarkParser, ParserError

__all__ = [LarkParser, ParserError]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make this more dynamic, you could do a * import and specify the __all__ in lark_parser.py.
This __all__ then becomes lark_parser.__all__.
In this way, if you decide to reveal other classes from lark_parser, you'll add them to that file's __all__ instead of the __init__.py.

See, e.g., here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally, I don't like to usage of * in imports. There are a lot of articles in favour and against its usage (this is just the first match on google). It is similar to the usage of global variable.
Of course, I will modify it as you suggested to keep the repo consistent.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand. We should at some point have a consensus concerning this package of what we do.

optimade/filtertransformers/mongo.py Show resolved Hide resolved
optimade/grammar/v0.10.0.lark Show resolved Hide resolved
CasperWA
CasperWA previously approved these changes Nov 15, 2019
Copy link
Member

@CasperWA CasperWA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The last * imports is not crucial for the merging of this PR, only for consistency with the rest of the repository.
So I vote to just get this is now, so that we can start developing our Transformers :)

Thank you immensely for this contribution @fekad 👍

Before merging, I would like the approval of @ml-evs as well - or I will at least leave a time-window of opportunity before merging 😅

@ml-evs
Copy link
Member

ml-evs commented Nov 15, 2019

Working on it! I think you've sorted out any requested changes to the code, so I'll just focus on double checking it all works in the example server. Would you mind commenting if you get a chance @fekad? I'm sure most of them are my fault... Current issues:

  • It seems like this might break the pagination at the moment; should http://localhost:5000/structures?filter=elements%20HAS%20%22Ac\%22?page_limit=5 work?
  • What am I doing wrong with this one? http://localhost:5000/structures?filter=species_at_sites%20HAS%20ALL%20%22Ba%22,%22F%22,%22H%22,%22Mn%22,%22O%22,%22Re%22,%22Si%22
  • I'm struggling to get the list grammar LENGTH, HAS ANY and HAS ALL working, are these supposed to work? (I may have missed it in the comments above). I'm just pushing my tests now so you can actually see what is failing (the request is in the error messages)

@CasperWA
Copy link
Member

CasperWA commented Nov 15, 2019

Working on it! I think you've sorted out any requested changes to the code, so I'll just focus on double checking it all works in the example server. Would you mind commenting if you get a chance @fekad? I'm sure most of them are my fault... Current issues:

  • It seems like this might break the pagination at the moment; should http://localhost:5000/structures?filter=elements%20HAS%20%22Ac\%22?page_limit=5 work?
  • What am I doing wrong with this one? http://localhost:5000/structures?filter=species_at_sites%20HAS%20ALL%20%22Ba%22,%22F%22,%22H%22,%22Mn%22,%22O%22,%22Re%22,%22Si%22
  • I'm struggling to get the list grammar LENGTH, HAS ANY and HAS ALL working, are these supposed to work? (I may have missed it in the comments above). I'm just pushing my tests now so you can actually see what is failing (the request is in the error messages)

As far as I can see, these are all valid queries (if I infer some % terminology). The only HAS that is currently implemented is property HAS value. Nothing else. Also LENGTH is not implemented.

@ml-evs
Copy link
Member

ml-evs commented Nov 15, 2019

Working on it! I think you've sorted out any requested changes to the code, so I'll just focus on double checking it all works in the example server. Would you mind commenting if you get a chance @fekad? I'm sure most of them are my fault... Current issues:

  • It seems like this might break the pagination at the moment; should http://localhost:5000/structures?filter=elements%20HAS%20%22Ac\%22?page_limit=5 work?
  • What am I doing wrong with this one? http://localhost:5000/structures?filter=species_at_sites%20HAS%20ALL%20%22Ba%22,%22F%22,%22H%22,%22Mn%22,%22O%22,%22Re%22,%22Si%22
  • I'm struggling to get the list grammar LENGTH, HAS ANY and HAS ALL working, are these supposed to work? (I may have missed it in the comments above). I'm just pushing my tests now so you can actually see what is failing (the request is in the error messages)

As far as I can see, these are all valid queries (if I infer some % terminology). The only HAS that is currently implemented is property HAS value. Nothing else. Also LENGTH is not implemented.

Ah okay, I was trying to match the grammar tests but I guess you mean its not implemented in the transformer? In that case I'm happy to add skips to these tests until we have them in. They do currently raise Lark errors rather than the NotImplementedError I'd expect from the code.

I assume the same thing doesn't apply to the pagination though?

@ml-evs
Copy link
Member

ml-evs commented Nov 15, 2019

A slight curiosity is that IS KNOWN seems to fail for aliased fields, e.g. id IS KNOWN and chemical_formula_descriptive IS KNOWN do not work, but lattice_vectors IS KNOWN works fine. I think I've skipped all the tests that we wouldn't expect to work now, so any remaining failures need to be investigated further.

@fekad
Copy link
Contributor Author

fekad commented Nov 15, 2019

A slight curiosity is that IS KNOWN seems to fail for aliased fields, e.g. id IS KNOWN and chemical_formula_descriptive IS KNOWN do not work, but lattice_vectors IS KNOWN works fine. I think I've skipped all the tests that we wouldn't expect to work now, so any remaining failures need to be investigated further.

Hi, thanks for the test cases and the review @ml-evs @CasperWA. This bug is actually caused by the _alias_filter function and StructureMapper because it ignores the lists when there is AND or OR relations. Of course, lattice_vectors works because it doesn't need to be mapped.

More about this here: #66 (comment)

@fekad
Copy link
Contributor Author

fekad commented Nov 15, 2019

Working on it! I think you've sorted out any requested changes to the code, so I'll just focus on double checking it all works in the example server. Would you mind commenting if you get a chance @fekad? I'm sure most of them are my fault... Current issues:

  • It seems like this might break the pagination at the moment; should http://localhost:5000/structures?filter=elements%20HAS%20%22Ac\%22?page_limit=5 work?
  • What am I doing wrong with this one? http://localhost:5000/structures?filter=species_at_sites%20HAS%20ALL%20%22Ba%22,%22F%22,%22H%22,%22Mn%22,%22O%22,%22Re%22,%22Si%22
  • I'm struggling to get the list grammar LENGTH, HAS ANY and HAS ALL working, are these supposed to work? (I may have missed it in the comments above). I'm just pushing my tests now so you can actually see what is failing (the request is in the error messages)

As far as I can see, these are all valid queries (if I infer some % terminology). The only HAS that is currently implemented is property HAS value. Nothing else. Also LENGTH is not implemented.

Ah okay, I was trying to match the grammar tests but I guess you mean its not implemented in the transformer? In that case I'm happy to add skips to these tests until we have them in. They do currently raise Lark errors rather than the NotImplementedError I'd expect from the code.

Yeah, only the implementation of the Transformer is missing the grammar/parser is there.

Unfortunately Lark code catches my NotImplementedError and raises its own error.

I assume the same thing doesn't apply to the pagination though?

This was a tricky one :). You have to use & combine multiple parameters in the URL. Cannot be white spaces around the & character and finally in the test data_returned has to be tested instead of data_available

Copy link
Member

@ml-evs ml-evs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @fekad , I was definitely being stupid for a few of those, but the fixes/responses look good. Happy to accept as is, but I'll leave merging to you!

Copy link
Member

@CasperWA CasperWA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you clarify what has happened concerning the predicate_comparison, i.e., should we simply merge here, and if this PR will change (again) before being merged, we "bug-fix" the grammar/transformer? Or should we wait until the spec. PR has been merged?

In my opinion, since we do not have 1 commit per 1 concept, it doesn't matter. More concretely, since the v0.10.0 grammar addition cannot be pinpointed to a specific commit, but rather has been implemented over several commits (and will be merged in like this via this PR), another commit to alter it after this PR has been merged, will not make a difference in the git history.

What do you think @fekad and @ml-evs? If we all agree, I will approve and merge.

optimade/filtertransformers/mongo.py Outdated Show resolved Hide resolved
optimade/filtertransformers/mongo.py Show resolved Hide resolved
optimade/filtertransformers/mongo.py Show resolved Hide resolved
@CasperWA
Copy link
Member

Since there have been no response, I will merge this and take it that you agree; we can "bug-fix" when the spec. PR has been merged, if needed.

@CasperWA CasperWA merged commit 2401873 into master Nov 19, 2019
Road to optimade-python-tools 1.0 automation moved this from In progress to Done Nov 19, 2019
@CasperWA CasperWA deleted the filter_v0.10.0 branch November 19, 2019 20:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants