Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing output for optional nodes #73

Open
kevinvanleer opened this issue Sep 14, 2023 · 1 comment
Open

Missing output for optional nodes #73

kevinvanleer opened this issue Sep 14, 2023 · 1 comment

Comments

@kevinvanleer
Copy link

kevinvanleer commented Sep 14, 2023

First of all, thanks for developing this great library! I am prototyping a search grammar in javascript with the hopes of porting it to Java for uses in an android app.

I am building a search grammar that takes the form expression (operator expression)*. If I build the grammar using optional repetition, the optional nodes are missing from the output.

Example:

peg file

grammar CyclotrackSearch
  query <- (expression @" "? expr_op? @" "?)+
  expression <- negation? @" "? lvalue @" " operator @" " rvalue
  expr_op <- "and" / "or"
  lvalue <- "distance" / "date"
  rvalue <- [a-zA-Z0-9]+
  operator <- "is" / "equals" / "greater than" / "less than" / "=" / "<" / ">" / ">=" / "<="
  negation <- "not" / "!"

test app.js

const search = require('./cyclotrack-search')

let tree = search.parse(process.argv[2])

console.log(JSON.stringify(tree,null,2));

for (let node of tree) {
  console.log(node.offset, node.text);
}

If I attempt to parse the following string:

'not distance is 20 or not date is today and not date is tomorrow'

I get the following output using the URL parser sample code:

0 not distance is 20 or
22 not date is today and
44 not date is tomorrow

I expected to get:

0 not distance is 20
19 or
22 not date is today
40 and
44 not date is tomorrow

If I change the grammar such that all nodes are required I get the expected output.

grammar CyclotrackSearch
  #query <- expression @" " expr_op @" " expression @" " expr_op @" " expression
  query <- (expression @" "? expr_op? @" "?)+
  expression <- negation? @" "? lvalue @" " operator @" " rvalue
  expr_op <- "and" / "or"
  lvalue <- "distance" / "date"
  rvalue <- [a-zA-Z0-9]+
  operator <- "is" / "equals" / "greater than" / "less than" / "=" / "<" / ">" / ">=" / "<="
  negation <- "not" / "!"

If I print the entire tree I see that all required nodes are in the tree, but optional nodes are not. Here is an excerpt from an expression node:

"expression": {
        "text": "not distance is 20",
        "offset": 0,
        "elements": [
          {
            "text": "not",
            "offset": 0,
            "elements": []
          },
          {
            "text": "distance",
            "offset": 4,
            "elements": []
          },
          {
            "text": "is",
            "offset": 13,
            "elements": []
          },
          {
            "text": "20",
            "offset": 16,
            "elements": [
              {
                "text": "2",
                "offset": 16,
                "elements": []
              },
              {
                "text": "0",
                "offset": 17,
                "elements": []
              }
            ]
          }
        ],
        "lvalue": {
          "text": "distance",
          "offset": 4,
          "elements": []
        },
        "operator": {
          "text": "is",
          "offset": 13,
          "elements": []
        },
        "rvalue": {
          "text": "20",
          "offset": 16,
          "elements": [
            {
              "text": "2",
              "offset": 16,
              "elements": []
            },
            {
              "text": "0",
              "offset": 17,
              "elements": []
            }
          ]
        }
      }
    },

You can see that all nodes other than the optional negation node are listed. The negation node is parsed as a member of the elements list. This is also true for the expr_op nodes. They are only included in the tree if they are required nodes.

If I change the grammar to use a different form of repetition I get a third output:

query <- expression (@" " expr_op @" " expression)*

output:

0 not distance is 20
18  or not date is today and not date is tomorrow

However the expr_op nodes are included as nodes in the second blob.

It seems like

query <- expression (@" " expr_op @" " expression)*

and

query <- (expression @" "? expr_op? @" "?)+

should produce the same output and that the expr_op nodes and expression nodes should be at the same level in the tree. Your help is greatly appreciated!

EDIT: I was hoping the tree object would look something like this:

{
[
  {expression: {
    negation: "",
    lvalue: "",
    operator: "",
    rvalue: "",
  }},
  {expr_op: ""},
  {expression: {
    negation: "",
    lvalue: "",
    operator: "",
    rvalue: "",
  }},
  {expr_op: ""},
  {expression: {
    negation: "",
    lvalue: "",
    operator: "",
    rvalue: "",
  }}
]
}
@jcoglan
Copy link
Owner

jcoglan commented Sep 19, 2023

The key thing to remember is that the structure of the output mirrors the
structure of the grammar. In your first grammar you have these rules:

query <- (expression @" "? expr_op? @" "?)+
expression <- negation? @" "? lvalue @" " operator @" " rvalue

You try to parse this string:

'not distance is 20 or not date is today and not date is tomorrow'

And you get this sequence of elements from the root node of the tree:

not distance is 20 or
not date is today and
not date is tomorrow

This mirrors the structure of the query rule whose expression is (expression @" "? expr_op? @" "?)+ -- a repetition of one or more instances of the
expression expression @" "? expr_op? @" "?. So each element of a match for
query will contain both an expression and a expr_op.

In order to get this output:

not distance is 20
or
not date is today
and
not date is tomorrow

You would need a grammar where the query rule was a repetion of either
expression or expr_op, i.e. (expression / expr_op)+ (with spaces added
where necessary). This is probably not what you want, because it means the
grammar would match nonsense queries like or or and or.

If I change the grammar such that all nodes are required I get the expected
output.

I'm not sure what you mean by "required" here, and the grammar presented after
this paragraph is identical to the first one.

Then you give another version of the query rule:

query <- expression (@" " expr_op @" " expression)*

This rule is a sequence of two elements: it requires one expression followed
by zero or more instances of (" " expr_op " " expression). Here is the output
showing which part of the rule is responsible for each part:

not distance is 20                                  expression
or not date is today and not date is tomorrow       (@" " expr_op @" " expression)*

If you look inside the second element here you'll see multiple elements for each
match of the (...)* expression:

or not date is today
and not date is tomorrow

This rule looks to me like the correct grammar, as you seem to want your query
language to allow one or more expressions joined together by and/or
operators. Every other version of query you've shown will also match other
strings that won't make any sense, e.g. allowing the string to start or end with
an operator, or be composed entirely of operators.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants