Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adding does not properly group #18

Open
Alan-Chen99 opened this issue Apr 15, 2024 · 12 comments · Fixed by #19
Open

adding does not properly group #18

Alan-Chen99 opened this issue Apr 15, 2024 · 12 comments · Fixed by #19

Comments

@Alan-Chen99
Copy link

import re

from regexfactory import *

r = str(ANCHOR_START + Or("abc", "xyz"))
print(repr(r))
print(re.search(r, "-xyz"))

results in

'^(?:abc)|(?:xyz)'
<re.Match object; span=(1, 4), match='xyz'>

which is incorrect

@GrandMoff100
Copy link
Owner

Oh interesting, I didn't know that ^ had a higher precedence than |.

I guess I can have a precedence system and surround a pattern with a non capturing group in the __radd__ method when we append a character that has a higher precedence than itself.

I'm on it.

@GrandMoff100
Copy link
Owner

Thanks for posting this btw!

@GrandMoff100 GrandMoff100 linked a pull request Apr 16, 2024 that will close this issue
@GrandMoff100
Copy link
Owner

@Alan-Chen99 try version regexfactory==1.0.1

@Alan-Chen99
Copy link
Author

Thanks!

I think "operations" also need to assign precedence. For ex

import re

from regexfactory import *

r = str(Or("a", "b") + Or("x", "y"))
print(repr(r))
print(re.search(r, "a"))

outputs

'(?:a)|(?:b)(?:x)|(?:y)'
<re.Match object; span=(0, 1), match='a'>

The precedence of a raw regex also need to be lower:

import re

from regexfactory import *

r = str(RegexPattern("^") + RegexPattern("a|b"))
print(repr(r))
print(re.search(r, "-b"))

outputs

'^a|b'
<re.Match object; span=(1, 2), match='b'>

I think two things need to be done:

  • operations (or, concat, etc) should group operands if the precedence of the operands is lower the the precendence of the operation itself
  • the default precedence of a raw regex should be the lowest (this will also account for future additions from python i think? since we will never generate them yet)

@Alan-Chen99
Copy link
Author

@GrandMoff100

@GrandMoff100
Copy link
Owner

Could you provide examples of what regex you would have it generate ideally in specific scenarios? This will help me implement the functionality you're looking for.

On another note: RegexPattern is not a class intended to be instantiated directly. Until I develop a parser system to parse raw regex strings into a tree of RegexFactory objects I don't intend to provide any support for "raw" regex strings. Without that parser system I don't want to restructure the parent class, RegexPattern, to take on a whole new functionality as a raw regex pattern class because it doesn't make sense for the children to inherit that raw string functionality.

However, the example you sent with the Or's being concatenated does look a little wonky so I will look into that in the next few days.

@GrandMoff100
Copy link
Owner

Looking at the Or example more closely it looks like the b and x patterns are getting interpreted as a merged Or option. So instead of the compiled string being a two character pattern with two options per character I think it might be interpreting the pattern as "a" or "bx" or "y" with three cases. I need to confirm this, but if I'm right then this shouldn't happen and I think I just need to implement a group around Or's specifically when they get concatted. Rather than creating a precedence system for operations which I don't entirely understand how would work.

@Alan-Chen99
Copy link
Author

I made a PR which assigns a "_precedence" to a RegexPattern, which is the precedence of the "root node" if one would to parse the underlying regex.

It generates some excessive parenthesis, for ex ANCHOR_START + "ab" now returns '(?:^)ab'. the emacs rx library appears to solve this by
(see https://github.com/emacs-mirror/emacs/blob/master/lisp/emacs-lisp/rx.el)

;; The `rx--translate...' functions below return (REGEXP . PRECEDENCE),
;; where REGEXP is a list of string expressions that will be
;; concatenated into a regexp, and PRECEDENCE is one of
;;
;;  t    -- can be used as argument to postfix operators (eg. "a")
;;  seq  -- can be concatenated in sequence with other seq or higher (eg. "ab")
;;  lseq -- can be concatenated to the left of rseq or higher (eg. "^a")
;;  rseq -- can be concatenated to the right of lseq or higher (eg. "a$")
;;  nil  -- can only be used in alternatives (eg. "a\\|b")
;;
;; They form a lattice:
;;
;;           t          highest precedence
;;           |
;;          seq
;;         /   \
;;      lseq   rseq
;;         \   /
;;          nil         lowest precedence

imo we should just use an extra group to keep stuff simpler

@Alan-Chen99
Copy link
Author

I don't intend to provide any support for "raw" regex strings

doesnt a literal str currently represent a "raw" regex?

@GrandMoff100 GrandMoff100 reopened this May 2, 2024
@GrandMoff100
Copy link
Owner

doesnt a literal str currently represent a "raw" regex?

Yeah, I suppose concatenating a literal string would represent "raw" regex, but what I meant was that I don't intend to be responsible for behavior of regexfactory when users concat literal strings because there are so many edge cases and head-aches that I don't want to deal with. Now that I think about it, we should raise exception when we try to concat a non-RegexPattern object. (i.e. ANCHOR_START + "ab").

I think then I'd also add LiteralString class as well in the case where you want to match a specific string of text. (This would let us do something like ANCHOR_START + LiteralString("ab") ---> ^ab. That would make my life easier when I parse raw regex expressions in RegexFactory components, to have a dedicated component for literal text patterns which keeps type consistency.

imo we should just use an extra group to keep stuff simpler

yeah, by that I presume you mean putting a group around the Or's when they get concatenated together? If so, that's what I wanted to do to begin with:

I think I just need to implement a group around Or's specifically when they get concatted.

@Alan-Chen99
Copy link
Author

Alan-Chen99 commented May 2, 2024

extra group to keep stuff simpler

actually i meant '(?:^)ab'.

I thought one need to use the diamond above, but actually regex in python seem to work differently then elisp

it seems that i can treat "^" just as a normal char. not sure though, its marked as having a precedence below concatenation according to https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html#tag_09_04_08, so if we follow that "^ab" is invalid and we need to make '(?:^)ab'
but turns out python just treat ^ as a char?

@GrandMoff100
Copy link
Owner

If you wanted to match the string "^ab" literally you can use the pattern \^ab you just have to escape the ^

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants