adding does not properly group #18

Alan-Chen99 · 2024-04-15T18:49:44Z

import re

from regexfactory import *

r = str(ANCHOR_START + Or("abc", "xyz"))
print(repr(r))
print(re.search(r, "-xyz"))

results in

'^(?:abc)|(?:xyz)'
<re.Match object; span=(1, 4), match='xyz'>

which is incorrect

The text was updated successfully, but these errors were encountered:

GrandMoff100 · 2024-04-16T01:31:15Z

Oh interesting, I didn't know that ^ had a higher precedence than |.

I guess I can have a precedence system and surround a pattern with a non capturing group in the __radd__ method when we append a character that has a higher precedence than itself.

I'm on it.

GrandMoff100 · 2024-04-16T01:34:58Z

Thanks for posting this btw!

GrandMoff100 · 2024-04-16T15:45:40Z

@Alan-Chen99 try version regexfactory==1.0.1

Alan-Chen99 · 2024-05-02T03:35:21Z

Thanks!

I think "operations" also need to assign precedence. For ex

import re

from regexfactory import *

r = str(Or("a", "b") + Or("x", "y"))
print(repr(r))
print(re.search(r, "a"))

outputs

'(?:a)|(?:b)(?:x)|(?:y)'
<re.Match object; span=(0, 1), match='a'>

The precedence of a raw regex also need to be lower:

import re

from regexfactory import *

r = str(RegexPattern("^") + RegexPattern("a|b"))
print(repr(r))
print(re.search(r, "-b"))

outputs

'^a|b'
<re.Match object; span=(1, 2), match='b'>

I think two things need to be done:

operations (or, concat, etc) should group operands if the precedence of the operands is lower the the precendence of the operation itself
the default precedence of a raw regex should be the lowest (this will also account for future additions from python i think? since we will never generate them yet)

Alan-Chen99 · 2024-05-02T03:41:30Z

@GrandMoff100

GrandMoff100 · 2024-05-02T04:00:39Z

Could you provide examples of what regex you would have it generate ideally in specific scenarios? This will help me implement the functionality you're looking for.

On another note: RegexPattern is not a class intended to be instantiated directly. Until I develop a parser system to parse raw regex strings into a tree of RegexFactory objects I don't intend to provide any support for "raw" regex strings. Without that parser system I don't want to restructure the parent class, RegexPattern, to take on a whole new functionality as a raw regex pattern class because it doesn't make sense for the children to inherit that raw string functionality.

However, the example you sent with the Or's being concatenated does look a little wonky so I will look into that in the next few days.

GrandMoff100 · 2024-05-02T04:06:17Z

Looking at the Or example more closely it looks like the b and x patterns are getting interpreted as a merged Or option. So instead of the compiled string being a two character pattern with two options per character I think it might be interpreting the pattern as "a" or "bx" or "y" with three cases. I need to confirm this, but if I'm right then this shouldn't happen and I think I just need to implement a group around Or's specifically when they get concatted. Rather than creating a precedence system for operations which I don't entirely understand how would work.

Alan-Chen99 · 2024-05-02T16:59:02Z

I made a PR which assigns a "_precedence" to a RegexPattern, which is the precedence of the "root node" if one would to parse the underlying regex.

It generates some excessive parenthesis, for ex ANCHOR_START + "ab" now returns '(?:^)ab'. the emacs rx library appears to solve this by
(see https://github.com/emacs-mirror/emacs/blob/master/lisp/emacs-lisp/rx.el)

;; The `rx--translate...' functions below return (REGEXP . PRECEDENCE),
;; where REGEXP is a list of string expressions that will be
;; concatenated into a regexp, and PRECEDENCE is one of
;;
;;  t    -- can be used as argument to postfix operators (eg. "a")
;;  seq  -- can be concatenated in sequence with other seq or higher (eg. "ab")
;;  lseq -- can be concatenated to the left of rseq or higher (eg. "^a")
;;  rseq -- can be concatenated to the right of lseq or higher (eg. "a$")
;;  nil  -- can only be used in alternatives (eg. "a\\|b")
;;
;; They form a lattice:
;;
;;           t          highest precedence
;;           |
;;          seq
;;         /   \
;;      lseq   rseq
;;         \   /
;;          nil         lowest precedence

imo we should just use an extra group to keep stuff simpler

Alan-Chen99 · 2024-05-02T17:06:59Z

I don't intend to provide any support for "raw" regex strings

doesnt a literal str currently represent a "raw" regex?

GrandMoff100 · 2024-05-02T20:04:20Z

doesnt a literal str currently represent a "raw" regex?

Yeah, I suppose concatenating a literal string would represent "raw" regex, but what I meant was that I don't intend to be responsible for behavior of regexfactory when users concat literal strings because there are so many edge cases and head-aches that I don't want to deal with. Now that I think about it, we should raise exception when we try to concat a non-RegexPattern object. (i.e. ANCHOR_START + "ab").

I think then I'd also add LiteralString class as well in the case where you want to match a specific string of text. (This would let us do something like ANCHOR_START + LiteralString("ab") ---> ^ab. That would make my life easier when I parse raw regex expressions in RegexFactory components, to have a dedicated component for literal text patterns which keeps type consistency.

imo we should just use an extra group to keep stuff simpler

yeah, by that I presume you mean putting a group around the Or's when they get concatenated together? If so, that's what I wanted to do to begin with:

I think I just need to implement a group around Or's specifically when they get concatted.

Alan-Chen99 · 2024-05-02T22:18:42Z

extra group to keep stuff simpler

actually i meant '(?:^)ab'.

I thought one need to use the diamond above, but actually regex in python seem to work differently then elisp

it seems that i can treat "^" just as a normal char. not sure though, its marked as having a precedence below concatenation according to https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html#tag_09_04_08, so if we follow that "^ab" is invalid and we need to make '(?:^)ab'
but turns out python just treat ^ as a char?

GrandMoff100 · 2024-05-02T22:29:47Z

If you wanted to match the string "^ab" literally you can use the pattern \^ab you just have to escape the ^

GrandMoff100 mentioned this issue Apr 16, 2024

Fix pattern precedence's #19

Merged

GrandMoff100 linked a pull request Apr 16, 2024 that will close this issue

Fix pattern precedence's #19

Merged

GrandMoff100 closed this as completed in #19 Apr 16, 2024

GrandMoff100 reopened this May 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

adding does not properly group #18

adding does not properly group #18

Alan-Chen99 commented Apr 15, 2024

GrandMoff100 commented Apr 16, 2024

GrandMoff100 commented Apr 16, 2024

GrandMoff100 commented Apr 16, 2024

Alan-Chen99 commented May 2, 2024

Alan-Chen99 commented May 2, 2024

GrandMoff100 commented May 2, 2024

GrandMoff100 commented May 2, 2024

Alan-Chen99 commented May 2, 2024

Alan-Chen99 commented May 2, 2024

GrandMoff100 commented May 2, 2024

Alan-Chen99 commented May 2, 2024 •

edited

Loading

GrandMoff100 commented May 2, 2024

adding does not properly group #18

adding does not properly group #18

Comments

Alan-Chen99 commented Apr 15, 2024

GrandMoff100 commented Apr 16, 2024

GrandMoff100 commented Apr 16, 2024

GrandMoff100 commented Apr 16, 2024

Alan-Chen99 commented May 2, 2024

Alan-Chen99 commented May 2, 2024

GrandMoff100 commented May 2, 2024

GrandMoff100 commented May 2, 2024

Alan-Chen99 commented May 2, 2024

Alan-Chen99 commented May 2, 2024

GrandMoff100 commented May 2, 2024

Alan-Chen99 commented May 2, 2024 • edited Loading

GrandMoff100 commented May 2, 2024

Alan-Chen99 commented May 2, 2024 •

edited

Loading