Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regex Library #19

Open
CeleritasCelery opened this issue Jan 14, 2023 · 0 comments
Open

Regex Library #19

CeleritasCelery opened this issue Jan 14, 2023 · 0 comments
Labels
design needed Items where more design help is needed

Comments

@CeleritasCelery
Copy link
Owner

CeleritasCelery commented Jan 14, 2023

Emacs regex is similar to PCRE regex. In that case we could use the fancy-regex crate (which implements a backtracking engine), once #84 is fixed. However there are still several differences that would need to be handled.

meta characters

Emacs regex meta characters are backwards from what most regex use. For example () represent literal parens, and \(\) is a capture group. Also | is literal, and \| is alternation. This is easy enough to fix with pre-processing the regex.

syntax aware matches

Several of the regex patterns match on the syntax definition of characters.

  • \w: word character
  • \s: match syntax class

"Word" and "symbol" are defined by the major modes syntax table. You could transform these into general character classes ([...]) for the rust regex engine.

There is also the special character \=, which matches the point. To handle this you could split the buffer into two parts; before point and after point. Then match each half separately.

boundaries

Emacs defines a regex for the boundary of words and symbols.

  • \<: beginning of word
  • \>: end of word
  • \_<: beginning of symbol
  • \_>: end of symbol

these will need to be implemented with look-arounds. You can’t even build them into the regex engine because they can change per major mode.

Buffer Gap

Most performance oriented regex libraries expect to operate on contiguous data. However a gap buffer will have a gap of garbage data somewhere in the buffer. This becomes a problem when the span of the regex search crosses the gap. The simplest solution here is to move the gap outside of the range of the search. This could performance issues if the lines are really long. We also have to consider how to match multiline regex. Not sure of a good way to handle that. Here are some notes from the remacs project.

@CeleritasCelery CeleritasCelery added the design needed Items where more design help is needed label Jan 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
design needed Items where more design help is needed
Projects
None yet
Development

No branches or pull requests

1 participant