Backslash sequences #5
Replies: 3 comments 16 replies
-
BTW, I've bought both RegexBuddy and RegexMagic form JGS (author of the website you linked), so if you need me to test some RegExs for you I'll happily do it. Both tools have a custom engine that includes all versions of the major RegEx engines (so that you can test backward compatibility issues with any engine) plus the custom engine by JGS, which is very powerful (also documented at the website). One of these two programs also allows debugging a RegEx to break it down into each single passage, in case you need to compare expected behaviour in your code with actual behaviour by other engines. As for the shorthand classes to implement, it really depends on what your engine goals are — which I'm guessing is mostly oriented toward lexers creation? I'm not quite sure that Some other useful shorthands can be found here:
I know that the above don't all qualify as characters shorthand, for some of them are more abstract in nature, but still... |
Beta Was this translation helpful? Give feedback.
-
I have now looked again at the documentation of Rust's crate With the help of the very useful tool I have adjusted the listing at the top and the relevant issues to use the Unicode's character classes. For example, |
Beta Was this translation helpful? Give feedback.
-
A backslash sequence that matches a byte value can also be useful. So the RegEx engine can also be used to match byte sequences (beside characters) in binary data. This is not so unusual anymore. The question comes up again and again on the Internet (search for: regex binary data). The NFA/DFA of the RegEx engine already works byte-based, so the implementation should not be difficult. Example: Syntax:
|
Beta Was this translation helpful? Give feedback.
-
In addition to character classes, there will also be shorthand character classes. However, I'm not quite sure yet which ones there should be and which characters they should cover.
According to this website, the different RegEx engines cover different characters in the shorthand character classes:
https://www.regular-expressions.info/shorthand.html
The current listing:
\r
for carriage return (Add escape sequence\r
(carriage return) #8)\n
for new line (Add escape sequence\n
(line feed) #9)\t
for horizontal tab character (Add escape sequence\t
(horizontal tab) #10)\f
for form feed (Add escape sequence\f
(form feed) #22)\d
for digit (Add predefined character class\d
(digit) #14)\D
for no digit (Add predefined character class\D
(no digit) #23)\s
for whitespace character ( Add predefined character class\s
(whitespace) #15)\S
for no whitespace character (Add predefined character class\S
(no whitespace) #24)\w
for word character (Add predefined character class\w
(word character) #25)\W
for no word character (Add predefined character class\W
(no word character) #26)\xhh
(Add escape sequence\xhh
(character with hex codehh
) #17)\uhhhh
(Add escape sequence\uhhhh
(character with hex codehhhh
) #18)\Q
...\E
(Add escape sequence\Q
...\E
#13)Beta Was this translation helpful? Give feedback.
All reactions