Implementation Of Case-Insensitive Mode #27

SicroAtGit · 2021-12-31T14:40:27Z

SicroAtGit
Dec 31, 2021
Maintainer

I am currently thinking about how to implement enabling/disabling case-insensitive mode.

My first thought was to give the Create(regEx$) function an optional caseInsensitive parameter (#True/#False), but it probably makes more sense to change the mode within the RegEx.

https://www.regular-expressions.info/modifiers.html
According to the website, other RegEx engines use (?i) to enable and (?-i) to disable case-insensitive mode. But the other RegEx engines also support some other mode that justifies such a syntax.

Maybe this syntax would be good: RegEx between < and > is case-insensitive.

What do you think?

tajmone · 2021-12-31T23:43:24Z

tajmone
Dec 31, 2021

My first thought was to give the Create(regEx$) function an optional caseInsensitive parameter (#True/#False), but it probably makes more sense to change the mode within the RegEx.

They are not mutually exclusive, and many RegEx library offer both solution. The parameter allows to set the base case-sensitiveness mode, whereas the in-line modifier allows to handle exceptions within complex RegExs. Usually the default mode is case-sensitive, so I'd suggest the parameter should be caseSensitive instead.

According to the website, other RegEx engines use (?i) to enable and (?-i) to disable case-insensitive mode. But the other RegEx engines also support some other mode that justifies such a syntax.

I think that's the most popular modifier, so I'd stick with that. Besides switching the case-sensitivity context, this also allows to create case-sensitive specific group definitions, e.g.:

(?-i:some pattern)

where the modifier only applies to the group within parenthesis.

There's also a further advantage in having the double standard of parameter option plus inline modifier: the same RegEx definition can be used in different contexts (i.e. with different parameter settings), where the inline modifiers only enforce specific sensitiveness in some places, leaving the main context open to reuse according to parameter.

Maybe this syntax would be good: RegEx between < and > is case-insensitive.

These symbols are used for named capturing groups in many RegEx engines, so I'd rather leave them free for future uses.

Also, I'd rather use the < .. > delimiters for literal/verbatim text — i.e. no need to escape special chars, and of course case-sensitive context since it's just literal. I don't remember which delimiters are used by the various engines for this, but I remember that Ruby has some very handy delimiters to define in-line literal text, but couldn't find a reference right now.

Case Sensitive How-to

How are you going to implement case-sensitiveness?

For the base Latin characters (i.e. those of the ASCII set, with no accents or diacritics) there's always the old bitwise trick AND 0xDF:

But if you need to take into account accented letters and special Latin letters with diacritics, etc., then things are not so simple. Not to mention the complexity of Unicode when it comes to supporting case-sensitivity across all languages that support different letter casing in Unicode.

You might want to have a look at the Unicode/ICU documentation and libraries regarding the various algorithms for implementing case-sensitiveness:

In general, case-operations in Unicode are rather complex, and there are various level of support (and equivalent flag/options) for this feature.

Even if you're not planning to add full Unicode aware support of case-sensitive operations, it would still be good to have a clear understanding of how Unicode classifies these operations so that you may:

Provide clear indications to the type of case-sensitive support offered by the library (in Unicode terms).
Leave open the possibility for future extensions of this feature (e.g. by adopting options/flags that will allow this in the future without breaking backward incompatibility).

1 reply

SicroAtGit Jan 1, 2022
Maintainer Author

They are not mutually exclusive, and many RegEx library offer both solution. The parameter allows to set the base case-sensitiveness mode, whereas the in-line modifier allows to handle exceptions within complex RegExs. Usually the default mode is case-sensitive, so I'd suggest the parameter should be caseSensitive instead.

Ok, I will implement both. Yes, caseSensitive also looks better.

I think that's the most popular modifier, so I'd stick with that. Besides switching the case-sensitivity context, this also allows to create case-sensitive specific group definitions

Ok, I will adopt the common syntax: (?i), (?-i), (?i:regex), (?-i:regex). Maybe more modes will be needed later and so they can be implemented easily. But I don't want to make the RegEx engine too complex. At least this case mode should be sufficient for version 1.0.0.

These symbols are used for named capturing groups in many RegEx engines, so I'd rather leave them free for future uses.

The RegEx engine will not support capturing groups. This is too much for DFAs. If I store the start/end position information of the groups in the NFA states, the information can be lost during the NFA to DFA conversion, because during this conversion NFA state sets are merged to single DFA states. There is a special form of DFA called tagged DFA, but this then becomes much more complicated.

But you are right, the remaining free symbols should not be reserved too quickly for new features, otherwise the reserves of free symbols for future features will be used up quickly.

Also, I'd rather use the < .. > delimiters for literal/verbatim text — i.e. no need to escape special chars, and of course case-sensitive context since it's just literal. I don't remember which delimiters are used by the various engines for this, but I remember that Ruby has some very handy delimiters to define in-line literal text, but couldn't find a reference right now.

In #5 we specified the syntax \Q...\E for escaping strings.

How are you going to implement case-sensitiveness?

If case-insensitive mode is activated, the character is added to the NFA/DFA in both forms.

My first thought was to use the PureBasic functions LCase() and UCase() to get the character in upper and lower case.

In the PureBasic English forum, mk-soft has written a module CaseUnicode, which apparently supports more characters than the PB functions support. This module is also in my code archive, licensed with the MIT license.

Thanks for the links, I will check them out.

For the RegEx engine I only intended to support the character range [\x01-\uFFFF] which is also supported by PureBasic.

SicroAtGit · 2022-01-09T13:28:49Z

SicroAtGit
Jan 9, 2022
Maintainer Author

I have now looked at everything, made a decision and created an issue (#28) where the implementation details are described. Thanks, @tajmone.

0 replies

SicroAtGit · 2022-06-12T09:57:47Z

SicroAtGit
Jun 12, 2022
Maintainer Author

@tajmone

I thought about it again, and I think I'll leave out the syntax (?i:regex) and (?-i:regex) because it's really only needed for non-capturing groups (?:regex) in other RegEx engines and my RegEx engine doesn't support those kinds of groups.

I also wonder if a parameter is really necessary because a RegEx can be reused this way:

RegEx::AddNfa(*regEx, regEx$, #Token_1)
RegEx::AddNfa(*regEx, "(?i)" + regEx$, #Token_2)

3 replies

tajmone Jun 13, 2022

I think I'll leave out the syntax (?i:regex) and (?-i:regex) because it's really only needed for non-capturing groups (?:regex) in other RegEx engines and my RegEx engine doesn't support those kinds of groups.

that makes sense.

I also wonder if a parameter is really necessary because a RegEx can be reused this way:

that's a good argument in favor of leaving out the parameter. Still, there might be cases where a parameter might come handy, e.g. in procedural code that manipulate many RegExs from raw data, where the case-sensitivity value might be stored as a Boolean value on its own. Surely, one could check if the value is true and add the "(?i)" + string as in the above example, but probably passing the Boolean value directly as a parameter is somewhat cleaner and more elegant than having to define a temporary string (either empty or "(?i)" +`) and prefix it to the RegEx string value — this would possibly also make the code slower due to either the conditional checks or having to define temporary strings, whereas passing the value as a parameter would result in slimmer code and no intermediate values.

But you're right, the parameter is not strictly necessary and could be done without.

SicroAtGit Jun 19, 2022
Maintainer Author

Ok, I will also implement a parameter. Instead of caseSensitive I will call it regExModes. This way it is flexible in case more modes are added in the future and this way also multiple modes can be active at the same time:

AddNfa(..., #RegExMode_NoCase | #RegExMode_WhatEver)

I have updated issue #28

tajmone Jun 20, 2022

This way it is flexible in case more modes are added in the future and this way also multiple modes can be active at the same time:

Excellent idea! This way, new modes won't break backward compatibility.

SicroAtGit · 2022-07-31T10:34:17Z

SicroAtGit
Jul 31, 2022
Maintainer Author

When I took a closer look at Unicode's case-folding, I realized that case-folding makes the characters in the match not only independent of upper and lower case, but also of more character variants.

Example:

(?i)\u00B5 matches the characters:

\u00B5
\u039C
\u03BC

The feature is now implemented.

4 replies

tajmone Jul 31, 2022

The feature is now implemented.

Well done!

When I took a closer look at Unicode's case-folding, I realized that case-folding makes the characters in the match not only independent of upper and lower case, but also of more character variants.

Example: (?i)\u00B5 matches the characters: \u00B5 \u039C \u03BC.

I don't know the nitty gritty of how upper-to-lower case paring works across different alphabets, symbols, etc., simply because there are so many different languages and non-Latin alphabets out there that it's beyond my knowledge.

But I do know that there have been some controversies regarding the way Unicode handles some special letters and their lower-case variants, e.g. in German and Swedish were there are some special letters which are supposed to exist only in lower case, and they are considered a ligature of two-letters; I vaguely remember having read some discussions of how Unicode treats their upper-case conversion in ways which are considered grammatically incorrect (see end of comment for some references).

Having personally witnessed how badly Unicode represents Arabic, where they got a core ligature of the alphabet (the Lam-Alif determinate article) completely wrong (a mistake any seven year old native speaker would spot), I am fairly distrustful of how the Unicode standard works in general.

In the above Unicode points example:

Unicode	chr	name
`\u00B5`	µ	Micro Sign
`\u03BC`	µ	Greek Small Letter Mu
`\u039C`	Μ	Greek Capital Letter Mu

I'm not entirely surely whether the Micro Sign symbol (\u00B5 = µ) has an upper-case counterpart, for the symbol is usually employed in it's lowercase glyph only (including when at the beginning of a sentence). So, technically speaking, in a search the Micro Sign and Greek letter Mu are different characters belonging to different domains — science borrows symbols from various alphabets, but then they become independent symbols —, therefore the Micro Sign should not match the capital Greek Mu letter in case insensitive searches IMO.

But of course, we might be dealing with use cases where common practices don't always overlap with what is linguistically and semiotically correct, and the former wins over the latter — in the debate over the way special German and Swedish letters where handled in Unicode, this was the argument that justified the way it's being done.

German Eszett Controversy

IIRC, the controversy on how some German letters are handles was about the ß letter:

Because ⟨ß⟩ had been treated as a ligature, rather than as a full letter of the German alphabet, it had no capital form in early modern typesetting. There were, however, proposals to introduce capital forms of ⟨ß⟩ for use in allcaps writing (where ⟨ß⟩ would otherwise usually be represented as either ⟨SS⟩ or ⟨SZ⟩). A capital was first seriously proposed in 1879, but did not enter official or widespread use.

— Wikipedia: ß » Development of a capital form

Ultimately, there exist different Unicode points covering this letter today, and IIRC the controversy is around the normalization rules and how the handle lower- and upper-case conversion and equivalence of this letter.

Something similar happened in Swedish and some other European languages. Bear in mind that since I don't speak these languages I can't really have a well informed opinion on any of this. But having witnessed how disgracefully the Arabic alphabet has been implemented in Unicode, I have very low esteem for the Unicode standard — this Arabic alphabet error is so bad that some Islamic scholars have issued fatwas forbidding to use Unicode for reproducing Quranic text.

SicroAtGit Aug 6, 2022
Maintainer Author

Thanks for your detailed reply.

There are differences:

Unicode's Case-Folding is applied to two texts to eliminate case differences to make them better comparable. There are two different variants:
- Unicode's Simple Case-Folding
  A single Unicode code point is mapped again to a single Unicode code point. For example, the SZ character ß remains unchanged during mapping and does not become ss.
- Unicode's Full Case-Folding
  A single Unicode code point is mapped to a single or a sequence of multiple Unicode code points. For example, the SZ character ß is mapped to ss.
Unicode's Case-Mapping
This is used to convert characters to upper case, lower case or title case. Default characters were defined because sometimes there are several upper and lower case letters.

There are also entries in the mapping tables that take into account special cases of the Turkish language. In addition, there are other mapping tables that cover other special cases of many other languages, but then it must be known in which language the text to be processed is written.

For more information I recommend you this page:
Character Model for the World Wide Web: String Matching

For the implementation I stick to the Unicode standard. If someone is bothered by this, they are welcome to provide alternative mapping tables to the project. Flags could then be added to AddNfa() to use alternative mapping tables.

Interesting that there is a capitalized SZ character. I had not known that. With headlines, I'm used to the SZ character always being written as SS.

tajmone Aug 8, 2022

For more information I recommend you this page:
Character Model for the World Wide Web: String Matching

For the implementation I stick to the Unicode standard.

When we write the engine documentation it might be worth covering this topic in a dedicate appendix, providing basic examples and reference tables.

SicroAtGit Aug 8, 2022
Maintainer Author

Yes, would be good.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implementation Of Case-Insensitive Mode #27

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments 8 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Implementation Of Case-Insensitive Mode #27

SicroAtGit Dec 31, 2021 Maintainer

Replies: 4 comments · 8 replies

tajmone Dec 31, 2021

Case Sensitive How-to

SicroAtGit Jan 1, 2022 Maintainer Author

SicroAtGit Jan 9, 2022 Maintainer Author

SicroAtGit Jun 12, 2022 Maintainer Author

tajmone Jun 13, 2022

SicroAtGit Jun 19, 2022 Maintainer Author

tajmone Jun 20, 2022

SicroAtGit Jul 31, 2022 Maintainer Author

tajmone Jul 31, 2022

German Eszett Controversy

SicroAtGit Aug 6, 2022 Maintainer Author

tajmone Aug 8, 2022

SicroAtGit Aug 8, 2022 Maintainer Author

SicroAtGit
Dec 31, 2021
Maintainer

Replies: 4 comments 8 replies

tajmone
Dec 31, 2021

SicroAtGit Jan 1, 2022
Maintainer Author

SicroAtGit
Jan 9, 2022
Maintainer Author

SicroAtGit
Jun 12, 2022
Maintainer Author

SicroAtGit Jun 19, 2022
Maintainer Author

SicroAtGit
Jul 31, 2022
Maintainer Author

SicroAtGit Aug 6, 2022
Maintainer Author

SicroAtGit Aug 8, 2022
Maintainer Author