-
Notifications
You must be signed in to change notification settings - Fork 104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FR] Unicode searching with normalizing #298
Comments
Not sure why this title says "slower". [FR] is a feature request. This reads like a feature request. Ugrep uses the Unicode code points (including the so-called "combining diacritical marks") to determine the screen layout in the TUI edit line and output. But it does not normalize these for matching input. |
Resources: Note: |
I've implemented a regex Unicode normalizer to canonicalize Unicode combining characters/marks. For example, a regex pattern containing an I'm implementing NFC for regex patterns. With the limitation that it does not apply decomposition before canonicalization. Only canonicalization to combine characters. Canonicalizing the input text is not considered. PS. option PPS. still had some doubts when to normalize, because the user may want to search specifically for a non-normalized form. That can be done with a quick hack, for example by using parenthesis to keep the combining character(s) separated like |
I've written plenty of such normalizers, without ICU. ICU is way too slow and big. you really need just tiny static tables. The problem is that the text should not be normalized, rather that the regex / substring should be expanded on the fly to include all possible legal variants. |
I agree that text should not be normalized, but rather the regex. But generating all legal variants can lead to hidden costs when the regex is modified as such, which it is, because RE/flex and PCRE2 do not generate all variants. Right now, regex conversion is minimal, such as to remove matching Perhaps an option to enable generating and matching all variants would be nice? Adding more options is a last resort though. I expect few will use it. I still have doubts on how to best approach this. I suspect that most word processors and text editors will insert canonical forms into the text, e.g. for accented characters (?). Nevertheless, some simple canonicalization step is nice to have to avoid problems with regex operators applied to combining forms, such as the A more interesting case is U+1E69 which has as many as five alternatives: |
Upon closer inspection, this Unicode normalization TR15 standard is not designed by computer scientists. For one, it does not support the derivation of a confluent TRS directly from the UnicodeData file. For example, the tables include this path: No surprise then that NFC requires denormalization, reordering and normalization to obtain a normal form or by using additional rewrite rules. Fortunately, we can use Knuth Bendix completion to make the system strongly normalizing without requiring denormalization, reordering and normalization. Strictly speaking, NFC is not a true canonical form in that sense, because the TRS based on (or described by) the UnicodeData is not strongly normalizing without the process of completion. NFD is strongly normalizing when code point sorting is used, but not as useful. |
I've written up a K-B completer to normalize pairs of Unicode characters to combine into one character without getting stuck when followed by more combining characters (this works for most cases). It reads UnicodeData.txt to create a C++ file with tables of rewrite rules. K-B completion takes 16 passes to converge with the following output. I find it interesting that we end up with a system that still is not strongly normalizing, because there are some Unicode character pairs that are not composable from reordered triple characters. That is, three characters
|
How about supporting "u" for unicode also?
Currently the tool does not find unicode mark variants which deviate only in its normalization. It does not find "Café" in "Café", the first using the decomposed "e\x301", the second using the composed "\e9" for the last small e with grave.
It should (with some new unicode flag) find mark characters and possible variants of not-normalized variants in the search pattern and create the unicode alternations to search for.
The text was updated successfully, but these errors were encountered: