New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Samemark breaks regional indicator symbols, should it be this way? #61
Comments
|
How samemark assumes the world exists The real world it is actually more like But then you have emoji, flags etc. What I think is the most reasonable solution is to treat This does not address the fact that "\c[Canada]".samemark("é") would give you a Canadian flag with an accent mark on it, and why anybody would desire this. As this is how samemark has always worked it doesn't make sense to change this. |
|
Here is an example of what I am proposing: use Test;
is 'hi'.samemark('é'), 'h́í', 'plain base';
is 'ēi'.samemark('é'), 'éí', 'mark nonmark base';
is ‘🇦🇬’.samemark('é').ords, '127462 127468 769';
# Prepend + Other will add the extend
is "\c[arabic number sign]a".samemark('é').NFD.list, '97 769', 'prepend test';
is "a".samemark("\c[arabic number sign]é").NFD.list,
"\c[arabic number sign]".ord ~ ' 97 769', 'prepend test';
done-testing; |
.samemark` is currently broken. it assumes the first codepoint of a grapheme is the "base", and *anything* occuring afterward is a "mark".
(below the `*` is used in the same meaning as regex, and things in brackets are single codepoint tokens)
**How samemark assumes the world exists**
`[base] [mark]*`
**The real world it is actually more like**
`[mark]*[base][mark]*`
**But** then you have emoji, flags etc.
`[Regional Indicator][Regional Indicator]`
**or**
`[Base Emoji]([ZWJ][Emoji])*`
What I think is the most reasonable solution is to treat `Grapheme_Cluster_Break=Extend` and `Grapheme_Cluster_Break=Prepend` as “mark”’s. This would cause it to treat flags or emoji sequences as the "base", since it's not really separable the same way accent marks are separable.
This does not address the fact that "\c[Canada]".samemark("é") would give you a Canadian flag with an accent mark on it, and why anybody would desire this. As this is how samemark has always worked it doesn't make sense to change this.
Fixes Raku/problem-solving#61
Note this is the second version of this change. The previous one was
reverted.
Code:
Output:
I don't think samemark functionality should apply to flags, this looks like a bug to me. But maybe it's not, considering that often people use samemark to get a string with single separate codepoints without prepend/append characters.
The text was updated successfully, but these errors were encountered: