Handling of non-breaking spaces when splitting to words #357

2colours · 2022-12-14T00:24:51Z

Hello,

The TL;DR of this issue would be: non-breaking spaces are handled differently by words and word quoting structures despite both only talking about whitespaces. This also makes a number of doc code examples wrong about their output.

The process of discovery was the following:

https://docs.raku.org/language/traps#___top "using Set subroutines (...)" part
- it produces "a b" and returns False for both code examples
- ... for all bisectable6 versions
https://docs.raku.org/language/quoting#index-entry-quote_%3C%3C_%3E%3E-quote_%C2%AB_%C2%BB-Word_quoting_with_interpolation_and_quote_protection:_%C2%AB_%C2%BB
- seems to work the same way as <<>> but that way also doesn't match the output in the docs: ("42 b", "c ")
reason: non-breaking space!
https://docs.raku.org/language/quoting#Word_quoting:_qw says:

The :w form, usually written as qw, splits the string into "words". In this context, words are defined as sequences of non-whitespace characters separated by whitespace.
words does seem to match this description and produce the supposed output with non-breaking spaces as well
both can make sense but which one is correct?

The text was updated successfully, but these errors were encountered:

2colours · 2022-12-14T00:29:14Z

My impression is that not taking all whitespaces into account for quoting structures was a sincere mistake and it could simply be a Rakudo bug. However, the sole fact that the word quotes don't just identify whitespaces, unlike words, might hint something intentional. If it was intentional, the documentation needs to take that into account and wrong examples need to be updated.

codesections · 2022-12-14T15:49:49Z

My impression is that not taking all whitespaces into account for quoting structures was a sincere mistake and it could simply be a Rakudo bug.

I'm inclined to agree. As additional evidence, if you store a string with nonbreaking spaces in a variable and then interpolate it, those nonbreaking spaces are treated as word separators:

# NOTE: all spaces below are non-breaking
say <a b c>.raku;     # OUTPUT: "a b c"
my $str = 'a b c';
say <\qq[$str]>.raku; # OUTPUT: ("a", "b", "c")

Unless someone disagrees, I'd say its fine to close this issue and open a Rakudo one.

alabamenhu · 2022-12-17T00:01:54Z

Well…

The question is what IS the correct one? Nonbreaking spaces are tricky. We definitely consider 1,000,000 to be a single word. But oftentimes, thousands are separated with non-breaking spaces. They can also be used to designate to strings of characters that are conceptually a single word, but for whatever reason may be split visually.

OTOH, they might be used for something as simple as preventing a break between a quote and a parenthetical notation (which isn't conceptually a single word).

There is no perfect way, and the option is either break at all whitespace, or break at all breakable whitespace (thus FS, NBSP, NNBSP, WJ and ZWNBS would not break).

codesections · 2022-12-17T00:37:23Z

There is no perfect way, and the option is either break at all whitespace, or break at all breakable whitespace (thus FS, NBSP, NNBSP, WJ and ZWNBS would not break).

Agreed. And, given that the current implementation breaks on almost all spaces, it seems like it's better (and less breaking) to go in that direction. And that seems least-bad to me, anyway.

(Incidentally, one area that always plays havoc with word counting/word division is legal citations. How many words should ideally be in 42 U.S.C. § 405(a)? Does your answer change if I tell you that the last space (but no others) is non-breaking?)

So, yeah, no perfect way.

alabamenhu · 2022-12-17T02:46:15Z

Agreed. And, given that the current implementation breaks on almost all spaces, it seems like it's better (and less breaking) to go in that direction. And that seems least-bad to me, anyway.

(Incidentally, one area that always plays havoc with word counting/word division is legal citations. How many words should ideally be in 42 U.S.C. § 405(a)? Does your answer change if I tell you that the last space (but no others) is non-breaking?)

So, yeah, no perfect way.

So there's really two alternatives, and thankfully Raku allows for a module to fill in the other:

.words and qw are effectively equivalent to .comb(/<:!Z>+/)
.words and qw are effectively equivalent to .comb(/<:!Z+[  ⁠]>+/) (where the space between brackets there is the non-breaking characters)

I'd probably personally go for the first one except I recall that Larry once said that he wanted the .words and .lines methods precisely so people weren't creating bad regexen to do the job and end up missing something. The question ultimately goes down to which of the two is the most likely to be the DWIM thing. After all, any whitespace that's not space, tab, or newline is probably inserted rather intentionally.

But I think the first one is ultimately the simplest to explain, and anyone who does need to worry about them can (at least for .words) make a very easy modification in module space, etc. (And probably soon qw could too once slangs are more robust)

2colours · 2022-12-17T10:23:11Z

For what it's worth, I also think it's easier to "make peace" by sticking to the "all whitespaces separate words" concept. The way I see it, a non-breaking space is usually rather about visual presentation than the number of words. For example, you wouldn't want to break a movie title or something similar that strongly represents one concept. I don't know the definition of a "word" but in my mind, such a... well, string? would still consist of multiple words, just presented in a certain way.
For numbers, this explanation may be less useful but I wouldn't call a sequence of digits a "word", whether it's separated by whitespace or something else - anyway, I wonder, does Raku parse nbsp separated numbers in the first place? If not then I don't think this is something to really take into account for the concept of words.

alabamenhu · 2022-12-17T15:57:10Z

does Raku parse nbsp separated numbers in the first place

No, but it also doesn't parse comma separated numbers either (it only allows for underscores to space digits). It does treat comma separated numbers as a single word for the purpose of .words though. The number isn't a theoretical thing, though. Since CLDR has started using both full and narrow non-break spaces in many of its number formats for major world languages, there's quite a few numbers in the real world floating around that are spaced accordingly and that number will grow. I would think most people understand that if .words slurps up words, it would slurp up different formats of numbers (such as 123,456.789) and need to reparse those accordingly.

I don't know the definition of a "word" but in my mind, such a... well, string? would still consist of multiple words, just presented in a certain way.

Just as a background, the definition of a word is fairly nebulous. In English, in fact, we very frequently will see a progression in terms from foo bar to foo-bar to foobar (these are called, respectively, open, hyphenated, and closed compound words), but not all words go the full path (ice cream is just a conceptually a single thing as rainbow, and personally given that the former is a trochee and the latter a spondee, I'd actually argue the former should be one word and the latter two). Different languages can display different examples of how what's one word is really multiple or multiple words are really one (Spanish gives the wonderful example of both: se lo dije vs díjeselo, where the words/affixes are written separate or together based on position, but they are still pronounced as a single word unit either way).

But that's just background, for the purpose of words that might have internal spaces, I'd agree that they should probably be expected to be split. The question might revolve more around what's going to be more common: encountering a purely formatting space with words, or encountering a

I suppose we could split the baby by enforcing number boundaries (where a non-breaker would be considered a part of the word if surrounded on both sides by a number, thus units would be split but not the numbers that make them up, as those often also have NBSPs), but that's one of those extra complexities that I'm not sure if it's better to leave it to a module (because it would induce more surprise if default) or make it a part of it (because despite the complexity, it would produce the least complexity for the typical user).

2colours · 2022-12-17T18:16:35Z

No, but it also doesn't parse comma separated numbers either (it only allows for underscores to space digits).

To be honest, I also don't "parse" comma separated numbers as one number, either. :P Or well, if it has exactly one comma, I would parse it as a fractional... anyway, I'm perfectly fine with settling on neither being one word, by this highly arbitrary and superficial definition of words.

(...) what's going to be more common: encountering a purely formatting space with words, or encountering a

Something seems to be missing here, doesn't it?

Anyway, I think we are reaching further and further from the issue of generic word-splitting. Many of these things could be addressed by providing built-in regexes/tokens - for example, I would be happy if the patterns provided by the Rakudo parser could be accessed some way - especially since it uses many standardized concepts probably so it could easily be "more Raku than Rakudo" per se.

codesections · 2022-12-20T04:51:55Z

This is more food for thought than anything else, but here are what a few other programs make of the string foo bar baz (with nbsp)

wc: 3 words
vim: 3 words, 3 Words
emacs: 3 words,
emacs (evil mode): 3 words, 1 Word

jubilatious1 · 2023-04-05T15:21:55Z

@alabamenhu said:

... break at all breakable whitespace (thus FS, NBSP, NNBSP, WJ and ZWNBS would not break).

Agreed.

jubilatious1 · 2023-11-10T19:49:49Z

Maybe words which respects &NBSP, and Words which doesn't?

and/or add a named argument to words something to the effect of :nbsp-family defaulting to False?

2colours added the fallback If no other label fits label Dec 14, 2022

2colours assigned codesections Dec 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handling of non-breaking spaces when splitting to words #357

Handling of non-breaking spaces when splitting to words #357

2colours commented Dec 14, 2022

2colours commented Dec 14, 2022

codesections commented Dec 14, 2022

alabamenhu commented Dec 17, 2022

codesections commented Dec 17, 2022

alabamenhu commented Dec 17, 2022

2colours commented Dec 17, 2022

alabamenhu commented Dec 17, 2022

2colours commented Dec 17, 2022

codesections commented Dec 20, 2022

jubilatious1 commented Apr 5, 2023

jubilatious1 commented Nov 10, 2023

Handling of non-breaking spaces when splitting to words #357

Handling of non-breaking spaces when splitting to words #357

Comments

2colours commented Dec 14, 2022

2colours commented Dec 14, 2022

codesections commented Dec 14, 2022

alabamenhu commented Dec 17, 2022

codesections commented Dec 17, 2022

alabamenhu commented Dec 17, 2022

2colours commented Dec 17, 2022

alabamenhu commented Dec 17, 2022

2colours commented Dec 17, 2022

codesections commented Dec 20, 2022

jubilatious1 commented Apr 5, 2023

jubilatious1 commented Nov 10, 2023