Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling of non-breaking spaces when splitting to words #357

Open
2colours opened this issue Dec 14, 2022 · 11 comments
Open

Handling of non-breaking spaces when splitting to words #357

2colours opened this issue Dec 14, 2022 · 11 comments
Assignees
Labels
fallback If no other label fits

Comments

@2colours
Copy link

Hello,

The TL;DR of this issue would be: non-breaking spaces are handled differently by words and word quoting structures despite both only talking about whitespaces. This also makes a number of doc code examples wrong about their output.

The process of discovery was the following:

@2colours 2colours added the fallback If no other label fits label Dec 14, 2022
@2colours
Copy link
Author

My impression is that not taking all whitespaces into account for quoting structures was a sincere mistake and it could simply be a Rakudo bug. However, the sole fact that the word quotes don't just identify whitespaces, unlike words, might hint something intentional. If it was intentional, the documentation needs to take that into account and wrong examples need to be updated.

@codesections
Copy link
Contributor

My impression is that not taking all whitespaces into account for quoting structures was a sincere mistake and it could simply be a Rakudo bug.

I'm inclined to agree. As additional evidence, if you store a string with nonbreaking spaces in a variable and then interpolate it, those nonbreaking spaces are treated as word separators:

# NOTE: all spaces below are non-breaking
say <a b c>.raku;     # OUTPUT: "a b c"
my $str = 'a b c';
say <\qq[$str]>.raku; # OUTPUT: ("a", "b", "c")

Unless someone disagrees, I'd say its fine to close this issue and open a Rakudo one.

@alabamenhu
Copy link

Well…

The question is what IS the correct one? Nonbreaking spaces are tricky. We definitely consider 1,000,000 to be a single word. But oftentimes, thousands are separated with non-breaking spaces. They can also be used to designate to strings of characters that are conceptually a single word, but for whatever reason may be split visually.

OTOH, they might be used for something as simple as preventing a break between a quote and a parenthetical notation (which isn't conceptually a single word).

There is no perfect way, and the option is either break at all whitespace, or break at all breakable whitespace (thus FS, NBSP, NNBSP, WJ and ZWNBS would not break).

@codesections
Copy link
Contributor

There is no perfect way, and the option is either break at all whitespace, or break at all breakable whitespace (thus FS, NBSP, NNBSP, WJ and ZWNBS would not break).

Agreed. And, given that the current implementation breaks on almost all spaces, it seems like it's better (and less breaking) to go in that direction. And that seems least-bad to me, anyway.

(Incidentally, one area that always plays havoc with word counting/word division is legal citations. How many words should ideally be in 42 U.S.C. § 405(a)? Does your answer change if I tell you that the last space (but no others) is non-breaking?)

So, yeah, no perfect way.

@alabamenhu
Copy link

Agreed. And, given that the current implementation breaks on almost all spaces, it seems like it's better (and less breaking) to go in that direction. And that seems least-bad to me, anyway.

(Incidentally, one area that always plays havoc with word counting/word division is legal citations. How many words should ideally be in 42 U.S.C. § 405(a)? Does your answer change if I tell you that the last space (but no others) is non-breaking?)

So, yeah, no perfect way.

So there's really two alternatives, and thankfully Raku allows for a module to fill in the other:

  1. .words and qw are effectively equivalent to .comb(/<:!Z>+/)
  2. .words and qw are effectively equivalent to .comb(/<:!Z+[   ⁠]>+/) (where the space between brackets there is the non-breaking characters)

I'd probably personally go for the first one except I recall that Larry once said that he wanted the .words and .lines methods precisely so people weren't creating bad regexen to do the job and end up missing something. The question ultimately goes down to which of the two is the most likely to be the DWIM thing. After all, any whitespace that's not space, tab, or newline is probably inserted rather intentionally.

But I think the first one is ultimately the simplest to explain, and anyone who does need to worry about them can (at least for .words) make a very easy modification in module space, etc. (And probably soon qw could too once slangs are more robust)

@2colours
Copy link
Author

For what it's worth, I also think it's easier to "make peace" by sticking to the "all whitespaces separate words" concept. The way I see it, a non-breaking space is usually rather about visual presentation than the number of words. For example, you wouldn't want to break a movie title or something similar that strongly represents one concept. I don't know the definition of a "word" but in my mind, such a... well, string? would still consist of multiple words, just presented in a certain way.
For numbers, this explanation may be less useful but I wouldn't call a sequence of digits a "word", whether it's separated by whitespace or something else - anyway, I wonder, does Raku parse nbsp separated numbers in the first place? If not then I don't think this is something to really take into account for the concept of words.

@alabamenhu
Copy link

does Raku parse nbsp separated numbers in the first place

No, but it also doesn't parse comma separated numbers either (it only allows for underscores to space digits). It does treat comma separated numbers as a single word for the purpose of .words though. The number isn't a theoretical thing, though. Since CLDR has started using both full and narrow non-break spaces in many of its number formats for major world languages, there's quite a few numbers in the real world floating around that are spaced accordingly and that number will grow. I would think most people understand that if .words slurps up words, it would slurp up different formats of numbers (such as 123,456.789) and need to reparse those accordingly.

I don't know the definition of a "word" but in my mind, such a... well, string? would still consist of multiple words, just presented in a certain way.

Just as a background, the definition of a word is fairly nebulous. In English, in fact, we very frequently will see a progression in terms from foo bar to foo-bar to foobar (these are called, respectively, open, hyphenated, and closed compound words), but not all words go the full path (ice cream is just a conceptually a single thing as rainbow, and personally given that the former is a trochee and the latter a spondee, I'd actually argue the former should be one word and the latter two). Different languages can display different examples of how what's one word is really multiple or multiple words are really one (Spanish gives the wonderful example of both: se lo dije vs díjeselo, where the words/affixes are written separate or together based on position, but they are still pronounced as a single word unit either way).

But that's just background, for the purpose of words that might have internal spaces, I'd agree that they should probably be expected to be split. The question might revolve more around what's going to be more common: encountering a purely formatting space with words, or encountering a

I suppose we could split the baby by enforcing number boundaries (where a non-breaker would be considered a part of the word if surrounded on both sides by a number, thus units would be split but not the numbers that make them up, as those often also have NBSPs), but that's one of those extra complexities that I'm not sure if it's better to leave it to a module (because it would induce more surprise if default) or make it a part of it (because despite the complexity, it would produce the least complexity for the typical user).

@2colours
Copy link
Author

No, but it also doesn't parse comma separated numbers either (it only allows for underscores to space digits).

To be honest, I also don't "parse" comma separated numbers as one number, either. :P Or well, if it has exactly one comma, I would parse it as a fractional... anyway, I'm perfectly fine with settling on neither being one word, by this highly arbitrary and superficial definition of words.

(...) what's going to be more common: encountering a purely formatting space with words, or encountering a

Something seems to be missing here, doesn't it?

Anyway, I think we are reaching further and further from the issue of generic word-splitting. Many of these things could be addressed by providing built-in regexes/tokens - for example, I would be happy if the patterns provided by the Rakudo parser could be accessed some way - especially since it uses many standardized concepts probably so it could easily be "more Raku than Rakudo" per se.

@codesections
Copy link
Contributor

This is more food for thought than anything else, but here are what a few other programs make of the string foo bar baz (with nbsp)

  • wc: 3 words
  • vim: 3 words, 3 Words
  • emacs: 3 words,
  • emacs (evil mode): 3 words, 1 Word

@jubilatious1
Copy link

@alabamenhu said:

... break at all breakable whitespace (thus FS, NBSP, NNBSP, WJ and ZWNBS would not break).

Agreed.

@jubilatious1
Copy link

Maybe words which respects &NBSP, and Words which doesn't?

and/or add a named argument to words something to the effect of :nbsp-family defaulting to False?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fallback If no other label fits
Projects
None yet
Development

No branches or pull requests

4 participants