-
-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"Collapse consecutive whitespace" operation does not collapse all possible unicode whitespace #4883
Comments
Hi, I write to let you know that I am interested in this issue and I would like to work on it |
There are some basic issues related to Unicode whitespace trimming, etc., by some libraries as they don't differentiate UTF-8 encoding but instead mistakenly use UTF-16 encoding. For instance, see these two stupidly confusing pages: Note that uC2A0 is the encoding for UTF-8 and simultaneously "invalid" Unicode...which it is decidedly not invalid. These are UTF-16 centric sites and prove it with the C and Python code samples. I had to purposely check for this specific whitespace issue in RDF Transform since even Java's regex misses it. |
fileformat.info is not where you want to look at for Unicode specifications. You should concern yourself with only those Unicode chars that fall into the And you need to understand some of the rules in Unicode Text Segmentation annex that concern SpacingMark and https://www.unicode.org/reports/tr29/tr29-39.html
@AtesComp This issue might be best left for those that have an intimate understanding of Unicode especially concerning category Zs. |
I tried to find within Unicode.org an HTML page that shows the Zs category, but gave up after a bit. I think it's in the data downloads for sure. Anyways, here's a quick primer on a listing from another site, which I have no idea how complete it is: https://www.compart.com/en/unicode/category/Zs Well, according to Wikipedia, as of Unicode v14 there are 17 graphic chars in the Zs category:
<!--EndFragment-->
</body>
</html>Zs Separator, space Graphic Character 17 Includes the space, but not [TAB](https://en.wikipedia.org/wiki/Tab_key), [CR](https://en.wikipedia.org/wiki/Carriage_return), or [LF](https://en.wikipedia.org/wiki/Newline), which are Cc
> So we should try to add all 17 into our source as a map. |
For sure fileformat.info is NOT the place for a proper look at the specification. I was just noting the awful presentation others do on the subject. The fact that the Java regex failed in this issue was surprising. The standard for the latest "Unicode" is documented at https://www.unicode.org/Public/UCD/latest/ I've had quite a bit of experience with Unicode over the last 20 years. Here is a general overview of the related issues: The UTF-8 standard is documented here: I can assuredly and unoccquivically report that the U+00A0 Unicode character is universally misrepresented by UTF-8 compliant processors. As seen here , the encoding for U+00A0 (000 1010 0000) code point must be converted to a 2 byte representation as it fits in the following part of the conversion table: U+0080 | U+07FF | 110xxxxx | 10xxxxxx And is, therefore, 11000010 10100000 or C2 A0 Many implementation naively and erroneously reference the U+00A0 Unicode character as 00 A0 for UTF-8...including most Java libraries and specifically the one you mentioned. I hope that is way more convincing. |
I knew I had these references somewhere... You can use the following site to do some Unicode look ups: For the interested However, there are no transform references. The following link discusses the various transforms: If you really want to do a full blown transform converter, get the ICU4J code: But why since UTF-8 is the dominant transform. Here's the RegEx Unicode site: From GeekforGeeks: The majority JavaScript as UTF-8 will need to be transcoded to UTF-16 for the Java RegEx. Which means:
Or precompile some regex pattens with CANON_EQ and you might get lucky. And to cap it all off, you can change the default encoding of the JVM using the confusingly-named property To be clear, even though I knew most of this from past experience, I've still made mistakes with my current coding. What a royal mess! |
This has convinced me of the bigger mess and security issues around JavaScript: I'm re-examining my UTF-8 / UTF-16 encoding issue.
I'm having an issue with the U+00A0 vs C2 A0 NBSP character. I am definitely getting a \uC2A0 character embedded in some OpenRefine ingestions. They appear as an extra space at the end of text in the UI (especially in the GREL expression results). I would think the \uC2A0 would either show up as a weird A with a space on the UI or an oriental character, but it doesn't--it's a blank space. I thought this was just a UTF-8 character not getting handled properly. Now, I think its some fundamental OpenRefine ingestion issue. On another note... |
@AtesComp Regarding the display issue of \uC2A0 character (is it a display issue or not that you are experiencing?) Are you using Lucida Sans Unicode (or one of these) as your standard font in your browser settings? Hex C2A0 = UTF-8 for NBSP Also, @AtesComp are you aware of UTF-8 being the default forthcoming in JDK 18 ? |
I don't want to pop off again before I have some definite info. I will do some comprehensive tests with a data set and recreate it to see if it is related. I'll create a new issue if not. I'll look at the various font setting to see if I can get it to display any different. It appears the same as in the spreadsheet. I'm suspicious about MS playing games with a regular space and nbsp to manage "presentation" to the user. I just don't yet understand how it could be \uC2A0 and not \u00A0 in JavaScript. I've been tracking the Java UTF issues for some time but didn't read about that. They are not going to default to the local system setting anymore? That should at least make it a little more predictable. |
Hi, thanks for all the inputs. I will send my pr to the issue tomorrow |
So I'm not at all expert in encoding or Unicode / UTF8 / UTF16 so I can't comment on any of the underlying issues being raised here, but I want to draw focus on how this issue is framed and how it works: The title of this issue is:
The "Collapse consecutive whitespace" function actually just applies the GREL: which, in code is doing:
Where str is the cell
So in terms of ensuring the "Collapse consecutive whitespace" function deals with the specific use case it would seem to me that by far the simplest approach would be to extend the regular expression used in the GREL replace to deal with additional whitespace characters - for some definition of what "whitespace characters" are included. For example we could use
To be honest this seems to me reasonably aggressive in its scope so we might want to discuss whether there are situations where this would cause issues. More conservatively we could go for something narrower like: which I think is equivalent to what @thadguidry suggests when they say:
Finally I'd note that that this is marked as a "good first issue" currently - which would suggest that it's been assessed as a relatively straightforward task. I think if this is a matter of tweaking the regex used then it's very straightforward and definitely a good first issue. But if the scope is more broadly dealing with whitespace in OR then it's by no means straightforward and should be re-labelled :) |
@ostephens, I believe you have the correct assessment. My issue is a parallel problem dealing with elimination, string end trimming. The only additional point is the apparent "nbsp" alignment. I'll likely create a new issue when I track down how an apparent UTF-8 encoding is getting forced into a UTF-16 encoding. Either the character should be transcoded properly to be condensed / eliminated as well or it should be displayed / managed differently. |
Let's remove the "good first issue" tag simply by virtue of the amount of text to read in this issue :) |
The "Collapse consecutive whitespace" operation does not work when applied to certain whitespace unicode characters.
To Reproduce
Steps to reproduce the behavior:
Current Results
The cell is not edited
Expected Behavior
The cell should be edited to "hello world"
Versions
Datasets
Real world dataset where this appears: https://opendata.paris.fr/explore/dataset/lieux-de-tournage-a-paris/information/?disjunctive.type_tournage&disjunctive.nom_tournage&disjunctive.nom_realisateur&disjunctive.nom_producteur&disjunctive.ardt_lieu
Additional context
Discovered while doing a demo at Dataharvest 2022
The text was updated successfully, but these errors were encountered: