Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Collapse consecutive whitespace" operation does not collapse all possible unicode whitespace #4883

Closed
wetneb opened this issue May 24, 2022 · 13 comments · Fixed by #4898
Closed
Assignees
Labels
localization anything to do with i18n Internationalization and I10n localization Type: Bug Issues related to software defects or unexpected behavior, which require resolution.
Milestone

Comments

@wetneb
Copy link
Sponsor Member

wetneb commented May 24, 2022

The "Collapse consecutive whitespace" operation does not work when applied to certain whitespace unicode characters.

To Reproduce

Steps to reproduce the behavior:

  1. First, import this openrefine project: Clipboard.openrefine.tar.gz
  2. Then, run the "Collapse consecutive whitespace" operation on the only column

Current Results

The cell is not edited

Expected Behavior

The cell should be edited to "hello world"

Versions

  • JRE or JDK Version: 11
  • OpenRefine: 3.6-SNAPSHOT

Datasets

Real world dataset where this appears: https://opendata.paris.fr/explore/dataset/lieux-de-tournage-a-paris/information/?disjunctive.type_tournage&disjunctive.nom_tournage&disjunctive.nom_realisateur&disjunctive.nom_producteur&disjunctive.ardt_lieu

Additional context

Discovered while doing a demo at Dataharvest 2022

@wetneb wetneb added Type: Bug Issues related to software defects or unexpected behavior, which require resolution. Good First Issue Indicates issues suitable for newcomers to design or coding, providing a gentle introduction. labels May 24, 2022
@carlos-montano-hub
Copy link
Contributor

Hi, I write to let you know that I am interested in this issue and I would like to work on it

@AtesComp
Copy link
Contributor

There are some basic issues related to Unicode whitespace trimming, etc., by some libraries as they don't differentiate UTF-8 encoding but instead mistakenly use UTF-16 encoding.

For instance, see these two stupidly confusing pages:
https://www.fileformat.info/info/unicode/char/00a0/index.htm
https://www.fileformat.info/info/unicode/char/c2a0/index.htm

Note that uC2A0 is the encoding for UTF-8 and simultaneously "invalid" Unicode...which it is decidedly not invalid. These are UTF-16 centric sites and prove it with the C and Python code samples.

I had to purposely check for this specific whitespace issue in RDF Transform since even Java's regex misses it.

@thadguidry
Copy link
Member

thadguidry commented May 25, 2022

fileformat.info is not where you want to look at for Unicode specifications.

You should concern yourself with only those Unicode chars that fall into the Zs space separator category of the Unicode standards. Code Point Types https://www.unicode.org/versions/Unicode14.0.0/ch02.pdf#G286941 and Chapter 4.5 covers General categories https://www.unicode.org/versions/Unicode14.0.0/ch04.pdf

And you need to understand some of the rules in Unicode Text Segmentation annex that concern SpacingMark and https://www.unicode.org/reports/tr29/tr29-39.html
such as

General_Category = Zs and not LineBreak = Glue

@AtesComp This issue might be best left for those that have an intimate understanding of Unicode especially concerning category Zs.
Or you can inspect some of the libraries that do have some maps. For instance, search for "space" within
https://github.com/apache/commons-text/blob/master/src/main/java/org/apache/commons/text/translate/EntityArrays.java

@thadguidry
Copy link
Member

thadguidry commented May 25, 2022

I tried to find within Unicode.org an HTML page that shows the Zs category, but gave up after a bit. I think it's in the data downloads for sure. Anyways, here's a quick primer on a listing from another site, which I have no idea how complete it is: https://www.compart.com/en/unicode/category/Zs

Well, according to Wikipedia, as of Unicode v14 there are 17 graphic chars in the Zs category:

Zs Separator, space Graphic Character 17 Includes the space, but not TAB, CR, or LF, which are Cc
<!--EndFragment-->
</body>
</html>Zs	Separator, space	Graphic	Character	17	Includes the space, but not [TAB](https://en.wikipedia.org/wiki/Tab_key), [CR](https://en.wikipedia.org/wiki/Carriage_return), or [LF](https://en.wikipedia.org/wiki/Newline), which are Cc
>

So we should try to add all 17 into our source as a map.

@AtesComp
Copy link
Contributor

AtesComp commented May 25, 2022

For sure fileformat.info is NOT the place for a proper look at the specification. I was just noting the awful presentation others do on the subject. The fact that the Java regex failed in this issue was surprising.

The standard for the latest "Unicode" is documented at https://www.unicode.org/Public/UCD/latest/
But that is not a clear representation of UTF-8 which is by far the most used translation encoding.

I've had quite a bit of experience with Unicode over the last 20 years. Here is a general overview of the related issues:
https://stackoverflow.com/questions/2241348/what-are-unicode-utf-8-and-utf-16#:~:text=A%20code%20unit%20is%20the,Unicode%20of%2021%20bits%20maximum

The UTF-8 standard is documented here:
https://www.ietf.org/rfc/rfc3629.txt

I can assuredly and unoccquivically report that the U+00A0 Unicode character is universally misrepresented by UTF-8 compliant processors. As seen here , the encoding for U+00A0 (000 1010 0000) code point must be converted to a 2 byte representation as it fits in the following part of the conversion table:

U+0080 | U+07FF | 110xxxxx | 10xxxxxx

And is, therefore, 11000010 10100000 or C2 A0

Many implementation naively and erroneously reference the U+00A0 Unicode character as 00 A0 for UTF-8...including most Java libraries and specifically the one you mentioned.

I hope that is way more convincing.

@AtesComp
Copy link
Contributor

I knew I had these references somewhere...

You can use the following site to do some Unicode look ups:
https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp

For the interested Zs category, use in the input box:
[\p{Zs}]
Then, press the Show Set button. It lists the 17 characters referenced above.

However, there are no transform references.

The following link discusses the various transforms:
https://unicode.org/faq/utf_bom.html

If you really want to do a full blown transform converter, get the ICU4J code:
https://icu.unicode.org/
https://github.com/unicode-org/icu

But why since UTF-8 is the dominant transform.

Here's the RegEx Unicode site:
https://www.regular-expressions.info/unicode.html
(See the "Do You Need To Worry About Different Encodings?" part for some hints)

From GeekforGeeks:
https://www.geeksforgeeks.org/how-to-get-and-set-default-character-encoding-or-charset-in-java/
indicated UTF-8 is the default encoding--wrong. Actually, it depends on your OS settings. Early Java was UTF-16 centric. The RegEx function is clearly working on either UTF-16 or raw Unicode code points. Why don't they document it!

The majority JavaScript as UTF-8 will need to be transcoded to UTF-16 for the Java RegEx. Which means:

...get strUTF8 from JavaScript...
String strUTF16 = new String(strUTF8.getBytes(StandardCharsets.UTF_8), StandardCharsets.UTF_16);
...do your regex work, then
String strUTF8 = new String(strUTF16.getBytes(StandardCharsets.UTF_16), StandardCharsets.UTF_8);
...send strUTF8 back to JavaScript...

Or precompile some regex pattens with CANON_EQ and you might get lucky.

And to cap it all off, you can change the default encoding of the JVM using the confusingly-named property file.encoding.

To be clear, even though I knew most of this from past experience, I've still made mistakes with my current coding.

What a royal mess!

@AtesComp
Copy link
Contributor

This has convinced me of the bigger mess and security issues around JavaScript:
https://encoding.spec.whatwg.org/

I'm re-examining my UTF-8 / UTF-16 encoding issue.

  1. Microsoft OS encodings are supposedly UTF-16 based, but experience shows OpenRefine ingested Excel spreadsheets often have what looks like a UTF-8 NBSP at the end of some text cells
  2. JavaScript is UTF-16 for all strings
  3. Then, the JavaScript RegEx expressions work only on UTF-16 data

I'm having an issue with the U+00A0 vs C2 A0 NBSP character. I am definitely getting a \uC2A0 character embedded in some OpenRefine ingestions. They appear as an extra space at the end of text in the UI (especially in the GREL expression results). I would think the \uC2A0 would either show up as a weird A with a space on the UI or an oriental character, but it doesn't--it's a blank space. I thought this was just a UTF-8 character not getting handled properly. Now, I think its some fundamental OpenRefine ingestion issue.

On another note...
I consider whitespace as "\p{Cc}\p{Co}\p{Cn}\p{Z}".
Please review these and check me on it. Reject as needed.

@thadguidry
Copy link
Member

thadguidry commented May 26, 2022

@AtesComp Regarding the display issue of \uC2A0 character (is it a display issue or not that you are experiencing?) Are you using Lucida Sans Unicode (or one of these) as your standard font in your browser settings?

Hex C2A0 = UTF-8 for NBSP
\uC2A0 or U+C2A0 = 슠

Also, @AtesComp are you aware of UTF-8 being the default forthcoming in JDK 18 ?

@AtesComp
Copy link
Contributor

I don't want to pop off again before I have some definite info. I will do some comprehensive tests with a data set and recreate it to see if it is related. I'll create a new issue if not. I'll look at the various font setting to see if I can get it to display any different. It appears the same as in the spreadsheet. I'm suspicious about MS playing games with a regular space and nbsp to manage "presentation" to the user. I just don't yet understand how it could be \uC2A0 and not \u00A0 in JavaScript.

I've been tracking the Java UTF issues for some time but didn't read about that. They are not going to default to the local system setting anymore? That should at least make it a little more predictable.

@carlos-montano-hub
Copy link
Contributor

Hi, thanks for all the inputs. I will send my pr to the issue tomorrow

@ostephens
Copy link
Sponsor Member

ostephens commented May 27, 2022

So I'm not at all expert in encoding or Unicode / UTF8 / UTF16 so I can't comment on any of the underlying issues being raised here, but I want to draw focus on how this issue is framed and how it works:

The title of this issue is:

"Collapse consecutive whitespace" operation does not collapse all possible unicode whitespace

The "Collapse consecutive whitespace" function actually just applies the GREL:
value.replace(/\s+/,' ')

which, in code is doing:

                    Pattern pattern = (Pattern) o2;
                    return pattern.matcher(str).replaceAll((String) o3);

Where str is the cell value, o2 is /\s+/ and o3 is " "
(see

return pattern.matcher(str).replaceAll((String) o3);
)

So in terms of ensuring the "Collapse consecutive whitespace" function deals with the specific use case it would seem to me that by far the simplest approach would be to extend the regular expression used in the GREL replace to deal with additional whitespace characters - for some definition of what "whitespace characters" are included.

For example we could use
value.replace(/[\p{Cc}\p{Co}\p{Cn}\p{Z}]+/," ")
based on the comment from @AtesComp

I consider whitespace as "\p{Cc}\p{Co}\p{Cn}\p{Z}"

To be honest this seems to me reasonably aggressive in its scope so we might want to discuss whether there are situations where this would cause issues.

More conservatively we could go for something narrower like:
value.replace(/[\p{Z}\s]+/," ")

which I think is equivalent to what @thadguidry suggests when they say:

as of Unicode v14 there are 17 graphic chars in the Zs category [....] So we should try to add all 17 into our source as a map

Finally I'd note that that this is marked as a "good first issue" currently - which would suggest that it's been assessed as a relatively straightforward task. I think if this is a matter of tweaking the regex used then it's very straightforward and definitely a good first issue. But if the scope is more broadly dealing with whitespace in OR then it's by no means straightforward and should be re-labelled :)

@AtesComp
Copy link
Contributor

@ostephens, I believe you have the correct assessment. My issue is a parallel problem dealing with elimination, string end trimming.

The only additional point is the apparent "nbsp" alignment. I'll likely create a new issue when I track down how an apparent UTF-8 encoding is getting forced into a UTF-16 encoding. Either the character should be transcoded properly to be condensed / eliminated as well or it should be displayed / managed differently.

@wetneb wetneb removed the Good First Issue Indicates issues suitable for newcomers to design or coding, providing a gentle introduction. label May 27, 2022
@wetneb
Copy link
Sponsor Member Author

wetneb commented May 27, 2022

Let's remove the "good first issue" tag simply by virtue of the amount of text to read in this issue :)

@wetneb wetneb added this to the 3.6 milestone Jun 15, 2022
@tfmorris tfmorris added the localization anything to do with i18n Internationalization and I10n localization label Nov 9, 2022
tfmorris added a commit to tfmorris/OpenRefine that referenced this issue Nov 9, 2022
tfmorris added a commit to tfmorris/OpenRefine that referenced this issue Nov 15, 2022
wetneb pushed a commit that referenced this issue Nov 15, 2022
* Replace hidden characters with escape codes. Refs #4883

* Backspace (BS) isn't actually a Java whitespace character
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
localization anything to do with i18n Internationalization and I10n localization Type: Bug Issues related to software defects or unexpected behavior, which require resolution.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants