Mangled Unicode characters in yellow message after matching using "search for match" dialog #6063

nikkiwd · 2023-09-22T23:38:25Z

After matching a value, there's a yellow message shown at the top of the page. If the match was done using the "Search for match" dialog, many Unicode characters are turned into "?".

To Reproduce

Steps to reproduce the behavior:

Create a new project with the following lines:

Māori
Omaha–Ponca
Võro

Select "Start reconciling" from the menu for "Column 1"
Select the Wikidata reconciliation service
Unselect "Auto-match candidates with high confidence"
Click "Start reconciling"
Click "Search for match" for one of the entries
Select the matching item from the dropdown

Current Results

The yellow message shown after matching shows Māori as "M?ori" and "Omaha–Ponca" as "Omaha?Ponca", but displays "Võro" correctly.

All of the names are displayed correctly if you click on the tick to accept the match instead of using "Search for match".

Expected Behavior

The Unicode characters should not turn into question marks.

Screenshots

After using "Search for match":

After clicking on the tick:

(taken from a longer list of names, so the row numbers don't match)

Versions

Operating System: Ubuntu
Browser Version:
JRE or JDK Version: openjdk 17.0.8.1
OpenRefine: Version 3.7.5 [a04fb5f]

Datasets

Additional context

The non-ASCII characters in these three names are:

U+0101 LATIN SMALL LETTER A WITH MACRON
U+2013 EN DASH
U+‎00F5 LATIN SMALL LETTER O WITH TILDE

It seems to only affect characters beyond U+00FF, so something is probably trying to use ISO 8859-1 (Latin-1).

Looking at the network requests seems to confirm that:

When using "Search for match", the browser sends the data to /command/core/recon-judge-similar-cells, and when clicking on the tick, it sends it to /command/core/recon-judge-one-cell.

In both cases it uses the header Content-Type: application/x-www-form-urlencoded; charset=UTF-8.

/command/core/recon-judge-similar-cells gives a response with the header Content-Type: application/json;charset=iso-8859-1 while /command/core/recon-judge-one-cell gives a response with Content-Type: application/json;charset=utf-8.

The text was updated successfully, but these errors were encountered:

tfmorris · 2023-12-13T18:40:37Z

@nikkiwd ~~Can you let us know what default character encoding your system is setup up to use please? ie in a Ubuntu shell~~

You can ignore that question. I've got a fix for this.

The response encoding was being set after the writer was created, so it was using the wrong (default) encoding. Also fix another place where UTF-8 encoding is not being set for a response.

nikkiwd added Type: Bug Issues related to software defects or unexpected behavior, which require resolution. Status: Pending Review Indicates that the issue or pull request is awaiting review by project maintainers or collaborators labels Sep 22, 2023

elebitzero mentioned this issue Oct 20, 2023

Mangled Unicode characters in reconciliation choices due to OS default character encoding #6107

Closed

tfmorris self-assigned this Dec 13, 2023

tfmorris mentioned this issue Dec 13, 2023

Make sure encoding is set before fetching writer. Fixes #6063 #6242

Merged

tfmorris closed this as completed in #6242 Dec 14, 2023

tfmorris added this to the 3.8 milestone Dec 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mangled Unicode characters in yellow message after matching using "search for match" dialog #6063

Mangled Unicode characters in yellow message after matching using "search for match" dialog #6063

nikkiwd commented Sep 22, 2023

tfmorris commented Dec 13, 2023 •

edited

Mangled Unicode characters in yellow message after matching using "search for match" dialog #6063

Mangled Unicode characters in yellow message after matching using "search for match" dialog #6063

Comments

nikkiwd commented Sep 22, 2023

To Reproduce

Current Results

Expected Behavior

Screenshots

Versions

Datasets

Additional context

tfmorris commented Dec 13, 2023 • edited

tfmorris commented Dec 13, 2023 •

edited