Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mangled Unicode characters in yellow message after matching using "search for match" dialog #6063

Closed
nikkiwd opened this issue Sep 22, 2023 · 1 comment · Fixed by #6242
Closed
Assignees
Labels
encoding Selection of encoding at import time, or encoding issues in data cleaning reconciliation Related to the reconciliation operations and other features Type: Bug Issues related to software defects or unexpected behavior, which require resolution.
Milestone

Comments

@nikkiwd
Copy link

nikkiwd commented Sep 22, 2023

After matching a value, there's a yellow message shown at the top of the page. If the match was done using the "Search for match" dialog, many Unicode characters are turned into "?".

To Reproduce

Steps to reproduce the behavior:

  1. Create a new project with the following lines:
  • Māori
  • Omaha–Ponca
  • Võro
  1. Select "Start reconciling" from the menu for "Column 1"
  2. Select the Wikidata reconciliation service
  3. Unselect "Auto-match candidates with high confidence"
  4. Click "Start reconciling"
  5. Click "Search for match" for one of the entries
  6. Select the matching item from the dropdown

Current Results

The yellow message shown after matching shows Māori as "M?ori" and "Omaha–Ponca" as "Omaha?Ponca", but displays "Võro" correctly.

All of the names are displayed correctly if you click on the tick to accept the match instead of using "Search for match".

Expected Behavior

The Unicode characters should not turn into question marks.

Screenshots

After using "Search for match":

screenshot

After clicking on the tick:

screenshot

(taken from a longer list of names, so the row numbers don't match)

Versions

  • Operating System: Ubuntu
  • Browser Version:
  • JRE or JDK Version: openjdk 17.0.8.1
  • OpenRefine: Version 3.7.5 [a04fb5f]

Datasets

Additional context

The non-ASCII characters in these three names are:

  • U+0101 LATIN SMALL LETTER A WITH MACRON
  • U+2013 EN DASH
  • U+‎00F5 LATIN SMALL LETTER O WITH TILDE

It seems to only affect characters beyond U+00FF, so something is probably trying to use ISO 8859-1 (Latin-1).

Looking at the network requests seems to confirm that:

When using "Search for match", the browser sends the data to /command/core/recon-judge-similar-cells, and when clicking on the tick, it sends it to /command/core/recon-judge-one-cell.

In both cases it uses the header Content-Type: application/x-www-form-urlencoded; charset=UTF-8.

/command/core/recon-judge-similar-cells gives a response with the header Content-Type: application/json;charset=iso-8859-1 while /command/core/recon-judge-one-cell gives a response with Content-Type: application/json;charset=utf-8.

@nikkiwd nikkiwd added Type: Bug Issues related to software defects or unexpected behavior, which require resolution. Status: Pending Review Indicates that the issue or pull request is awaiting review by project maintainers or collaborators labels Sep 22, 2023
@wetneb wetneb added encoding Selection of encoding at import time, or encoding issues in data cleaning reconciliation Related to the reconciliation operations and other features and removed Status: Pending Review Indicates that the issue or pull request is awaiting review by project maintainers or collaborators labels Sep 23, 2023
@tfmorris
Copy link
Member

tfmorris commented Dec 13, 2023

@nikkiwd Can you let us know what default character encoding your system is setup up to use please? ie in a Ubuntu shell

You can ignore that question. I've got a fix for this.

@tfmorris tfmorris self-assigned this Dec 13, 2023
tfmorris added a commit to tfmorris/OpenRefine that referenced this issue Dec 13, 2023
The response encoding was being set after the writer was
created, so it was using the wrong (default) encoding.

Also fix another place where UTF-8 encoding is not being set
for a response.
tfmorris added a commit that referenced this issue Dec 14, 2023
The response encoding was being set after the writer was
created, so it was using the wrong (default) encoding.

Also fix another place where UTF-8 encoding is not being set
for a response.
@tfmorris tfmorris added this to the 3.8 milestone Dec 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
encoding Selection of encoding at import time, or encoding issues in data cleaning reconciliation Related to the reconciliation operations and other features Type: Bug Issues related to software defects or unexpected behavior, which require resolution.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants