Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problems with special characters in project name #4402

Closed
b2m opened this issue Jan 3, 2022 · 9 comments · Fixed by #4768
Closed

Problems with special characters in project name #4402

b2m opened this issue Jan 3, 2022 · 9 comments · Fixed by #4768
Assignees
Labels
encoding Selection of encoding at import time, or encoding issues in data cleaning Type: Bug Issues related to software defects or unexpected behavior, which require resolution.
Milestone

Comments

@b2m
Copy link

b2m commented Jan 3, 2022

This issue describes problems with special characters in the project name and other fields from metadata.json.

To Reproduce

Steps to reproduce the behavior:

  1. Create a project with special characters in the project name ("Testing äüöß").
  2. Close OpenRefine and check metadata.json (everything ok!)
  3. Reopen OpenRefine and check special characters in the project name ("Testing äüöß").
  4. Close OpenRefine and check metadata.json (not ok!)
  5. Reopen OpenRefine and check special characters in the project name ("Testing äüöß")

Current Results

The special characters are somehow completely wrong and their number doubles with each restart of OpenRefine.

Expected Behavior

Reading and writing metadata to metadata.json preserves special characters.

Screenshots

After creating the project (step 1):

openrefine_special_chars_1

First restart (step 3):

openrefine_special_chars_2

Second restart (step 5):

openrefine_special_chars_3

Versions

  • Operating System: Windows 10
  • Browser Version: Firefox 95
  • JRE: 11.0.13 (using the bundled version "openrefine-win-with-java-3.5.1.zip")
  • OpenRefine: observed in 3.5.0, 3.5.1 and 3.5.2

Additional context

  1. The list of garbled up characters grows exponentially which results in realy large metadata.json files and OutOfMemory errors.
  2. This might be related or even be the same problem as described in Out of memory errors from large metadata.json files being parsed during OpenRefine Project Open #3431.
  3. I also observed this problem in metadata.json in the fields .importOptionMetadata and .preferences.entries."exporters.templating.template".
@b2m b2m added Type: Bug Issues related to software defects or unexpected behavior, which require resolution. Status: Pending Review Indicates that the issue or pull request is awaiting review by project maintainers or collaborators labels Jan 3, 2022
@wetneb wetneb added the encoding Selection of encoding at import time, or encoding issues in data cleaning label Jan 3, 2022
@ashwini-m-hub
Copy link

i want work on it

@wetneb
Copy link
Sponsor Member

wetneb commented Jan 9, 2022

Thanks for volunteering! This might be easier to reproduce on Windows (but it is potentially also doable on Linux).
We have a guide for first contributions here: https://docs.openrefine.org/technical-reference/contributing#your-first-code-pull-request
Let me know how it goes!

@ashwini-m-hub
Copy link

Sir I'm new to open source
After setting it on local environment
What is the next step

@wetneb
Copy link
Sponsor Member

wetneb commented Jan 9, 2022

The steps are outlined in the link I gave you above. If you point me to a more concrete problem with any of these steps I will do my best to help you.

@elroykanye
Copy link
Member

@ashwini-m-hub are you still working on this?

@WaltonG
Copy link
Member

WaltonG commented Apr 15, 2022

Hi @elroykanye, Are you still interested on this ?

@elroykanye
Copy link
Member

elroykanye commented Apr 15, 2022

Hi @WaltonG , didn't notice @ashwini-m-hub was unassigned.
I am still interested in this

Thanks for offering 🙏🏾 I'll appreciate any help you have to offer.

@b2m
Copy link
Author

b2m commented Jul 8, 2022

As I am now receiving more reports from colleagues regarding this issue I am adding some context to enhance the visibility in the search and add some mitigation strategies. Colleagues are reporting that they suddenly receive an HTTP ERROR 500 java.lang.OutOfMemoryError: Java heap space immediately after starting OpenRefine.

The natural reaction is to enhance the max memory heap size in openrefine.l4j.ini, but this is not a long term strategy because the special characters will double each time OpenRefine is opened (exponential growth).

To mitigate the problem you first have to identify affected projects.
For that find out where your OpenRefine projects are stored.

Then identify unusually big metadata.json files, where "unusually big" depends on your projects.
If a metadata.json has a size that is measured in megabytes (or bigger!) I usually have a look at it.

To find big metadata.json files let tools help you to analyze the OpenRefine projects folder.

Some tools that might do the job for your are for Windows: TreeSize, WinDirStat or alternatives; for Linux: Disk Usage Analyzer, QDirStat or alternatives; for Mac: Disk Inventory X, GrandPerspective or alternatives.

You can then isolate the affected projects by removing their id from workspace.json and maybe move their folders to a temporary location.

There is a solution to fix the metadata.json files using a combination of find and jq described in the issue Out of memory errors from large metadata.json files being parsed during OpenRefine Project Open
.

If using command line tools is not your thing you might open the files in a text editor and manually tidy the files up.
In my experience it is easier to copy the "clean" JSON parts into a new file instead of trying to delete megabytes of garbage.

If you are affected by this behavior you might want to switch early to OpenRefine v3.6-rc (or a later release) where this bug is fixed.

@tfmorris tfmorris removed the Status: Pending Review Indicates that the issue or pull request is awaiting review by project maintainers or collaborators label Oct 6, 2022
@tfmorris
Copy link
Member

tfmorris commented Feb 7, 2023

As mentioned above, this was first mentioned in #3431. The problem was introduced by #2657 which was attempting to
fix #2543 #2544 #2627, but was incomplete, forcing save to UTF-8, but not doing the same for load (and for some reason the test didn't catch it).

I think the problem affects 3.4 & 3.5 on systems which don't use UTF-8 as their default encoding.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
encoding Selection of encoding at import time, or encoding issues in data cleaning Type: Bug Issues related to software defects or unexpected behavior, which require resolution.
Projects
None yet
6 participants