Skip to content

Tab character in a value in a JSON file results in a truncated data import #1164

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ostephens opened this issue Dec 9, 2016 · 11 comments
Closed
Assignees
Labels
import About importers in general - add a label for the data format if available JSON JSON as a data import/export format Type: Feature Request Identifies requests for new features or enhancements. These involve proposing new improvements.
Milestone

Comments

@ostephens
Copy link
Member

ostephens commented Dec 9, 2016

When importing JSON to OpenRefine through 'Create Project', if there is a tab inside a value in the JSON file, the import succeeds but only imports data up to the tab character. e.g. JSON:

{ "rows": [ { "Content": "A" }, { "Content": " B" }, { "Content": "C" } ] }
(tab character is before the 'B')

Since tab is an illegal character in this position (should be escaped as \t) it makes sense for the import to fail - but failing silently is a problem.

Not sure what the best solution is - three possibilities:

  1. Import should fail completely with an error
  2. Import should succeed with the data truncated as currently but with feedback to the user that the truncation has occurred and why
  3. Import should succeed with full data, converting illegal character to escaped version
@thadguidry
Copy link
Member

I think 3 would be most appropriate given OpenRefine is a data cleansing tool. And we should try to cleanse a bit of that. Folks expect it not to fail for a String value and we can just escape for those cases.

@thadguidry
Copy link
Member

@wetneb wetneb added Type: Feature Request Identifies requests for new features or enhancements. These involve proposing new improvements. import About importers in general - add a label for the data format if available labels Aug 2, 2017
@wetneb wetneb added JSON JSON as a data import/export format and removed import About importers in general - add a label for the data format if available labels Sep 18, 2017
@wetneb wetneb added the import About importers in general - add a label for the data format if available label Dec 22, 2019
@cz2h
Copy link
Member

cz2h commented Mar 17, 2020

Hi, I am now working with this issue and had make some progress.

Just double check if I understand the expected effect correct: file with { "rows": [ { "Content": "A" }, { "Content": " B" }, { " Content": "C" } ] } should be able to get loaded and there are two types of Content, one with a tab and one do not. And there is a tab before value B .

@wetneb
Copy link
Member

wetneb commented Mar 17, 2020

I am not sure what you mean by "two types of Content": I would expect A, B and C to end up in the same column.

@cz2h
Copy link
Member

cz2h commented Mar 17, 2020

I see, so unquoted control chars (like \t, \n) should be ignored during the parsing right?

@wetneb
Copy link
Member

wetneb commented Mar 17, 2020

I would not ignore these unescaped control characters - see the 3 options mentioned by Owen above.

@tfmorris
Copy link
Member

This currently does produce an error:

Illegal unquoted character ((CTRL-CHAR, code 9)): has to be escaped using backslash to be included in string value

although it doesn't happen until you make your record selection (ie not in initial preview).

Is there ever a time that we don't want to be permissive in parsing? ie Do we really need an import option for this? It's an invalid JSON file, but we're a data wrangling tool, not a validation tool.

@thadguidry
Copy link
Member

thadguidry commented May 22, 2020

We clean messy data. We DO NOT clean messy formats? I guess that's what we should adhere to since we are dependent on a large ecosystem of helper libraries for parsing and validating before importing the data.

In general, I take back my comment above that we should offer an option to cleanse unescaped values to clean messy JSON, since I was not aware that this was in fact an invalid or unknown format parsing error for JSON (Thanks to @tfmorris for helping me understand now on this)

Let's close this issue if we all agree and that our existing error message is adequate enough for this.
Or if we need to work further on a fix regarding @ostephens possible solution 3:

Import should fail completely with an error

@tfmorris
Copy link
Member

I think option 3, permissive parsing, always enabled with no option setting is the right answer here.

  1. Import should succeed with full data, converting illegal character to escaped version

More generally, I'd really like to avoid the situation where someone works on a PR that addresses an issue only to have the code review devolved into a labored discussion as to whether the original issue was valid or not. Let's get that resolved up front and not flip flop on our decisions. We have a number of open PRs where this is happening and it's got to be discouraging to new contributors.

@thadguidry
Copy link
Member

@tfmorris Ah, so you agreed with my original feeling, OK. I thought you were disagreeing. The flip-flopping happens because I read one thing instead of hearing you voice the comment. So I think about it more, because I value your opinion @tfmorris and think around it. No problem. I'm very used to code review opening up more questions back to the original design issue #. I think new contributors should be comfortable with some of that; they'll navigate that in the real world soon enough. But agree that the issues are where we should do the collaborating on the design or on the dev mailing list.

@wetneb wetneb added this to the 3.5 milestone Jun 16, 2020
@wetneb wetneb closed this as completed Jun 16, 2020
@wetneb
Copy link
Member

wetneb commented Jun 16, 2020

Closed by #2431

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
import About importers in general - add a label for the data format if available JSON JSON as a data import/export format Type: Feature Request Identifies requests for new features or enhancements. These involve proposing new improvements.
Projects
None yet
Development

No branches or pull requests

5 participants