New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open XML file from URL generates lots of empty lines #1095

Open
psychemedia opened this Issue Nov 27, 2015 · 8 comments

Comments

Projects
None yet
6 participants
@psychemedia

psychemedia commented Nov 27, 2015

In the 2.6RC2 installed on Linux from https://github.com/OpenRefine/OpenRefine/releases/download/2.6-rc.2 I get blank lines displayed when importing XML from a URL.

eg importing XML from http://api.worldbank.org/countries/all/indicators/SP.POP.TOTL?date=2000:2001

image

Empty lines:

image

If the parsing of this can't be fixed, an ignore empty rows setting would be useful?

@jackyq2015

This comment has been minimized.

Show comment
Hide comment
@jackyq2015

jackyq2015 Dec 2, 2015

Contributor

Confirmed it's a bug. The extra empty lines seem are generated by one wb:data node.

Do you have similar issue for other xml files?

Contributor

jackyq2015 commented Dec 2, 2015

Confirmed it's a bug. The extra empty lines seem are generated by one wb:data node.

Do you have similar issue for other xml files?

@psychemedia

This comment has been minimized.

Show comment
Hide comment
@psychemedia

psychemedia Dec 3, 2015

@jackyq2015 I haven't tried any others recently.. will let you know if I find other examples.

psychemedia commented Dec 3, 2015

@jackyq2015 I haven't tried any others recently.. will let you know if I find other examples.

@joewiz

This comment has been minimized.

Show comment
Hide comment
@joewiz

joewiz Dec 3, 2015

Contributor

This reminds me that I saw this too on all XML files when I first began using OR. I decided to pre-process my XML because I needed to move ahead on the project, so I never reported the bug. This would be a nice one to fix. If additional files are needed to reproduce the bug I could dig some up.

Contributor

joewiz commented Dec 3, 2015

This reminds me that I saw this too on all XML files when I first began using OR. I decided to pre-process my XML because I needed to move ahead on the project, so I never reported the bug. This would be a nice one to fix. If additional files are needed to reproduce the bug I could dig some up.

@jackyq2015

This comment has been minimized.

Show comment
Hide comment
@jackyq2015

jackyq2015 Dec 3, 2015

Contributor

@joewiz May I know how did you pre-process the XML in order to make it work?

Contributor

jackyq2015 commented Dec 3, 2015

@joewiz May I know how did you pre-process the XML in order to make it work?

@joewiz

This comment has been minimized.

Show comment
Hide comment
@joewiz

joewiz Dec 3, 2015

Contributor

@jackyq2015 Sorry, what I meant was that I converted the XML to TSV outside of OR before importing into it. I used XQuery - using a script like this one: https://gist.github.com/joewiz/48ce061423aa7d3ada28.

Contributor

joewiz commented Dec 3, 2015

@jackyq2015 Sorry, what I meant was that I converted the XML to TSV outside of OR before importing into it. I used XQuery - using a script like this one: https://gist.github.com/joewiz/48ce061423aa7d3ada28.

@ettorerizza

This comment has been minimized.

Show comment
Hide comment
@ettorerizza

ettorerizza Dec 4, 2015

Member

With Open Refine RC2 (Windows), I have a lot of problems for parsing XML
files as simple as this :
https://www.dropbox.com/s/acj3x3b8wl3bkso/4408000.xml?dl=0
I just get a white screen.
Google Refine 2.5 does not have this issue.

Member

ettorerizza commented Dec 4, 2015

With Open Refine RC2 (Windows), I have a lot of problems for parsing XML
files as simple as this :
https://www.dropbox.com/s/acj3x3b8wl3bkso/4408000.xml?dl=0
I just get a white screen.
Google Refine 2.5 does not have this issue.

@tfmorris

This comment has been minimized.

Show comment
Hide comment
@tfmorris

tfmorris Dec 4, 2015

Member

The issue is being caused by whitespace between tags and a code path that the "trim whitespace" flag doesn't effect. With that fixed, turning off "preserve empty strings" (on by default) and turning on "trim whitespace" (off by default), the XML import generates a much more compact table.

We changed some of the import setting defaults in 2.6 to disable transformations which aren't reversible, but XML whitespace is kind of a special case, so I'll take a closer look to see if there's a better way to fix this.

@ettorerizza Your file imports for me, but most of the populated data columns are off to the far right of the screen. In other words, it behaves the same as Tony's example.

Member

tfmorris commented Dec 4, 2015

The issue is being caused by whitespace between tags and a code path that the "trim whitespace" flag doesn't effect. With that fixed, turning off "preserve empty strings" (on by default) and turning on "trim whitespace" (off by default), the XML import generates a much more compact table.

We changed some of the import setting defaults in 2.6 to disable transformations which aren't reversible, but XML whitespace is kind of a special case, so I'll take a closer look to see if there's a better way to fix this.

@ettorerizza Your file imports for me, but most of the populated data columns are off to the far right of the screen. In other words, it behaves the same as Tony's example.

tfmorris pushed a commit to tfmorris/OpenRefine that referenced this issue Dec 4, 2015

tfmorris added a commit to tfmorris/OpenRefine that referenced this issue Mar 22, 2016

@tfmorris

This comment has been minimized.

Show comment
Hide comment
@tfmorris

tfmorris Mar 22, 2016

Member

I revisited this and have a better solution, but there is an issue that I'm unsure how to deal with. Without a DTD or XML Schema, the parsing mode that XML parsers use is the "Mixed" mode where an element can have text, nested elements, or both. If there's a DTD/Schema and it says that an element is not a mixed element, then any text can be discarded by the parser. Without a schema, there's not way to distinguish pretty-print white space from element text.

We can drop all whitespace-only strings that occur in mixed elements, but I've got a nagging concern that this may cause other issues. Am I being overly paranoid or do we need yet another toggle to control this?

Member

tfmorris commented Mar 22, 2016

I revisited this and have a better solution, but there is an issue that I'm unsure how to deal with. Without a DTD or XML Schema, the parsing mode that XML parsers use is the "Mixed" mode where an element can have text, nested elements, or both. If there's a DTD/Schema and it says that an element is not a mixed element, then any text can be discarded by the parser. Without a schema, there's not way to distinguish pretty-print white space from element text.

We can drop all whitespace-only strings that occur in mixed elements, but I've got a nagging concern that this may cause other issues. Am I being overly paranoid or do we need yet another toggle to control this?

@tfmorris tfmorris added the bug label Apr 20, 2016

@wetneb wetneb added the import label Aug 2, 2017

@wetneb wetneb added import/xml and removed import labels Sep 18, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment