Navigation Menu

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unescape character entities in raw data #60

Closed
mdoering opened this issue Mar 6, 2018 · 3 comments
Closed

Unescape character entities in raw data #60

mdoering opened this issue Mar 6, 2018 · 3 comments
Assignees
Milestone

Comments

@mdoering
Copy link
Member

mdoering commented Mar 6, 2018

Raw source data can have character data, especially unicode, escaped in various ways.
During import this needs to be resolved and flagged. The verbatim data should be preserved, but the interpreted data stored in the db anywhere but the verbatim table should have properly unescaped characters:

  • xml & html entites
    • named &
    • hex &
    • decimal &
  • unicode entities
    • U+0026
  • java unicode entites
    • hex \x26
    • octal \046
  • CSS & ECMA Javascript
    • Unicode escapes started by "\u": \u00A9
    • Unicode code point escapes indicated by "\u{}": \u{2F804}
    • Hexadecimal escapes started by "\x": \xA9

Apache commons has libraries for this: https://commons.apache.org/proper/commons-text/javadocs/api-release/org/apache/commons/text/StringEscapeUtils.html

@mdoering mdoering self-assigned this Mar 6, 2018
@mdoering mdoering added this to the Datasource Staging API milestone Mar 6, 2018
@mdoering
Copy link
Member Author

mdoering commented Mar 7, 2018

we should probably also strip xml/html tags such as in this title:

A new species of <i>Neamia</i> (Perciformes: Apogonidae) from the West Pacific Ocean.

@mdoering
Copy link
Member Author

mdoering commented Mar 7, 2018

Using a UnescapedVerbatimRedord wrapper class that does most of the job and remembers if any values have been modified so we can flag an issue: eee5d67

@mdoering
Copy link
Member Author

mdoering commented Mar 7, 2018

Addresses some concerns expressed in CatalogueOfLife/general#37

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant