Unescape character entities in raw data #60

mdoering · 2018-03-06T11:21:19Z

Raw source data can have character data, especially unicode, escaped in various ways.
During import this needs to be resolved and flagged. The verbatim data should be preserved, but the interpreted data stored in the db anywhere but the verbatim table should have properly unescaped characters:

xml & html entites
- named &
- hex &
- decimal &
unicode entities
- U+0026
java unicode entites
- hex \x26
- octal \046
CSS & ECMA Javascript
- Unicode escapes started by "\u": \u00A9
- Unicode code point escapes indicated by "\u{}": \u{2F804}
- Hexadecimal escapes started by "\x": \xA9

Apache commons has libraries for this: https://commons.apache.org/proper/commons-text/javadocs/api-release/org/apache/commons/text/StringEscapeUtils.html

The text was updated successfully, but these errors were encountered:

mdoering · 2018-03-07T11:48:17Z

we should probably also strip xml/html tags such as in this title:

A new species of <i>Neamia</i> (Perciformes: Apogonidae) from the West Pacific Ocean.

mdoering · 2018-03-07T16:07:39Z

Using a UnescapedVerbatimRedord wrapper class that does most of the job and remembers if any values have been modified so we can flag an issue: eee5d67

mdoering · 2018-03-07T17:21:39Z

Addresses some concerns expressed in CatalogueOfLife/general#37

mdoering self-assigned this Mar 6, 2018

mdoering added this to the Datasource Staging API milestone Mar 6, 2018

mdoering closed this as completed Mar 7, 2018

mdoering mentioned this issue Mar 7, 2018

chinese char encoding looks wrong #56

Closed

mdoering added the issue rules label Jul 2, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unescape character entities in raw data #60

Unescape character entities in raw data #60

mdoering commented Mar 6, 2018 •

edited

mdoering commented Mar 7, 2018 •

edited

mdoering commented Mar 7, 2018

mdoering commented Mar 7, 2018

Navigation Menu

Unescape character entities in raw data #60

Unescape character entities in raw data #60

Comments

mdoering commented Mar 6, 2018 • edited

mdoering commented Mar 7, 2018 • edited

mdoering commented Mar 7, 2018

mdoering commented Mar 7, 2018

mdoering commented Mar 6, 2018 •

edited

mdoering commented Mar 7, 2018 •

edited