Rdf4j does not parse URLs with | and/or # characters inside #43

kcmcleod · 2019-10-31T12:09:26Z

E.g., https://www.alliancegenome.org/gene/MGI:2442292 produces the following error:

[Rio error] Unexpected character U+7C at index 59: https://fonts.googleapis.com/css?family=Droid+Serif:400,700|Lato:400,700 (42, -1)
Cannot parse triples into a model
org.eclipse.rdf4j.rio.RDFParseException: Unexpected character U+7C at index 59: https://fonts.googleapis.com/css?family=Droid+Serif:400,700|Lato:400,700 [line 42]
	at org.eclipse.rdf4j.rio.helpers.RDFParserHelper.reportError(RDFParserHelper.java:322)
	at org.eclipse.rdf4j.rio.helpers.AbstractRDFParser.reportError(AbstractRDFParser.java:673)
	at org.eclipse.rdf4j.rio.ntriples.NTriplesParser.reportError(NTriplesParser.java:684)
	at org.eclipse.rdf4j.rio.helpers.AbstractRDFParser.createURI(AbstractRDFParser.java:424)
	at org.eclipse.rdf4j.rio.ntriples.NTriplesParser.createURI(NTriplesParser.java:623)
	at org.eclipse.rdf4j.rio.ntriples.NTriplesParser.parseObject(NTriplesParser.java:394)
	at org.eclipse.rdf4j.rio.ntriples.NTriplesParser.parseTriple(NTriplesParser.java:292)
	at org.eclipse.rdf4j.rio.ntriples.NTriplesParser.parse(NTriplesParser.java:179)
	at org.eclipse.rdf4j.rio.ntriples.NTriplesParser.parse(NTriplesParser.java:118)
	at org.eclipse.rdf4j.rio.Rio.parse(Rio.java:298)
	at org.eclipse.rdf4j.rio.Rio.parse(Rio.java:224)
	at hwu.elixir.scrape.scraper.ScraperCore.createModelFromNTriples(ScraperCore.java:551)
	at hwu.elixir.scrape.scraper.ScraperCore.processTriples(ScraperCore.java:427)
	at hwu.elixir.scrape.scraper.ScraperCore.scrape(ScraperCore.java:944)
	at hwu.elixir.scrape.scraper.ServiceScraper.scrape(ServiceScraper.java:44)
	at hwu.elixir.scrape.scraper.ServiceScraper.main(ServiceScraper.java:79)

The text was updated successfully, but these errors were encountered:

AlasdairGray · 2019-11-06T09:22:05Z

Certain characters including | are deemed unsafe https://www.urlencoder.io/faq/

Have you contacted the rdf4j support list about this issue? Do other URL libs also suffer?

kcmcleod · 2019-11-06T11:04:49Z

Do other URL libs also suffer?

This is not a URL library, but a RDF one. At this stage the HTML has been parsed by Any23 and converted into triples. The triples are then passed into RDF4J in order build a model which can be processed (ie triples filtered, BNodes removed, provenance added etc).

RDF4J refuses to parse the triples due to the illegal character in the IRI.
My solution to this is to catch the exception, do a global search and replace on the | character and then redo the parse. This works.

In terms of actual URL libs, there is a variety of behaviour. For example Apache commons accepts the | but rejects the ( and ). Brackets are reserved characters. However, other libraries identify everything after the ? as the query string and do not parse it. Thereby the URL is valid for them.

Have you contacted the rdf4j support list about this issue?

They are complying with the https://tools.ietf.org/html/rfc3987 standard:

Systems accepting IRIs MAY also deal with the printable characters in US-ASCII that are not allowed in URIs, namely "<", ">", '"', space, "{", "}", "|", "", "^", and "`", in step 2 above. If these
characters are found but are not converted, then the conversion SHOULD fail.

The capitals are part of the standard, not added by me. Thus you do not have to encode the URL before parsing it and if it is not encoded you must fail it. This is exactly what RDF4J is doing.

Of course we could point out that the use of () is against the standard and they are accepting them... so why fail over the use of a |. However this seems a bit of a minefield.

AlasdairGray · 2019-11-07T09:21:39Z

It is good to know that rdf4j is doing the right thing and that we have a workaround.

We should probably feedback to Genome Alliance that their URLs contain illegal characters and this may cause problems for some applications consuming their data.

petrospaps · 2020-11-08T20:17:30Z

mobidb sitemap is also causing the same problem with character U+23 "#".

petrospaps · 2020-11-09T08:43:27Z

A fix is pushed to the dev branch, this will be moved to master branch once thoroughly tested.

petrospaps · 2020-11-20T13:33:21Z

The # character is valid in the URL, there is only a problem when you have more than one # characters in the URL. The code was changed back to not remove the # character

kcmcleod added the bug Something isn't working label Oct 31, 2019

kcmcleod self-assigned this Oct 31, 2019

kcmcleod added this to To do in Scraper via automation Oct 31, 2019

kcmcleod mentioned this issue Oct 31, 2019

Sites I cannot scrape #8

Closed

kcmcleod added work around Work around found for bug. Perhaps not best solution. bug Something isn't working and removed bug Something isn't working labels Nov 6, 2019

kcmcleod closed this as completed in 52bc366 Nov 18, 2019

Scraper automation moved this from To do to Done Nov 18, 2019

petrospaps self-assigned this Nov 8, 2020

petrospaps changed the title ~~Rdf4j does not parse URLs with | characters inside~~ Rdf4j does not parse URLs with | and/or # characters inside Nov 9, 2020

petrospaps reopened this Nov 9, 2020

Scraper automation moved this from Done to In progress Nov 9, 2020

petrospaps closed this as completed Nov 20, 2020

Scraper automation moved this from In progress to Done Nov 20, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rdf4j does not parse URLs with | and/or # characters inside #43

Rdf4j does not parse URLs with | and/or # characters inside #43

kcmcleod commented Oct 31, 2019

AlasdairGray commented Nov 6, 2019

kcmcleod commented Nov 6, 2019 •

edited

Loading

AlasdairGray commented Nov 7, 2019

petrospaps commented Nov 8, 2020

petrospaps commented Nov 9, 2020

petrospaps commented Nov 20, 2020

Rdf4j does not parse URLs with | and/or # characters inside #43

Rdf4j does not parse URLs with | and/or # characters inside #43

Comments

kcmcleod commented Oct 31, 2019

AlasdairGray commented Nov 6, 2019

kcmcleod commented Nov 6, 2019 • edited Loading

AlasdairGray commented Nov 7, 2019

petrospaps commented Nov 8, 2020

petrospaps commented Nov 9, 2020

petrospaps commented Nov 20, 2020

kcmcleod commented Nov 6, 2019 •

edited

Loading