Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rdf4j does not parse URLs with | and/or # characters inside #43

Closed
kcmcleod opened this issue Oct 31, 2019 · 6 comments
Closed

Rdf4j does not parse URLs with | and/or # characters inside #43

kcmcleod opened this issue Oct 31, 2019 · 6 comments
Assignees
Labels
bug Something isn't working work around Work around found for bug. Perhaps not best solution.
Projects

Comments

@kcmcleod
Copy link
Contributor

E.g., https://www.alliancegenome.org/gene/MGI:2442292 produces the following error:

[Rio error] Unexpected character U+7C at index 59: https://fonts.googleapis.com/css?family=Droid+Serif:400,700|Lato:400,700 (42, -1)
Cannot parse triples into a model
org.eclipse.rdf4j.rio.RDFParseException: Unexpected character U+7C at index 59: https://fonts.googleapis.com/css?family=Droid+Serif:400,700|Lato:400,700 [line 42]
	at org.eclipse.rdf4j.rio.helpers.RDFParserHelper.reportError(RDFParserHelper.java:322)
	at org.eclipse.rdf4j.rio.helpers.AbstractRDFParser.reportError(AbstractRDFParser.java:673)
	at org.eclipse.rdf4j.rio.ntriples.NTriplesParser.reportError(NTriplesParser.java:684)
	at org.eclipse.rdf4j.rio.helpers.AbstractRDFParser.createURI(AbstractRDFParser.java:424)
	at org.eclipse.rdf4j.rio.ntriples.NTriplesParser.createURI(NTriplesParser.java:623)
	at org.eclipse.rdf4j.rio.ntriples.NTriplesParser.parseObject(NTriplesParser.java:394)
	at org.eclipse.rdf4j.rio.ntriples.NTriplesParser.parseTriple(NTriplesParser.java:292)
	at org.eclipse.rdf4j.rio.ntriples.NTriplesParser.parse(NTriplesParser.java:179)
	at org.eclipse.rdf4j.rio.ntriples.NTriplesParser.parse(NTriplesParser.java:118)
	at org.eclipse.rdf4j.rio.Rio.parse(Rio.java:298)
	at org.eclipse.rdf4j.rio.Rio.parse(Rio.java:224)
	at hwu.elixir.scrape.scraper.ScraperCore.createModelFromNTriples(ScraperCore.java:551)
	at hwu.elixir.scrape.scraper.ScraperCore.processTriples(ScraperCore.java:427)
	at hwu.elixir.scrape.scraper.ScraperCore.scrape(ScraperCore.java:944)
	at hwu.elixir.scrape.scraper.ServiceScraper.scrape(ServiceScraper.java:44)
	at hwu.elixir.scrape.scraper.ServiceScraper.main(ServiceScraper.java:79)
@kcmcleod kcmcleod added the bug Something isn't working label Oct 31, 2019
@kcmcleod kcmcleod self-assigned this Oct 31, 2019
@kcmcleod kcmcleod added this to To do in Scraper via automation Oct 31, 2019
@AlasdairGray
Copy link
Member

Certain characters including | are deemed unsafe https://www.urlencoder.io/faq/

Have you contacted the rdf4j support list about this issue? Do other URL libs also suffer?

@kcmcleod
Copy link
Contributor Author

kcmcleod commented Nov 6, 2019

Do other URL libs also suffer?

This is not a URL library, but a RDF one. At this stage the HTML has been parsed by Any23 and converted into triples. The triples are then passed into RDF4J in order build a model which can be processed (ie triples filtered, BNodes removed, provenance added etc).

RDF4J refuses to parse the triples due to the illegal character in the IRI.
My solution to this is to catch the exception, do a global search and replace on the | character and then redo the parse. This works.

In terms of actual URL libs, there is a variety of behaviour. For example Apache commons accepts the | but rejects the ( and ). Brackets are reserved characters. However, other libraries identify everything after the ? as the query string and do not parse it. Thereby the URL is valid for them.

Have you contacted the rdf4j support list about this issue?

They are complying with the https://tools.ietf.org/html/rfc3987 standard:

Systems accepting IRIs MAY also deal with the printable characters in US-ASCII that are not allowed in URIs, namely "<", ">", '"', space, "{", "}", "|", "", "^", and "`", in step 2 above. If these
characters are found but are not converted, then the conversion SHOULD fail.

The capitals are part of the standard, not added by me. Thus you do not have to encode the URL before parsing it and if it is not encoded you must fail it. This is exactly what RDF4J is doing.

Of course we could point out that the use of () is against the standard and they are accepting them... so why fail over the use of a |. However this seems a bit of a minefield.

@kcmcleod kcmcleod added work around Work around found for bug. Perhaps not best solution. bug Something isn't working and removed bug Something isn't working labels Nov 6, 2019
@AlasdairGray
Copy link
Member

It is good to know that rdf4j is doing the right thing and that we have a workaround.

We should probably feedback to Genome Alliance that their URLs contain illegal characters and this may cause problems for some applications consuming their data.

Scraper automation moved this from To do to Done Nov 18, 2019
@petrospaps
Copy link
Contributor

mobidb sitemap is also causing the same problem with character U+23 "#".

@petrospaps petrospaps self-assigned this Nov 8, 2020
@petrospaps petrospaps changed the title Rdf4j does not parse URLs with | characters inside Rdf4j does not parse URLs with | and/or # characters inside Nov 9, 2020
@petrospaps
Copy link
Contributor

A fix is pushed to the dev branch, this will be moved to master branch once thoroughly tested.

@petrospaps petrospaps reopened this Nov 9, 2020
Scraper automation moved this from Done to In progress Nov 9, 2020
@petrospaps
Copy link
Contributor

The # character is valid in the URL, there is only a problem when you have more than one # characters in the URL. The code was changed back to not remove the # character

Scraper automation moved this from In progress to Done Nov 20, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working work around Work around found for bug. Perhaps not best solution.
Projects
Scraper
  
Done
Development

No branches or pull requests

3 participants