-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rdf4j does not parse URLs with | and/or # characters inside #43
Comments
Certain characters including Have you contacted the rdf4j support list about this issue? Do other URL libs also suffer? |
This is not a URL library, but a RDF one. At this stage the HTML has been parsed by Any23 and converted into triples. The triples are then passed into RDF4J in order build a model which can be processed (ie triples filtered, BNodes removed, provenance added etc). RDF4J refuses to parse the triples due to the illegal character in the IRI. In terms of actual URL libs, there is a variety of behaviour. For example Apache commons accepts the | but rejects the ( and ). Brackets are reserved characters. However, other libraries identify everything after the ? as the query string and do not parse it. Thereby the URL is valid for them.
They are complying with the https://tools.ietf.org/html/rfc3987 standard:
The capitals are part of the standard, not added by me. Thus you do not have to encode the URL before parsing it and if it is not encoded you must fail it. This is exactly what RDF4J is doing. Of course we could point out that the use of () is against the standard and they are accepting them... so why fail over the use of a |. However this seems a bit of a minefield. |
It is good to know that rdf4j is doing the right thing and that we have a workaround. We should probably feedback to Genome Alliance that their URLs contain illegal characters and this may cause problems for some applications consuming their data. |
mobidb sitemap is also causing the same problem with character U+23 "#". |
A fix is pushed to the dev branch, this will be moved to master branch once thoroughly tested. |
The # character is valid in the URL, there is only a problem when you have more than one # characters in the URL. The code was changed back to not remove the # character |
E.g., https://www.alliancegenome.org/gene/MGI:2442292 produces the following error:
The text was updated successfully, but these errors were encountered: