Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regression in 6.2.0 when URL source to be parsed contains a query string #2120

Closed
nutjob4life opened this issue Sep 21, 2022 · 3 comments · Fixed by #2304
Closed

Regression in 6.2.0 when URL source to be parsed contains a query string #2120

nutjob4life opened this issue Sep 21, 2022 · 3 comments · Fixed by #2304
Assignees
Labels
6.3.x Things that should be fixed in 6.3.x or later. bug Something isn't working regression Something stopped working

Comments

@nutjob4life
Copy link

Graph.parse in 6.2.0 is producing inconsistent statement counts when reading over HTTP versus 6.1.1 when the source URL contains a query string.

To reproduce:

$ date -u
Wed Sep 21 16:57:54 UTC 2022
$ cd /tmp
$ python3.10 --version
Python 3.10.5
$ python3.10 -m venv venv
$ cd venv
$ bin/pip install --quiet --upgrade setuptools pip wheel rdflib==6.2.0
$ bin/python
Python 3.10.5 (main, Jun 23 2022, 17:15:25) [Clang 13.1.6 (clang-1316.0.21.2.5)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import rdflib
>>> a, b = rdflib.Graph(), rdflib.Graph()
>>> a.parse('https://bmdb.jpl.nasa.gov/rdf/example')
>>> b.parse('https://bmdb.jpl.nasa.gov/rdf/example?all=right')
>>> len(a), len(b), len(a) != len(b)
(2, 2, False)

This should be (2, 4, True).

Using 6.1.1 produces the correct results:

$ bin/pip install --quiet rdflib==6.1.1
$ bin/python
Python 3.10.5 (main, Jun 23 2022, 17:15:25) [Clang 13.1.6 (clang-1316.0.21.2.5)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import rdflib
>>> a, b = rdflib.Graph(), rdflib.Graph()
>>> a.parse('https://bmdb.jpl.nasa.gov/rdf/example')
>>> b.parse('https://bmdb.jpl.nasa.gov/rdf/example?all=right')
>>> len(a), len(b), len(a) != len(b)
(2, 4, True)

Apparently in 6.2.2 the query string part of the URL ?all=right gets stripped out. It is necessary with certain web services APIs in order to select different sets of RDF statements.

nutjob4life added a commit to EDRN/edrn.bmdb that referenced this issue Sep 21, 2022
- Drops zc.buildout and uses a plain old venv like the rest of the world
- Upgrades to Python 3.10
- Downgrades rdflib to 6.1.1 thanks to regression; see RDFLib/rdflib#2120
- Changes reStructuredText to Markdown
- Ups version number to 1.1.0
- Adds an example RDF service that doesn't reveal any private data but honors the private data parameter
- Adds Maureen's latest report requests
nutjob4life added a commit to EDRN/P5 that referenced this issue Sep 22, 2022
Issue: RDF source URLs with query strings get their query strings dropped

See: RDFLib/rdflib#2120
@aucampia aucampia added regression Something stopped working bug Something isn't working labels Mar 19, 2023
@aucampia aucampia self-assigned this Mar 19, 2023
@aucampia
Copy link
Member

aucampia commented Mar 20, 2023

Regression came in with #1902, the problem is that we are using quote(), which is designed for quoting paths, for non-path elements.

The exact fix is not that simple, but not that complex either, should have this mostly-fixed for 6.3.2, though it likely will be some corner cases. If we can find a reference implementation, it will help a lot.

@aucampia aucampia added the 6.3.x Things that should be fixed in 6.3.x or later. label Mar 20, 2023
@aucampia
Copy link
Member

Okay, this is not that complex to fix: werkzeug has a valid implementation of this [ref] and I will try to align our implementation of iri2uri with theirs.

aucampia added a commit to aucampia/rdflib that referenced this issue Mar 21, 2023
The URI to IRI conversion was percentage-quoting characters that should not have
been quoted, like equals in the query string. It was also not quoting things
that should have been quoted, like the username and password components of a
URI.

This change improves the conversion by only quoting characters that are not
allowed in specific parts of the URI and quoting previously unquoted components.
The safe characters for each segment are taken from
[RFC3986](https://datatracker.ietf.org/doc/html/rfc3986).

The new behavior is heavily inspired by
[`werkzeug.urls.iri_to_uri`](https://github.com/pallets/werkzeug/blob/92c6380248c7272ee668e1f8bbd80447027ccce2/src/werkzeug/urls.py#L926-L931)
though there are some differences.

- Closes <RDFLib#2120>.
aucampia added a commit to aucampia/rdflib that referenced this issue Mar 21, 2023
The URI to IRI conversion was percentage-quoting characters that should not have
been quoted, like equals in the query string. It was also not quoting things
that should have been quoted, like the username and password components of a
URI.

This change improves the conversion by only quoting characters that are not
allowed in specific parts of the URI and quoting previously unquoted components.
The safe characters for each segment are taken from
[RFC3986](https://datatracker.ietf.org/doc/html/rfc3986).

The new behavior is heavily inspired by
[`werkzeug.urls.iri_to_uri`](https://github.com/pallets/werkzeug/blob/92c6380248c7272ee668e1f8bbd80447027ccce2/src/werkzeug/urls.py#L926-L931)
though there are some differences.

- Closes <RDFLib#2120>.
aucampia added a commit that referenced this issue Mar 23, 2023
The URI to IRI conversion was percentage-quoting characters that should not have
been quoted, like equals in the query string. It was also not quoting things
that should have been quoted, like the username and password components of a
URI.

This change improves the conversion by only quoting characters that are not
allowed in specific parts of the URI and quoting previously unquoted components.
The safe characters for each segment are taken from
[RFC3986](https://datatracker.ietf.org/doc/html/rfc3986).

The new behavior is heavily inspired by
[`werkzeug.urls.iri_to_uri`](https://github.com/pallets/werkzeug/blob/92c6380248c7272ee668e1f8bbd80447027ccce2/src/werkzeug/urls.py#L926-L931)
though there are some differences.

- Closes <#2120>.
@nutjob4life
Copy link
Author

Wow! That pull-request was a lot more involved than I expected.

Thank you for tackling this, @aucampia! 💃🕺

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
6.3.x Things that should be fixed in 6.3.x or later. bug Something isn't working regression Something stopped working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants