Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pagination / offset of results for the SPARQL endpoint is faulty #150

Open
dhimmel opened this issue Dec 3, 2020 · 8 comments
Open

Pagination / offset of results for the SPARQL endpoint is faulty #150

dhimmel opened this issue Dec 3, 2020 · 8 comments

Comments

@dhimmel
Copy link

dhimmel commented Dec 3, 2020

Thanks for making the MeSH RDF SPARQL API. It's been convenient for quick access to MeSH.

I'd like to do a query that returns over 1000 results, and therefore need to figure out how to use pagination with the SPARQL API at https://id.nlm.nih.gov/mesh/sparql. Here's my query to return a table of descriptors:

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX meshv: <http://id.nlm.nih.gov/mesh/vocab#>
SELECT *
FROM <http://id.nlm.nih.gov/mesh/2020>
WHERE {
  ?mesh_uri a meshv:Descriptor .
  ?mesh_uri meshv:identifier ?mesh_id.
  ?mesh_uri rdfs:label ?mesh_label .
} 
ORDER BY ?mesh_uri

But I'm having trouble incrementing limit and offset to retrieve all results.

In search of a more reproducible example, I've simplified it to the this API call, generated by this python code:

import requests

query = """
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX meshv: <http://id.nlm.nih.gov/mesh/vocab#>
SELECT *
FROM <http://id.nlm.nih.gov/mesh/2020>
WHERE {
  ?mesh_uri a meshv:Descriptor .
  ?mesh_uri meshv:identifier ?mesh_id.
  ?mesh_uri rdfs:label ?mesh_label .
} 
ORDER BY ?mesh_uri
LIMIT 5
"""

params = {
    "query": query,
    "format": "json",
    "inference": True,
    "limit": 10,
    "offset": 4,
    "year": 2020,
}
api_url = "https://id.nlm.nih.gov/mesh/sparql"
response = requests.get(api_url, params)
print(response.url)
len(response.json()["results"]["bindings"])

The expected result is to receive a single record (the 5th record), because the query should return 5 records, and the offset is 4. Instead, 5 records are returned. The returned records under results.bindings start with:

      {
        "mesh_uri": { "type": "uri" , "value": "http://id.nlm.nih.gov/mesh/2020/D000005" } ,
        "mesh_id": { "type": "literal" , "value": "D000005" } ,
        "mesh_label": { "type": "literal" , "xml:lang": "en" , "value": "Abdomen" }
      } ,

So it looks like offset was respected, but something about the SPARQL LIMIT 5 or API parameter limit=10 does not work.

@dhimmel
Copy link
Author

dhimmel commented Dec 8, 2020

@danizen any ideas here or examples of successfully using the API for multi-page results?

@danizen
Copy link
Contributor

danizen commented Dec 8, 2020

When we were in the prototype phase, I argued that SPARQL is fine as a query language, but the whole deal with federated SPARQL and paging is flawed. Virtuoso also may truncate results. I went to bat at work for another team having direct access to the Virtuoso SQL interface (over JDBC) because of this.

So, I have devoted less time to validating this than I should, and will look into it. I have long wanted to automatically add a count of results, but I don't think I have time to do that soon. If I can recreate your issue, I will do it as an automated test and keep the fix DRY.

@danizen
Copy link
Contributor

danizen commented Dec 8, 2020

I suspect that there is some string processing that would add the LIMIT and OFFSET to the query via string processing rather than paging from a temporary graph or model. If so, this could be hard to fix. To fix the results you see would be a pretty fundamental change - we would need a query cache, with a query ID based on the sessionID and the query text. A similar cache to the one I propose is described in RFC 1813 and applies to NFS. Virtuoso may also support this itself, and I would have to involve @simonjupp from https://github.com/EBISPOT/lodestar as to a proper solution.

As a shorter term alternative, you can simply drop the LIMIT 5 clause from your query and it should work. Also, by doing a COUNT() aggregation query before your query, you may be able to discover if Virtuoso has truncated the query.

@dhimmel
Copy link
Author

dhimmel commented Dec 8, 2020

I suspect that there is some string processing that would add the LIMIT and OFFSET to the query via string processing rather than paging

Ah that makes sense, the limit and offset parameters get appended to the query, such that my earlier example probably became:

LIMIT 5
LIMIT 10
OFFSET 4

It looks like limit is set to 1000 when no limit is specified or a limit > 1000 is specified. The easy solution would be to allow for higher limits, perhaps limiting queries by some other mechanism like execution time if required.

But since the limit is capped at 1000, I tried two methods of determining when pagination has exhausted all results:

  1. Using SPARQL's COUNT function (example), but that returned 1000 results like:

    [{'callret-0': {'type': 'literal',
    'datatype': 'http://www.w3.org/2001/XMLSchema#integer',
    'value': '1'}},
    {'callret-0': {'type': 'literal',
    'datatype': 'http://www.w3.org/2001/XMLSchema#integer',
    'value': '1'}},
  2. Iterative queries that increment offset until len(results) < limit, but it seems that the API always returns 1000 results for the final call rather than the actual number of records beyond the offset.

@danizen
Copy link
Contributor

danizen commented Dec 8, 2020

Let me think about this a bit.

@danizen
Copy link
Contributor

danizen commented Dec 9, 2020

I see no problems in the code or differences with the EBISPOT upstream, so I will try to duplicate your results. One thing this won't change is whether we return more than 1000 results to a single http request/response.

The implementation of limit and offset do not use string concatenation, but the parsed Apache Jena query.
See uk.ac.ebi.fgpt.lode.impl.JenaSparqlService:207 and uk.ac.ebi.fgpt.lode.impl.JenaSparqlService:256.

@danizen
Copy link
Contributor

danizen commented Dec 9, 2020

After a review of the code, an attempt to reproduce, and a re-read of the SPARQL spec., I think this is working as designed. The LIMIT is from the OFFSET, and therefore designed to allow you to page through a large result set:

https://www.w3.org/TR/rdf-sparql-query/#modOffset

There may however be some Virtuoso setting limiting the the actual result set, and I will check on this tomorrow.

@dhimmel
Copy link
Author

dhimmel commented Dec 10, 2020

Thanks @danizen for looking into this and narrowing down where things are going wrong.

Here's a versioned link for JenaSparqlService in case that file changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants