Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convert rdflib.plugins.sparql.processor.SPARQLResult to a pandas.DataFrame? #1179

Closed
dhimmel opened this issue Oct 5, 2020 · 5 comments
Closed

Comments

@dhimmel
Copy link

dhimmel commented Oct 5, 2020

Running a SPARQL query on a rdflib.graph.Graph returns a rdflib.plugins.sparql.processor.SPARQLResult object. The .bindings method enables access to the underlying values. Since results are tabular in nature, it would be helpful to quickly be able to convert a SPARQLResult to a pandas.DataFrame.

I originally posted at RDFLib/sparqlwrapper#125 (comment), but realized that might not be the correct repository.

What is the best way to convert a SPARQLResult to a DataFrame and would it make sense to have this utility built into rdflib?

@dhimmel
Copy link
Author

dhimmel commented Oct 5, 2020

Copying from RDFLib/sparqlwrapper#125 (comment), there are a few ways I was able to convert to a DataFrame:

import pandas as pd

# results is a rdflib.plugins.sparql.processor.SPARQLResult object

# renders properly in notebooks, but DataFrame values are rdflib objects rather than builtin python types
pd.DataFrame(results.bindings)

# converts everything to strings including missing values
pd.DataFrame(results.bindings).applymap(str).rename(columns=str)

# serialize with json and then parse (clobbers types, converting values to strings)
import json
results_json = results.serialize(format="json")
bindings = json.loads(results_json)["results"]["bindings"]
bindings = [{k: v["value"] for k, v in result.items()} for result in bindings]
pd.DataFrame(bindings)

It also looks like the gastradon library has an implementation.

Ideally, there would be a solution that:

  1. is concise (not too many commands)
  2. converts results to python/pandas types, noting Cast XML Datatypes in SPARQL query results to native Python types? #1178
  3. handles missing values (e.g. from OPTIONAL)
  4. preserves column order

@white-gecko
Copy link
Member

Was already pointed out in RDFLib/sparqlwrapper#125 (comment)

I think this is not technically an issue of the core rdflib but could be some additional PandasWrapper within the RDFLib collection of projects.

@dhimmel
Copy link
Author

dhimmel commented Oct 6, 2020

Thanks @white-gecko for the suggestion at RDFLib/sparqlwrapper#125 (comment). Copying it here:

My simple solution was so far:

from pandas import DataFrame
from rdflib import Graph, URIRef
graph = Graph()
…
result = graph.query("select * {?s ?p ?o} limit 10")
DataFrame(result, columns=result.vars)

I'm replying here because this issue is specifically about converting a rdflib.plugins.sparql.processor.SPARQLResult object to a DataFrame, whereas RDFLib/sparqlwrapper#125 seems to be focused on converting a SPARQLWrapper.Wrapper.QueryResult to a DataFrame.

The problem with the proposed solution is that it doesn't convert types. The columns have type rdflib.term.Variable rather than str, making lookup like df["s"] or df.s impossible. Furthermore, values in the DataFrame are typed like rdflib.term.URIRef and rdflib.term.Literal. The structure of the DataFrame looks good, so casting to python/pandas types as per #1178 seems to be the remaining obstacle to producing broadly-useful DataFrames.

Two of the solutions I posted at RDFLib/sparqlwrapper#125 (comment) cast every table cell and column name to str, but ideally we'd be able to use the XML to Python type conversion table here.

@dhimmel
Copy link
Author

dhimmel commented Oct 6, 2020

casting to python/pandas types as per #1178 seems to be the remaining obstacle to producing broadly-useful DataFrames.

@matentzn suggested the missing piece at RDFLib/sparqlwrapper#125 (comment): the .toPython() method!

Based on this, I created the following function in RDFLib/sparqlwrapper#125 (comment):

from pandas import DataFrame
from rdflib.plugins.sparql.processor import SPARQLResult

def sparql_results_to_df(results: SPARQLResult) -> DataFrame:
    """
    Export results from an rdflib SPARQL query into a `pandas.DataFrame`,
    using Python types. See https://github.com/RDFLib/rdflib/issues/1179.
    """
    return DataFrame(
        data=([None if x is None else x.toPython() for x in row] for row in results),
        columns=[str(x) for x in results.vars],
    )

sparql_results_to_df(results)

This solution is fully functional as far as I can tell.

I think this is not technically an issue of the core rdflib but could be some additional PandasWrapper within the RDFLib collection of projects.

I imagine this is a pretty common use case. Do you see anywhere specifically where this convenience function would be a good fit in the RDFlib suite?

Would a function like SPARQLResult.toPandas() would be a good place?

@dhimmel
Copy link
Author

dhimmel commented Oct 6, 2020

unit test

Here's some test code for the sparql_results_to_df function above.

import pytest
import rdflib

@pytest.fixture
def rdflib_foaf_graph() -> rdflib.Graph:
    """
    FOAF (Friend of a Friend) testing graph from rdflib.
    """
    graph = rdflib.Graph()
    return graph.parse(
        source="https://github.com/RDFLib/rdflib/raw/56dc4207ce6e7b11ed7b45fb4fd4020ba548e718/examples/foaf.n3",
        format="n3",
    )


_foaf_sparql = """\
SELECT
  ?subject
  ?subject_is_tim
  (COUNT(*) AS ?n_triples)
  (MIN(?predicate) AS ?sample_predicate)
  (SAMPLE(?missing) AS ?missing) 
WHERE {
  ?subject ?predicate ?object.
  BIND(?subject = <http://www.w3.org/People/Berners-Lee/card#i> AS ?subject_is_tim)
  OPTIONAL {?subject <this_predicate_does_not_exist> ?missing .}
}
GROUP BY ?subject ?subject_is_tim
ORDER BY DESC(?n_triples) ?subject
LIMIT 10
"""


def test_sparql_results_to_df(rdflib_foaf_graph: rdflib.Graph) -> None:
    results = rdflib_foaf_graph.query(_foaf_sparql)
    df = sparql_results_to_df(results)
    assert len(df) == 10
    # test column values (no ? prefix), type (as strings), and order
    assert list(df.columns) == [
        "subject",
        "subject_is_tim",
        "n_triples",
        "sample_predicate",
        "missing",
    ]
    first_row = next(df.itertuples())
    # test value of subject, ensuring type conversion to str
    assert first_row.subject == "http://www.w3.org/People/Berners-Lee/card#i"
    # test value of subject_is_tim, ensuring type conversion to bool
    assert first_row.subject_is_tim is True
    # test value of n_triples, ensuring type conversion to int
    assert first_row.n_triples == 61
    # test value of sample_predicate, ensuring type conversion to str
    assert (
        first_row.sample_predicate == "http://www.w3.org/1999/02/22-rdf-syntax-ns#type"
    )
    # test value of missing, ensuring it's None
    assert first_row.missing is None

dhimmel added a commit to related-sciences/nxontology-data that referenced this issue Apr 24, 2022
dhimmel added a commit to related-sciences/nxontology-data that referenced this issue Apr 24, 2022
dhimmel added a commit to related-sciences/nxontology-data that referenced this issue Apr 24, 2022
dhimmel added a commit to related-sciences/nxontology-data that referenced this issue Apr 29, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants