Convert rdflib.plugins.sparql.processor.SPARQLResult to a pandas.DataFrame? #1179

dhimmel · 2020-10-05T19:48:36Z

Running a SPARQL query on a rdflib.graph.Graph returns a rdflib.plugins.sparql.processor.SPARQLResult object. The .bindings method enables access to the underlying values. Since results are tabular in nature, it would be helpful to quickly be able to convert a SPARQLResult to a pandas.DataFrame.

I originally posted at RDFLib/sparqlwrapper#125 (comment), but realized that might not be the correct repository.

What is the best way to convert a SPARQLResult to a DataFrame and would it make sense to have this utility built into rdflib?

The text was updated successfully, but these errors were encountered:

dhimmel · 2020-10-05T19:55:34Z

Copying from RDFLib/sparqlwrapper#125 (comment), there are a few ways I was able to convert to a DataFrame:

import pandas as pd

# results is a rdflib.plugins.sparql.processor.SPARQLResult object

# renders properly in notebooks, but DataFrame values are rdflib objects rather than builtin python types
pd.DataFrame(results.bindings)

# converts everything to strings including missing values
pd.DataFrame(results.bindings).applymap(str).rename(columns=str)

# serialize with json and then parse (clobbers types, converting values to strings)
import json
results_json = results.serialize(format="json")
bindings = json.loads(results_json)["results"]["bindings"]
bindings = [{k: v["value"] for k, v in result.items()} for result in bindings]
pd.DataFrame(bindings)

It also looks like the gastradon library has an implementation.

Ideally, there would be a solution that:

is concise (not too many commands)
converts results to python/pandas types, noting Cast XML Datatypes in SPARQL query results to native Python types? #1178
handles missing values (e.g. from OPTIONAL)
preserves column order

white-gecko · 2020-10-06T09:00:34Z

Was already pointed out in RDFLib/sparqlwrapper#125 (comment)

I think this is not technically an issue of the core rdflib but could be some additional PandasWrapper within the RDFLib collection of projects.

dhimmel · 2020-10-06T13:16:50Z

Thanks @white-gecko for the suggestion at RDFLib/sparqlwrapper#125 (comment). Copying it here:

My simple solution was so far:

from pandas import DataFrame
from rdflib import Graph, URIRef
graph = Graph()
…
result = graph.query("select * {?s ?p ?o} limit 10")
DataFrame(result, columns=result.vars)

I'm replying here because this issue is specifically about converting a rdflib.plugins.sparql.processor.SPARQLResult object to a DataFrame, whereas RDFLib/sparqlwrapper#125 seems to be focused on converting a SPARQLWrapper.Wrapper.QueryResult to a DataFrame.

The problem with the proposed solution is that it doesn't convert types. The columns have type rdflib.term.Variable rather than str, making lookup like df["s"] or df.s impossible. Furthermore, values in the DataFrame are typed like rdflib.term.URIRef and rdflib.term.Literal. The structure of the DataFrame looks good, so casting to python/pandas types as per #1178 seems to be the remaining obstacle to producing broadly-useful DataFrames.

Two of the solutions I posted at RDFLib/sparqlwrapper#125 (comment) cast every table cell and column name to str, but ideally we'd be able to use the XML to Python type conversion table here.

dhimmel · 2020-10-06T14:15:43Z

casting to python/pandas types as per #1178 seems to be the remaining obstacle to producing broadly-useful DataFrames.

@matentzn suggested the missing piece at RDFLib/sparqlwrapper#125 (comment): the .toPython() method!

Based on this, I created the following function in RDFLib/sparqlwrapper#125 (comment):

from pandas import DataFrame
from rdflib.plugins.sparql.processor import SPARQLResult

def sparql_results_to_df(results: SPARQLResult) -> DataFrame:
    """
    Export results from an rdflib SPARQL query into a `pandas.DataFrame`,
    using Python types. See https://github.com/RDFLib/rdflib/issues/1179.
    """
    return DataFrame(
        data=([None if x is None else x.toPython() for x in row] for row in results),
        columns=[str(x) for x in results.vars],
    )

sparql_results_to_df(results)

This solution is fully functional as far as I can tell.

I think this is not technically an issue of the core rdflib but could be some additional PandasWrapper within the RDFLib collection of projects.

I imagine this is a pretty common use case. Do you see anywhere specifically where this convenience function would be a good fit in the RDFlib suite?

Would a function like SPARQLResult.toPandas() would be a good place?

dhimmel · 2020-10-06T15:51:53Z

unit test

Here's some test code for the sparql_results_to_df function above.

import pytest
import rdflib

@pytest.fixture
def rdflib_foaf_graph() -> rdflib.Graph:
    """
    FOAF (Friend of a Friend) testing graph from rdflib.
    """
    graph = rdflib.Graph()
    return graph.parse(
        source="https://github.com/RDFLib/rdflib/raw/56dc4207ce6e7b11ed7b45fb4fd4020ba548e718/examples/foaf.n3",
        format="n3",
    )


_foaf_sparql = """\
SELECT
  ?subject
  ?subject_is_tim
  (COUNT(*) AS ?n_triples)
  (MIN(?predicate) AS ?sample_predicate)
  (SAMPLE(?missing) AS ?missing) 
WHERE {
  ?subject ?predicate ?object.
  BIND(?subject = <http://www.w3.org/People/Berners-Lee/card#i> AS ?subject_is_tim)
  OPTIONAL {?subject <this_predicate_does_not_exist> ?missing .}
}
GROUP BY ?subject ?subject_is_tim
ORDER BY DESC(?n_triples) ?subject
LIMIT 10
"""


def test_sparql_results_to_df(rdflib_foaf_graph: rdflib.Graph) -> None:
    results = rdflib_foaf_graph.query(_foaf_sparql)
    df = sparql_results_to_df(results)
    assert len(df) == 10
    # test column values (no ? prefix), type (as strings), and order
    assert list(df.columns) == [
        "subject",
        "subject_is_tim",
        "n_triples",
        "sample_predicate",
        "missing",
    ]
    first_row = next(df.itertuples())
    # test value of subject, ensuring type conversion to str
    assert first_row.subject == "http://www.w3.org/People/Berners-Lee/card#i"
    # test value of subject_is_tim, ensuring type conversion to bool
    assert first_row.subject_is_tim is True
    # test value of n_triples, ensuring type conversion to int
    assert first_row.n_triples == 61
    # test value of sample_predicate, ensuring type conversion to str
    assert (
        first_row.sample_predicate == "http://www.w3.org/1999/02/22-rdf-syntax-ns#type"
    )
    # test value of missing, ensuring it's None
    assert first_row.missing is None

refs RDFLib/rdflib#1179

white-gecko closed this as completed Oct 6, 2020

dhimmel mentioned this issue Oct 6, 2020

convert query results into pandas dataframe RDFLib/sparqlwrapper#125

Open

appliedgraphs mentioned this issue Mar 12, 2022

Convert SPARQLWrapper.SmartWrapper to pandas dataframe RDFLib/sparqlwrapper#205

Open

dhimmel added a commit to related-sciences/nxontology-data that referenced this issue Apr 24, 2022

rdflib utility: sparql_results_to_df

e38c757

refs RDFLib/rdflib#1179

dhimmel added a commit to related-sciences/nxontology-data that referenced this issue Apr 24, 2022

rdflib utility: sparql_results_to_df

6f3ca2a

refs RDFLib/rdflib#1179

dhimmel added a commit to related-sciences/nxontology-data that referenced this issue Apr 24, 2022

rdflib utility: sparql_results_to_df

e6ba7aa

refs RDFLib/rdflib#1179

dhimmel added a commit to related-sciences/nxontology-data that referenced this issue Apr 29, 2022

rdflib utility: sparql_results_to_df

c3b6582

refs RDFLib/rdflib#1179

khider mentioned this issue Feb 18, 2023

Return the results of the SPARQL queries as a dataframe LinkedEarth/pylipd#19

Closed

mpo-vliz mentioned this issue Jul 24, 2024

query :: conversion to dataframe is really slow vliz-be-opsci/py-sema#85

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Convert rdflib.plugins.sparql.processor.SPARQLResult to a pandas.DataFrame? #1179

Convert rdflib.plugins.sparql.processor.SPARQLResult to a pandas.DataFrame? #1179

dhimmel commented Oct 5, 2020

dhimmel commented Oct 5, 2020 •

edited

Loading

white-gecko commented Oct 6, 2020

dhimmel commented Oct 6, 2020

dhimmel commented Oct 6, 2020 •

edited

Loading

dhimmel commented Oct 6, 2020

Convert rdflib.plugins.sparql.processor.SPARQLResult to a pandas.DataFrame? #1179

Convert rdflib.plugins.sparql.processor.SPARQLResult to a pandas.DataFrame? #1179

Comments

dhimmel commented Oct 5, 2020

dhimmel commented Oct 5, 2020 • edited Loading

white-gecko commented Oct 6, 2020

dhimmel commented Oct 6, 2020

dhimmel commented Oct 6, 2020 • edited Loading

dhimmel commented Oct 6, 2020

unit test

dhimmel commented Oct 5, 2020 •

edited

Loading

dhimmel commented Oct 6, 2020 •

edited

Loading