-
Notifications
You must be signed in to change notification settings - Fork 549
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Convert rdflib.plugins.sparql.processor.SPARQLResult to a pandas.DataFrame? #1179
Comments
Copying from RDFLib/sparqlwrapper#125 (comment), there are a few ways I was able to convert to a DataFrame: import pandas as pd
# results is a rdflib.plugins.sparql.processor.SPARQLResult object
# renders properly in notebooks, but DataFrame values are rdflib objects rather than builtin python types
pd.DataFrame(results.bindings)
# converts everything to strings including missing values
pd.DataFrame(results.bindings).applymap(str).rename(columns=str)
# serialize with json and then parse (clobbers types, converting values to strings)
import json
results_json = results.serialize(format="json")
bindings = json.loads(results_json)["results"]["bindings"]
bindings = [{k: v["value"] for k, v in result.items()} for result in bindings]
pd.DataFrame(bindings) It also looks like the Ideally, there would be a solution that:
|
Was already pointed out in RDFLib/sparqlwrapper#125 (comment) I think this is not technically an issue of the core rdflib but could be some additional PandasWrapper within the RDFLib collection of projects. |
Thanks @white-gecko for the suggestion at RDFLib/sparqlwrapper#125 (comment). Copying it here:
I'm replying here because this issue is specifically about converting a The problem with the proposed solution is that it doesn't convert types. The columns have type Two of the solutions I posted at RDFLib/sparqlwrapper#125 (comment) cast every table cell and column name to |
@matentzn suggested the missing piece at RDFLib/sparqlwrapper#125 (comment): the Based on this, I created the following function in RDFLib/sparqlwrapper#125 (comment): from pandas import DataFrame
from rdflib.plugins.sparql.processor import SPARQLResult
def sparql_results_to_df(results: SPARQLResult) -> DataFrame:
"""
Export results from an rdflib SPARQL query into a `pandas.DataFrame`,
using Python types. See https://github.com/RDFLib/rdflib/issues/1179.
"""
return DataFrame(
data=([None if x is None else x.toPython() for x in row] for row in results),
columns=[str(x) for x in results.vars],
)
sparql_results_to_df(results) This solution is fully functional as far as I can tell.
I imagine this is a pretty common use case. Do you see anywhere specifically where this convenience function would be a good fit in the RDFlib suite? Would a function like |
unit testHere's some test code for the import pytest
import rdflib
@pytest.fixture
def rdflib_foaf_graph() -> rdflib.Graph:
"""
FOAF (Friend of a Friend) testing graph from rdflib.
"""
graph = rdflib.Graph()
return graph.parse(
source="https://github.com/RDFLib/rdflib/raw/56dc4207ce6e7b11ed7b45fb4fd4020ba548e718/examples/foaf.n3",
format="n3",
)
_foaf_sparql = """\
SELECT
?subject
?subject_is_tim
(COUNT(*) AS ?n_triples)
(MIN(?predicate) AS ?sample_predicate)
(SAMPLE(?missing) AS ?missing)
WHERE {
?subject ?predicate ?object.
BIND(?subject = <http://www.w3.org/People/Berners-Lee/card#i> AS ?subject_is_tim)
OPTIONAL {?subject <this_predicate_does_not_exist> ?missing .}
}
GROUP BY ?subject ?subject_is_tim
ORDER BY DESC(?n_triples) ?subject
LIMIT 10
"""
def test_sparql_results_to_df(rdflib_foaf_graph: rdflib.Graph) -> None:
results = rdflib_foaf_graph.query(_foaf_sparql)
df = sparql_results_to_df(results)
assert len(df) == 10
# test column values (no ? prefix), type (as strings), and order
assert list(df.columns) == [
"subject",
"subject_is_tim",
"n_triples",
"sample_predicate",
"missing",
]
first_row = next(df.itertuples())
# test value of subject, ensuring type conversion to str
assert first_row.subject == "http://www.w3.org/People/Berners-Lee/card#i"
# test value of subject_is_tim, ensuring type conversion to bool
assert first_row.subject_is_tim is True
# test value of n_triples, ensuring type conversion to int
assert first_row.n_triples == 61
# test value of sample_predicate, ensuring type conversion to str
assert (
first_row.sample_predicate == "http://www.w3.org/1999/02/22-rdf-syntax-ns#type"
)
# test value of missing, ensuring it's None
assert first_row.missing is None |
Running a SPARQL query on a
rdflib.graph.Graph
returns ardflib.plugins.sparql.processor.SPARQLResult
object. The.bindings
method enables access to the underlying values. Since results are tabular in nature, it would be helpful to quickly be able to convert aSPARQLResult
to apandas.DataFrame
.I originally posted at RDFLib/sparqlwrapper#125 (comment), but realized that might not be the correct repository.
What is the best way to convert a
SPARQLResult
to aDataFrame
and would it make sense to have this utility built intordflib
?The text was updated successfully, but these errors were encountered: