# Extracting Fruits from the FoodOn using SPARQL

With this notebook, we query the [FoodOn](https://foodon.org/) for all available fruit objects using a SPARQL query. Since there is no online SPARQL endpoint available, we query the a local .owl file.
The query we employ with additional comments and explanations can be found [in our repository](https://github.com/Food-Ninja/FoodCutting/blob/main/Methodology/all_fruits.sparql). 
To use SPARQL in Python, we employ [rdflib](https://rdflib.readthedocs.io/en/stable/). 
The result is a pandas dataframe that consist of three columns: The (cleaned up) label of the fruit in the ontology (e.g. apple), the distinct IRI identifying this fruit and a comment that includes a brief description of the fruit.

In [None]:
# imports
from rdflib import Graph, Literal, Namespace, RDF, RDFS, URIRef
from rdflib.plugins.sparql import prepareQuery
import pandas as pd

In [None]:
# define ontology location
loc = "your/path/here"

In [None]:
# load the (local) ontology and get the data through the SPARQL query 
g = Graph()
g.parse(loc)

# namespace prefixes
FOOD = Namespace("http://purl.obolibrary.org/obo/")
RDFS = Namespace("http://www.w3.org/2000/01/rdf-schema#")

# SPARQL query
query = prepareQuery(
    """
    SELECT ?fruit_label (SAMPLE(?fruit_id) AS ?rndm_fruit_id) (SAMPLE(?def) AS ?rndm_def)
    WHERE {
        ?fruit_id rdfs:label ?label.
        ?fruit_id rdfs:subClassOf* food:PO_0009001.
        OPTIONAL { ?fruit_id food:IAO_0000115 ?def. }
        
        BIND (LCASE(STR(?label)) AS ?str_label).
        BIND (STRBEFORE(?str_label, "(") AS ?fruit_label).
        FILTER CONTAINS(?str_label, "whole").
        FILTER NOT EXISTS { ?fruit_id rdfs:subClassOf* food:PO_0030104. }
    }
    GROUP BY ?fruit_label
    ORDER BY ?fruit_label
    """,
    initNs={"food": FOOD, "rdfs": RDFS}
)

results = g.query(query)

In [None]:
# convert query results in a panda dataframe for further analysis
results_list = [(str(row[0]), str(row[1]), str(row[2])) for row in results]
df = pd.DataFrame(results_list, columns=["fruit_label", "fruit_id", "fruit_comment"])
print(df)