# Extracting Fruits from the FoodOn using SPARQL

With this notebook, we query the [FoodOn](https://foodon.org/) for all available fruit objects using a SPARQL query. Since there is no online SPARQL endpoint available, we query the a local .owl file.
The query we employ with additional comments and explanations can be found [in our repository](https://github.com/Food-Ninja/FoodCutting/blob/main/Methodology/all_fruits.sparql). 
To use SPARQL in Python, we employ [rdflib](https://rdflib.readthedocs.io/en/stable/). 
The result is a pandas dataframe that consist of three columns: The (cleaned up) label of the fruit in the ontology (e.g. apple), the distinct IRI identifying this fruit and a comment that includes a brief description of the fruit.

In [None]:
# imports
from rdflib import Graph, Literal, Namespace, RDF, RDFS, URIRef
from rdflib.plugins.sparql import prepareQuery
import pandas as pd

In [None]:
# define ontology location
loc = "your/path/here"

In [None]:
# load the (local) ontology and set the namespace prefixes
g = Graph()
g.parse(loc)

FOOD = Namespace("http://purl.obolibrary.org/obo/")
RDFS = Namespace("http://www.w3.org/2000/01/rdf-schema#")

In [None]:
# get the fruit data through the SPARQL query 
query = prepareQuery(
    """
    SELECT ?label (SAMPLE(?fruit_id) AS ?rndm_id) (SAMPLE(?def) AS ?rndm_def)
    WHERE {
        ?fruit_id rdfs:label ?base_label.
        ?fruit_id rdfs:subClassOf* food:PO_0009001.
        OPTIONAL { ?fruit_id food:IAO_0000115 ?def. }
        
        BIND (LCASE(STR(?base_label)) AS ?str_label).
        BIND (STRBEFORE(?str_label, "(") AS ?label).
        FILTER CONTAINS(?str_label, "whole").
        FILTER NOT EXISTS { ?fruit_id rdfs:subClassOf* food:PO_0030104. }
    }
    GROUP BY ?label
    ORDER BY ?label
    """,
    initNs={"food": FOOD, "rdfs": RDFS}
)

fruits = g.query(query)

In [None]:
# get the vegetable data through the SPARQL query 
query = prepareQuery(
    """
    SELECT ?label (SAMPLE(?veg_id) AS ?rndm_id) (SAMPLE(?def) AS ?rndm_def)
    WHERE {
        ?veg_id rdfs:label ?base_label.
        ?veg_id rdfs:subClassOf* food:FOODON_03302008.
        OPTIONAL { ?veg_id food:IAO_0000115 ?def. }

        BIND (LCASE(STR(?base_label)) AS ?str_label).
        BIND (STRBEFORE(?str_label, "(") AS ?label).
        FILTER NOT EXISTS { ?veg_id rdfs:subClassOf* food:FOODON_03302007. }
    }
    GROUP BY ?label
    ORDER BY ?label
    """,
    initNs={"food": FOOD, "rdfs": RDFS}
)

veggies = g.query(query)

In [None]:
# convert query results in a panda dataframe for further analysis
fruit_list = [(str(row[0]), str(row[1]), str(row[2])) for row in fruits]
veggie_list = [(str(row[0]), str(row[1]), str(row[2])) for row in veggies]
fruit_df = pd.DataFrame(fruit_list, columns=["label", "rndm_id", "rndm_def"])
veggie_df = pd.DataFrame(veggie_list, columns=["label", "rndm_id", "rndm_def"])
frames = [fruit_df, veggie_df]
result = pd.concat(frames)
display(result)