# Extracting Fruits and Vegetables from the FoodOn using SPARQL

With this notebook, we query the [FoodOn](https://foodon.org/) for all available fruit and vegetable objects using two SPARQL queries. Since there is no online SPARQL endpoint available, we query the a local .owl file.
The queries we employ with additional comments and explanations can be found [in our repository](https://github.com/Food-Ninja/FoodCutting/blob/main/Methodology). 
To use SPARQL in Python, we employ [rdflib](https://rdflib.readthedocs.io/en/stable/). 
The result is a pandas dataframe that consist of three columns: The (cleaned up) label of the fruit/vegetable in the ontology (e.g. apple, asparagus), the distinct IRI identifying this fruit/vegetable and a comment that includes a brief description of the fruit/vegetable.

In [None]:
# imports
from rdflib import Graph, Literal, Namespace, RDF, RDFS, URIRef
from rdflib.plugins.sparql import prepareQuery
import pandas as pd

In [None]:
# define ontology locations
foodon_loc = "your/path/here"
foodcut_loc = "../food_cutting.owl"

In [None]:
# load the (local) ontology and set the namespace prefixes
g = Graph()
g.parse(foodon_loc)

FOOD = Namespace("http://purl.obolibrary.org/obo/")
RDFS = Namespace("http://www.w3.org/2000/01/rdf-schema#")

In [None]:
# get the fruit data through the SPARQL query 
query = prepareQuery(   
    """
    SELECT ?fruit_label (SAMPLE(?fruit_id) AS ?rndm_fruit_id) (SAMPLE(?def) AS ?rndm_def)
    WHERE {
        ?fruit_id rdfs:label ?label.
        ?fruit_id rdfs:subClassOf+ food:PO_0009001.
        OPTIONAL { ?fruit_id food:IAO_0000115 ?def. }

        BIND (LCASE(STR(?label)) AS ?str_label).
        BIND (STRBEFORE(?str_label, "(") AS ?fruit_label).
        FILTER CONTAINS(?str_label, "whole").
        FILTER NOT EXISTS { ?fruit_id rdfs:subClassOf* food:PO_0030104. }
        FILTER (?fruit_id != food:FOODON_03304644).
    }
    GROUP BY ?fruit_label
    ORDER BY ?fruit_label
    """,
    initNs={"food": FOOD, "rdfs": RDFS}
)

fruits = g.query(query)

In [None]:
# get the vegetable data through the SPARQL query 
query = prepareQuery(
    """
    SELECT ?veg_label (SAMPLE(?veg_id) AS ?rndm_veg_id) (SAMPLE(?def) AS ?rndm_def)
    WHERE {
        ?veg_id rdfs:label ?label.
        ?veg_id rdfs:subClassOf+ food:FOODON_03302008.
        OPTIONAL { ?veg_id food:IAO_0000115 ?def. }

        BIND (LCASE(STR(?label)) AS ?str_label).
        BIND (STRBEFORE(?str_label, "(") AS ?veg_label).
        FILTER NOT EXISTS { ?veg_id rdfs:subClassOf* food:FOODON_03302007. }
    }
    GROUP BY ?veg_label
    ORDER BY ?veg_label
    """,
    initNs={"food": FOOD, "rdfs": RDFS}
)

veggies = g.query(query)

In [None]:
# convert query results into panda dataframes for further analysis
fruit_list = [(str(row[0]), str(row[1]), str(row[2])) for row in fruits]
veggie_list = [(str(row[0]), str(row[1]), str(row[2])) for row in veggies]

fruit_df = pd.DataFrame(fruit_list, columns=["fruit_label", "rndm_fruit_id", "rndm_def"])
veggie_df = pd.DataFrame(veggie_list, columns=["veg_label", "rndm_veg_id", "rndm_def"])

In [None]:
# add results to ontology
cut = Graph()
cut.parse(foodcut_loc)

CUT = Namespace("http://www.ease-crc.org/ont/food_cutting/")

super_fruit = URIRef('http://purl.obolibrary.org/obo/PO_0009001')
super_veggie = URIRef('http://www.ease-crc.org/ont/food_cutting#vegetable')

for index, row in fruit_df.iterrows():
    fruit = URIRef(row['rndm_fruit_id'])
    cut.add((fruit, RDFS.subClassOf, super_fruit))
    cut.add((fruit, RDFS.label, Literal(row['fruit_label'])))
    cut.add((fruit, RDFS.comment, Literal(row['rndm_def'])))

for index, row in veggie_df.iterrows():
    veggie = URIRef(row['rndm_veg_id'])
    cut.add((veggie, RDFS.subClassOf, super_veggie))
    cut.add((veggie, RDFS.label, Literal(row['veg_label'])))
    cut.add((veggie, RDFS.comment, Literal(row['rndm_def'])))
    
cut.serialize(destination=foodcut_loc)