# Project Name: Food 4 Thought

** This project revisits and expands upon an original assignment completed as part of the "Data and Knowledge Engineering" course during my Bachelor's degree in Business Informatics at Wirtschaftsuniversität Wien in 2021. For further details refer to the README.md or theory.md files included in the repository.

- Author: Kristina Chuang
- Original date: 29/05/2021
- Revisited: 12/07/2024

### Technologies:
- Python, SQL, SPARQL, Relational Databases, RDF (Resource Description Framework), Knowledge graphs, Semantic Web.

### Goals: 
1. Combine a csv table with common generic foods and their scientific names with nutritional RDF data from DBpedia.org
2. Output a new CSV report with the combined findings, giving values for each vegetable grams of macronutrients; Protein, Carbohydrates and Fat, and computing their combined caloric value.
3. Query recipes from DBpedia using SPARQL and match them with the nutritional table in a Python Application.

## Part 1: Relational Database
- Import the dataset:
    - The original dataset can be found in the portal data.world <br>
    https://data.world/alexandra/generic-food-database/workspace/file?filename=generic-food.csv <br>
- We will first use the request and pandas package as the data set will require some cleaning and transforming to correctly match the dbpedia.org ontologies.
    

In [21]:
# importing necessary packages

import pandas as pd
import requests
import os.path

In [23]:
fn = "data/foods.csv"
if os.path.isfile(fn):
    print("file exists.")
else:
    url = 'https://query.data.world/s/kg6g2cwvmdxaaixvxa63ix2j6w3l3t?dws=00000'
    r = requests.get(url)
    f = open(fn, 'w')
    f.write(r.text)
    f.close()
    print("downloaded.")

file exists.


- Inspect data in foods.csv with pandas

In [74]:
df = pd.read_csv('data/foods.csv')
df.sample(5)

Unnamed: 0,FOOD NAME,SCIENTIFIC NAME,GROUP,SUB GROUP
837,Soft drink,,Beverages,Other beverages
44,Chestnut,Castanea,Nuts,Nuts
301,Bison,Bison bison,Animal foods,Bovines
901,White cabbage,Brassica oleracea L. var. capitata L. f. alba DC.,Vegetables,Cabbages
815,Pita bread,,Cereals and cereal products,Flat breads


In [76]:
df.shape

(907, 4)

In [78]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 907 entries, 0 to 906
Data columns (total 4 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   FOOD NAME        907 non-null    object
 1   SCIENTIFIC NAME  648 non-null    object
 2   GROUP            907 non-null    object
 3   SUB GROUP        907 non-null    object
dtypes: object(4)
memory usage: 28.5+ KB


### Things to correct
- only need the first 2 columns FOOD NAME and SCIENTIFIC NAME
- rename using underscore "_" to handle blank spaces in column name
- Multiple SCIENTIFIC NAME rows with Null value (drop them)
- FOOD NAME contains 1 or multiple words and values:
    - do string manipulation to keep only the first two words and separate them with an "_"
    - first world must be capitalize, second word must be in lower case.
- That will ensure maximizing the names match later with dbpedia.org RDFs triples format.

In [92]:
# 1. Keep only first 2 columns
df = df.iloc[:, :2]

# 2. Rename columns
df.columns = ['food_name', 'scientific_name']

# 3. Manipulate strings of first column with a function
# Define the function to manipulate 'food_name'
def manipulate_food_name(name):
    name = name.replace('(', '').replace(')', '').replace('.', '').replace(',', '')
    #remove punctuations marks
    words = name.split()  # Split strings
    
    if len(words) == 1: 
        return words[0].capitalize() # if only one name capitalize
    
    elif len(words) > 1:
        return words[0].capitalize() + "_" + words[1].lower() 
    # if name is longer capitalize the first, add underscore and second word in lower case.

# Apply the function to the 'food_name' column
df['food_name'] = df['food_name'].apply(manipulate_food_name)

# drop rows with Nan in "scientific_name"
df = df.dropna()

# save clean dataframe as csv for manual cross referencing with dbpedia (next part)
df.to_csv('data/foods_clean.csv', index=False)

# check
df.sample(5)

Unnamed: 0,food_name,scientific_name
118,Sweet_basil,Ocimum basilicum
573,Wheat,Triticum
620,Walnut,Juglans
558,Jew's_ear,Auricularia auricula-judae
176,Sorghum,Sorghum bicolor


In [89]:
df.shape

(648, 2)

- Now we have a clean data with 648 food_name and scientific_name
- It is not perfect, but we maximize the chances of matching the food_name to a wikipedia page and hence to a dbpedia page and its ontology (dbo:)
- we can create a local sqlite database and load the table.

In [104]:
import sqlite3

In [106]:
# load sql extension
%load_ext sql

The sql extension is already loaded. To reload it, use:
  %reload_ext sql


- create a local sqlite database for this project

In [109]:
%sql sqlite:///food4thought.db

In [111]:
# connection to db
conn = sqlite3.connect('food4thought.db')

In [127]:
%sql drop table if exists foods;

# send dataframe as table to db
df.to_sql('foods', conn, index = False)

 * sqlite:///food4thought.db
Done.


648

In [133]:
%%sql
SELECT * 
FROM foods 
ORDER BY RANDOM() 
LIMIT 10;

 * sqlite:///food4thought.db
Done.


food_name,scientific_name
Ceylon_cinnamon,Cinnamomum verum
Chinese_chives,Allium tuberosum
Peppermint,Mentha X piperita
Sorghum,Sorghum bicolor
Hedge_mustard,Sisymbrium
American_cranberry,Vaccinium macrocarpon
Sheepshead,Archosargus probatocephalus
Butternut,Juglans cinerea
Alaska_wild,Polygonum alpinum
Atlantic_mackerel,Scomber scombrus


## Part 2: RDF, SPARQL queries for macronutrients

- This is where the search gets interesting, because the ontology from DBpedia is by nature crowd-sourced, there are no guarantees of its logical correctnes in natural language.

- As a curiosity, try it yourself:
    1. find a wikipedia page for Food (https://en.wikipedia.org/wiki/Food). This ensures that (most-likely) there will also be a dbpedia page for this subject.
    2. go to https://dbpedia.org/page/ and type 'Food' after (page/) or any other wikipedia article name. Ensure that the name matches the wikipedia article URL ending, inluding capitalization of the first letter and potentially underscore "_" separating the name.
    3. You will find that Food is "An Entity of Type: music genre, from Named Graph: http://dbpedia.org"
    4. This means that an RDF triple such as (see below) are valid.
        - :Food rdf:type dbo:MusicGenre
        - :Strawberry rdf:type dbo:Insect<br>
In natural language is like saying, Food is a type of music genre and Strawberry is a type of insect. Which makes no sense, but so are the state of things and have not change in the last 3 years in 2024.
    

- As a first step we need to find a common dbpedia ontology type (dbo:type) for some of the food names in the foods table.
- This is some manual work inserting some food names in https://dbpedia.org/page/.