In [142]:
import pandas as pd
df = pd.read_csv('data/processed_claims.csv')
df.head()

Unnamed: 0,Claim,Food,Phenotype,Relationship-effect,Food Ontology Term,foodDB_wiki
0,ALA contributes to the maintenance of normal b...,Alpha-linolenic acid (ALA),Normal blood cholesterol,Maintenance of a function,https://foodb.ca/compounds/FDB012462,['http://en.wikipedia.org/wiki/Alpha-Linolenic...
1,Activated charcoal contributes to reducing exe...,Activated charcoal,Excessive flatulence,Enhancing a function,https://foodb.ca/compounds/FDB008898,['http://en.wikipedia.org/wiki/Activated_carbon']
2,Barley grain fibre contributes to an increase ...,Barley grain fibre,Increase in faecal bulk,Enhancing a function,https://foodb.ca/foods/FOOD00088,['http://en.wikipedia.org/wiki/Barley']
3,Beta-glucans contribute to the maintenance of ...,Beta-glucans,Normal blood cholesterol,Maintenance of a function,https://foodb.ca/compounds/FDB005762,['http://en.wikipedia.org/wiki/Glucan']
4,Betaine contributes to normal homocysteine met...,Betaine,Normal blood homocysteine,Maintenance of a function,https://foodb.ca/compounds/FDB009020,['http://en.wikipedia.org/wiki/Betaine']


### ANOTATE CLAIMS USING OPENAI GPT-3.5 TURBO MODEL

The text present in the 'Claim' column is sent to the OpenAI GPT-3.5 Turbo model for annotation. The function extracts the entities, classifies them, and extracts an association between those entities. The entities to extract are of the types: "Food Entity", "Phenotype". The function returns the results as a YAML object with the following fields:

entities: the list of entities in the text, each entity is an object with the fields: label, type
association: a list with the most important association between entities in the text, an association is an object with the fields: "subject" for the subject entity, "predicate" for the relation (maintenance of function, enhancing a function, reducing a risk factor), "object" for the object entity
The function saves the annotated results in 'data/annotated_claims.csv' file.

In [143]:
import requests
import json

def annotateClaimsKG():
    url = 'https://api.collaboratory.semanticscience.org/openai-extract'
    headers = {'content-type': 'application/json'}
    params = {'prompt':'From the text below, extract the entities, classify them and extract an association between those entities, Entities to extract should be of one of those types: "Food Entity", "Phenotype". Return the results as a YAML object with the following fields: - entities: <the list of entities in the text, each entity is an object with the fields: label, type> - association: <a list with the most important association between entities in the text, an association is an object with the fields: "subject" for the subject entity, "predicate" for the relation (maintenance of function, enhancing a function, reducing a risk factor), "object" for the object entity>'}

    annot = []
    for i in range(0,len(df)):
        while True:
            data = {
                'text': df['Claim'][i]
            }
            response = requests.post(url, params=params, data=json.dumps(data), headers=headers)
            print(response)
            try:
                response = response.json()
                print(response)
                food_entities = [entity['label'] for entity in response['entities'] if entity['type'] == 'Food Entity']
                phenotype_entities = [entity['label'] for entity in response['entities'] if entity['type'] == 'Phenotype']
                health_relationships = [association['predicate'] for association in response['association']]
                row = [df['Claim'][i], food_entities, phenotype_entities, health_relationships]
                annot.append(row)
                break
            except:
                pass
    annotations = pd.DataFrame(annot, columns=['Claim', 'Food','Phenotype','Health relationship'])
    annotations.to_csv('data/annotated_claims.csv', index=False)

### EVALUATE ANNOTATIONS OF OPENAI GPT3.5-TURBO MODEL ON FOOD TERMS
Now we evaluate the annotations of the OpenAI GPT3.5-Turbo model on food terms using the precision, recall, and F1-score. The annotations are compared to the ground truth annotations present in the 'Food' column of the dataframe. The annotations are compared using the following rules:
- if the true label is in the predicted label or vice versa, then it is a true positive
- if the true label is not in the predicted label, then it is a false negative
- if the predicted label is not in the true label, then it is a false positive

In [144]:
import re
for i in range(len(df)):
    s = re.split(', | & | and | or ',df['Food'][i])
    s = [x for x in s if len(x) > 1]
    df['Food'][i] = s
df.head(120)

Unnamed: 0,Claim,Food,Phenotype,Relationship-effect,Food Ontology Term,foodDB_wiki
0,ALA contributes to the maintenance of normal b...,[Alpha-linolenic acid (ALA)],Normal blood cholesterol,Maintenance of a function,https://foodb.ca/compounds/FDB012462,['http://en.wikipedia.org/wiki/Alpha-Linolenic...
1,Activated charcoal contributes to reducing exe...,[Activated charcoal],Excessive flatulence,Enhancing a function,https://foodb.ca/compounds/FDB008898,['http://en.wikipedia.org/wiki/Activated_carbon']
2,Barley grain fibre contributes to an increase ...,[Barley grain fibre],Increase in faecal bulk,Enhancing a function,https://foodb.ca/foods/FOOD00088,['http://en.wikipedia.org/wiki/Barley']
3,Beta-glucans contribute to the maintenance of ...,[Beta-glucans],Normal blood cholesterol,Maintenance of a function,https://foodb.ca/compounds/FDB005762,['http://en.wikipedia.org/wiki/Glucan']
4,Betaine contributes to normal homocysteine met...,[Betaine],Normal blood homocysteine,Maintenance of a function,https://foodb.ca/compounds/FDB009020,['http://en.wikipedia.org/wiki/Betaine']
...,...,...,...,...,...,...
115,Phosphorus contributes to normal energy-yieldi...,[Phosphorus],Normal energy-yielding metabolism,Maintenance of a function,https://foodb.ca/compounds/FDB003520,['http://en.wikipedia.org/wiki/Phosphorus']
116,Phosphorus contributes to normal function of c...,[Phosphorus],Normal function of cell membrane,Maintenance of a function,https://foodb.ca/compounds/FDB003520,['http://en.wikipedia.org/wiki/Phosphorus']
117,Phosphorus contributes to the maintenance of n...,[Phosphorus],Normal bone,Maintenance of a function,https://foodb.ca/compounds/FDB003520,['http://en.wikipedia.org/wiki/Phosphorus']
118,Phosphorus contributes to the maintenance of n...,[Phosphorus],Normal teeth,Maintenance of a function,https://foodb.ca/compounds/FDB003520,['http://en.wikipedia.org/wiki/Phosphorus']


In [145]:
import ast
annotations = pd.read_csv('data/annotated_claims.csv')

def evaluateAnnotations():
    confusion_matrix = [[0,0],[0,0]]

    for i in range(len(df)):
        true_labels = df['Food'][i].copy()
        predicted_labels = ast.literal_eval(annotations['Food'][i])
        for tlabel in true_labels:
            if not isinstance(tlabel,str):
                tlabel = str(tlabel)
            for plabel in predicted_labels:
                if not isinstance(plabel,str):
                    plabel = str(plabel)
                if tlabel in plabel or plabel in tlabel: # if true label is in predicted label or vice versa
                    confusion_matrix[0][0] += 1 # add to true positive
                    true_labels.remove(tlabel)
                    predicted_labels.remove(plabel)
        confusion_matrix[0][1] += len(true_labels) # add to false negative (unmatched true labels)
        confusion_matrix[1][0] += len(predicted_labels) # add to false positive (unmatched predicted labels)

    precision = confusion_matrix[0][0]/(confusion_matrix[0][0]+confusion_matrix[1][0])
    recall = confusion_matrix[0][0]/(confusion_matrix[0][0]+confusion_matrix[0][1])
    f1 = 2*(precision*recall)/(precision+recall)
    print(confusion_matrix)
    print('Precision: ', precision)
    print('Recall: ', recall)
    print('F1: ',f1)
evaluateAnnotations()

[[238, 56], [60, 0]]
Precision:  0.7986577181208053
Recall:  0.8095238095238095
F1:  0.8040540540540542


### EVALUATION OF ENTITY LINKING OF WIKIFIER VS FOODB WIKIPEDIA URLS (OF TRUE LABELS)


In [146]:
from wikifier import CallWikifier
def queryWikifier(dataf, path):
    data = []
    for i in range(len(dataf)):
        labels = dataf['Food'][i]
        for label in labels:
            response = CallWikifier(label)
            wiki_annotations = response["annotations"]
            dbpedia_iris = []
            wiki_urls = []
            for wannotation in wiki_annotations:
                dbpedia_iri = wannotation["dbPediaIri"]
                wiki_url = wannotation["url"]
                wiki_urls.append(wiki_url)
                dbpedia_iris.append(dbpedia_iri)
            row = [dataf['Claim'][i], label, dbpedia_iris, wiki_urls]
            data.append(row)
    wikifier_annotations = pd.DataFrame(data, columns=['Claim', 'food','dbpedia_iris', 'wiki_urls'])
    wikifier_annotations.to_csv(path, index=False)
#queryWikifier(df, 'data/wiki_iris_true.csv')

In [147]:
wiki_iris_true = pd.read_csv('data/wiki_iris_true.csv')
def evaluateEL(dataf):
    confusion_matrix = [[0,0],[0,0]]
    for i in range(len(df)):
        true_labels = df['foodDB_wiki'][i]
        if pd.isna(true_labels):
            continue
        true_labels = ast.literal_eval(true_labels)
        predicted_labels = dataf[dataf['Claim'] == df['Claim'][i]]
        for predicted_label in predicted_labels['wiki_urls']:
            predicted_label = ast.literal_eval(predicted_label)
            for tlabel in true_labels:
                for plabel in predicted_label:
                    if plabel == tlabel:
                        confusion_matrix[0][0] += 1 # add to true positive
                        true_labels.remove(tlabel)
                        predicted_label.remove(plabel)
            confusion_matrix[0][1] += len(true_labels) # add to false negative (unmatched true labels)
            confusion_matrix[1][0] += len(predicted_label) # add to false positive (unmatched predicted labels)

    precision = confusion_matrix[0][0]/(confusion_matrix[0][0]+confusion_matrix[1][0])
    recall = confusion_matrix[0][0]/(confusion_matrix[0][0]+confusion_matrix[0][1])
    f1 = 2*(precision*recall)/(precision+recall)
    print(confusion_matrix)
    print('Precision: ', precision)
    print('Recall: ', recall)
    print('F1: ',f1)
evaluateEL(wiki_iris_true)

[[162, 72], [209, 0]]
Precision:  0.4366576819407008
Recall:  0.6923076923076923
F1:  0.5355371900826446


### EVLUATION OF ENTITY LINKING OF WIKIFIER VS FOODB WIKIPEDIA URLS (OF PREDICTED LABELS)
First we query the Wikifier API to get the dbpedia iris and wikipedia urls of all the predicted food entities.

In [148]:
# get the dbpedia iris of all the predicted food entities
def queryWikifier():
    data = []
    for i in range(len(annotations)):
        predicted_labels = ast.literal_eval(annotations['Food'][i])
        for plabel in predicted_labels:
            response = CallWikifier(plabel)
            wiki_annotations = response["annotations"]
            dbpedia_iris = []
            wiki_urls = []
            for wannotation in wiki_annotations:
                dbpedia_iri = wannotation["dbPediaIri"]
                wiki_url = wannotation["url"]
                wiki_urls.append(wiki_url)
                dbpedia_iris.append(dbpedia_iri)
            row = [annotations['Claim'][i], plabel, dbpedia_iris, wiki_urls]
            data.append(row)
    wikifier_annotations = pd.DataFrame(data, columns=['Claim', 'predicted_label','dbpedia_iris', 'wiki_urls'])
    wikifier_annotations.to_csv('data/wiki_iris_predicted.csv', index=False)

Then we compare the urls from Wikifier to the wikipedia urls of the food entities in the processed_claims dataset. If the urls are the same, then the entity linking is correct. We calculate the precision, recall, and F1-score of the entity linking.

In [149]:
wiki_iris_predicted = pd.read_csv('data/wiki_iris_predicted.csv')
evaluateEL(wiki_iris_predicted)

[[157, 82], [181, 0]]
Precision:  0.46449704142011833
Recall:  0.6569037656903766
F1:  0.5441941074523398


### IMPROVING NER WITH WIKIFIER


With this approach the entities extracted by the GPT model are then queried against Wikifier to get the Wikidata ID of the entity. Using the DBpedia IRI the entity is accepted if it is of type Food or ChemicalSubstance. The annotations are compared to the ground truth annotations present in the 'Food' column of the dataframe. The annotations are compared the same way as before.

In [150]:
# Filter the predicted labels based on the dbpedia iris

from SPARQLWrapper import SPARQLWrapper, JSON

def queryDBpedia():
    wiki_iris = pd.read_csv('data/wiki_iris.csv')

    sparql = SPARQLWrapper("http://dbpedia.org/sparql")
    sparql.setReturnFormat(JSON)

    isfoodlist = []
    for i in range(len(wiki_iris)):
        entities = ast.literal_eval(wiki_iris["dbpedia_iris"].iloc[i])
        isfoodCount = 0
        for entity in entities:
            query = f'''
                ASK {{
                {{<{entity}> rdf:type ?a.
                ?a rdfs:subClassOf* dbo:Food}}
                union
                {{<{entity}> rdf:type ?b.
                ?b rdfs:subClassOf* dbo:Condiment}}
                union
                {{<{entity}> rdf:type ?c.
                ?c rdfs:subClassOf* dbo:Beverage}}
                union
                {{<{entity}> rdf:type ?d.
                ?d rdfs:subClassOf* dbo:Plant}}
                union
                {{<{entity}> rdf:type ?e.
                ?e rdfs:subClassOf* dbo:Animal}}
                union
                {{<{entity}> rdf:type ?f.
                ?f rdfs:subClassOf* dbo:Fungus}}
                union
                {{<{entity}> rdf:type ?g.
                ?g rdfs:subClassOf* dbo:ChemicalSubstance}}
                }}
            '''
            sparql.setQuery(query)
            results = sparql.query().convert()
            isfood = results["boolean"]
            if isfood:
                isfoodCount += 1
        if isfoodCount > 0:
            isfoodlist.append(1)
        else:
            isfoodlist.append(0)
    wiki_iris.insert(3, "isFood", isfoodlist, True)
    wiki_iris.to_csv('data/wiki_iris.csv', index=False)

In [151]:
for i in range(len(annotations)):
    predicted_labels = ast.literal_eval(annotations['Food'][i])
    for plabel in predicted_labels:
        for j in range(len(wiki_iris)):
            if plabel == wiki_iris['predicted_label'][j]:
                if wiki_iris['isFood'][j] == 0:
                    predicted_labels.remove(plabel)
                    annotations['Food'][i] = str(predicted_labels)
                    break
annotations.head(100)

Unnamed: 0,Claim,Food,Phenotype,Health relationship
0,ALA contributes to the maintenance of normal b...,[],['normal blood cholesterol levels'],['contributes to the maintenance of']
1,Activated charcoal contributes to reducing exe...,['Activated charcoal'],['exessive flatulence'],['reducing a risk factor']
2,Barley grain fibre contributes to an increase ...,['Barley grain fibre'],['faecal bulk'],['contributes to an increase in']
3,Beta-glucans contribute to the maintenance of ...,[],['normal blood cholesterol levels'],['maintenance']
4,Betaine contributes to normal homocysteine met...,[],['homocysteine metabolism'],['contributes to']
...,...,...,...,...
95,Manganese contributes to the normal formation ...,['Manganese'],['connective tissues'],['enhancing a function']
96,Manganese contributes to the protection of cel...,['Manganese'],"['cells', 'oxidative stress']","['enhances function', 'reduces risk factor']"
97,Meat or fish contributes to the improvement of...,"['fish', 'other foods containing iron']",['improvement'],"['enhancing a function', 'maintenance of funct..."
98,Melatonin contributes to the allevation of sub...,['Melatonin'],"['subjective feelings', 'jet lag']",['contributing to']


In [152]:
confusion_matrix = [[0,0],[0,0]]
for i in range(len(df)):
    true_labels = df['Food'][i]
    if len(annotations['Food'][i]) == 0:
        predicted_labels = []
        confusion_matrix[0][1] += len(true_labels) # add to false negative (unmatched true labels)
        confusion_matrix[1][0] += len(predicted_labels) # add to false positive (unmatched predicted labels)
    else:
        predicted_labels = ast.literal_eval(annotations['Food'][i])
        for tlabel in true_labels:
            if not isinstance(tlabel,str):
                tlabel = str(tlabel)
            for plabel in predicted_labels:
                if not isinstance(plabel,str):
                    plabel = str(plabel)
                if tlabel in plabel or plabel in tlabel: # if true label is in predicted label or vice versa
                    confusion_matrix[0][0] += 1 # add to true positive
                    true_labels.remove(tlabel)
                    predicted_labels.remove(plabel)
        confusion_matrix[0][1] += len(true_labels) # add to false negative (unmatched true labels)
        confusion_matrix[1][0] += len(predicted_labels) # add to false positive (unmatched predicted labels)

precision = confusion_matrix[0][0]/(confusion_matrix[0][0]+confusion_matrix[1][0])
recall = confusion_matrix[0][0]/(confusion_matrix[0][0]+confusion_matrix[0][1])
f1 = 2*(precision*recall)/(precision+recall)
print(confusion_matrix)
print('Precision: ', precision)
print('Recall: ', recall)
print('F1: ',f1)

[[168, 126], [36, 0]]
Precision:  0.8235294117647058
Recall:  0.5714285714285714
F1:  0.6746987951807228
