In [133]:
import pandas as pd
df = pd.read_csv('data/processed_claims.csv')
df.head()

Unnamed: 0,Claim,Food,Phenotype,Relationship-effect,Food Ontology Term,foodDB_wiki
0,ALA contributes to the maintenance of normal b...,Alpha-linolenic acid (ALA),Normal blood cholesterol,Maintenance of a function,https://foodb.ca/compounds/FDB012462,['http://en.wikipedia.org/wiki/Alpha-Linolenic...
1,Activated charcoal contributes to reducing exe...,Activated charcoal,Excessive flatulence,Enhancing a function,https://foodb.ca/compounds/FDB008898,['http://en.wikipedia.org/wiki/Activated_carbon']
2,Barley grain fibre contributes to an increase ...,Barley grain fibre,Increase in faecal bulk,Enhancing a function,https://foodb.ca/foods/FOOD00088,['http://en.wikipedia.org/wiki/Barley']
3,Beta-glucans contribute to the maintenance of ...,Beta-glucans,Normal blood cholesterol,Maintenance of a function,https://foodb.ca/compounds/FDB005762,['http://en.wikipedia.org/wiki/Glucan']
4,Betaine contributes to normal homocysteine met...,Betaine,Normal blood homocysteine,Maintenance of a function,https://foodb.ca/compounds/FDB009020,['http://en.wikipedia.org/wiki/Betaine']


# NER TASK

### ANOTATE CLAIMS USING OPENAI GPT-3.5 TURBO MODEL

The text present in the 'Claim' column is sent to the OpenAI GPT-3.5 Turbo model for annotation. The function extracts the entities, classifies them, and extracts an association between those entities. The entities to extract are of the types: "Food Entity", "Phenotype". The function returns the results as a YAML object with the following fields:

entities: the list of entities in the text, each entity is an object with the fields: label, type
association: a list with the most important association between entities in the text, an association is an object with the fields: "subject" for the subject entity, "predicate" for the relation (maintenance of function, enhancing a function, reducing a risk factor), "object" for the object entity
The function saves the annotated results in 'data/annotated_claims.csv' file.

In [134]:
import requests
import json

def annotateClaimsKG():
    url = 'https://api.collaboratory.semanticscience.org/openai-extract'
    headers = {'content-type': 'application/json'}
    params = {'prompt':'From the text below, extract the entities, classify them and extract an association between those entities, Entities to extract should be of one of those types: "Food Entity", "Phenotype". Return the results as a YAML object with the following fields: - entities: <the list of entities in the text, each entity is an object with the fields: label, type> - association: <a list with the most important association between entities in the text, an association is an object with the fields: "subject" for the subject entity, "predicate" for the relation (maintenance of function, enhancing a function, reducing a risk factor), "object" for the object entity>'}

    annot = []
    for i in range(0,len(df)):
        while True:
            data = {
                'text': df['Claim'][i]
            }
            response = requests.post(url, params=params, data=json.dumps(data), headers=headers)
            print(response)
            try:
                response = response.json()
                print(response)
                food_entities = [entity['label'] for entity in response['entities'] if entity['type'] == 'Food Entity']
                phenotype_entities = [entity['label'] for entity in response['entities'] if entity['type'] == 'Phenotype']
                health_relationships = [association['predicate'] for association in response['association']]
                row = [df['Claim'][i], food_entities, phenotype_entities, health_relationships]
                annot.append(row)
                break
            except:
                pass
    annotations = pd.DataFrame(annot, columns=['Claim', 'Food','Phenotype','Health relationship'])
    annotations.to_csv('data/annotated_claims.csv', index=False)
annotations = pd.read_csv('data/annotated_claims.csv')
annotations.head()

Unnamed: 0,Claim,Food,Phenotype,Health relationship
0,ALA contributes to the maintenance of normal b...,['ALA'],['normal blood cholesterol levels'],['contributes to the maintenance of']
1,Activated charcoal contributes to reducing exe...,"['Activated charcoal', 'eating']",['exessive flatulence'],['reducing a risk factor']
2,Barley grain fibre contributes to an increase ...,['Barley grain fibre'],['faecal bulk'],['contributes to an increase in']
3,Beta-glucans contribute to the maintenance of ...,['Beta-glucans'],['normal blood cholesterol levels'],['maintenance']
4,Betaine contributes to normal homocysteine met...,['Betaine'],['homocysteine metabolism'],['contributes to']


### EVALUATE ANNOTATIONS OF OPENAI GPT3.5-TURBO MODEL ON FOOD AND PHENOTYPES
Now we evaluate the annotations of the OpenAI GPT3.5-Turbo model on food terms using nervaluate. Full named-entity (i.e., not tag/token) evaluation metrics based on SemEval’13

In [135]:
import re
import ast
for i in range(len(df)):
    s = re.split(', | & | and | or ',df['Food'][i])
    s = [x for x in s if len(x) > 1]
    df['Food'][i] = s
    df['Phenotype'][i] = [df['Phenotype'][i]]
df.head()

Unnamed: 0,Claim,Food,Phenotype,Relationship-effect,Food Ontology Term,foodDB_wiki
0,ALA contributes to the maintenance of normal b...,[Alpha-linolenic acid (ALA)],[Normal blood cholesterol],Maintenance of a function,https://foodb.ca/compounds/FDB012462,['http://en.wikipedia.org/wiki/Alpha-Linolenic...
1,Activated charcoal contributes to reducing exe...,[Activated charcoal],[Excessive flatulence],Enhancing a function,https://foodb.ca/compounds/FDB008898,['http://en.wikipedia.org/wiki/Activated_carbon']
2,Barley grain fibre contributes to an increase ...,[Barley grain fibre],[Increase in faecal bulk],Enhancing a function,https://foodb.ca/foods/FOOD00088,['http://en.wikipedia.org/wiki/Barley']
3,Beta-glucans contribute to the maintenance of ...,[Beta-glucans],[Normal blood cholesterol],Maintenance of a function,https://foodb.ca/compounds/FDB005762,['http://en.wikipedia.org/wiki/Glucan']
4,Betaine contributes to normal homocysteine met...,[Betaine],[Normal blood homocysteine],Maintenance of a function,https://foodb.ca/compounds/FDB009020,['http://en.wikipedia.org/wiki/Betaine']


Convert the true and predicted labels to the indices that they match in the Claim text

In [136]:
from stringMatcher import longest_common_substring
# Convert "Food" column to indices
annotations_copy = annotations.copy()
df_copy = df[['Claim','Food','Phenotype']].copy()
annotations_copy['Food'] = annotations_copy.apply(lambda row: [longest_common_substring(row['Claim'],food) for food in ast.literal_eval(row['Food'])], axis=1)
df_copy['Food'] = df_copy.apply(lambda row: [longest_common_substring(row['Claim'],food) for food in row['Food']], axis=1)
# Convert "Phenotype" column to indices
annotations_copy['Phenotype'] = annotations_copy.apply(lambda row: [longest_common_substring(row['Claim'],phenotype) for phenotype in ast.literal_eval(row['Phenotype'])], axis=1)
df_copy['Phenotype'] = df_copy.apply(lambda row: [longest_common_substring(row['Claim'],phenotype) for phenotype in row['Phenotype']], axis=1)
# Display the updated DataFrame
indices_df = df_copy
indices_df['predicted_food'] = annotations_copy['Food']
indices_df['predicted_phenotype'] = annotations_copy['Phenotype']
indices_df.head()

Unnamed: 0,Claim,Food,Phenotype,predicted_food,predicted_phenotype
0,ALA contributes to the maintenance of normal b...,"[(0, 2)]","[(39, 61)]","[(0, 2)]","[(38, 68)]"
1,Activated charcoal contributes to reducing exe...,"[(0, 17)]","[(45, 61)]","[(0, 17), (69, 74)]","[(43, 61)]"
2,Barley grain fibre contributes to an increase ...,"[(0, 17)]","[(38, 59)]","[(0, 17)]","[(49, 59)]"
3,Beta-glucans contribute to the maintenance of ...,"[(0, 11)]","[(47, 69)]","[(0, 11)]","[(46, 76)]"
4,Betaine contributes to normal homocysteine met...,"[(0, 6)]","[(29, 41)]","[(0, 6)]","[(30, 52)]"


Add text length to the indices so the indices are not repeated for each row.

In [137]:
# Calculate the cumulative lengths of the Claim text
cumulative_lengths = indices_df['Claim'].apply(len).cumsum().shift(fill_value=0)
# Modify the spans in the columns based on the cumulative lengths
indices_df['Food'] = indices_df.apply(lambda row: [(start + cumulative_lengths[row.name], end + cumulative_lengths[row.name]) for (start, end) in row['Food']], axis=1)
indices_df['Phenotype'] = indices_df.apply(lambda row: [(start + cumulative_lengths[row.name], end + cumulative_lengths[row.name]) for (start, end) in row['Phenotype']], axis=1)
indices_df['predicted_food'] = indices_df.apply(lambda row: [(start + cumulative_lengths[row.name], end + cumulative_lengths[row.name]) for (start, end) in row['predicted_food']], axis=1)
indices_df['predicted_phenotype'] = indices_df.apply(lambda row: [(start + cumulative_lengths[row.name], end + cumulative_lengths[row.name]) for (start, end) in row['predicted_phenotype']], axis=1)
indices_df.head()

Unnamed: 0,Claim,Food,Phenotype,predicted_food,predicted_phenotype
0,ALA contributes to the maintenance of normal b...,"[(0, 2)]","[(39, 61)]","[(0, 2)]","[(38, 68)]"
1,Activated charcoal contributes to reducing exe...,"[(69, 86)]","[(114, 130)]","[(69, 86), (138, 143)]","[(112, 130)]"
2,Barley grain fibre contributes to an increase ...,"[(144, 161)]","[(182, 203)]","[(144, 161)]","[(193, 203)]"
3,Beta-glucans contribute to the maintenance of ...,"[(204, 215)]","[(251, 273)]","[(204, 215)]","[(250, 280)]"
4,Betaine contributes to normal homocysteine met...,"[(281, 287)]","[(310, 322)]","[(281, 287)]","[(311, 333)]"


In [138]:
food_true = [{'label': 'FOOD', 'start': span[0], 'end': span[1]} for spans in indices_df['Food'] for span in spans]
phenotype_true = [{'label': 'PHENOTYPE', 'start': span[0], 'end': span[1]} for spans in indices_df['Phenotype'] for span in spans]
food_predicted = [{'label': 'FOOD', 'start': span[0], 'end': span[1]} for spans in indices_df['predicted_food'] for span in spans]
phenotype_predicted = [{'label': 'PHENOTYPE', 'start': span[0], 'end': span[1]} for spans in indices_df['predicted_phenotype'] for span in spans]

true = [food_true, phenotype_true]
predicted = [food_predicted, phenotype_predicted]

from nervaluate import Evaluator
evaluator = Evaluator(true, predicted, tags=['FOOD', 'PHENOTYPE'])
# Returns overall metrics and metrics for each tag
results, results_per_tag = evaluator.evaluate()
print(results_per_tag)

{'FOOD': {'ent_type': {'correct': 263, 'incorrect': 0, 'partial': 0, 'missed': 30, 'spurious': 35, 'possible': 293, 'actual': 298, 'precision': 0.8825503355704698, 'recall': 0.8976109215017065, 'f1': 0.8900169204737733}, 'partial': {'correct': 188, 'incorrect': 0, 'partial': 75, 'missed': 30, 'spurious': 35, 'possible': 293, 'actual': 298, 'precision': 0.7567114093959731, 'recall': 0.7696245733788396, 'f1': 0.763113367174281}, 'strict': {'correct': 188, 'incorrect': 75, 'partial': 0, 'missed': 30, 'spurious': 35, 'possible': 293, 'actual': 298, 'precision': 0.6308724832214765, 'recall': 0.6416382252559727, 'f1': 0.6362098138747885}, 'exact': {'correct': 188, 'incorrect': 75, 'partial': 0, 'missed': 30, 'spurious': 35, 'possible': 293, 'actual': 298, 'precision': 0.6308724832214765, 'recall': 0.6416382252559727, 'f1': 0.6362098138747885}}, 'PHENOTYPE': {'ent_type': {'correct': 250, 'incorrect': 0, 'partial': 0, 'missed': 10, 'spurious': 135, 'possible': 260, 'actual': 385, 'precision': 

# EL TASK

### EVALUATION OF ENTITY LINKING OF WIKIFIER FROM PREDICTED LABELS
First we query the Wikifier API to get the dbpedia iris and wikipedia urls of all the predicted food entities.

In [139]:
# get the dbpedia iris and wikipedia urls of all the predicted food entities
from wikifier import CallWikifier
def queryWikifier():
    data = []
    for i in range(len(annotations)):
        predicted_labels = ast.literal_eval(annotations['Food'][i])
        for plabel in predicted_labels:
            response = CallWikifier(plabel)
            wiki_annotations = response["annotations"]
            dbpedia_iris = []
            wiki_urls = []
            for wannotation in wiki_annotations:
                dbpedia_iri = wannotation["dbPediaIri"]
                wiki_url = wannotation["url"]
                wiki_urls.append(wiki_url)
                dbpedia_iris.append(dbpedia_iri)
            row = [annotations['Claim'][i], plabel, dbpedia_iris, wiki_urls]
            data.append(row)
    wikifier_annotations = pd.DataFrame(data, columns=['Claim', 'predicted_label','dbpedia_iris', 'wiki_urls'])
    wikifier_annotations.to_csv('data/wiki_iris_predicted.csv', index=False)

Then we compare the urls from Wikifier to the wikipedia urls of the food entities in the processed_claims dataset. If the urls are the same, then the entity linking is correct. We calculate the precision, recall, and F1-score of the entity linking.

In [140]:
wiki_iris_predicted = pd.read_csv('data/wiki_iris_predicted.csv')
def evaluateEL(dataf):
    confusion_matrix = [[0,0],[0,0]]
    for i in range(len(df)):
        true_labels = df['foodDB_wiki'][i]
        if pd.isna(true_labels):
            continue
        true_labels = ast.literal_eval(true_labels)
        predicted_labels = dataf[dataf['Claim'] == df['Claim'][i]]
        for predicted_label in predicted_labels['wiki_urls']:
            predicted_label = ast.literal_eval(predicted_label)
            for tlabel in true_labels:
                for plabel in predicted_label:
                    if plabel == tlabel:
                        confusion_matrix[0][0] += 1 # add to true positive
                        true_labels.remove(tlabel)
                        predicted_label.remove(plabel)
            confusion_matrix[0][1] += len(true_labels) # add to false negative (unmatched true labels)
            confusion_matrix[1][0] += len(predicted_label) # add to false positive (unmatched predicted labels)

    precision = confusion_matrix[0][0]/(confusion_matrix[0][0]+confusion_matrix[1][0])
    recall = confusion_matrix[0][0]/(confusion_matrix[0][0]+confusion_matrix[0][1])
    f1 = 2*(precision*recall)/(precision+recall)
    print(confusion_matrix)
    print('Precision: ', precision)
    print('Recall: ', recall)
    print('F1: ',f1)
evaluateEL(wiki_iris_predicted)

[[157, 82], [181, 0]]
Precision:  0.46449704142011833
Recall:  0.6569037656903766
F1:  0.5441941074523398


### EVALUATION OF ENTITY LINKING OF WIKIFIER FROM CLAIM
In this approach the Wikifier API is called for each claim and it annotates the relevant entities. Then the food terms are extracted by filtering the entities based on their dbepedia iris. If they are not of the type FOOD, CONDIMENT, BEVERAGE... they are removed.
Finally, the wikipedia urls of the remaining entities are compared to the true urls.

In [141]:
# Filter the predicted labels based on the dbpedia iris
from SPARQLWrapper import SPARQLWrapper, JSON
# get the dbpedia iris and wikipedia urls of claims
def queryWikifier():
    data = []
    for i in range(len(df)):
        claim = df['Claim'][i]
        response = CallWikifier(claim)
        wiki_annotations = response["annotations"]
        dbpedia_iris = []
        wiki_urls = []
        for wannotation in wiki_annotations:
            dbpedia_iri = wannotation["dbPediaIri"]
            wiki_url = wannotation["url"]
            wiki_urls.append(wiki_url)
            dbpedia_iris.append(dbpedia_iri)
        row = [annotations['Claim'][i], dbpedia_iris, wiki_urls]
        data.append(row)
    wikifier_annotations = pd.DataFrame(data, columns=['Claim','dbpedia_iris', 'wiki_urls'])
    wikifier_annotations.to_csv('data/wiki_iris.csv', index=False)

In [142]:
def queryDBpedia():
    wiki_iris = pd.read_csv('data/wiki_iris.csv')

    sparql = SPARQLWrapper("http://dbpedia.org/sparql")
    sparql.setReturnFormat(JSON)

    for i in range(len(wiki_iris)):
        entities = ast.literal_eval(wiki_iris["dbpedia_iris"].iloc[i])
        wiki_urls = ast.literal_eval(wiki_iris["wiki_urls"].iloc[i])
        for entity in entities:
            query = f'''
                ASK {{
                {{<{entity}> rdf:type ?a.
                ?a rdfs:subClassOf* dbo:Food}}
                union
                {{<{entity}> rdf:type ?b.
                ?b rdfs:subClassOf* dbo:Condiment}}
                union
                {{<{entity}> rdf:type ?c.
                ?c rdfs:subClassOf* dbo:Beverage}}
                union
                {{<{entity}> rdf:type ?d.
                ?d rdfs:subClassOf* dbo:Plant}}
                union
                {{<{entity}> rdf:type ?e.
                ?e rdfs:subClassOf* dbo:Animal}}
                union
                {{<{entity}> rdf:type ?f.
                ?f rdfs:subClassOf* dbo:Fungus}}
                union
                {{<{entity}> rdf:type ?g.
                ?g rdfs:subClassOf* dbo:ChemicalSubstance}}
                }}
            '''
            sparql.setQuery(query)
            results = sparql.query().convert()
            isfood = results["boolean"]
            if  not isfood:
                index = entities.index(entity)
                entities.remove(entity)
                wiki_urls.remove(wiki_urls[index])
        wiki_iris["dbpedia_iris"].iloc[i] = str(entities)
        wiki_iris["wiki_urls"].iloc[i] = str(wiki_urls)
    wiki_iris.to_csv('data/wiki_iris.csv', index=False)

In [143]:
wiki_iris = pd.read_csv('data/wiki_iris.csv')
evaluateEL(wiki_iris)

[[104, 123], [409, 0]]
Precision:  0.20272904483430798
Recall:  0.4581497797356828
F1:  0.2810810810810811


# REL TASK

First we want to map all the relations to the relations defined in the task (maintenance of a function, reducing a risk factor, enhancing of a function)

In [150]:
health_rel = annotations.copy()
health_rel.loc[health_rel['Health relationship'].str.contains('maintenance', case=False), 'Health relationship'] = 'Maintenance of a function'
health_rel.loc[health_rel['Health relationship'].str.contains('contributes', case=False), 'Health relationship'] = 'Maintenance of a function'
health_rel.loc[health_rel['Health relationship'].str.contains('risk', case=False), 'Health relationship'] = 'Reducing a risk factor'
health_rel.loc[health_rel['Health relationship'].str.contains('enhancing', case=False), 'Health relationship'] = 'Enhancing of a function'
health_rel.head(20)

Unnamed: 0,Claim,Food,Phenotype,Health relationship
0,ALA contributes to the maintenance of normal b...,['ALA'],['normal blood cholesterol levels'],Maintenance of a function
1,Activated charcoal contributes to reducing exe...,"['Activated charcoal', 'eating']",['exessive flatulence'],Reducing a risk factor
2,Barley grain fibre contributes to an increase ...,['Barley grain fibre'],['faecal bulk'],Maintenance of a function
3,Beta-glucans contribute to the maintenance of ...,['Beta-glucans'],['normal blood cholesterol levels'],Maintenance of a function
4,Betaine contributes to normal homocysteine met...,['Betaine'],['homocysteine metabolism'],Maintenance of a function
5,Biotin contributes to normal energy-yielding m...,['Biotin'],['energy-yielding metabolism'],Enhancing of a function
6,Biotin contributes to normal functioning of th...,['Biotin'],['nervous system'],Maintenance of a function
7,Biotin contributes to normal macronutrient met...,"['Biotin', 'macronutrient']",['metabolism'],Maintenance of a function
8,Biotin contributes to normal psychological fun...,['Biotin'],['normal psychological function'],Maintenance of a function
9,Biotin contributes to the maintenance of norma...,['Biotin'],['hair'],Maintenance of a function


Now we evaluate the model

In [152]:
matches = 0
for i in range(len(df)):
    true_rel = df['Relationship-effect'][i]
    pred_rel = health_rel['Health relationship'][i]
    if true_rel == pred_rel:
        matches += 1
print('Accuracy: ', matches/len(df))

Accuracy:  0.5884615384615385
