## Evaluate Dataset from Paper "Personalized Food Recommendation as Constrained Question Answering over a Large scale Food Knowledge Graph"

In [52]:
import pandas as pd 
import numpy as np 
from foodrec.tools.ingredient_normalizer import IngredientNormalisation
from foodrec.config.structure.paths import DATASET_PATHS
from foodrec.config.structure.dataset_enum import DatasetEnum

## Load Dataset

In [6]:
df = pd.read_json(DATASET_PATHS / "test_qas_090820.json", orient="records", lines=True)

In [7]:
df.head()

Unnamed: 0,entities,topicKey,rel_path,qOriginText,qType,multi_tag_type,origin_answers,answers,log_dishes,qText,guideline,persona,domainType,qId,explicit_nutrition
0,"[[papaya, tag]]",[http://idea.rpi.edu/heals/kb/tag/papaya],[tagged_dishes],Can you suggest papaya recipes that do not con...,constraint,none,"[Thai Curried Prawn Soup, Pineapple, Papaya & ...",[Tropical Quinoa (Ww)],"[Irish Roasted Salmon, Slammin Salmon, Salmon ...",Can you suggest papaya recipes that do not con...,"{'saturated fat': {'percentage': 'calories', '...","{'ingredient_likes': ['scallions'], 'ingredien...",in-domain,constraint-qas-test-00000,
1,"[[labor-day, tag]]",[http://idea.rpi.edu/heals/kb/tag/labor-day],[tagged_dishes],What are low protein labor-day recipes which d...,constraint,none,"[Bacon-Wrapped Tater Tots, Southwestern Roaste...","[Mandarin Chicken Pasta Salad - Pampered Chef,...",[Fassolia Gigantes Plaki (Giant Beans Baked in...,What are low protein labor-day recipes which d...,"{'carbohydrates': {'unit': 'g', 'meal': {'type...","{'ingredient_likes': ['red bell peppers'], 'in...",in-domain,constraint-qas-test-00001,"[{'nutrition': 'protein', 'level': 'low', 'ran..."
2,"[[shakes, tag]]",[http://idea.rpi.edu/heals/kb/tag/shakes],[tagged_dishes],Can you suggest shakes recipes that do not con...,constraint,none,"[Baileys Frappe, Chocolate Coconut Milkshake, ...","[Grape Juice Shake, My Great Grape and Banana ...",[Fassolia Gigantes Plaki (Giant Beans Baked in...,Can you suggest shakes recipes that do not con...,"{'sugar': {'unit': 'g', 'meal': {'type': 'rang...","{'ingredient_likes': ['grape juice'], 'ingredi...",in-domain,constraint-qas-test-00002,
3,"[[rosh-hashana, tag]]",[http://idea.rpi.edu/heals/kb/tag/rosh-hashana],[tagged_dishes],Suggest low carbohydrates rosh-hashana dishes ...,constraint,none,"[Winter Fruit Salad, Canadian \'old-Time\' Bra...",[Red Snapper Baked with Orange],"[Irish Roasted Salmon, Slammin Salmon, Salmon ...",Suggest low carbohydrates rosh-hashana dishes ...,"{'sugar': {'unit': 'g', 'meal': {'type': 'rang...","{'ingredient_likes': ['red snapper'], 'ingredi...",in-domain,constraint-qas-test-00003,"[{'nutrition': 'carbohydrates', 'level': 'low'..."
4,"[[cherries, tag]]",[http://idea.rpi.edu/heals/kb/tag/cherries],[tagged_dishes],What cherries dishes do not contain ingredient...,constraint,none,"[Italian Cherry Sauce, Cherry Berry Smoothies,...",[Chocolate-Dipped Cherries With Pistachios],"[Irish Roasted Salmon, Slammin Salmon, Salmon ...",What cherries dishes do not contain ingredient...,"{'sugar': {'unit': 'g', 'meal': {'type': 'rang...","{'ingredient_likes': ['heavy cream', 'pistachi...",in-domain,constraint-qas-test-00004,


In [15]:
df.columns

Index(['entities', 'topicKey', 'rel_path', 'qOriginText', 'qType',
       'multi_tag_type', 'origin_answers', 'answers', 'log_dishes', 'qText',
       'guideline', 'persona', 'domainType', 'qId', 'explicit_nutrition'],
      dtype='object')

In [20]:
df['persona']

0       {'ingredient_likes': ['scallions'], 'ingredien...
1       {'ingredient_likes': ['red bell peppers'], 'in...
2       {'ingredient_likes': ['grape juice'], 'ingredi...
3       {'ingredient_likes': ['red snapper'], 'ingredi...
4       {'ingredient_likes': ['heavy cream', 'pistachi...
                              ...                        
2756    {'ingredient_likes': ['white pepper'], 'ingred...
2757    {'ingredient_likes': ['apple juice'], 'ingredi...
2758    {'ingredient_likes': ['dates'], 'ingredient_di...
2759    {'ingredient_likes': ['lean beef'], 'ingredien...
2760    {'ingredient_likes': ['zucchini'], 'ingredient...
Name: persona, Length: 2761, dtype: object

## Exploratory Analysis

"" We are particulary interested in the personas and Queries as one example shown below. The problem is we must adapt those to our possible personas including the variables cuisine and time""

First lets analyse different types of queries

In [25]:
print("One Example of a persona and query:")
row = df.iloc[40]
print(f"Query: {row['qOriginText']}")
print(f"Persona: {row['persona']}")

One Example of a persona and query:
Query: What squid dishes can I make that do not contain toasted sesame seeds?
Persona: {'ingredient_likes': ['carrot'], 'ingredient_dislikes': ['lobsters'], 'constrained_entities': {'1': ['carrot', 'calories with desired range 100.0 calories to 800.0 calories'], '2': ['lobsters', 'toasted sesame seeds']}}


## Select Ingredients

"""
We need 100 queries, which should be very variable in this context. That is the reason, why we now select 100 from the 2700 Queries. 
1. We check manually if the queries match or constraints. 
2. Later on from the 100 queries, we add to 10 percent of the personas vegetarian diet, which are not in our dataset as variable. 
3. As well we add to 20 % of the personas cuisines preferences, because they are as well part of our variables
"""

In [55]:
df_sample = df.sample(n=100, random_state=42)

In [56]:
df_filtered = df_sample[['persona', 'qOriginText']].copy()

In [57]:
def reduce_persona(persona):
    ingredients = persona.get('ingredient_likes', [])
    dislikes = persona.get('ingredient_dislikes', [])
    return {
        'likes': ingredients,
        'dislikes': dislikes,
    }

In [58]:
df_filtered['persona'] = df_filtered['persona'].apply(reduce_persona)

In [59]:
df_filtered

Unnamed: 0,persona,qOriginText
367,"{'likes': ['caramels'], 'dislikes': ['plain lo...",Recommend oatmeal recipes which do not have in...
2759,"{'likes': ['lean beef'], 'dislikes': ['ginger'...",Could you recommend czech dishes which do not ...
1330,"{'likes': ['pork back ribs'], 'dislikes': ['ri...",Suggest high fat beef-ribs dishes that do not ...
2750,"{'likes': ['chorizo sausage'], 'dislikes': ['b...",What are medium carbohydrates puerto-rican dis...
521,"{'likes': ['dill pickle'], 'dislikes': ['large...",Suggest kwanzaa dishes that do not contain onion?
...,...,...
2419,"{'likes': ['egg whites'], 'dislikes': ['countr...",What are breakfast recipes that do not contain...
2665,"{'likes': ['buttermilk'], 'dislikes': ['cornme...",What puerto-rican dishes do not contain ingred...
1601,"{'likes': ['vegetable stock'], 'dislikes': ['c...",Can you suggest simply-potatoes recipes that d...
1666,"{'likes': ['vermicelli'], 'dislikes': ['poblan...",Recommend leftovers recipes which do not have ...


In [60]:
def replace_nutritions(text):
    text = text.replace("carbohydrates", "")
    text = text.replace("protein", "")
    text = text.replace("fat", "")
    text = text.replace("low", "")
    text = text.replace("high", "")
    text = text.replace("medium", "")
    text = text.replace("  ", " ")
    return text

In [61]:
df_filtered['qOriginText'] = df_filtered['qOriginText'].apply(replace_nutritions)

### 2. Diet

We want to add to 10 percent a diet preference in the form of vegetarian

In [62]:
def add_vegetarian_diet(persona):
    if np.random.rand() < 0.1:
        persona['diet'] = 'vegetarian'
    return persona

df_filtered['persona'] = df_filtered['persona'].apply(add_vegetarian_diet)      

### 3. Add cuisine

In [63]:
def add_cuisine(persona):
    cuisines = ['central_europe', 'north_america', 'asia', 'latin_america', 'middle_east']
    if np.random.rand() < 0.2:
        persona['cuisine'] = str(np.random.choice(cuisines))

    return persona

df_filtered['persona'] = df_filtered['persona'].apply(add_cuisine)

### 4. Ingredient Normalisation

in the persona text, we have a ton of ingredients, which should be normalised before using them 

In [64]:
df_filtered['persona'].iloc[0]

{'likes': ['caramels'],
 'dislikes': ['plain low - fat yogurt', 'strawberry yogurt']}

In [65]:
normaliser = IngredientNormalisation(DatasetEnum.ALL_RECIPE)
def ingredient_normalisation(persona):
    likes = persona.get('likes', [])
    dislikes = persona.get('dislikes', [])
    likes_normalised = [normaliser.advanced_hybrid_search(ingredient)[0] for ingredient in likes]
    dislikes_normalised = [normaliser.advanced_hybrid_search(ingredient)[0] for ingredient in dislikes]
    persona['likes'] = likes_normalised
    persona['dislikes'] = dislikes_normalised
    return persona

df_filtered['persona'] = df_filtered['persona'].apply(ingredient_normalisation)

2025-08-16 15:02:52,212 - foodrec.data.load_ingredient_embeddings - INFO - EmbeddingLoader initialized with path: /Users/noah/Documents/github/MultiAgentBiase/system/foodrec/config/dataset/ingredient_embeddings/ingredient_embeddings_ALL_RECIPE.csv
2025-08-16 15:02:52,214 - foodrec.data.load_ingredient_embeddings - INFO - Starting embedding retrieval process...
2025-08-16 15:02:52,215 - foodrec.data.load_ingredient_embeddings - INFO - ✓ Found existing embeddings file: /Users/noah/Documents/github/MultiAgentBiase/system/foodrec/config/dataset/ingredient_embeddings/ingredient_embeddings_ALL_RECIPE.csv
2025-08-16 15:02:52,215 - foodrec.data.load_ingredient_embeddings - INFO - Loading existing embeddings...


####################Load Embeddings####################


In [67]:
df_filtered.insert(0, "id", range(1, len(df_filtered) + 1))
df_filtered.to_csv(DATASET_PATHS / "zw_personas.csv", index=False)

## Summary

We got 100 real queries + personas from the Paper above, which we now can use to test our model. We added randomly cuisine and diet preferences and deleted stuff like low carb or high carbohydrates, because we want that it comes from the system itself