# Learning for Utility

This notebook is to train a utility function given a recipe based on dataset: https://www.kaggle.com/datasets/realalexanderwei/food-com-recipes-with-ingredients-and-tags

## Idea behind the utility

Cluster the recipes into K categories based on number of steps + ingredient count (measuring the complexity of the recipe) and encoding of the common tags (each tag will have a award/penalty e.g. `less-than-4-hrs` 0.1 `less-than-1-hrs` 0.5)

## Ideas on tag encoding
Not all tags contribute positively to utility. For example:

- Positive / desirable tags: easy, 30-minutes-or-less, desserts, vegetables, main-dish
- Neutral / context tags: preparation, course, occasion, equipment
- Negative / high-effort tags: 4-hours-or-less, 3-steps-or-less (depending on your definition of convenience)

Assign weights 

- Positive tags → +1 (or normalized 0-1)
- Neutral tags → 0
- Negative tags → -1 (if you want to penalize)

## Approach

1. Cluster the recipes based on your features (tags + steps + ingredients count).
This groups similar recipes together automatically.

2. Calculate cluster-level statistics from the features:
Average total_steps → shorter recipes are better
Tag composition → clusters with "popular" tags score higher

3. Combine and normalize these statistics to get a proxy utility for each cluster.


## Data exploration

Examine the dataset

In [26]:
import pandas as pd
import os

In [27]:
csv_path = os.path.join("..","Data", "recipes_ingredients.csv")
df = pd.read_csv(csv_path)
df.columns

Index(['id', 'name', 'description', 'ingredients', 'ingredients_raw', 'steps',
       'servings', 'serving_size', 'tags'],
      dtype='object')

In [28]:
print(len(df))
df = df[df["tags"].notna()]
df = df[df["ingredients"].notna()]
df = df[df["steps"].notna()]
print(df.head(2))
columns_to_drop = ['description', 'servings', 'ingredients_raw']
df.drop(columns=columns_to_drop, inplace=True)
len(df)

500471
      id                             name  \
0  71247          Cherry Streusel Cobbler   
1  76133  Reuben and Swiss Casserole Bake   

                                         description  \
0  I haven't made this in years, so I'm just gues...   
1  I think this is even better than a reuben sand...   

                                         ingredients  \
0  ["cherry pie filling", "condensed milk", "melt...   
1  ["corned beef chopped", "sauerkraut cold water...   

                                     ingredients_raw  \
0  ["2 (21   ounce) cans   cherry pie filling","2...   
1  ["1/2-1   lb    corned beef, cooked and choppe...   

                                               steps  servings serving_size  \
0  ["Preheat oven to 375°F.", "Spread cherry pie ...       6.0    1 (347 g)   
1  ["Set oven to 350 degrees F.", "Butter a 9 x 1...       4.0    1 (207 g)   

                                                tags  
0  ["60-minutes-or-less", "time-to-make", "course...  
1 

500436

In [29]:
df.head()["steps"]

0    ["Preheat oven to 375°F.", "Spread cherry pie ...
1    ["Set oven to 350 degrees F.", "Butter a 9 x 1...
2    ["Preheat oven to 350°F  In a mixing bowl, usi...
3    ["In a large mixing bowl, combine the first 6 ...
4    ["Cream butter and sugars together.", "Blend i...
Name: steps, dtype: object

In [30]:
df.head()["ingredients"]

0    ["cherry pie filling", "condensed milk", "melt...
1    ["corned beef chopped", "sauerkraut cold water...
2    ["unsalted butter", "vegetable oil", "all - pu...
3    ["orange cake mix", "instant vanilla pudding",...
4    ["butter", "brown sugar", "granulated sugar", ...
Name: ingredients, dtype: object

In [31]:
import ast
import heapq
from collections import defaultdict
tag_counts = defaultdict(int)
cnt = 0
for r in df["tags"]:
    xs = []
    if isinstance(r, str):
        xs = ast.literal_eval(r)
    for x in xs:
        tag_counts[x] += 1
        cnt += 1
        
top_n = 25
pq = []

for tag, freq in tag_counts.items():
    if len(pq) < top_n:
        heapq.heappush(pq, (freq, tag))
    else:
        heapq.heappushpop(pq, (freq, tag))

# Extract top tags sorted by freq descending
top_tags = [tag for freq, tag in sorted(pq, key=lambda x: -x[0])]
print(top_tags)
# print(tag_set)

['preparation', 'time-to-make', 'course', 'main-ingredient', 'dietary', 'easy', 'occasion', 'cuisine', 'low-in-something', '60-minutes-or-less', 'main-dish', 'equipment', '30-minutes-or-less', 'number-of-servings', 'meat', '4-hours-or-less', 'desserts', 'vegetables', '3-steps-or-less', 'taste-mood', 'north-american', 'low-sodium', 'low-carb', 'healthy', '15-minutes-or-less']


## Data processing

How to vectorize the data?

1. Vectorize based on tag
2. Count the number of steps

### Vectorize tags

Convert tags from string to list using `ast` library

The top 20 tag is stored in `top_tags` => add new column at each dish corresponding to the tag

In [32]:
df['tags_list'] = df['tags'].apply(lambda x: ast.literal_eval(x))
df.head()['tags_list']

0    [60-minutes-or-less, time-to-make, course, mai...
1    [60-minutes-or-less, time-to-make, course, mai...
2    [time-to-make, course, main-ingredient, cuisin...
3    [60-minutes-or-less, time-to-make, course, pre...
4    [15-minutes-or-less, time-to-make, course, mai...
Name: tags_list, dtype: object

In [33]:
for tag in top_tags:
    df[f'tag_{tag}'] = df['tags_list'].apply(lambda x: 1 if tag in x else 0)
    
df.head()
df.drop(columns=["serving_size", "tags", "tags_list", "id"], inplace=True)

In [34]:
df.head(3)

Unnamed: 0,name,ingredients,steps,tag_preparation,tag_time-to-make,tag_course,tag_main-ingredient,tag_dietary,tag_easy,tag_occasion,...,tag_4-hours-or-less,tag_desserts,tag_vegetables,tag_3-steps-or-less,tag_taste-mood,tag_north-american,tag_low-sodium,tag_low-carb,tag_healthy,tag_15-minutes-or-less
0,Cherry Streusel Cobbler,"[""cherry pie filling"", ""condensed milk"", ""melt...","[""Preheat oven to 375°F."", ""Spread cherry pie ...",1,1,1,1,0,0,0,...,0,1,0,0,0,1,0,0,0,0
1,Reuben and Swiss Casserole Bake,"[""corned beef chopped"", ""sauerkraut cold water...","[""Set oven to 350 degrees F."", ""Butter a 9 x 1...",1,1,1,1,1,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Yam-Pecan Recipe,"[""unsalted butter"", ""vegetable oil"", ""all - pu...","[""Preheat oven to 350°F In a mixing bowl, usi...",1,1,1,1,0,1,1,...,1,0,0,1,1,1,0,0,0,0


### Vectorize steps count and ingredients

Use `ast` to convert from string to list

Add new columns

In [35]:
## Vectorize steps count

df["step_cnt"] = df["steps"].apply(lambda x: len(x.split(".")))
df.head(2)

Unnamed: 0,name,ingredients,steps,tag_preparation,tag_time-to-make,tag_course,tag_main-ingredient,tag_dietary,tag_easy,tag_occasion,...,tag_desserts,tag_vegetables,tag_3-steps-or-less,tag_taste-mood,tag_north-american,tag_low-sodium,tag_low-carb,tag_healthy,tag_15-minutes-or-less,step_cnt
0,Cherry Streusel Cobbler,"[""cherry pie filling"", ""condensed milk"", ""melt...","[""Preheat oven to 375°F."", ""Spread cherry pie ...",1,1,1,1,0,0,0,...,1,0,0,0,1,0,0,0,0,14
1,Reuben and Swiss Casserole Bake,"[""corned beef chopped"", ""sauerkraut cold water...","[""Set oven to 350 degrees F."", ""Butter a 9 x 1...",1,1,1,1,1,0,0,...,0,0,0,0,0,0,0,0,0,8


In [36]:
## Vectorize ingredients count

df["ingredients_cnt"] = df["ingredients"].apply(lambda x: len(x.split(",")))
df.head(2)

Unnamed: 0,name,ingredients,steps,tag_preparation,tag_time-to-make,tag_course,tag_main-ingredient,tag_dietary,tag_easy,tag_occasion,...,tag_vegetables,tag_3-steps-or-less,tag_taste-mood,tag_north-american,tag_low-sodium,tag_low-carb,tag_healthy,tag_15-minutes-or-less,step_cnt,ingredients_cnt
0,Cherry Streusel Cobbler,"[""cherry pie filling"", ""condensed milk"", ""melt...","[""Preheat oven to 375°F."", ""Spread cherry pie ...",1,1,1,1,0,0,0,...,0,0,0,1,0,0,0,0,14,11
1,Reuben and Swiss Casserole Bake,"[""corned beef chopped"", ""sauerkraut cold water...","[""Set oven to 350 degrees F."", ""Butter a 9 x 1...",1,1,1,1,1,0,0,...,0,0,0,0,0,0,0,0,8,5


In [37]:
df.drop(columns=["ingredients", "steps"], inplace=True)
df.head(3)

Unnamed: 0,name,tag_preparation,tag_time-to-make,tag_course,tag_main-ingredient,tag_dietary,tag_easy,tag_occasion,tag_cuisine,tag_low-in-something,...,tag_vegetables,tag_3-steps-or-less,tag_taste-mood,tag_north-american,tag_low-sodium,tag_low-carb,tag_healthy,tag_15-minutes-or-less,step_cnt,ingredients_cnt
0,Cherry Streusel Cobbler,1,1,1,1,0,0,0,1,0,...,0,0,0,1,0,0,0,0,14,11
1,Reuben and Swiss Casserole Bake,1,1,1,1,1,0,0,0,0,...,0,0,0,0,0,0,0,0,8,5
2,Yam-Pecan Recipe,1,1,1,1,0,1,1,1,0,...,0,1,1,1,0,0,0,0,11,8


## Training

In [38]:
feature_cols = [col for col in df.columns if col not in ['name']]
X = df[feature_cols]

In [39]:
from sklearn.preprocessing import MinMaxScaler
from copy import deepcopy

scaled_cols = ['step_cnt', 'ingredients_cnt']

scaler = MinMaxScaler()
X_scaled = deepcopy(X)
X_scaled[scaled_cols] = scaler.fit_transform(X[scaled_cols])

In [40]:
print(len(X_scaled))
X_scaled.head()

500436


Unnamed: 0,tag_preparation,tag_time-to-make,tag_course,tag_main-ingredient,tag_dietary,tag_easy,tag_occasion,tag_cuisine,tag_low-in-something,tag_60-minutes-or-less,...,tag_vegetables,tag_3-steps-or-less,tag_taste-mood,tag_north-american,tag_low-sodium,tag_low-carb,tag_healthy,tag_15-minutes-or-less,step_cnt,ingredients_cnt
0,1,1,1,1,0,0,0,1,0,1,...,0,0,0,1,0,0,0,0,0.090909,0.232558
1,1,1,1,1,1,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0.048951,0.093023
2,1,1,1,1,0,1,1,1,0,0,...,0,1,1,1,0,0,0,0,0.06993,0.162791
3,1,1,1,0,1,0,0,0,1,1,...,0,0,0,0,0,0,0,0,0.076923,0.162791
4,1,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0.027972,0.139535


In [41]:
from sklearn.cluster import KMeans
import joblib

kmeans = KMeans(n_clusters=2000, random_state=42)
df['cluster'] = kmeans.fit_predict(X_scaled)
joblib.dump(kmeans, 'kmeans_model.joblib')

['kmeans_model.joblib']

In [42]:
df[df['cluster']==3][['name', 'cluster']]

Unnamed: 0,name,cluster
8609,Persian Chicken - Tah Cheen,3
11622,Scotch Broth,3
14686,Sinfully Delicious Indian Ginger Mutton Karahi,3
15075,Bon Bon Ribs #2,3
17927,Beef Bourguignon (Crock Pot),3
...,...,...
486467,Vietnamese Lemongrass Porkabobs - Thit Lui,3
487540,Panzanella,3
490472,Ronny's Sugar-free Fruity Spice Cake,3
491117,Mama Zuquinis Brodo,3


# Utility for each cluster

1. assign awards/weights for each characteristics manually 
2. positive tags should give greater reward
3. neutral tags should give less reward
4. smaller the number of steps and number of ingredients used, the reward should be higher/penalty should be smaller

In [43]:
cols_for_summary = [col for col in df.columns if col != 'name' and col != 'cluster']

In [44]:
cluster_summary = df.groupby('cluster')[cols_for_summary].mean()
cluster_summary

Unnamed: 0_level_0,tag_preparation,tag_time-to-make,tag_course,tag_main-ingredient,tag_dietary,tag_easy,tag_occasion,tag_cuisine,tag_low-in-something,tag_60-minutes-or-less,...,tag_vegetables,tag_3-steps-or-less,tag_taste-mood,tag_north-american,tag_low-sodium,tag_low-carb,tag_healthy,tag_15-minutes-or-less,step_cnt,ingredients_cnt
cluster,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,1.0,1.000000,0.989362,0.172340,0.000000,1.000000,0.106383,1.000000,0.000000,0.000000,...,0.000000,0.834043,0.000000,0.000000,0.000000,0.000000,0.000000,0.997872,5.412766,6.217021
1,1.0,1.000000,1.000000,1.000000,0.000000,0.000000,0.000000,0.000000,0.000000,1.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,12.154381,8.984701
2,1.0,1.000000,1.000000,1.000000,0.847134,1.000000,1.000000,1.000000,0.038217,0.000000,...,1.000000,0.191083,0.363057,0.146497,0.000000,0.000000,0.000000,0.000000,13.235669,10.101911
3,1.0,0.945055,0.890110,1.000000,0.972527,0.054945,1.000000,0.945055,0.010989,0.082418,...,0.032967,0.000000,0.945055,0.071429,0.005495,0.000000,0.016484,0.203297,12.434066,9.170330
4,1.0,1.000000,0.848057,1.000000,1.000000,1.000000,0.130742,0.000000,1.000000,0.000000,...,0.925795,1.000000,0.000000,0.000000,1.000000,0.689046,1.000000,0.000000,6.173145,6.236749
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1995,1.0,1.000000,0.912621,1.000000,1.000000,1.000000,0.213592,0.087379,1.000000,0.000000,...,0.305825,1.000000,0.000000,0.000000,1.000000,1.000000,0.000000,0.995146,4.466019,4.330097
1996,1.0,1.000000,1.000000,1.000000,1.000000,0.000000,0.773770,1.000000,0.042623,0.000000,...,0.000000,0.000000,0.000000,1.000000,0.000000,0.000000,0.068852,0.000000,11.550820,8.144262
1997,1.0,1.000000,1.000000,0.884422,1.000000,0.135678,0.050251,1.000000,1.000000,0.000000,...,0.000000,0.000000,0.000000,0.170854,1.000000,0.125628,0.361809,0.000000,11.613065,8.567839
1998,1.0,1.000000,0.965517,1.000000,1.000000,0.027586,0.034483,1.000000,1.000000,0.986207,...,0.800000,0.000000,0.000000,0.027586,0.096552,0.806897,1.000000,0.000000,12.006897,9.689655


In [45]:
for t in top_tags:
    print(f"tag_{t}")

tag_preparation
tag_time-to-make
tag_course
tag_main-ingredient
tag_dietary
tag_easy
tag_occasion
tag_cuisine
tag_low-in-something
tag_60-minutes-or-less
tag_main-dish
tag_equipment
tag_30-minutes-or-less
tag_number-of-servings
tag_meat
tag_4-hours-or-less
tag_desserts
tag_vegetables
tag_3-steps-or-less
tag_taste-mood
tag_north-american
tag_low-sodium
tag_low-carb
tag_healthy
tag_15-minutes-or-less


In [46]:
import numpy as np
def sample_tag_score(tag_type):
    if tag_type == 'P':  # Positive
        return np.random.normal(5, 1)
    elif tag_type == 'N':  # Negative
        return np.random.normal(-5, 1)
    elif tag_type == 'NsC' or tag_type == 'NIC':  # Negative
        return np.random.normal(-1, 1)
    else:  # Neutral
        return np.random.normal(0, 1)

columns = {
    'tag_easy': 'P',
    'tag_15-minutes-or-less': 'P',
    'tag_30-minutes-or-less': 'P',
    'tag_60-minutes-or-less': 'P',
    'tag_4-hours-or-less': 'N',
    'tag_3-steps-or-less': 'P',
    'tag_healthy': 'P',
    'tag_low-sodium': 'P',
    'tag_low-carb': 'P',
    'tag_low-in-something': 'N',
    'tag_main-dish': 'P',
    'tag_desserts': 'O',
    'tag_vegetables': 'P',
    'tag_meat': 'O',
    'tag_preparation': 'O',
    'tag_time-to-make': 'O',
    'tag_number-of-servings': 'O',
    'tag_equipment': 'O',
    'tag_cuisine': 'O',
    'tag_north-american': 'O',
    'tag_occasion': 'O',
    'tag_taste-mood': 'O',
    'tag_main-ingredient': 'O',
    'tag_dietary': 'O',
    'tag_course': 'O',
    'step_cnt': 'NSC',
    'ingredients_cnt': 'NIC'
}

In [50]:
cluster_summary['utility'] = cluster_summary[list(columns.keys())].apply(lambda row: sum(row[tag] * sample_tag_score(columns.get(tag)) for tag in columns.keys()), axis=1)
print(cluster_summary)

         tag_preparation  tag_time-to-make  tag_course  tag_main-ingredient  \
cluster                                                                       
0                    1.0          1.000000    0.989362             0.172340   
1                    1.0          1.000000    1.000000             1.000000   
2                    1.0          1.000000    1.000000             1.000000   
3                    1.0          0.945055    0.890110             1.000000   
4                    1.0          1.000000    0.848057             1.000000   
...                  ...               ...         ...                  ...   
1995                 1.0          1.000000    0.912621             1.000000   
1996                 1.0          1.000000    1.000000             1.000000   
1997                 1.0          1.000000    1.000000             0.884422   
1998                 1.0          1.000000    0.965517             1.000000   
1999                 1.0          1.000000    0.9538

In [53]:
utility_map = {}

for i in range(len(cluster_summary)):
    utility_map[int(cluster_summary['cluster'][i])] = cluster_summary['utility'][i]

print(utility_map)

KeyError: 'cluster'