# Data Prep

First, download the [Kaggle dataset](https://www.kaggle.com/shuyangli94/food-com-recipes-and-user-interactions/metadata) and inflate/rename it as the `food_com` directory.
Then load as a `pandas` dataframe:

In [1]:
import pandas as pd

df_raw = pd.read_csv("food_com/RAW_recipes.csv")
df_raw.head()

Unnamed: 0,name,id,minutes,contributor_id,submitted,tags,nutrition,n_steps,steps,description,ingredients,n_ingredients
0,arriba baked winter squash mexican style,137739,55,47892,2005-09-16,"['60-minutes-or-less', 'time-to-make', 'course...","[51.5, 0.0, 13.0, 0.0, 2.0, 0.0, 4.0]",11,"['make a choice and proceed with recipe', 'dep...",autumn is my favorite time of year to cook! th...,"['winter squash', 'mexican seasoning', 'mixed ...",7
1,a bit different breakfast pizza,31490,30,26278,2002-06-17,"['30-minutes-or-less', 'time-to-make', 'course...","[173.4, 18.0, 0.0, 17.0, 22.0, 35.0, 1.0]",9,"['preheat oven to 425 degrees f', 'press dough...",this recipe calls for the crust to be prebaked...,"['prepared pizza crust', 'sausage patty', 'egg...",6
2,all in the kitchen chili,112140,130,196586,2005-02-25,"['time-to-make', 'course', 'preparation', 'mai...","[269.8, 22.0, 32.0, 48.0, 39.0, 27.0, 5.0]",6,"['brown ground beef in large pot', 'add choppe...",this modified version of 'mom's' chili was a h...,"['ground beef', 'yellow onions', 'diced tomato...",13
3,alouette potatoes,59389,45,68585,2003-04-14,"['60-minutes-or-less', 'time-to-make', 'course...","[368.1, 17.0, 10.0, 2.0, 14.0, 8.0, 20.0]",11,['place potatoes in a large pot of lightly sal...,"this is a super easy, great tasting, make ahea...","['spreadable cheese with garlic and herbs', 'n...",11
4,amish tomato ketchup for canning,44061,190,41706,2002-10-25,"['weeknight', 'time-to-make', 'course', 'main-...","[352.9, 1.0, 337.0, 23.0, 3.0, 0.0, 28.0]",5,['mix all ingredients& boil for 2 1 / 2 hours ...,my dh's amish mother raised him on this recipe...,"['tomato juice', 'apple cider vinegar', 'sugar...",8


We'll keep a subset of the available columns. Let's avoid republishing the entire dataset.

In [2]:
df = df_raw[["id", "name", "minutes", "tags", "description", "ingredients"]]
df.head()

Unnamed: 0,id,name,minutes,tags,description,ingredients
0,137739,arriba baked winter squash mexican style,55,"['60-minutes-or-less', 'time-to-make', 'course...",autumn is my favorite time of year to cook! th...,"['winter squash', 'mexican seasoning', 'mixed ..."
1,31490,a bit different breakfast pizza,30,"['30-minutes-or-less', 'time-to-make', 'course...",this recipe calls for the crust to be prebaked...,"['prepared pizza crust', 'sausage patty', 'egg..."
2,112140,all in the kitchen chili,130,"['time-to-make', 'course', 'preparation', 'mai...",this modified version of 'mom's' chili was a h...,"['ground beef', 'yellow onions', 'diced tomato..."
3,59389,alouette potatoes,45,"['60-minutes-or-less', 'time-to-make', 'course...","this is a super easy, great tasting, make ahea...","['spreadable cheese with garlic and herbs', 'n..."
4,44061,amish tomato ketchup for canning,190,"['weeknight', 'time-to-make', 'course', 'main-...",my dh's amish mother raised him on this recipe...,"['tomato juice', 'apple cider vinegar', 'sugar..."


Next, let's constrain the recipes to those which have common ingredients:

In [3]:
common_ingredient = {
    "water": "Water",
    "salt": "Salt",
    "pepper": "BlackPepper",
    "black pepper": "BlackPepper",
    "dried basil": "Basil",

    "butter": "Butter",
    "milk": "CowMilk",
    "egg": "ChickenEgg",
    "eggs": "ChickenEgg",
    "bacon": "Bacon",

    "sugar": "WhiteSugar",
    "brown sugar": "BrownSugar",
    "honey": "Honey",
    "vanilla": "VanillaExtract",
    "vanilla extract": "VanillaExtract",

    "flour": "AllPurposeFlour",
    "all-purpose flour": "AllPurposeFlour",
    "whole wheat flour": "WholeWheatFlour",

    "olive oil": "OliveOil",
    "vinegar": "AppleCiderVinegar",

    "garlic": "Garlic",
    "garlic clove": "Garlic",
    "garlic cloves": "Garlic",
}

In [4]:
from collections import defaultdict

counts = defaultdict(int)
simple_recipes = set([])

for index, row in df.iterrows():
    ind_list = eval(row["ingredients"])
    all_known = True

    for ind in ind_list:
        ingredient = ind.strip()
        counts[ingredient] += 1
        
        if ingredient not in common_ingredient:
            all_known = False

    if all_known and len(ind_list) >= 3:
        simple_recipes.add(row["id"])
        #print(row["name"], len(ind_list))

print(len(simple_recipes))

240


In [5]:
for k, v in sorted(counts.items(), key=lambda kv: kv[1], reverse=True)[:100]:
    if k not in common_ingredient:
        print(v, k)

39065 onion
17504 baking powder
15415 salt and pepper
14807 parmesan cheese
14233 lemon juice
14099 baking soda
13912 vegetable oil
12560 cinnamon
11950 tomatoes
11779 sour cream
10887 garlic powder
9925 oil
9872 onions
9827 cream cheese
9541 celery
8969 cheddar cheese
8935 unsalted butter
8856 soy sauce
8736 mayonnaise
7982 paprika
7963 chicken broth
7832 worcestershire sauce
7704 extra virgin olive oil
7656 fresh parsley
7486 cornstarch
7160 fresh ground black pepper
7023 carrots
7001 parsley
6984 chili powder
6864 ground cinnamon
6707 carrot
6507 potatoes
6299 nutmeg
6285 cayenne pepper
6254 granulated sugar
6169 ground cumin
5824 ground beef
5814 green onions
5777 red onion
5765 walnuts
5752 pecans
5599 dijon mustard
5585 green onion
5583 kosher salt
5377 powdered sugar
5311 fresh lemon juice
5201 heavy cream
5077 margarine
4980 mozzarella cheese
4882 dried oregano
4702 orange juice
4591 zucchini
4487 raisins
4450 red bell pepper
4402 tomato sauce
4360 fresh cilantro
4352 chicken s

That gives us several hundred recipes as a working set to build the KG examples. Let's create a subset of the dataframe and save it to the `recipes.csv` file:

In [6]:
df_simple = df[df["id"].isin(simple_recipes)]
df_simple.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 240 entries, 762 to 230696
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   id           240 non-null    int64 
 1   name         240 non-null    object
 2   minutes      240 non-null    int64 
 3   tags         240 non-null    object
 4   description  235 non-null    object
 5   ingredients  240 non-null    object
dtypes: int64(2), object(4)
memory usage: 13.1+ KB


In [7]:
df_simple.to_csv("recipes.csv", index=False)

We'll also keep a list of **all** ingredients, to use for the embedding example"

In [8]:
df_ind = df_raw[["id", "name", "minutes", "ingredients"]]
df_ind.head()

Unnamed: 0,id,name,minutes,ingredients
0,137739,arriba baked winter squash mexican style,55,"['winter squash', 'mexican seasoning', 'mixed ..."
1,31490,a bit different breakfast pizza,30,"['prepared pizza crust', 'sausage patty', 'egg..."
2,112140,all in the kitchen chili,130,"['ground beef', 'yellow onions', 'diced tomato..."
3,59389,alouette potatoes,45,"['spreadable cheese with garlic and herbs', 'n..."
4,44061,amish tomato ketchup for canning,190,"['tomato juice', 'apple cider vinegar', 'sugar..."


In [9]:
df_ind.to_csv("all_ind.csv", index=False)