# Data Prep

First, download the [Kaggle dataset](https://www.kaggle.com/shuyangli94/food-com-recipes-and-user-interactions/metadata) and inflate/rename it as the `food_com` directory.
Then load as a `pandas` dataframe:

In [33]:
import pandas as pd

df_raw = pd.read_csv("food_com/RAW_recipes.csv")
df_raw.head()

Unnamed: 0,name,id,minutes,contributor_id,submitted,tags,nutrition,n_steps,steps,description,ingredients,n_ingredients
0,arriba baked winter squash mexican style,137739,55,47892,2005-09-16,"['60-minutes-or-less', 'time-to-make', 'course...","[51.5, 0.0, 13.0, 0.0, 2.0, 0.0, 4.0]",11,"['make a choice and proceed with recipe', 'dep...",autumn is my favorite time of year to cook! th...,"['winter squash', 'mexican seasoning', 'mixed ...",7
1,a bit different breakfast pizza,31490,30,26278,2002-06-17,"['30-minutes-or-less', 'time-to-make', 'course...","[173.4, 18.0, 0.0, 17.0, 22.0, 35.0, 1.0]",9,"['preheat oven to 425 degrees f', 'press dough...",this recipe calls for the crust to be prebaked...,"['prepared pizza crust', 'sausage patty', 'egg...",6
2,all in the kitchen chili,112140,130,196586,2005-02-25,"['time-to-make', 'course', 'preparation', 'mai...","[269.8, 22.0, 32.0, 48.0, 39.0, 27.0, 5.0]",6,"['brown ground beef in large pot', 'add choppe...",this modified version of 'mom's' chili was a h...,"['ground beef', 'yellow onions', 'diced tomato...",13
3,alouette potatoes,59389,45,68585,2003-04-14,"['60-minutes-or-less', 'time-to-make', 'course...","[368.1, 17.0, 10.0, 2.0, 14.0, 8.0, 20.0]",11,['place potatoes in a large pot of lightly sal...,"this is a super easy, great tasting, make ahea...","['spreadable cheese with garlic and herbs', 'n...",11
4,amish tomato ketchup for canning,44061,190,41706,2002-10-25,"['weeknight', 'time-to-make', 'course', 'main-...","[352.9, 1.0, 337.0, 23.0, 3.0, 0.0, 28.0]",5,['mix all ingredients& boil for 2 1 / 2 hours ...,my dh's amish mother raised him on this recipe...,"['tomato juice', 'apple cider vinegar', 'sugar...",8


We'll keep a subset of the available columns. Let's avoid republishing the entire dataset.

In [34]:
df = df_raw[["id", "name", "minutes", "tags", "description", "ingredients"]]
df.head()

Unnamed: 0,id,name,minutes,tags,description,ingredients
0,137739,arriba baked winter squash mexican style,55,"['60-minutes-or-less', 'time-to-make', 'course...",autumn is my favorite time of year to cook! th...,"['winter squash', 'mexican seasoning', 'mixed ..."
1,31490,a bit different breakfast pizza,30,"['30-minutes-or-less', 'time-to-make', 'course...",this recipe calls for the crust to be prebaked...,"['prepared pizza crust', 'sausage patty', 'egg..."
2,112140,all in the kitchen chili,130,"['time-to-make', 'course', 'preparation', 'mai...",this modified version of 'mom's' chili was a h...,"['ground beef', 'yellow onions', 'diced tomato..."
3,59389,alouette potatoes,45,"['60-minutes-or-less', 'time-to-make', 'course...","this is a super easy, great tasting, make ahea...","['spreadable cheese with garlic and herbs', 'n..."
4,44061,amish tomato ketchup for canning,190,"['weeknight', 'time-to-make', 'course', 'main-...",my dh's amish mother raised him on this recipe...,"['tomato juice', 'apple cider vinegar', 'sugar..."


Next, let's constrain the recipes to those which have common ingredients:

In [35]:
common_ingredient = {
    "water": "Water",
    "salt": "Salt",
    "pepper": "BlackPepper",
    "black pepper": "BlackPepper",
    
    "baking powder": "BakingPowder",
    "baking soda": "BakingSoda",

    "vanilla": "VanillaExtract",
    "vanilla extract": "VanillaExtract",

    "butter": "Butter",
    "milk": "CowMilk",
    "egg": "ChickenEgg",
    "eggs": "ChickenEgg",

    "sugar": "WhiteSugar",
    "brown sugar": "BrownSugar",
    "honey": "Honey",

    "flour": "AllPurposeFlour",
    "all-purpose flour": "AllPurposeFlour",
    "whole wheat flour": "WholeWheatFlour",

    "olive oil": "OliveOil",
    "vinegar": "AppleCiderVinegar",

    "onion": "Onion",
    "onions": "Onion",
    "garlic": "Garlic",
    "garlic clove": "Garlic",
    "garlic cloves": "Garlic",
    "cabbage": "Cabbage",
    "carrot": "Carrot",
    "carrots": "Carrot",
    "celery": "Celery",
    "potato": "Potato",
    "potatoes": "Potato",
    "tomato": "Tomato",
    "tomatoes": "Tomato",
}

In [36]:
from collections import defaultdict

counts = defaultdict(int)
simple_recipes = set([])

for index, row in df.iterrows():
    ind_list = eval(row["ingredients"])
    all_known = True

    for ind in ind_list:
        ingredient = ind.strip()
        counts[ingredient] += 1
        
        if ingredient not in common_ingredient:
            all_known = False

    if all_known:
        simple_recipes.add(row["id"])
        print(row["name"], len(ind_list))

print("---")

for k, v in sorted(counts.items(), key=lambda kv: kv[1], reverse=True)[:100]:
    if k not in common_ingredient:
        print(v, k)
    
print(len(simple_recipes))

make it your way  shortcakes 7
better  cake mix 3
1 bowl 1 person mashed potatoes 5
1 1 1 tempura batter 3
2 step pound cake  for a kitchen aide mixer 6
30 minute baked potato 2
40 second omelet 3
a 1 dumplings 4
ableskiver   danish doughnuts 6
afghan eggs and tomato  tukhum bonjan or agay bonjan 4
all purpose dinner crepes batter 5
all purpose quick mix with 28 variations 5
american style vanilla biscotti 7
amish butterscotch brownies 7
amish dumplings 4
amish friendship starter 3
ann s shortbread 4
anna s brilliant biscuits 5
another perfect poached egg 4
anya s dutch pancakes 5
anytime crepes 3
apple and banana sauce 6
astoria frosting 6
av s  biscoitos  portuguese biscotti cookies 5
baby dutchman 5
baked eggs in tomatoes 5
baked finnish pancakes 6
baked potato onion wrap ups 2
baked potatoes 2
bakes   baking powder biscuits from barbados 8
baking powder dumplings 5
bannock 5
basic biscotti recipe 7
basic biscuits 5
basic biscuits my style 5
basic crepes 5
basic crepes ii 5
basic cr

That gives us several hundred recipes as a working set to build the KG examples. Let's create a subset of the dataframe and save it to the `recipes.csv` file:

In [37]:
df_simple = df[df["id"].isin(simple_recipes)]
df_simple.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 613 entries, 56 to 231625
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   id           613 non-null    int64 
 1   name         613 non-null    object
 2   minutes      613 non-null    int64 
 3   tags         613 non-null    object
 4   description  598 non-null    object
 5   ingredients  613 non-null    object
dtypes: int64(2), object(4)
memory usage: 33.5+ KB


In [38]:
df_simple.to_csv("recipes.csv", index=False)

We'll also keep a list of **all** ingredients, to use for the embedding example"

In [39]:
df_ind = df_raw[["id", "ingredients"]]
df_ind.head()

Unnamed: 0,id,ingredients
0,137739,"['winter squash', 'mexican seasoning', 'mixed ..."
1,31490,"['prepared pizza crust', 'sausage patty', 'egg..."
2,112140,"['ground beef', 'yellow onions', 'diced tomato..."
3,59389,"['spreadable cheese with garlic and herbs', 'n..."
4,44061,"['tomato juice', 'apple cider vinegar', 'sugar..."


In [40]:
df_ind.to_csv("all_ind.csv", index=False)