# Edamam Data Exploration and Collection

This notebook contains work to explore and collect data for recipes. The goal is to gather a dataset for a model that will predict if a recipe is vegan or not. Perhaps it will look more like: does the recipe have the potential to be vegan, but this is subject to change. This will work by using the vegan label, and will likely be a binary classification task. The input features could be just the recipe name or recipe name and ingrediants. 

Some useful links:
- https://developer.edamam.com/admin/application
- https://developer.edamam.com/edamam-docs-recipe-api#/

## Testing API

In [1]:
import requests
import pandas as pd
import ast
import json

In [1]:
# Define the API keys, add them and delte them here when done so they don't get uploaded to the public git repo
"""
api_keys = {
    "edamam_app_id": "",
    "edamam_app_key": "",
    "gretel_api_key": ""
}

# Write the API keys to the config.json file
with open('config.json', 'w') as f:
    json.dump({"api_keys": api_keys}, f, indent=4)

"""

In [2]:
with open('config.json', 'r') as f:
    config_data = json.load(f)

# Access the API keys
api_keys = config_data['api_keys']

APP_ID = api_keys['edamam_app_id']
APP_KEY = api_keys['edamam_app_key']

In [11]:
api_keys = config_data.get('api_keys', {})
api_keys.get('edamam_app_id', None)


'213b4d83'

Some functions to get the data into a dataframe.

In [3]:
def recipe_search(ingredient, from_index=0, to_index=10):
    if to_index > 100:
        raise ValueError("to_index must be 100 at maximum")
    
    app_id = APP_ID  # Replace with your Edamam API app ID
    app_key = APP_KEY  # Replace with your Edamam API app key
    result = requests.get(
        'https://api.edamam.com/search?q={}&app_id={}&app_key={}&from={}&to={}'.format(
            ingredient, app_id, app_key, from_index, to_index
        )
    )
    data = result.json()
    return data['hits']

def get_recipe_df(recipes):
    recipes_lst = [recipes[i]['recipe'] for i in range(len(recipes))]
    return pd.DataFrame(recipes_lst)

In [4]:
res = recipe_search('carrot', from_index=0, to_index=10)

In [9]:
res[0].keys()

dict_keys(['recipe'])

In [12]:
len(res)

10

In [6]:
get_recipe_df(res)

Unnamed: 0,uri,label,image,source,url,shareAs,yield,dietLabels,healthLabels,cautions,...,calories,totalWeight,totalTime,cuisineType,mealType,dishType,totalNutrients,totalDaily,digest,tags
0,http://www.edamam.com/ontologies/edamam.owl#re...,Carrot Limeade,https://edamam-product-images.s3.amazonaws.com...,Martha Stewart,https://www.marthastewart.com/1547130/carrot-l...,http://www.edamam.com/recipe/carrot-limeade-ee...,8.0,"[Low-Fat, Low-Sodium]","[Kidney-Friendly, Vegan, Vegetarian, Pescatari...",[Sulfites],...,356.811948,1065.68474,10.0,[american],[lunch/dinner],[drinks],"{'ENERC_KCAL': {'label': 'Energy', 'quantity':...","{'ENERC_KCAL': {'label': 'Energy', 'quantity':...","[{'label': 'Fat', 'tag': 'FAT', 'schemaOrgTag'...",
1,http://www.edamam.com/ontologies/edamam.owl#re...,Carrot Halwa,https://edamam-product-images.s3.amazonaws.com...,Food52,https://food52.com/recipes/17405-carrot-halwa,http://www.edamam.com/recipe/carrot-halwa-89f0...,4.0,[Low-Sodium],"[Vegetarian, Pescatarian, Gluten-Free, Wheat-F...",[],...,1103.223865,724.303125,66.0,[indian],[lunch/dinner],[desserts],"{'ENERC_KCAL': {'label': 'Energy', 'quantity':...","{'ENERC_KCAL': {'label': 'Energy', 'quantity':...","[{'label': 'Fat', 'tag': 'FAT', 'schemaOrgTag'...",
2,http://www.edamam.com/ontologies/edamam.owl#re...,Carrot & tarragon purée,https://edamam-product-images.s3.amazonaws.com...,BBC Good Food,https://www.bbcgoodfood.com/recipes/carrot-tar...,http://www.edamam.com/recipe/carrot-tarragon-p...,8.0,[Low-Sodium],"[Vegetarian, Pescatarian, Gluten-Free, Wheat-F...",[],...,775.875,1052.5,50.0,[american],[lunch/dinner],[condiments and sauces],"{'ENERC_KCAL': {'label': 'Energy', 'quantity':...","{'ENERC_KCAL': {'label': 'Energy', 'quantity':...","[{'label': 'Fat', 'tag': 'FAT', 'schemaOrgTag'...","[winter, make-ahead, 30-60-minutes, 200-kcal-o..."
3,http://www.edamam.com/ontologies/edamam.owl#re...,Carrot Cake Smoothie,https://edamam-product-images.s3.amazonaws.com...,Epicurious,https://www.epicurious.com/recipes/food/views/...,http://www.edamam.com/recipe/carrot-cake-smoot...,1.0,"[Balanced, High-Fiber]","[Vegetarian, Pescatarian, Gluten-Free, Wheat-F...","[Tree-Nuts, Sulfites]",...,456.793,525.2625,0.0,[american],"[breakfast, lunch/dinner]","[drinks, desserts]","{'ENERC_KCAL': {'label': 'Energy', 'quantity':...","{'ENERC_KCAL': {'label': 'Energy', 'quantity':...","[{'label': 'Fat', 'tag': 'FAT', 'schemaOrgTag'...","[Smoothie, Breakfast, Vegetarian, Drink, Orang..."
4,http://www.edamam.com/ontologies/edamam.owl#re...,Carrot Reduction,https://edamam-product-images.s3.amazonaws.com...,PBS Food,http://www.pbs.org/food/recipes/carrot-reduction/,http://www.edamam.com/recipe/carrot-reduction-...,2.0,[Low-Fat],"[Vegan, Vegetarian, Pescatarian, Dairy-Free, G...",[],...,112.211964,268.855569,14.0,[american],[lunch/dinner],[drinks],"{'ENERC_KCAL': {'label': 'Energy', 'quantity':...","{'ENERC_KCAL': {'label': 'Energy', 'quantity':...","[{'label': 'Fat', 'tag': 'FAT', 'schemaOrgTag'...",
5,http://www.edamam.com/ontologies/edamam.owl#re...,Carrot-Orange Mimosa,https://edamam-product-images.s3.amazonaws.com...,Food Network,https://www.foodnetwork.com/recipes/bobby-flay...,http://www.edamam.com/recipe/carrot-orange-mim...,6.0,"[Low-Fat, Low-Sodium]","[Vegan, Vegetarian, Pescatarian, Mediterranean...","[Sulfites, FODMAP]",...,940.491344,1538.599201,5.0,[world],[lunch/dinner],[alcohol cocktail],"{'ENERC_KCAL': {'label': 'Energy', 'quantity':...","{'ENERC_KCAL': {'label': 'Energy', 'quantity':...","[{'label': 'Fat', 'tag': 'FAT', 'schemaOrgTag'...","[Vegetable, Fruit, Brunch, Drinks, Mixed Drink..."
6,http://www.edamam.com/ontologies/edamam.owl#re...,Cheddar-Carrot Balls,https://edamam-product-images.s3.amazonaws.com...,Delish,http://www.delish.com/cooking/recipe-ideas/rec...,http://www.edamam.com/recipe/cheddar-carrot-ba...,4.0,[Low-Carb],"[Sugar-Conscious, Low Potassium, Kidney-Friend...",[Sulfites],...,650.40048,231.097139,0.0,[italian],[lunch/dinner],[desserts],"{'ENERC_KCAL': {'label': 'Energy', 'quantity':...","{'ENERC_KCAL': {'label': 'Energy', 'quantity':...","[{'label': 'Fat', 'tag': 'FAT', 'schemaOrgTag'...",
7,http://www.edamam.com/ontologies/edamam.owl#re...,Carrot Kugel,https://edamam-product-images.s3.amazonaws.com...,Elana's Pantry,http://www.elanaspantry.com/carrot-kugel/,http://www.edamam.com/recipe/carrot-kugel-fabb...,8.0,"[Balanced, Low-Sodium]","[Vegetarian, Pescatarian, Paleo, Dairy-Free, G...","[Sulfites, FODMAP]",...,915.992,1016.6,0.0,[kosher],[lunch/dinner],[main course],"{'ENERC_KCAL': {'label': 'Energy', 'quantity':...","{'ENERC_KCAL': {'label': 'Energy', 'quantity':...","[{'label': 'Fat', 'tag': 'FAT', 'schemaOrgTag'...",
8,http://www.edamam.com/ontologies/edamam.owl#re...,Carrot Sorbet,https://edamam-product-images.s3.amazonaws.com...,Food52,https://food52.com/recipes/12210-carrot-sorbet,http://www.edamam.com/recipe/carrot-sorbet-352...,4.0,"[Low-Fat, Low-Sodium]","[Vegan, Vegetarian, Pescatarian, Dairy-Free, G...",[Sulfites],...,854.456,785.766667,256.0,[italian],[lunch/dinner],[desserts],"{'ENERC_KCAL': {'label': 'Energy', 'quantity':...","{'ENERC_KCAL': {'label': 'Energy', 'quantity':...","[{'label': 'Fat', 'tag': 'FAT', 'schemaOrgTag'...",
9,http://www.edamam.com/ontologies/edamam.owl#re...,Carrot Dressing,https://edamam-product-images.s3.amazonaws.com...,Whole Foods,http://www.wholefoodsmarket.com/recipes/2655,http://www.edamam.com/recipe/carrot-dressing-6...,1.0,[Low-Fat],"[Vegan, Vegetarian, Pescatarian, Dairy-Free, E...",[Sulfites],...,151.0875,176.4625,0.0,[japanese],[lunch/dinner],[condiments and sauces],"{'ENERC_KCAL': {'label': 'Energy', 'quantity':...","{'ENERC_KCAL': {'label': 'Energy', 'quantity':...","[{'label': 'Fat', 'tag': 'FAT', 'schemaOrgTag'...",


In [36]:
get_recipe_df(res)['healthLabels'][50]

['Vegetarian',
 'Pescatarian',
 'Gluten-Free',
 'Wheat-Free',
 'Egg-Free',
 'Peanut-Free',
 'Tree-Nut-Free',
 'Soy-Free',
 'Fish-Free',
 'Shellfish-Free',
 'Pork-Free',
 'Red-Meat-Free',
 'Crustacean-Free',
 'Celery-Free',
 'Mustard-Free',
 'Sesame-Free',
 'Lupine-Free',
 'Mollusk-Free',
 'Alcohol-Free',
 'Sulfite-Free',
 'FODMAP-Free',
 'Kosher']

## Data Collection

### Gathering Common Ingredients

The developer version of the API (the free version) limits 100 results per request and 10 requests per min. So we will make an custom list of common main ingredients to query, and add timers per request to fit these limitations. The following are common ingredients generated by ChatGPT per diet types. We will query these to get a dataset.

In [18]:
vegan_common_ingredients = [
    "Carrots",
    "Broccoli",
    "Spinach",
    "Kale",
    "Tomatoes",
    "Bell peppers",
    "Onions",
    "Garlic",
    "Mushrooms",
    "Zucchini",
    "Cauliflower",
    "Potatoes",
    "Sweet potatoes",
    "Green beans",
    "Peas",
    "Apples",
    "Bananas",
    "Oranges",
    "Berries",
    "Avocados",
    "Rice",
    "Quinoa",
    "Oats",
    "Beans",
    "Lentils",
    "Chickpeas",
    "Black beans",
    "Almonds",
    "Walnuts",
    "Cashews",
    "Peanuts",
    "Sunflower seeds",
    "Chia seeds",
    "Flaxseeds",
    "Soybeans",
    "Tofu",
    "Seitan",
    "Tempeh",
    "Whole wheat flour",
    "Whole grain bread",
    "Whole grain pasta",
    "Brown rice",
    "Coconut milk",
    "Almond milk",
    "Soy milk",
    "Olive oil",
    "Coconut oil",
    "Balsamic vinegar",
    "Maple syrup"
]

pescatarian_common_ingredients = [
    "Salmon",
    "Tuna",
    "Shrimp",
    "Cod",
    "Trout",
    "Sardines",
    "Mackerel",
    "Scallops",
    "Crab",
    "Lobster",
    "Oysters",
    "Clams",
    "Mussels",
    "Tilapia",
    "Halibut",
    "Haddock",
    "Mahi-mahi",
    "Swordfish",
    "Anchovies",
    "Seaweed",
    "Avocados",
    "Spinach",
    "Kale",
    "Tomatoes",
    "Bell peppers",
    "Onions",
    "Garlic",
    "Mushrooms",
    "Zucchini",
    "Cauliflower",
    "Potatoes",
    "Sweet potatoes",
    "Green beans",
    "Peas",
    "Apples",
    "Bananas",
    "Oranges",
    "Berries",
    "Rice",
    "Quinoa",
    "Oats",
    "Lentils",
    "Chickpeas",
    "Almonds",
    "Walnuts",
    "Cashews",
    "Peanuts",
    "Coconut milk",
    "Olive oil"
]

mediterranean_common_ingredients = [
    "Tomatoes",
    "Cucumbers",
    "Bell peppers",
    "Red onions",
    "Garlic",
    "Spinach",
    "Kale",
    "Lettuce",
    "Artichokes",
    "Eggplant",
    "Zucchini",
    "Cauliflower",
    "Broccoli",
    "Carrots",
    "Radishes",
    "Beets",
    "Chickpeas",
    "Lentils",
    "Quinoa",
    "Brown rice",
    "Whole wheat couscous",
    "Farro",
    "Bulgur",
    "Olive oil",
    "Olives",
    "Feta cheese",
    "Greek yogurt",
    "Hummus",
    "Tahini",
    "Almonds",
    "Walnuts",
    "Pistachios",
    "Hazelnuts",
    "Sunflower seeds",
    "Flaxseeds",
    "Chia seeds",
    "Salmon",
    "Tuna",
    "Mackerel",
    "Sardines",
    "Anchovies",
    "Shrimp",
    "Mussels",
    "Clams",
    "Swordfish",
    "Chicken",
    "Turkey",
    "Eggs",
    "Red wine"
]

paleo_common_ingredients = [
    "Beef",
    "Chicken",
    "Turkey",
    "Pork",
    "Lamb",
    "Bison",
    "Venison",
    "Salmon",
    "Tuna",
    "Trout",
    "Cod",
    "Shrimp",
    "Scallops",
    "Crab",
    "Lobster",
    "Oysters",
    "Clams",
    "Mussels",
    "Eggs",
    "Bacon",
    "Avocados",
    "Spinach",
    "Kale",
    "Lettuce",
    "Broccoli",
    "Cauliflower",
    "Brussels sprouts",
    "Zucchini",
    "Carrots",
    "Onions",
    "Garlic",
    "Mushrooms",
    "Tomatoes",
    "Bell peppers",
    "Sweet potatoes",
    "Butternut squash",
    "Acorn squash",
    "Pumpkin",
    "Blueberries",
    "Strawberries",
    "Raspberries",
    "Blackberries",
    "Almonds",
    "Walnuts",
    "Cashews",
    "Pecans",
    "Macadamia nuts",
    "Coconut",
    "Coconut oil",
    "Olive oil"
]

vegetarian_common_ingredients = [
    "Spinach",
    "Kale",
    "Lettuce",
    "Broccoli",
    "Cauliflower",
    "Zucchini",
    "Carrots",
    "Bell peppers",
    "Onions",
    "Garlic",
    "Mushrooms",
    "Tomatoes",
    "Cucumbers",
    "Eggplant",
    "Potatoes",
    "Sweet potatoes",
    "Beets",
    "Radishes",
    "Green beans",
    "Peas",
    "Avocados",
    "Bananas",
    "Apples",
    "Oranges",
    "Berries",
    "Grapes",
    "Lemons",
    "Limes",
    "Cherries",
    "Mangoes",
    "Pineapple",
    "Watermelon",
    "Whole wheat bread",
    "Whole wheat pasta",
    "Brown rice",
    "Quinoa",
    "Oats",
    "Lentils",
    "Chickpeas",
    "Black beans",
    "Kidney beans",
    "Tofu",
    "Tempeh",
    "Seitan",
    "Almonds",
    "Walnuts",
    "Cashews",
    "Peanuts",
    "Chia seeds",
    "Flaxseeds"
]

meat_common_ingredients = [
    "Beef",
    "Chicken",
    "Pork",
    "Lamb",
    "Turkey",
    "Venison",
    "Bison",
    "Duck",
    "Quail",
    "Rabbit",
    "Salmon",
    "Tuna",
    "Trout",
    "Cod",
    "Halibut",
    "Swordfish",
    "Shrimp",
    "Scallops",
    "Crab",
    "Lobster",
    "Oysters",
    "Clams",
    "Mussels",
    "Bacon",
    "Sausages",
    "Ham",
    "Pepperoni",
    "Prosciutto",
    "Liver",
    "Kidneys",
    "Heart",
    "Tongue",
    "Tripe",
    "Eggs",
    "Milk",
    "Cheese",
    "Yogurt",
    "Butter",
    "Beef broth",
    "Chicken broth",
    "Fish broth",
    "Lard",
    "Tallow",
    "Duck fat",
    "Goose fat",
    "Pork rinds",
    "Bone marrow",
    "Bone broth",
    "Gelatin"
]

# add all these lists and use a set to get only unique elements
common_ingredients = vegan_common_ingredients + pescatarian_common_ingredients + mediterranean_common_ingredients + vegetarian_common_ingredients + meat_common_ingredients
common_ingredients = list(set(common_ingredients))

print("We have {} common ingrediants to query".format(len(common_ingredients)))

We have 133 common ingrediants to query


Saving our common ingredients to a csv file and we can add to it later.

In [23]:
common_ingredients_df = pd.DataFrame(common_ingredients, columns=['commonIngredients'])
common_ingredients_df.to_csv('common_ingredients.csv', index=False)

### Querying to Collect Data

In [48]:
query1 = recipe_search(common_ingredients[0])
query2 = recipe_search(common_ingredients[1])

In [56]:
pd.concat([get_recipe_df(query1), get_recipe_df(query2)], ignore_index=True).shape

(20, 22)

In [57]:
import time
def combine_recipe_data(ingredient_list, from_index=0, to_index=10):
    all_recipes = []
    
    # Iterate over each ingredient in the list
    for i in range(len(ingredient_list)):        
        ingredient = ingredient_list[i]
        
        # Make API request for recipes
        recipes = recipe_search(ingredient, from_index, to_index)
        
        # Convert the list of recipes to a dataframe
        ingredient_df = get_recipe_df(recipes)
        
        # Append the dataframe to the list of all recipes
        all_recipes.append(ingredient_df)

        # Wait for 6 seconds between requests to adhere to rate limit, i.e. 1 request per second
        time.sleep(6)
    
    # Combine all dataframes into one dataframe
    combined_df = pd.concat(all_recipes, ignore_index=True)
    
    return combined_df

In [59]:
testdf = combine_recipe_data(common_ingredients[0:12], from_index=0, to_index=100)

In [61]:
testdf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1200 entries, 0 to 1199
Data columns (total 22 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   uri              1200 non-null   object 
 1   label            1200 non-null   object 
 2   image            1200 non-null   object 
 3   source           1200 non-null   object 
 4   url              1200 non-null   object 
 5   shareAs          1200 non-null   object 
 6   yield            1200 non-null   float64
 7   dietLabels       1200 non-null   object 
 8   healthLabels     1200 non-null   object 
 9   cautions         1200 non-null   object 
 10  ingredientLines  1200 non-null   object 
 11  ingredients      1200 non-null   object 
 12  calories         1200 non-null   float64
 13  totalWeight      1200 non-null   float64
 14  totalTime        1200 non-null   float64
 15  cuisineType      1200 non-null   object 
 16  mealType         1200 non-null   object 
 17  dishType      

In [66]:
testdf.to_csv('example_recipes.csv', index=False)

### Cleaning and Exploring Example Recipe CSV

Everything will be moved to a .py script later, so this is just testing things out with a smaller dataset for now.

In [69]:
example_df = pd.read_csv('example_recipes.csv')
example_df.head(2)

Unnamed: 0,uri,label,image,source,url,shareAs,yield,dietLabels,healthLabels,cautions,...,calories,totalWeight,totalTime,cuisineType,mealType,dishType,totalNutrients,totalDaily,digest,tags
0,http://www.edamam.com/ontologies/edamam.owl#re...,Cheese Omelette,https://edamam-product-images.s3.amazonaws.com...,Epicurious,https://www.epicurious.com/recipes/food/views/...,http://www.edamam.com/recipe/cheese-omelette-f...,1.0,['Low-Carb'],"['Sugar-Conscious', 'Low Potassium', 'Kidney-F...",['Sulfites'],...,175.691029,78.902344,0.0,['french'],['lunch/dinner'],"['main course', 'egg']","{'ENERC_KCAL': {'label': 'Energy', 'quantity':...","{'ENERC_KCAL': {'label': 'Energy', 'quantity':...","[{'label': 'Fat', 'tag': 'FAT', 'schemaOrgTag'...","['Breakfast', 'Vegetarian', 'Egg', 'Cheddar', ..."
1,http://www.edamam.com/ontologies/edamam.owl#re...,Cheese straws,https://edamam-product-images.s3.amazonaws.com...,BBC,http://www.bbc.co.uk/food/recipes/cheese_straw...,http://www.edamam.com/recipe/cheese-straws-bdc...,36.0,['Low-Sodium'],"['Sugar-Conscious', 'Low Potassium', 'Kidney-F...",['Sulfites'],...,3902.660094,886.594271,60.0,['american'],['lunch/dinner'],['starter'],"{'ENERC_KCAL': {'label': 'Energy', 'quantity':...","{'ENERC_KCAL': {'label': 'Energy', 'quantity':...","[{'label': 'Fat', 'tag': 'FAT', 'schemaOrgTag'...",


In [73]:
example_df.columns

Index(['uri', 'label', 'image', 'source', 'url', 'shareAs', 'yield',
       'dietLabels', 'healthLabels', 'cautions', 'ingredientLines',
       'ingredients', 'calories', 'totalWeight', 'totalTime', 'cuisineType',
       'mealType', 'dishType', 'totalNutrients', 'totalDaily', 'digest',
       'tags'],
      dtype='object')

In [72]:
example_df[example_df.duplicated()]

Unnamed: 0,uri,label,image,source,url,shareAs,yield,dietLabels,healthLabels,cautions,...,calories,totalWeight,totalTime,cuisineType,mealType,dishType,totalNutrients,totalDaily,digest,tags


So there are no duplicates. Let's only explore columns that are obviously relevant. These are label, dietLabels, healthLabels, cuisineType, mealType, dishType, tags. 

Note the other columns might be useful but are just information and won't be used for the model. Also note the totalNutrients column as nutritional information, the totalDaily column has the same info but as a percentage for a 2000 calorie diet, and the digest has similar info but as a brief overview. 

In [95]:
example_df[example_df['label'].duplicated()]

Unnamed: 0,uri,label,image,source,url,shareAs,yield,dietLabels,healthLabels,cautions,...,calories,totalWeight,totalTime,cuisineType,mealType,dishType,totalNutrients,totalDaily,digest,tags
15,http://www.edamam.com/ontologies/edamam.owl#re...,Four-Cheese Mac and Cheese,https://edamam-product-images.s3.amazonaws.com...,Delish,http://www.delish.com/cooking/recipe-ideas/rec...,http://www.edamam.com/recipe/four-cheese-mac-a...,6.0,[],"['Vegetarian', 'Pescatarian', 'Egg-Free', 'Pea...",['Sulfites'],...,5146.950587,2195.099465,95.0,['american'],['lunch/dinner'],['main course'],"{'ENERC_KCAL': {'label': 'Energy', 'quantity':...","{'ENERC_KCAL': {'label': 'Energy', 'quantity':...","[{'label': 'Fat', 'tag': 'FAT', 'schemaOrgTag'...",
26,http://www.edamam.com/ontologies/edamam.owl#re...,Pimento Cheese,https://edamam-product-images.s3.amazonaws.com...,Lottie + Doof,http://www.lottieanddoof.com/2009/05/pimento-c...,http://www.edamam.com/recipe/pimento-cheese-c2...,10.0,['Low-Carb'],"['Sugar-Conscious', 'Low Potassium', 'Kidney-F...",['Sulfites'],...,2710.328075,883.965463,0.0,['south american'],['lunch/dinner'],['condiments and sauces'],"{'ENERC_KCAL': {'label': 'Energy', 'quantity':...","{'ENERC_KCAL': {'label': 'Energy', 'quantity':...","[{'label': 'Fat', 'tag': 'FAT', 'schemaOrgTag'...",
35,http://www.edamam.com/ontologies/edamam.owl#re...,Appetizer Cheese Ball,https://edamam-product-images.s3.amazonaws.com...,The Daily Meal,http://www.thedailymeal.com/appetizer-cheese-b...,http://www.edamam.com/recipe/appetizer-cheese-...,7.0,['Low-Carb'],"['Sugar-Conscious', 'Low Potassium', 'Kidney-F...",['Sulfites'],...,1358.811799,427.660467,0.0,['american'],['lunch/dinner'],['starter'],"{'ENERC_KCAL': {'label': 'Energy', 'quantity':...","{'ENERC_KCAL': {'label': 'Energy', 'quantity':...","[{'label': 'Fat', 'tag': 'FAT', 'schemaOrgTag'...",
53,http://www.edamam.com/ontologies/edamam.owl#re...,Pimento Cheese,https://edamam-product-images.s3.amazonaws.com...,Homesick Texan,http://homesicktexan.blogspot.com/2007/02/comf...,http://www.edamam.com/recipe/pimento-cheese-13...,10.0,['Low-Carb'],"['Sugar-Conscious', 'Low Potassium', 'Kidney-F...",['Sulfites'],...,2643.531349,583.979751,0.0,['south american'],['lunch/dinner'],['condiments and sauces'],"{'ENERC_KCAL': {'label': 'Energy', 'quantity':...","{'ENERC_KCAL': {'label': 'Energy', 'quantity':...","[{'label': 'Fat', 'tag': 'FAT', 'schemaOrgTag'...",
57,http://www.edamam.com/ontologies/edamam.owl#re...,Cheese Souffle recipes,https://edamam-product-images.s3.amazonaws.com...,Martha Stewart,http://www.marthastewart.com/868012/cheese-sou...,http://www.edamam.com/recipe/cheese-souffle-re...,4.0,['Low-Carb'],"['Vegetarian', 'Pescatarian', 'Peanut-Free', '...",['Sulfites'],...,4089.743528,1920.352480,65.0,['french'],['lunch/dinner'],['pancake'],"{'ENERC_KCAL': {'label': 'Energy', 'quantity':...","{'ENERC_KCAL': {'label': 'Energy', 'quantity':...","[{'label': 'Fat', 'tag': 'FAT', 'schemaOrgTag'...","['egg souffle', 'souffles', 'egg cheese souffl..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1094,http://www.edamam.com/ontologies/edamam.owl#re...,Lemon Sorbet,https://edamam-product-images.s3.amazonaws.com...,Martha Stewart,https://www.marthastewart.com/342370/lemon-sorbet,http://www.edamam.com/recipe/lemon-sorbet-201c...,6.0,"['Low-Fat', 'Low-Sodium']","['Low Potassium', 'Kidney-Friendly', 'Vegan', ...",['Sulfites'],...,1319.875000,658.500000,0.0,['eastern europe'],['lunch/dinner'],['desserts'],"{'ENERC_KCAL': {'label': 'Energy', 'quantity':...","{'ENERC_KCAL': {'label': 'Energy', 'quantity':...","[{'label': 'Fat', 'tag': 'FAT', 'schemaOrgTag'...",
1119,http://www.edamam.com/ontologies/edamam.owl#re...,Coconut-Lime Tilapia,https://edamam-product-images.s3.amazonaws.com...,Men's Health,https://www.menshealth.com/recipes/coconut-lim...,http://www.edamam.com/recipe/coconut-lime-tila...,6.0,"['High-Protein', 'Low-Carb']","['Sugar-Conscious', 'Keto-Friendly', 'Pescatar...","['Tree-Nuts', 'Sulfites']",...,981.249594,1129.815462,20.0,['mediterranean'],['lunch/dinner'],['starter'],"{'ENERC_KCAL': {'label': 'Energy', 'quantity':...","{'ENERC_KCAL': {'label': 'Energy', 'quantity':...","[{'label': 'Fat', 'tag': 'FAT', 'schemaOrgTag'...","['Fish', 'Dinner', 'Main Dishes', 'Sauteed', '..."
1128,http://www.edamam.com/ontologies/edamam.owl#re...,Almond-Crusted Tilapia,https://edamam-product-images.s3.amazonaws.com...,My Recipes,http://www.myrecipes.com/recipe/almond-crusted...,http://www.edamam.com/recipe/almond-crusted-ti...,2.0,['Low-Carb'],"['Sugar-Conscious', 'Keto-Friendly', 'Pescatar...",[],...,726.842991,425.002220,11.0,['mediterranean'],['lunch/dinner'],['main course'],"{'ENERC_KCAL': {'label': 'Energy', 'quantity':...","{'ENERC_KCAL': {'label': 'Energy', 'quantity':...","[{'label': 'Fat', 'tag': 'FAT', 'schemaOrgTag'...","['Fish', 'MainDishes']"
1139,http://www.edamam.com/ontologies/edamam.owl#re...,Parmesan-Crusted Tilapia,https://edamam-product-images.s3.amazonaws.com...,Delish,http://www.delish.com/cooking/recipe-ideas/rec...,http://www.edamam.com/recipe/parmesan-crusted-...,4.0,[],"['Sugar-Conscious', 'Pescatarian', 'Mediterran...",['Sulfites'],...,1855.221534,1024.500687,15.0,['mediterranean'],['lunch/dinner'],['main course'],"{'ENERC_KCAL': {'label': 'Energy', 'quantity':...","{'ENERC_KCAL': {'label': 'Energy', 'quantity':...","[{'label': 'Fat', 'tag': 'FAT', 'schemaOrgTag'...",


In [96]:
example_df[example_df['label'] == 'Pimento Cheese']

Unnamed: 0,uri,label,image,source,url,shareAs,yield,dietLabels,healthLabels,cautions,...,calories,totalWeight,totalTime,cuisineType,mealType,dishType,totalNutrients,totalDaily,digest,tags
3,http://www.edamam.com/ontologies/edamam.owl#re...,Pimento Cheese,https://edamam-product-images.s3.amazonaws.com...,Pioneer Woman,http://thepioneerwoman.com/cooking/2014/12/pim...,http://www.edamam.com/recipe/pimento-cheese-b9...,12.0,['Low-Carb'],"['Sugar-Conscious', 'Low Potassium', 'Kidney-F...",['Sulfites'],...,2997.094739,821.771888,0.0,['south american'],['lunch/dinner'],['condiments and sauces'],"{'ENERC_KCAL': {'label': 'Energy', 'quantity':...","{'ENERC_KCAL': {'label': 'Energy', 'quantity':...","[{'label': 'Fat', 'tag': 'FAT', 'schemaOrgTag'...",
26,http://www.edamam.com/ontologies/edamam.owl#re...,Pimento Cheese,https://edamam-product-images.s3.amazonaws.com...,Lottie + Doof,http://www.lottieanddoof.com/2009/05/pimento-c...,http://www.edamam.com/recipe/pimento-cheese-c2...,10.0,['Low-Carb'],"['Sugar-Conscious', 'Low Potassium', 'Kidney-F...",['Sulfites'],...,2710.328075,883.965463,0.0,['south american'],['lunch/dinner'],['condiments and sauces'],"{'ENERC_KCAL': {'label': 'Energy', 'quantity':...","{'ENERC_KCAL': {'label': 'Energy', 'quantity':...","[{'label': 'Fat', 'tag': 'FAT', 'schemaOrgTag'...",
53,http://www.edamam.com/ontologies/edamam.owl#re...,Pimento Cheese,https://edamam-product-images.s3.amazonaws.com...,Homesick Texan,http://homesicktexan.blogspot.com/2007/02/comf...,http://www.edamam.com/recipe/pimento-cheese-13...,10.0,['Low-Carb'],"['Sugar-Conscious', 'Low Potassium', 'Kidney-F...",['Sulfites'],...,2643.531349,583.979751,0.0,['south american'],['lunch/dinner'],['condiments and sauces'],"{'ENERC_KCAL': {'label': 'Energy', 'quantity':...","{'ENERC_KCAL': {'label': 'Energy', 'quantity':...","[{'label': 'Fat', 'tag': 'FAT', 'schemaOrgTag'...",


So it looks like we have duplicated data from recipes appearing in multiple websites. So let's drop duplicates based on the label.

In [99]:
cleaned_example_df = example_df.drop_duplicates(subset = ['label'])

Also we can see here an know above that the tags column just has similar info from other relevant columns, but also has some missing data, so we can ignore this column.

In [100]:
cleaned_example_df['tags'][0]

"['Breakfast', 'Vegetarian', 'Egg', 'Cheddar', 'No Sugar Added', 'Kosher', 'Kid-Friendly', 'Peanut Free', 'Pescatarian', 'Soy Free', 'Tree Nut Free', 'Wheat/Gluten-Free', 'Sugar Conscious', 'Kidney Friendly', 'Weelicious']"

Next is the `dietLabels` columns

In [101]:
cleaned_example_df['dietLabels']

0                       ['Low-Carb']
1                     ['Low-Sodium']
2                       ['Low-Carb']
3                       ['Low-Carb']
4                                 []
                    ...             
1195    ['High-Protein', 'Low-Carb']
1196                              []
1197      ['Low-Carb', 'Low-Sodium']
1198    ['High-Protein', 'Low-Carb']
1199                              []
Name: dietLabels, Length: 1125, dtype: object

In [106]:
cleaned_example_df['dietLabels'][0]

"['Low-Carb']"

In [153]:
#we have a few columns with list elements and we want to see the unqiue values so this function can do that
def get_unique_values(df, column):
    df.loc[df[column].isna(), column] = '[]'
    labels_lst = []
    for label in df[column].apply(ast.literal_eval):
        labels_lst += label
    return set(labels_lst)

#this checks to see what recipes are multilabeled for the given column
def check_if_multilabel(df, column):
    lst = []
    for row in df[column].apply(ast.literal_eval):
        if len(row) > 1:
            lst.append(row)
    return lst

In [150]:
get_unique_values(cleaned_example_df, 'dietLabels')

{'Balanced', 'High-Fiber', 'High-Protein', 'Low-Carb', 'Low-Fat', 'Low-Sodium'}

In [152]:
len(check_if_multilabel(cleaned_example_df, 'dietLabels'))

361

So we have 5 potential diet labels with a 6th one being no label. We could use this as either an input variable or just information after an output is given. It might be best as an input, or even an optional input.

We also have lots of recipes with multiple diet Labels.

Here are the unique `cuisineType` values. We can also use this as an input variable just for good measure if it's useful. 

In [118]:
print(get_unique_values(cleaned_example_df, 'cuisineType'))

{'indian', 'american', 'south east asian', 'italian', 'greek', 'eastern europe', 'mexican', 'asian', 'french', 'british', 'chinese', 'mediterranean', 'south american', 'middle eastern', 'caribbean', 'central europe', 'nordic', 'world', 'japanese'}


In [143]:
check_if_multilabel(cleaned_example_df, 'cuisineType')

['mexican', 'indian']
['american', 'british']
['american', 'mexican']
['nordic', 'greek']
['mediterranean', 'greek']
['chinese', 'asian']
['mediterranean', 'greek']
['mediterranean', 'greek']
['american', 'mediterranean']
['middle eastern', 'nordic']
['italian', 'south east asian']
['south east asian', 'asian']
['chinese', 'asian']


So some dishes have more than one cuisineType which can be dealt with during pre-processing.

So we have 5 `mealType` values as well, and we can make this an input value as well if it helps the model.

In [138]:
print(get_unique_values(cleaned_example_df, 'mealType'))

{'breakfast', 'lunch/dinner', 'brunch', 'teatime', 'snack'}


In [144]:
check_if_multilabel(cleaned_example_df, 'mealType')

['breakfast', 'snack']
['breakfast', 'snack']


So we have some multilabeled values to deal with here too - leave for pre-processing.

Finally we can check the same things for `dishType`

In [139]:
print(get_unique_values(cleaned_example_df, 'dishType'))

{'desserts', 'biscuits and cookies', 'special occasions', 'preserve', 'preps', 'egg', 'soup', 'sandwiches', 'starter', 'salad', 'alcohol cocktail', 'main course', 'christmas', 'cereals', 'bread', 'drinks', 'pancake', 'condiments and sauces'}


In [145]:
check_if_multilabel(cleaned_example_df, 'dishType')

['main course', 'egg']
['main course', 'egg']
['main course', 'egg']
['main course', 'egg']
['main course', 'egg']
['condiments and sauces', 'salad']
['biscuits and cookies', 'christmas', 'special occasions']
['main course', 'egg']
['main course', 'salad']
['main course', 'salad']
['main course', 'desserts']
['main course', 'salad']


So all these columns describing the recipe are categorical with some having multiple categories. We can preo-process them all similarly later if needed.

#### Target Variable

Now we can look at the `healthLabels` column which has our target variable.

In [155]:
print(get_unique_values(cleaned_example_df, 'healthLabels'))

{'Egg-Free', 'Sesame-Free', 'Mustard-Free', 'Dairy-Free', 'Keto-Friendly', 'Mediterranean', 'DASH', 'Low Potassium', 'Vegetarian', 'Crustacean-Free', 'Kidney-Friendly', 'Peanut-Free', 'Sulfite-Free', 'Tree-Nut-Free', 'Immuno-Supportive', 'Alcohol-Cocktail', 'Gluten-Free', 'Pork-Free', 'Vegan', 'Red-Meat-Free', 'Wheat-Free', 'No oil added', 'Paleo', 'Mollusk-Free', 'Fish-Free', 'Pescatarian', 'Alcohol-Free', 'Kosher', 'Lupine-Free', 'FODMAP-Free', 'Soy-Free', 'Celery-Free', 'Sugar-Conscious', 'Shellfish-Free'}


In [157]:
len(check_if_multilabel(cleaned_example_df, 'healthLabels'))

1119

In [158]:
cleaned_example_df.shape

(1125, 22)

So the vast majority of recipes have multiple healthLabels. 

## Loading Full Dataset

Here we will experiment with the ETL script and get a full dataset with 100 rows for each ingredient.

In [5]:
from EdamamETL import RecipeETL
import pandas as pd

In [6]:
recipe_etl = RecipeETL()

In [7]:
common_ingredients_df = pd.read_csv('common_ingredients.csv')

In [11]:
ex_ingredient = common_ingredients_df['commonIngredients'][0]
ex1_ingredient = common_ingredients_df['commonIngredients'][1]
print(ex_ingredient, '\n', ex1_ingredient)

Green beans 
 Beef


In [22]:
recipes_lst_ex = recipe_etl.recipe_search(ex_ingredient, from_index=0, to_index=5)
recipes_df_ex = recipe_etl.get_recipe_df(recipes_lst_ex)
#recipes_df_ex


Unnamed: 0,uri,label,image,source,url,shareAs,yield,dietLabels,healthLabels,cautions,...,calories,totalWeight,totalTime,cuisineType,mealType,dishType,totalNutrients,totalDaily,digest,tags
0,http://www.edamam.com/ontologies/edamam.owl#re...,Green Beans,https://edamam-product-images.s3.amazonaws.com...,Martha Stewart,http://www.marthastewart.com/338543/green-beans,http://www.edamam.com/recipe/green-beans-a91ad...,4.0,[],"[Sugar-Conscious, Kidney-Friendly, Keto-Friend...",[],...,245.950111,471.932982,24.0,[american],[lunch/dinner],[main course],"{'ENERC_KCAL': {'label': 'Energy', 'quantity':...","{'ENERC_KCAL': {'label': 'Energy', 'quantity':...","[{'label': 'Fat', 'tag': 'FAT', 'schemaOrgTag'...",
1,http://www.edamam.com/ontologies/edamam.owl#re...,Sauteed Green Beans,https://edamam-product-images.s3.amazonaws.com...,Epicurious,https://www.epicurious.com/recipes/food/views/...,http://www.edamam.com/recipe/sauteed-green-bea...,8.0,"[Balanced, Low-Sodium]","[Sugar-Conscious, Kidney-Friendly, Keto-Friend...",[],...,331.965452,699.388555,0.0,[french],[lunch/dinner],[starter],"{'ENERC_KCAL': {'label': 'Energy', 'quantity':...","{'ENERC_KCAL': {'label': 'Energy', 'quantity':...","[{'label': 'Fat', 'tag': 'FAT', 'schemaOrgTag'...","[Vegetarian, Vegan, Quick & Easy, Bean, Vegeta..."
2,http://www.edamam.com/ontologies/edamam.owl#re...,Caramelized Green Beans,https://edamam-product-images.s3.amazonaws.com...,Saveur,http://www.saveur.com/article/Recipes/Carameli...,http://www.edamam.com/recipe/caramelized-green...,6.0,"[Low-Carb, Low-Sodium]","[Sugar-Conscious, Kidney-Friendly, Keto-Friend...",[],...,1025.432452,793.988555,0.0,[american],[lunch/dinner],[main course],"{'ENERC_KCAL': {'label': 'Energy', 'quantity':...","{'ENERC_KCAL': {'label': 'Energy', 'quantity':...","[{'label': 'Fat', 'tag': 'FAT', 'schemaOrgTag'...",
3,http://www.edamam.com/ontologies/edamam.owl#re...,Sautéed Fresh Green Beans,https://edamam-product-images.s3.amazonaws.com...,EatingWell,http://www.eatingwell.com/recipe/261341/sautee...,http://www.edamam.com/recipe/saut%C3%A9ed-fres...,4.0,"[Balanced, Low-Sodium]","[Sugar-Conscious, Kidney-Friendly, Keto-Friend...",[],...,220.173635,462.59237,5.0,[french],[lunch/dinner],[starter],"{'ENERC_KCAL': {'label': 'Energy', 'quantity':...","{'ENERC_KCAL': {'label': 'Energy', 'quantity':...","[{'label': 'Fat', 'tag': 'FAT', 'schemaOrgTag'...","[Gluten-Free, Low Fat, Vegan, High Fiber, Dair..."
4,http://www.edamam.com/ontologies/edamam.owl#re...,Fancy Green Beans,https://edamam-product-images.s3.amazonaws.com...,PBS Food,http://www.pbs.org/food/recipes/fancy-green-be...,http://www.edamam.com/recipe/fancy-green-beans...,2.0,"[Balanced, High-Fiber]","[Vegan, Vegetarian, Pescatarian, Dairy-Free, G...","[Sulfites, FODMAP]",...,245.191312,472.898439,47.0,[american],[lunch/dinner],[main course],"{'ENERC_KCAL': {'label': 'Energy', 'quantity':...","{'ENERC_KCAL': {'label': 'Energy', 'quantity':...","[{'label': 'Fat', 'tag': 'FAT', 'schemaOrgTag'...",


In [23]:
import time
def combine_recipe_data(df, column='commonIngredients', from_index=0, to_index=10):
    all_recipes = []
    
    # Iterate over each ingredient in the df
    for ingredient in df[column]:        
        
        # Make API request for recipes
        recipes = recipe_etl.recipe_search(ingredient, from_index, to_index)
        
        # Convert the list of recipes to a dataframe
        ingredient_df = recipe_etl.get_recipe_df(recipes)
        
        # Append the dataframe to the list of all recipes
        all_recipes.append(ingredient_df)

        # Wait for 6 seconds between requests to adhere to rate limit, i.e. 10 requests per minute
        time.sleep(6)
    
    # Combine all dataframes into one dataframe
    combined_df = pd.concat(all_recipes, ignore_index=True)
    
    return combined_df

In [8]:
full_df = recipe_etl.run_extract(common_ingredients_df)

In [9]:
full_df.shape

(13272, 22)