# Meal Planner Recipe Preprocessing

This notebook cleans and enriches the raw recipe dataset so it can be reused across projects. It performs the following steps:

- load the raw CSV hosted on Hugging Face (Edamam-based recipe dataset)
- extract convenient nutrient totals (fat, carbs, protein)
- normalize label and ingredient fields for easier downstream filtering
- drop obviously invalid rows and save a processed CSV ready for analysis or app ingestion

Update the configuration section below if your raw file lives elsewhere or you want a different output path.


In [None]:
from pathlib import Path
from urllib.parse import urlparse

import json
import numpy as np
import pandas as pd

pd.set_option("display.max_columns", None)
pd.set_option("display.max_colwidth", 120)

In [None]:
RAW_DATA_URI = "hf://datasets/datahiveai/recipes-with-nutrition/recipes-with-nutrition.csv"
PROCESSED_DATA_PATH = Path("recipes_processed.csv")

parsed = urlparse(RAW_DATA_URI)
if parsed.scheme not in {"hf", "https", "http", "s3", "gs", "file"}:
    raise ValueError(f"Unsupported RAW_DATA_URI scheme: {parsed.scheme}")

PROCESSED_DATA_PATH.parent.mkdir(parents=True, exist_ok=True)


## Load Raw Data

Read the raw recipe export into a DataFrame. Update `RAW_DATA_URI` above if you want to point at a different source.


In [None]:
data = pd.read_csv(RAW_DATA_URI)
source_display_name = Path(parsed.path).name if parsed.path else RAW_DATA_URI
print(f"Loaded {len(data):,} recipes from {source_display_name}")
data.head()

Loaded 39,447 recipes from recipes-with-nutrition.csv


Unnamed: 0,recipe_name,source,url,servings,calories,total_weight_g,image_url,diet_labels,health_labels,cautions,cuisine_type,meal_type,dish_type,ingredient_lines,ingredients,total_nutrients,daily_values,digest
0,Classic Cabbage Slaw with Grandmother Shinn's Dressing,Food Network,https://www.foodnetwork.com/recipes/classic-cabbage-slaw-with-grandmother-shinns-dressing-recipe-1940391,6.0,511.28325,1239.311259,https://datahive-prod-dataset-products.s3.eu-central-1.amazonaws.com/dataset/346/0066d6f63e9b4ace21cfdb6a3579251a750...,"[""Balanced""]","[""Vegetarian"",""Gluten-Free"",""Peanut-Free"",""Tree-Nut-Free"",""Soy-Free"",""Fish-Free"",""Shellfish-Free""]",[],"[""american""]","[""lunch/dinner""]","[""salad""]","[""1 tablespoon kosher salt"",""2 cups water"",""4 cups shredded green cabbage"",""1 cup peeled and shredded carrots"",""3/4 ...","[{""food"":""kosher salt"",""text"":""1 tablespoon kosher salt"",""weight"":14.56249999975379,""measure"":""tablespoon"",""quantity...","{""K"":{""unit"":""mg"",""label"":""Potassium"",""quantity"":1344.937282852889},""P"":{""unit"":""mg"",""label"":""Phosphorus"",""quantity""...","{""K"":{""unit"":""%"",""label"":""Potassium"",""quantity"":28.6156868692104},""P"":{""unit"":""%"",""label"":""Phosphorus"",""quantity"":54...","[{""sub"":[{""tag"":""FASAT"",""unit"":""g"",""daily"":28.7006625,""label"":""Saturated"",""total"":5.7401325,""hasRDI"":true,""schemaOrg..."
1,Black Bean Soup,Cookstr,http://www.cookstr.com/recipes/black-bean-soup-4-bonnie-tandy-leblang,8.0,1850.99899,3339.58323,https://datahive-prod-dataset-products.s3.eu-central-1.amazonaws.com/dataset/346/000146940613626dc55ec857d59817f3fcd...,"[""High-Fiber""]","[""Dairy-Free"",""Gluten-Free"",""Egg-Free"",""Peanut-Free"",""Tree-Nut-Free"",""Soy-Free"",""Fish-Free"",""Shellfish-Free""]","[""Sulfites""]","[""american""]","[""lunch/dinner""]","[""soup""]","[""1 pound fully cooked bone-in ham steak"",""1 tablespoon olive oil"",""1 medium onion, chopped (about 1 cup)"",""2 garlic...","[{""food"":""ham steak"",""text"":""1 pound fully cooked bone-in ham steak"",""weight"":453.59237,""measure"":""pound"",""quantity""...","{""K"":{""unit"":""mg"",""label"":""Potassium"",""quantity"":5951.657617735301},""P"":{""unit"":""mg"",""label"":""Phosphorus"",""quantity""...","{""K"":{""unit"":""%"",""label"":""Potassium"",""quantity"":126.6310131433043},""P"":{""unit"":""%"",""label"":""Phosphorus"",""quantity"":3...","[{""sub"":[{""tag"":""FASAT"",""unit"":""g"",""daily"":77.77513185494949,""label"":""Saturated"",""total"":15.5550263709899,""hasRDI"":t..."
2,Eat for Eight Bucks: Tofu with Tomatoes and Cilantro Recipe,Serious Eats,http://www.seriouseats.com/recipes/2010/06/eat-for-eight-bucks-tofu-with-tomatoes-and-cilantro-recipe.html,4.0,1643.758565,1453.960928,https://datahive-prod-dataset-products.s3.eu-central-1.amazonaws.com/dataset/346/000273198e35300038a0b5833e294c23d74...,"[""High-Fiber"",""Low-Carb""]","[""Vegan"",""Vegetarian"",""Dairy-Free"",""Egg-Free"",""Tree-Nut-Free"",""Fish-Free"",""Shellfish-Free""]","[""Gluten"",""Wheat"",""Sulfites""]","[""asian""]","[""lunch/dinner""]","[""main course""]","[""1 pound medium to firm tofu, cut into 1-inch cubes and patted dry"",""4 tablespoons peanut or canola oil"",""2 scallio...","[{""food"":""firm tofu"",""text"":""1 pound medium to firm tofu, cut into 1-inch cubes and patted dry"",""weight"":453.59237,""...","{""K"":{""unit"":""mg"",""label"":""Potassium"",""quantity"":3055.594089732292},""P"":{""unit"":""mg"",""label"":""Phosphorus"",""quantity""...","{""K"":{""unit"":""%"",""label"":""Potassium"",""quantity"":65.01264020707004},""P"":{""unit"":""%"",""label"":""Phosphorus"",""quantity"":1...","[{""sub"":[{""tag"":""FASAT"",""unit"":""g"",""daily"":54.49940968131862,""label"":""Saturated"",""total"":10.89988193626372,""hasRDI"":..."
3,Fried Chicken Banh Mi,Food Network,https://www.foodnetwork.com/recipes/fried-chicken-banh-mi-18315478,4.0,8471.182075,2547.239375,https://datahive-prod-dataset-products.s3.eu-central-1.amazonaws.com/dataset/346/0003c1856f51c706fed64ddb5f965c26d5b...,"[""High-Fiber""]","[""Peanut-Free"",""Tree-Nut-Free"",""Fish-Free"",""Shellfish-Free""]","[""Gluten"",""Wheat"",""Sulfites""]","[""south east asian""]","[""lunch/dinner""]","[""sandwiches""]","[""Neutral oil, for frying"",""1/2 cup unseasoned rice vinegar"",""1 tablespoon granulated sugar"",""1 teaspoon kosher salt...","[{""food"":""oil"",""text"":""Neutral oil, for frying"",""weight"":30.39684999998356,""measure"":null,""quantity"":0},{""food"":""ric...","{""K"":{""unit"":""mg"",""label"":""Potassium"",""quantity"":0},""P"":{""unit"":""mg"",""label"":""Phosphorus"",""quantity"":0},""CA"":{""unit""...","{""K"":{""unit"":""%"",""label"":""Potassium"",""quantity"":0},""P"":{""unit"":""%"",""label"":""Phosphorus"",""quantity"":0},""CA"":{""unit"":""...","[{""sub"":[{""tag"":""FASAT"",""unit"":""g"",""daily"":0,""label"":""Saturated"",""total"":0,""hasRDI"":true,""schemaOrgTag"":""saturatedFa..."
4,The Macaron Frappé,French Revolution Food,http://www.frenchrevolutionfood.com/2013/07/le-macaron-frappe-milkshake-for-bastille-day/,1.0,276.243903,438.762998,https://datahive-prod-dataset-products.s3.eu-central-1.amazonaws.com/dataset/346/00040113c0b2836ed5b9f8a5036d459c20f...,[],"[""Vegetarian"",""Peanut-Free"",""Tree-Nut-Free"",""Soy-Free"",""Fish-Free"",""Shellfish-Free""]","[""Sulfites""]","[""french""]","[""snack""]","[""desserts""]","[""1 macaron (about ½ ounce), in any flavor you like"",""1/2 cup milk"",""1/2 cup vanilla ice cream"",""1 cup of ice (about...","[{""food"":""macaron"",""text"":""1 macaron (about ½ ounce), in any flavor you like"",""weight"":14.1747615625,""measure"":""ounc...","{""K"":{""unit"":""mg"",""label"":""Potassium"",""quantity"":309.814956721875},""P"":{""unit"":""mg"",""label"":""Phosphorus"",""quantity"":...","{""K"":{""unit"":""%"",""label"":""Potassium"",""quantity"":6.591807589827128},""P"":{""unit"":""%"",""label"":""Phosphorus"",""quantity"":2...","[{""sub"":[{""tag"":""FASAT"",""unit"":""g"",""daily"":47.9986353703125,""label"":""Saturated"",""total"":9.5997270740625,""hasRDI"":tru..."


### Inspect Raw Schema

Understanding the shape and column types helps downstream processing.


In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 39447 entries, 0 to 39446
Data columns (total 18 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   recipe_name       39447 non-null  object 
 1   source            39447 non-null  object 
 2   url               39447 non-null  object 
 3   servings          39447 non-null  float64
 4   calories          39447 non-null  float64
 5   total_weight_g    39447 non-null  float64
 6   image_url         39170 non-null  object 
 7   diet_labels       39447 non-null  object 
 8   health_labels     39447 non-null  object 
 9   cautions          39447 non-null  object 
 10  cuisine_type      39447 non-null  object 
 11  meal_type         39447 non-null  object 
 12  dish_type         39447 non-null  object 
 13  ingredient_lines  39447 non-null  object 
 14  ingredients       39447 non-null  object 
 15  total_nutrients   39447 non-null  object 
 16  daily_values      39447 non-null  object

## Helper Functions

These utilities standardize the nested JSON-like fields included in the raw export.


In [5]:
def parse_json_like(value):
    """Attempt to load JSON strings into Python objects; return None on failure."""
    if pd.isna(value):
        return None
    if isinstance(value, (dict, list)):
        return value
    if isinstance(value, str):
        stripped = value.strip()
        if not stripped:
            return None
        try:
            return json.loads(stripped)
        except json.JSONDecodeError:
            return stripped
    return value


def extract_nutrient_quantity(nutrient_field, key, fallback_keys=None):
    """Return the rounded quantity for a nutrient key from the nested field."""
    nutrient_field = parse_json_like(nutrient_field)
    if not isinstance(nutrient_field, dict):
        return np.nan

    keys_to_try = [key]
    if fallback_keys:
        keys_to_try.extend(fallback_keys)

    for candidate in keys_to_try:
        value = nutrient_field.get(candidate)
        if isinstance(value, dict) and "quantity" in value:
            try:
                return round(float(value["quantity"]), 2)
            except (TypeError, ValueError):
                continue
    return np.nan


def ensure_list(value):
    """Return a list regardless of whether the input was JSON, scalar, or list-like."""
    parsed = parse_json_like(value)
    if parsed is None:
        return []
    if isinstance(parsed, list):
        return parsed
    return [parsed]


def ingredients_to_names(ingredients_field):
    """Return ordered, deduplicated ingredient names from the raw ingredients field."""
    items = ensure_list(ingredients_field)

    seen = set()
    names = []
    for item in items:
        if isinstance(item, dict):
            name = (item.get("food") or "").strip().lower()
        else:
            name = str(item).strip().lower()
        if not name or name in seen:
            continue
        seen.add(name)
        names.append(name)
    return names


def normalize_list_field(value):
    """Standardize label-like fields to lowercase lists regardless of raw storage format."""
    items = ensure_list(value)

    normalized = []
    for item in items:
        text = str(item).strip().lower()
        if text:
            normalized.append(text)
    return normalized


def identify_invalid_macro_rows(df):
    """Flag rows where all macros are zero and supporting data is empty."""
    ingredient_text = (
        df["ingredient_text"].astype(str).str.strip()
        if "ingredient_text" in df.columns
        else pd.Series("", index=df.index)
    )

    return (
        (df["fat_g"] == 0)
        & (df["carbs_g"] == 0)
        & (df["protein_g"] == 0)
        & ((df["calories"] == 0) | (ingredient_text == ""))
    )

### Extract Macronutrients

Pull out fat, carbohydrate, and protein totals from the nested `total_nutrients` column.


In [6]:
data["fat_g"] = data["total_nutrients"].apply(lambda x: extract_nutrient_quantity(x, "FAT"))
data["carbs_g"] = data["total_nutrients"].apply(
    lambda x: extract_nutrient_quantity(x, "CHOCDF", fallback_keys=["CHOCDF.net"])
)
data["protein_g"] = data["total_nutrients"].apply(lambda x: extract_nutrient_quantity(x, "PROCNT"))


### Check for Missing Macro Values

Ensure the extracted macronutrient columns are populated as expected.


In [7]:
macro_na = data[["fat_g", "carbs_g", "protein_g"]].isna().sum()
macro_na

fat_g        1
carbs_g      0
protein_g    0
dtype: int64

### Drop Clearly Invalid Recipes

Some rows report zero nutrients and provide no supporting information. Drop them to avoid downstream issues.


In [8]:
nutrient_cols = ["fat_g", "carbs_g", "protein_g"]
any_zero_count = data[nutrient_cols].eq(0).any(axis=1).sum()
all_zero_count = data[nutrient_cols].eq(0).all(axis=1).sum()
print(f"Rows with any zero macro: {any_zero_count:,}")
print(f"Rows with all three macros zero: {all_zero_count:,}")

invalid_mask = identify_invalid_macro_rows(data)
print(f"Rows flagged as invalid: {invalid_mask.sum():,}")

if invalid_mask.any():
    preview_columns = [
        col
        for col in ["recipe_name", "calories", "ingredient_lines", "ingredient_text"]
        if col in data.columns
    ]
    display(data.loc[invalid_mask, preview_columns].head(5))

# Drop invalid rows and reset index
if invalid_mask.any():
    data = data.loc[~invalid_mask].reset_index(drop=True)

print(f"Remaining rows after drop: {len(data):,}")


Rows with any zero macro: 302
Rows with all three macros zero: 18
Rows flagged as invalid: 18


Unnamed: 0,recipe_name,calories,ingredient_lines
490,Pernod (pastis) Classique,64.218,"[""1 fluid ounce Pernod"",""5 fluid ounces water"",""2 ice cubes""]"
3983,Purple Piña Colada,873.191282,"[""1 (13.5 oz) can Coconut Milk"",""1 cup frozen pineapple"",""1/4 cup mulberries"",""1 scoop Vital Proteins Mixed Berry Co..."
4401,Healthy Homemade Chocolate Milk,213.346391,"[""8 oz milk"",""1 Tbsp raw cacao"",""1 Tbsp collagen hydrolysate"",""1 Tbsp pure maple syrup""]"
6476,Oatmeal Raisin Cookie Breakfast Bowl Recipe,1074.815929,"[""2 cups unsweetened plain or vanilla almond milk"",""1 cup water"",""3/4 cup oats"",""1/4 cup quinoa"",""1/2 tsp salt"",""2 t..."
9263,Green Breakfast Smoothie,385.440714,"[""1 cup spinach"",""1 1/2 cups mixed frozen fruit"",""1/4 cup raspberries"",""1 cutie orange, peeled"",""1 Tablespoon chia s..."


Remaining rows after drop: 39,429


### Normalize Ingredient and Label Fields

Convert nested or JSON-encoded columns into consistent Python lists for downstream use.


In [9]:
data["ingredient_lines"] = data["ingredient_lines"].apply(ensure_list)
data["ingredient_names"] = data["ingredients"].apply(ingredients_to_names)

list_columns = {
    "normalized_health_labels": "health_labels",
    "normalized_diet_labels": "diet_labels",
    "normalized_meal_types": "meal_type",
    "normalized_dish_types": "dish_type",
}
for new_col, source_col in list_columns.items():
    data[new_col] = data[source_col].apply(normalize_list_field)

data[["ingredient_names", "normalized_health_labels", "normalized_diet_labels"]].head()


Unnamed: 0,ingredient_names,normalized_health_labels,normalized_diet_labels
0,"[kosher salt, water, green cabbage, carrots, scallions, eggs, apple cider vinegar, granulated sugar, dry mustard, bl...","[vegetarian, gluten-free, peanut-free, tree-nut-free, soy-free, fish-free, shellfish-free]",[balanced]
1,"[ham steak, olive oil, onion, garlic, water, carrots, cooked black beans, salt, black pepper, spinach]","[dairy-free, gluten-free, egg-free, peanut-free, tree-nut-free, soy-free, fish-free, shellfish-free]",[high-fiber]
2,"[firm tofu, canola oil, scallions, piece of ginger, garlic, button mushrooms, cilantro, plum tomatoes, soy sauce, ri...","[vegan, vegetarian, dairy-free, egg-free, tree-nut-free, fish-free, shellfish-free]","[high-fiber, low-carb]"
3,"[oil, rice vinegar, granulated sugar, kosher salt, carrots, daikon, mayonnaise, sriracha, hoisin sauce, eggs, cornst...","[peanut-free, tree-nut-free, fish-free, shellfish-free]",[high-fiber]
4,"[macaron, milk, vanilla ice cream, ice]","[vegetarian, peanut-free, tree-nut-free, soy-free, fish-free, shellfish-free]",[]


### Final Formatting & Export

Round numeric columns for readability, review a sample, and write the processed CSV.


In [10]:
numeric_cols = ["calories", "fat_g", "carbs_g", "protein_g"]
data[numeric_cols] = data[numeric_cols].apply(lambda col: col.round(2))

preview_columns = [
    "recipe_name",
    "calories",
    "fat_g",
    "carbs_g",
    "protein_g",
    "normalized_diet_labels",
    "normalized_meal_types",
]
data.loc[:, preview_columns].head()

Unnamed: 0,recipe_name,calories,fat_g,carbs_g,protein_g,normalized_diet_labels,normalized_meal_types
0,Classic Cabbage Slaw with Grandmother Shinn's Dressing,511.28,15.26,74.52,19.6,[balanced],[lunch/dinner]
1,Black Bean Soup,1851.0,55.54,158.91,181.81,[high-fiber],[lunch/dinner]
2,Eat for Eight Bucks: Tofu with Tomatoes and Cilantro Recipe,1643.76,102.62,103.41,101.24,"[high-fiber, low-carb]",[lunch/dinner]
3,Fried Chicken Banh Mi,8471.18,559.33,662.33,203.31,[high-fiber],[lunch/dinner]
4,The Macaron Frappé,276.24,14.43,30.11,6.58,[],[snack]


In [11]:
data[numeric_cols].describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
calories,39429.0,2139.618606,1924.846542,0.06,866.63,1657.22,2817.68,33319.35
fat_g,39428.0,111.276985,135.031069,0.0,27.57,72.58,150.2225,3174.88
carbs_g,39429.0,207.167439,231.39432,0.0,52.12,137.41,285.01,4764.17
protein_g,39429.0,81.803151,100.756023,0.0,18.7,50.59,106.97,1510.0


In [12]:
data.to_csv(PROCESSED_DATA_PATH, index=False)
print(f"Saved processed dataset to {PROCESSED_DATA_PATH.resolve()}")
print(f"Rows in processed dataset: {len(data):,}")


Saved processed dataset to C:\Users\yefim\Self Study\Projects\meal_planner_draft\recipes_processed.csv
Rows in processed dataset: 39,429
