# MM-Food-100K Dataset Exploration

This notebook explores multiple food-related datasets to understand their structure, content, and potential for building the Query2Dish model:

## Datasets Overview

1. **MM-Food-100K**: 100,000 food images with rich metadata (dishes, ingredients, nutrition)
2. **ESCI Food Dataset**: ESCI-labeled food query-image pairs with visual similarity scores
3. **3A2M Recipe Dataset**: 2.2M recipes with ingredients, directions, and genre classifications
4. **Cooking Queries**: 29K+ natural language food queries

## Goals

- Analyze the structure and quality of each dataset
- Understand the relationships between queries, images, and recipes
- Identify opportunities for ESCI framework adaptation to food domain
- Assess datasets for training Query2Dish search model

## Setup and Data Loading

### Import Libraries and Configuration

In [28]:
import os

import pandas as pd

pd.set_option("display.max_columns", None)
pd.set_option("display.max_colwidth", None)
pd.set_option("display.width", 0)

### Dataset Files Discovery

Let's examine what food-related datasets are available in our data directory.

In [10]:
# list in data folder
data_folder = "/Users/luvsuneja/Documents/repos/masala-embed/esci-dataset/data"

In [29]:
# print all files in data folder
print(os.listdir(data_folder))

['shopping_queries_dataset_examples.parquet', 'shopping_queries_dataset_products.parquet', 'shopping_queries_dataset_sources.csv', 'ESCI_MM-Food-100K_Example__preview_.csv', 'MM-Food-100K.csv', '3A2M.csv', 'cooking_queries_all.txt']


We can see several food-related datasets:

**Food-specific datasets:**
- `MM-Food-100K.csv` - Main food images dataset
- `ESCI_MM-Food-100K_Example__preview_.csv` - ESCI-labeled food examples
- `3A2M.csv` - Recipe dataset with 2M+ recipes
- `cooking_queries_all.txt` - Natural language food queries

**Original ESCI datasets:**
- `shopping_queries_dataset_*` - Original Amazon ESCI data for comparison

In [12]:
# read 'cooking_queries_all.txt'
df_queries = pd.read_csv(
    os.path.join(data_folder, "cooking_queries_all.txt"),
    sep="\t",
    header=None,
    names=["query"],
)

## 1. Cooking Queries Dataset

First, let's explore the natural language cooking queries to understand what users search for.

In [30]:
df_queries.sample(5)

Unnamed: 0,query
3157,pumpkin pie spice recipe
7836,what is pico de gallo
28978,beef stroganoff recipe
17814,KFC Fried Chicken Secret Recipe
1552,what is red velvet cake


In [14]:
# read mm-food-100k.csv
df_mm_food = pd.read_csv(os.path.join(data_folder, "mm-food-100k.csv"))

## 2. MM-Food-100K Dataset

This is the main food images dataset with 100,000 entries containing rich metadata about dishes, ingredients, nutrition, and cooking methods.

In [31]:
df_mm_food.head()

Unnamed: 0,image_url,camera_or_phone_prob,food_prob,dish_name,food_type,ingredients,portion_size,nutritional_profile,cooking_method,sub_dt
0,https://file.b18a.io/7843322356500104680_443548_.jpeg,0.7,0.95,Fried Chicken,Restaurant food,"[""chicken"",""breading"",""oil""]","[""chicken:300g""]","{""fat_g"":25.0,""protein_g"":30.0,""calories_kcal"":400,""carbohydrate_g"":15.0}",Frying,20250704
1,https://file.b18a.io/7833227147700100732_674878_.jpeg,0.7,1.0,Pho,Restaurant food,"[""noodles"",""beef"",""basil"",""lime"",""green onions"",""chili""]","[""noodles:200g"",""beef:100g"",""vegetables:50g""]","{""fat_g"":15.0,""protein_g"":25.0,""calories_kcal"":450,""carbohydrate_g"":60.0}",boiled,20250702
2,https://file.b18a.io/7832600581600103585_264234_.jpg,0.8,0.95,Pan-fried Dumplings,Restaurant food,"[""dumplings"",""chili oil"",""soy sauce""]","[""dumplings:300g"",""sauce:50g""]","{""fat_g"":15.0,""protein_g"":20.0,""calories_kcal"":400,""carbohydrate_g"":50.0}",Pan-frying,20250625
3,https://file.b18a.io/7839056601700101188_985151_.jpg,0.7,1.0,Bananas,Raw vegetables and fruits,"[""Bananas""]","[""Bananas: 10 pieces (about 1kg)""]","{""fat_g"":3.0,""protein_g"":12.0,""calories_kcal"":1050,""carbohydrate_g"":270.0}",Raw,20250718
4,https://file.b18a.io/7837642737500100261_173129_.jpeg,0.8,0.9,Noodle Stir-Fry,Restaurant food,"[""noodles"",""chicken"",""vegetables"",""sauce""]","[""noodles:300g"",""chicken:100g"",""vegetables:50g""]","{""fat_g"":20.0,""protein_g"":25.0,""calories_kcal"":600,""carbohydrate_g"":80.0}",stir-fried,20250711


### Dataset Structure

Each record contains:

In [32]:
df_mm_food.shape

(100000, 10)

**Dataset size**: 100,000 food images with complete metadata

### Key Fields Analysis

- **image_url**: Links to food images
- **food_prob**: Confidence score that image contains food (0.90-1.00)
- **dish_name**: Human-readable dish names
- **ingredients**: JSON array of ingredients
- **nutritional_profile**: Structured nutrition data (calories, macros)
- **cooking_method**: Cooking technique used

Randomly Querying for dishes to see examples of food items in the dataset:

In [382]:
df_mm_food.query("dish_name.str.contains('Roo', na=False, case=False)")

Unnamed: 0,image_url,camera_or_phone_prob,food_prob,dish_name,food_type,ingredients,portion_size,nutritional_profile,cooking_method,sub_dt
266,https://file.b18a.io/7838173276400109163_913120_.jpg,0.7,0.90,Lotus Root Soup,Homemade food,"[""lotus root"",""goji berries"",""red dates"",""water""]","[""lotus root:200g"",""goji berries:30g"",""red dates:20g""]","{""fat_g"":1.5,""protein_g"":3.0,""calories_kcal"":150,""carbohydrate_g"":35.0}",boiling,20250702
403,https://file.b18a.io/7836539002300107170_887273_.png,0.8,0.95,Stir-fried Chicken with Mushrooms,Restaurant food,"[""chicken"",""mushrooms"",""green onions"",""soy sauce""]","[""chicken:200g"",""mushrooms:150g"",""sauce:50g""]","{""fat_g"":20.0,""protein_g"":30.0,""calories_kcal"":450,""carbohydrate_g"":40.0}",Stir-frying,20250716
575,https://file.b18a.io/7833149487400100112_516441_.jpg,0.8,1.00,Mushroom Soup,Homemade food,"[""mushrooms"",""green onions"",""red peppers"",""broth""]","[""mushrooms:200g"",""vegetables:100g"",""broth:150g""]","{""fat_g"":10.0,""protein_g"":15.0,""calories_kcal"":250,""carbohydrate_g"":30.0}",simmering,20250718
892,https://file.b18a.io/7832951279300109465_681960_.jpeg,0.7,0.90,Mushrooms,Raw vegetables and fruits,"[""mushrooms""]","[""mushrooms:300g""]","{""fat_g"":0.5,""protein_g"":3.1,""calories_kcal"":50,""carbohydrate_g"":10.0}",Raw,20250722
959,https://file.b18a.io/7833594142600100691_330054_.jpg,0.7,0.90,Spicy Stir-Fried Lotus Root,Homemade food,"[""lotus root"",""oil"",""spices""]","[""lotus root:300g"",""oil:20g""]","{""fat_g"":10.0,""protein_g"":5.0,""calories_kcal"":250,""carbohydrate_g"":35.0}",stir-frying,20250708
...,...,...,...,...,...,...,...,...,...,...
99421,https://file.b18a.io/7833741876900106219_712413_.jpeg,0.8,0.90,Mushroom and Goji Berry Soup,Homemade food,"[""mushrooms"",""goji berries"",""water""]","[""mushrooms:200g"",""goji berries:50g""]","{""fat_g"":2.0,""protein_g"":5.0,""calories_kcal"":150,""carbohydrate_g"":30.0}",boiling,20250711
99450,https://file.b18a.io/7855191536600102146_788001_.jpg,0.7,1.00,Mushroom Stir-Fry,Homemade food,"[""mushrooms"",""green peppers"",""red peppers"",""garlic"",""soy sauce""]","[""mushrooms:200g"",""green peppers:100g"",""red peppers:50g""]","{""fat_g"":10.0,""protein_g"":15.0,""calories_kcal"":250,""carbohydrate_g"":30.0}",stir-frying,20250707
99645,https://file.b18a.io/7840707662800104134_155163_.jpg,0.7,0.90,Stir-fried Mushrooms and Vegetables,Homemade food,"[""mushrooms"",""bell peppers"",""onions"",""soy sauce""]","[""mushrooms:300g"",""bell peppers:50g"",""onions:50g""]","{""fat_g"":5.0,""protein_g"":10.0,""calories_kcal"":200,""carbohydrate_g"":30.0}",Stir-frying,20250703
99700,https://file.b18a.io/7841749432200105201_624723_.jpeg,0.7,0.90,Lotus Root Soup,Homemade food,"[""lotus root"",""green onion"",""water""]","[""lotus root:300g"",""green onion:10g""]","{""fat_g"":0.5,""protein_g"":3,""calories_kcal"":150,""carbohydrate_g"":35}",boiled,20250717


In [None]:
df_mm_food.query("dish_name.str.contains('Pho', na=False, case=False)").iloc[0]

image_url                                   https://file.b18a.io/7833227147700100732_674878_.jpeg
camera_or_phone_prob                                                                          0.7
food_prob                                                                                     1.0
dish_name                                                                                     Pho
food_type                                                                         Restaurant food
ingredients                              ["noodles","beef","basil","lime","green onions","chili"]
portion_size                                        ["noodles:200g","beef:100g","vegetables:50g"]
nutritional_profile     {"fat_g":15.0,"protein_g":25.0,"calories_kcal":450,"carbohydrate_g":60.0}
cooking_method                                                                             boiled
sub_dt                                                                                   20250702
Name: 1, dtype: obje

In [None]:
print(df_mm_food["dish_name"].nunique())

19288

There are totally 19,288 unique dishes. 

In [376]:
df_unique_dishes = df_mm_food.drop_duplicates(subset=["dish_name"])

In [None]:
df_unique_dishes.shape

(19289, 10)

In [None]:
df_unique_dishes.to_parquet(
    os.path.join(data_folder, "mm-food-100k-unique-dishes.parquet"), index=False
)

In [None]:
df_unique_dishes = pd.read_parquet(
    os.path.join(data_folder, "mm-food-100k-unique-dishes.parquet")
)

In [381]:
df_unique_dishes.sample(5)

Unnamed: 0,image_url,camera_or_phone_prob,food_prob,dish_name,food_type,ingredients,portion_size,nutritional_profile,cooking_method,sub_dt
8923,https://file.b18a.io/7843258368600103237_508632_.jpg,0.8,0.9,Fried Platter,Restaurant food,"[""fried items"",""sauce"",""lettuce"",""onion""]","[""fried items:300g"",""sauce:50g"",""salad:50g""]","{""fat_g"":35.0,""protein_g"":25.0,""calories_kcal"":600,""carbohydrate_g"":50.0}",Fried,20250706
9304,https://file.b18a.io/7839325701600105671_962855_.jpg,0.8,0.9,Burger with Fruits,Homemade food,"[""burger patty"",""bun"",""strawberries"",""blueberries""]","[""burger:200g"",""strawberries:100g"",""blueberries:50g""]","{""fat_g"":30.0,""protein_g"":25.0,""calories_kcal"":600,""carbohydrate_g"":50.0}",Grilled,20250703
678,https://file.b18a.io/7893306157900101617_936768_.jpeg,0.8,0.9,Banana,Raw vegetables and fruits,"[""banana""]","[""banana:120g""]","{""fat_g"":0.3,""protein_g"":1.3,""calories_kcal"":105,""carbohydrate_g"":27}",Raw,20250717
7453,https://file.b18a.io/7832754434500100633_135877_.jpeg,0.8,0.9,Cheese Sandwiches,Restaurant food,"[""bread"",""cheese"",""butter""]","[""bread:200g"",""cheese:100g""]","{""fat_g"":10.0,""protein_g"":12.0,""calories_kcal"":300,""carbohydrate_g"":40.0}","No cooking, assembled",20250625
9920,https://file.b18a.io/7833381352600100120_979298_.jpg,0.8,1.0,Egg and Ham Tortilla Wraps,Homemade food,"[""tortilla"",""egg"",""ham"",""avocado""]","[""tortilla:100g"",""egg:100g"",""ham:50g"",""avocado:50g""]","{""fat_g"":30.0,""protein_g"":25.0,""calories_kcal"":500,""carbohydrate_g"":40.0}",Fried,20250630


In [383]:
df_unique_dishes["food_type"].value_counts()

food_type
Homemade food                9452
Restaurant food              6505
Packaged food                2635
Raw vegetables and fruits     612
Others                         85
Name: count, dtype: int64

In [385]:
df_unique_dishes["food_type"].value_counts(normalize=True)

food_type
Homemade food                0.490020
Restaurant food              0.337239
Packaged food                0.136606
Raw vegetables and fruits    0.031728
Others                       0.004407
Name: proportion, dtype: float64

In [457]:
df_unique_dishes.query("food_type == 'Packaged food'").sample(5)

Unnamed: 0,image_url,camera_or_phone_prob,food_prob,dish_name,food_type,ingredients,portion_size,nutritional_profile,cooking_method,sub_dt
7386,https://file.b18a.io/7837100869800101655_833829_.jpg,0.7,0.9,Grilled Flavor Pork,Packaged food,"[""pork"",""seasoning""]","[""pork:30g""]","{""fat_g"":10.0,""protein_g"":12.0,""calories_kcal"":150,""carbohydrate_g"":5.0}",grilled,20250702
10543,https://file.b18a.io/7833530512900106817_362477_.jpg,0.7,0.9,Cheese-flavored snack,Packaged food,"[""cornmeal"",""cheese powder"",""oil""]","[""snack:50g""]","{""fat_g"":8.0,""protein_g"":2.0,""calories_kcal"":150,""carbohydrate_g"":18.0}",Fried,20250707
6066,https://file.b18a.io/7838857055500104955_332357_.jpeg,0.8,0.9,Green Tea Snack Bar,Packaged food,"[""green tea"",""nuts"",""sugar"",""rice""]","[""snack_bar:50g""]","{""fat_g"":5.0,""protein_g"":3.0,""calories_kcal"":150,""carbohydrate_g"":25.0}",Packaged,20250719
18026,https://file.b18a.io/7859488616900107050_795828_.jpeg,0.7,0.9,Oreo Strawberry Cream,Packaged food,"[""Oreo cookie"",""strawberry cream""]","[""Oreo: 100g""]","{""fat_g"":7.0,""protein_g"":1.0,""calories_kcal"":150,""carbohydrate_g"":22.0}",Packaged,20250718
10413,https://file.b18a.io/7954739284400108122_318853_.png,0.6,0.9,Assorted Snack Chips,Packaged food,"[""potato"",""oil"",""salt"",""flavoring""]","[""potato chips:500g""]","{""fat_g"":80.0,""protein_g"":20.0,""calories_kcal"":1500,""carbohydrate_g"":180.0}",Fried,20250721


This Pho record shows:
- **Rich ingredient data**: Detailed ingredient lists as JSON arrays
- **Nutritional information**: Structured macro and calorie data
- **Portion sizing**: Specific measurements for ingredients
- **Cooking methods**: Standardized cooking techniques

### Sample Record Analysis

Let's examine a detailed record to understand the data structure:

In [356]:
# read 'ESCI_MM-Food-100K_Example__preview_.csv'
esci_df = pd.read_csv(
    os.path.join(data_folder, "ESCI_MM-Food-100K_Example__preview_.csv")
)

In [373]:
esci_df.iloc[0]

example_id                                                                                             Q001_C001
query_id                                                                                                    Q001
query_text                                                                         vegan pad thai under 600 kcal
query_type                                                                                             dish_name
query_filters                                                                  {"diet": "vegan", "max_cal": 600}
query_source                                                                                           synthetic
candidate_id                                                                                              C-0001
candidate_image_url                                                                       mmfood/images/0001.jpg
candidate_dish_name                                                                             

## 3. ESCI Food Dataset

This dataset applies the ESCI framework to food domain with image queries and candidate food items. It includes visual similarity scores and food-specific features.

In [357]:
esci_df.shape

(9, 26)

**Small preview dataset**: 9 examples with 26 detailed features

This appears to be a preview of the full ESCI food dataset, showing the rich feature engineering applied to food search scenarios.

In [358]:
esci_df.sample(1).iloc[0]

example_id                                                                                               Q001_C003
query_id                                                                                                      Q001
query_text                                                                           vegan pad thai under 600 kcal
query_type                                                                                               dish_name
query_filters                                                                    {"diet": "vegan", "max_cal": 600}
query_source                                                                                             synthetic
candidate_id                                                                                                C-0003
candidate_image_url                                                                         mmfood/images/0003.jpg
candidate_dish_name                                                             

**Key Food-Specific Features:**

1. **Visual Query Support**: `query_type: "image_query"` - Supports image-based food search
2. **Food-Specific Similarity Metrics**:
   - `ingredient_jaccard: 0.25` - Ingredient overlap score
   - `nutrition_distance: 0.1` - Nutritional similarity
   - `image_sim: 0.72` - Visual similarity
3. **Rich Candidate Metadata**: Full nutritional profiles, cooking methods, ingredients
4. **ESCI Label**: "Substitute" - Shows different protein but visual similarity
5. **Human + AI Labeling**: Both human annotators and confidence scores

This demonstrates how ESCI framework can be enhanced for food domain with domain-specific similarity metrics.

### ESCI Food Framework Analysis

This sample shows the adaptation of ESCI to food domain:

In [359]:
df_342 = pd.read_csv(os.path.join(data_folder, "3A2M.csv"))

## 4. 3A2M Recipe Dataset

A massive dataset of 2.2M+ recipes with ingredients, cooking directions, and genre classifications.

In [360]:
df_342.sample(1).iloc[0]

Unnamed: 0                                                                                                                                                                                                               25245
title                                                                                                                                                                                                          Orange Scallops
directions    ["Combine zest, juice, oil and scallops in a plastic bag; refrigerate 1 hour.", "Broil scallops 4 inches from heat, 6 minutes until dull in color.", "Add parsley.", "(136 calories, 19 protein, 4 grams fat.)"]
NER                                                                                                                                                    ["orange zest", "olive oil", "parsley", "orange juice", "bay scallops"]
genre                                                                                                       

**Key Features:**
- **Detailed directions**: Step-by-step cooking instructions
- **Named Entity Recognition (NER)**: Extracted ingredients list
- **Genre classification**: Categorized as "drinks" (may need review)
- **Numerical labels**: Multi-class categorization system

The recipe provides rich textual data that could complement the image-based MM-Food-100K dataset.

### Recipe Structure Analysis

This "Sherried Orange Chicken" recipe demonstrates the rich recipe data:

In [None]:
df_342["label"].value_counts()

label
4    398677
2    353938
6    340495
8    338497
3    315828
5    177109
1    160712
9     92630
7     53257
Name: count, dtype: int64

### Label Distribution Analysis

The numerical labels show interesting patterns:

In [None]:
df_342["genre"].value_counts()

genre
vegetables    398677
drinks        353938
cereal        340496
sides         338497
nonveg        315828
fastfood      177108
bakery        160711
fusion         84800
meal           53258
Fusion          7830
Name: count, dtype: int64

**Observations:**
- **Vegetables dominate** (398K recipes) - Good coverage for plant-based queries
- **Drinks and cereals** well represented - Broad food category coverage
- **Non-vegetarian dishes** substantial (315K) - Good protein variety
- **Genre diversity** - From fastfood to fusion cuisine
- **Note**: Some categorizations may need review (e.g., chicken dish labeled as "drinks")

### Genre Distribution Analysis

The genre distribution reveals the recipe dataset's coverage:

In [25]:
df_342.shape

(2231143, 6)

**Massive scale**: 2.23M recipes providing extensive textual recipe data to complement image datasets.

In [26]:
df_queries["query"].sample(5)

5481       what is pizza in italian
25261      what is tandoori chicken
25748    what beef is used for stew
12357           what is sponge cake
21123         what is a butter cake
Name: query, dtype: object

**Query Types Observed:**

1. **Definition queries**: "what is a gazpacho", "what is red velvet cake"
2. **Translation queries**: "what is pizza in italian"  
3. **Recipe requests**: "KFC Fried Chicken Secret Recipe", "pumpkin pie spice recipe"
4. **Ingredient inquiries**: "what are baby back ribs made from"
5. **Cooking instructions**: "how to bake homemade pumpkin"

This diversity shows that food search involves multiple types of information needs beyond just finding dishes.

### Query Pattern Analysis

These sample queries show different types of food-related search intents:

## 5. Cooking Queries Analysis

Returning to our cooking queries dataset to understand user intent patterns:

In [27]:
df_queries["query"].shape

(29557,)

**Query dataset size**: 29,557 cooking-related queries providing diverse search intent patterns.

## Key Insights & Query2Dish Model Implications

### Dataset Complementarity

Our exploration reveals four complementary datasets that together provide comprehensive food search capabilities:

| Dataset | Scale | Strength | Use Case |
|---------|-------|----------|----------|
| **MM-Food-100K** | 100K images | Rich visual + metadata | Image-based food search |
| **ESCI Food** | 9 examples (preview) | ESCI framework adaptation | Relevance labeling patterns |
| **3A2M Recipes** | 2.2M recipes | Detailed cooking instructions | Recipe recommendations |
| **Cooking Queries** | 29K queries | Natural language patterns | Query understanding |

### Food-Specific ESCI Adaptations

The ESCI food dataset demonstrates key enhancements for culinary search:

1. **Multi-modal queries**: Support for both text and image queries
2. **Food-specific similarity metrics**:
   - Ingredient overlap (Jaccard similarity)
   - Nutritional distance
   - Visual appearance matching
3. **Rich candidate metadata**: Nutritional profiles, cooking methods, ingredients
4. **Domain expertise**: Human food annotators with culinary knowledge

### Query2Dish Model Architecture Recommendations

**Multi-Dataset Integration Strategy**:
- **MM-Food-100K**: Primary candidate database with visual + metadata
- **3A2M Recipes**: Extended candidate pool for recipe-specific queries  
- **Cooking Queries**: Training data for query intent classification
- **ESCI Food**: Relevance labeling methodology and feature engineering

**Search Pipeline Design**:
1. **Query Classification**: Identify intent type (recipe, definition, ingredient, etc.)
2. **Candidate Retrieval**: Search appropriate dataset(s) based on query type
3. **Multi-Modal Matching**: Combine text, visual, and nutritional similarity
4. **ESCI Ranking**: Apply food-adapted ESCI framework for relevance scoring

### Next Steps for Implementation

1. **Data Integration**: Merge datasets with consistent schema
2. **Feature Engineering**: Implement food-specific similarity metrics
3. **Model Training**: Use cooking queries for intent classification
4. **Evaluation Framework**: Adapt ESCI metrics for food domain validation

The combination of these datasets provides a solid foundation for building a comprehensive food search system that handles diverse query types and leverages both visual and textual food information.