# MM-Food-100K Dataset Exploration

This notebook explores multiple food-related datasets to understand their structure, content, and potential for building the Query2Dish model:

## Datasets Overview

1. **MM-Food-100K**: 100,000 food images with rich metadata (dishes, ingredients, nutrition)
2. **ESCI Food Dataset**: ESCI-labeled food query-image pairs with visual similarity scores
3. **3A2M Recipe Dataset**: 2.2M recipes with ingredients, directions, and genre classifications
4. **Cooking Queries**: 29K+ natural language food queries

## Goals

- Analyze the structure and quality of each dataset
- Understand the relationships between queries, images, and recipes
- Identify opportunities for ESCI framework adaptation to food domain
- Assess datasets for training Query2Dish search model

## Setup and Data Loading

### Import Libraries and Configuration

In [9]:
import os

import pandas as pd

pd.set_option("display.max_columns", None)
pd.set_option("display.max_colwidth", None)
pd.set_option("display.width", 0)

### Dataset Files Discovery

Let's examine what food-related datasets are available in our data directory.

In [10]:
# list in data folder
data_folder = "/Users/luvsuneja/Documents/repos/masala-embed/esci-dataset/data"

In [11]:
# print all files in data folder
print(os.listdir(data_folder))

['shopping_queries_dataset_examples.parquet', 'shopping_queries_dataset_products.parquet', 'shopping_queries_dataset_sources.csv', 'ESCI_MM-Food-100K_Example__preview_.csv', 'MM-Food-100K.csv', '3A2M.csv', 'cooking_queries_all.txt']


We can see several food-related datasets:

**Food-specific datasets:**
- `MM-Food-100K.csv` - Main food images dataset
- `ESCI_MM-Food-100K_Example__preview_.csv` - ESCI-labeled food examples
- `3A2M.csv` - Recipe dataset with 2M+ recipes
- `cooking_queries_all.txt` - Natural language food queries

**Original ESCI datasets:**
- `shopping_queries_dataset_*` - Original Amazon ESCI data for comparison

In [12]:
# read 'cooking_queries_all.txt'
df_queries = pd.read_csv(
    os.path.join(data_folder, "cooking_queries_all.txt"),
    sep="\t",
    header=None,
    names=["query"],
)

## 1. Cooking Queries Dataset

First, let's explore the natural language cooking queries to understand what users search for.

In [13]:
df_queries.sample(5)

Unnamed: 0,query
11925,what is chimichurri
1717,cooking time for baked chicken breast
10425,how to cook pork chops in gravy
28946,what is mole sauce
883,what is sponge cake


In [14]:
# read mm-food-100k.csv
df_mm_food = pd.read_csv(os.path.join(data_folder, "mm-food-100k.csv"))

## 2. MM-Food-100K Dataset

This is the main food images dataset with 100,000 entries containing rich metadata about dishes, ingredients, nutrition, and cooking methods.

In [15]:
df_mm_food.head()

Unnamed: 0,image_url,camera_or_phone_prob,food_prob,dish_name,food_type,ingredients,portion_size,nutritional_profile,cooking_method,sub_dt
0,https://file.b18a.io/7843322356500104680_443548_.jpeg,0.7,0.95,Fried Chicken,Restaurant food,"[""chicken"",""breading"",""oil""]","[""chicken:300g""]","{""fat_g"":25.0,""protein_g"":30.0,""calories_kcal"":400,""carbohydrate_g"":15.0}",Frying,20250704
1,https://file.b18a.io/7833227147700100732_674878_.jpeg,0.7,1.0,Pho,Restaurant food,"[""noodles"",""beef"",""basil"",""lime"",""green onions"",""chili""]","[""noodles:200g"",""beef:100g"",""vegetables:50g""]","{""fat_g"":15.0,""protein_g"":25.0,""calories_kcal"":450,""carbohydrate_g"":60.0}",boiled,20250702
2,https://file.b18a.io/7832600581600103585_264234_.jpg,0.8,0.95,Pan-fried Dumplings,Restaurant food,"[""dumplings"",""chili oil"",""soy sauce""]","[""dumplings:300g"",""sauce:50g""]","{""fat_g"":15.0,""protein_g"":20.0,""calories_kcal"":400,""carbohydrate_g"":50.0}",Pan-frying,20250625
3,https://file.b18a.io/7839056601700101188_985151_.jpg,0.7,1.0,Bananas,Raw vegetables and fruits,"[""Bananas""]","[""Bananas: 10 pieces (about 1kg)""]","{""fat_g"":3.0,""protein_g"":12.0,""calories_kcal"":1050,""carbohydrate_g"":270.0}",Raw,20250718
4,https://file.b18a.io/7837642737500100261_173129_.jpeg,0.8,0.9,Noodle Stir-Fry,Restaurant food,"[""noodles"",""chicken"",""vegetables"",""sauce""]","[""noodles:300g"",""chicken:100g"",""vegetables:50g""]","{""fat_g"":20.0,""protein_g"":25.0,""calories_kcal"":600,""carbohydrate_g"":80.0}",stir-fried,20250711


### Dataset Structure

Each record contains:

In [16]:
df_mm_food.shape

(100000, 10)

**Dataset size**: 100,000 food images with complete metadata

### Key Fields Analysis

- **image_url**: Links to food images
- **food_prob**: Confidence score that image contains food (0.90-1.00)
- **dish_name**: Human-readable dish names
- **ingredients**: JSON array of ingredients
- **nutritional_profile**: Structured nutrition data (calories, macros)
- **cooking_method**: Cooking technique used

In [17]:
df_mm_food.loc[1]

image_url                                   https://file.b18a.io/7833227147700100732_674878_.jpeg
camera_or_phone_prob                                                                          0.7
food_prob                                                                                     1.0
dish_name                                                                                     Pho
food_type                                                                         Restaurant food
ingredients                              ["noodles","beef","basil","lime","green onions","chili"]
portion_size                                        ["noodles:200g","beef:100g","vegetables:50g"]
nutritional_profile     {"fat_g":15.0,"protein_g":25.0,"calories_kcal":450,"carbohydrate_g":60.0}
cooking_method                                                                             boiled
sub_dt                                                                                   20250702
Name: 1, dtype: obje

This Pho record shows:
- **Rich ingredient data**: Detailed ingredient lists as JSON arrays
- **Nutritional information**: Structured macro and calorie data
- **Portion sizing**: Specific measurements for ingredients
- **Cooking methods**: Standardized cooking techniques

### Sample Record Analysis

Let's examine a detailed record to understand the data structure:

In [18]:
# read 'ESCI_MM-Food-100K_Example__preview_.csv'
esci_df = pd.read_csv(
    os.path.join(data_folder, "ESCI_MM-Food-100K_Example__preview_.csv")
)

## 3. ESCI Food Dataset

This dataset applies the ESCI framework to food domain with image queries and candidate food items. It includes visual similarity scores and food-specific features.

In [19]:
esci_df.shape

(9, 26)

**Small preview dataset**: 9 examples with 26 detailed features

This appears to be a preview of the full ESCI food dataset, showing the rich feature engineering applied to food search scenarios.

In [20]:
esci_df.sample(1).iloc[0]

example_id                                                                                             Q001_C001
query_id                                                                                                    Q001
query_text                                                                         vegan pad thai under 600 kcal
query_type                                                                                             dish_name
query_filters                                                                  {"diet": "vegan", "max_cal": 600}
query_source                                                                                           synthetic
candidate_id                                                                                              C-0001
candidate_image_url                                                                       mmfood/images/0001.jpg
candidate_dish_name                                                                             

**Key Food-Specific Features:**

1. **Visual Query Support**: `query_type: "image_query"` - Supports image-based food search
2. **Food-Specific Similarity Metrics**:
   - `ingredient_jaccard: 0.25` - Ingredient overlap score
   - `nutrition_distance: 0.1` - Nutritional similarity
   - `image_sim: 0.72` - Visual similarity
3. **Rich Candidate Metadata**: Full nutritional profiles, cooking methods, ingredients
4. **ESCI Label**: "Substitute" - Shows different protein but visual similarity
5. **Human + AI Labeling**: Both human annotators and confidence scores

This demonstrates how ESCI framework can be enhanced for food domain with domain-specific similarity metrics.

### ESCI Food Framework Analysis

This sample shows the adaptation of ESCI to food domain:

In [21]:
df_342 = pd.read_csv(os.path.join(data_folder, "3A2M.csv"))

## 4. 3A2M Recipe Dataset

A massive dataset of 2.2M+ recipes with ingredients, cooking directions, and genre classifications.

In [22]:
df_342.sample(1).iloc[0]

Unnamed: 0                                                                                                                                                                                         150728
title                                                                                                                                                                                           Oven Stew
directions    ["Combine the first 9 ingredients in a pan.", "Bring to boil on top of stove, then cover and bake at 375\u00b0 for 1 hour.", "Add vegetables and bake for 30 minutes.", "Remove bay leaf."]
NER                                                                     ["lean meat", "onion", "garlic", "tomato puree", "bay leaf", "oregano", "oil", "herb vinegar", "red wine", "potatoes", "carrots"]
genre                                                                                                                                                                                           

**Key Features:**
- **Detailed directions**: Step-by-step cooking instructions
- **Named Entity Recognition (NER)**: Extracted ingredients list
- **Genre classification**: Categorized as "drinks" (may need review)
- **Numerical labels**: Multi-class categorization system

The recipe provides rich textual data that could complement the image-based MM-Food-100K dataset.

### Recipe Structure Analysis

This "Sherried Orange Chicken" recipe demonstrates the rich recipe data:

In [23]:
df_342["label"].value_counts()

label
4    398677
2    353938
6    340495
8    338497
3    315828
5    177109
1    160712
9     92630
7     53257
Name: count, dtype: int64

### Label Distribution Analysis

The numerical labels show interesting patterns:

In [24]:
df_342["genre"].value_counts()

genre
vegetables    398677
drinks        353938
cereal        340496
sides         338497
nonveg        315828
fastfood      177108
bakery        160711
fusion         84800
meal           53258
Fusion          7830
Name: count, dtype: int64

**Observations:**
- **Vegetables dominate** (398K recipes) - Good coverage for plant-based queries
- **Drinks and cereals** well represented - Broad food category coverage
- **Non-vegetarian dishes** substantial (315K) - Good protein variety
- **Genre diversity** - From fastfood to fusion cuisine
- **Note**: Some categorizations may need review (e.g., chicken dish labeled as "drinks")

### Genre Distribution Analysis

The genre distribution reveals the recipe dataset's coverage:

In [25]:
df_342.shape

(2231143, 6)

**Massive scale**: 2.23M recipes providing extensive textual recipe data to complement image datasets.

In [26]:
df_queries["query"].sample(5)

5481       what is pizza in italian
25261      what is tandoori chicken
25748    what beef is used for stew
12357           what is sponge cake
21123         what is a butter cake
Name: query, dtype: object

**Query Types Observed:**

1. **Definition queries**: "what is a gazpacho", "what is red velvet cake"
2. **Translation queries**: "what is pizza in italian"  
3. **Recipe requests**: "KFC Fried Chicken Secret Recipe", "pumpkin pie spice recipe"
4. **Ingredient inquiries**: "what are baby back ribs made from"
5. **Cooking instructions**: "how to bake homemade pumpkin"

This diversity shows that food search involves multiple types of information needs beyond just finding dishes.

### Query Pattern Analysis

These sample queries show different types of food-related search intents:

## 5. Cooking Queries Analysis

Returning to our cooking queries dataset to understand user intent patterns:

In [27]:
df_queries["query"].shape

(29557,)

**Query dataset size**: 29,557 cooking-related queries providing diverse search intent patterns.

## Key Insights & Query2Dish Model Implications

### Dataset Complementarity

Our exploration reveals four complementary datasets that together provide comprehensive food search capabilities:

| Dataset | Scale | Strength | Use Case |
|---------|-------|----------|----------|
| **MM-Food-100K** | 100K images | Rich visual + metadata | Image-based food search |
| **ESCI Food** | 9 examples (preview) | ESCI framework adaptation | Relevance labeling patterns |
| **3A2M Recipes** | 2.2M recipes | Detailed cooking instructions | Recipe recommendations |
| **Cooking Queries** | 29K queries | Natural language patterns | Query understanding |

### Food-Specific ESCI Adaptations

The ESCI food dataset demonstrates key enhancements for culinary search:

1. **Multi-modal queries**: Support for both text and image queries
2. **Food-specific similarity metrics**:
   - Ingredient overlap (Jaccard similarity)
   - Nutritional distance
   - Visual appearance matching
3. **Rich candidate metadata**: Nutritional profiles, cooking methods, ingredients
4. **Domain expertise**: Human food annotators with culinary knowledge

### Query2Dish Model Architecture Recommendations

**Multi-Dataset Integration Strategy**:
- **MM-Food-100K**: Primary candidate database with visual + metadata
- **3A2M Recipes**: Extended candidate pool for recipe-specific queries  
- **Cooking Queries**: Training data for query intent classification
- **ESCI Food**: Relevance labeling methodology and feature engineering

**Search Pipeline Design**:
1. **Query Classification**: Identify intent type (recipe, definition, ingredient, etc.)
2. **Candidate Retrieval**: Search appropriate dataset(s) based on query type
3. **Multi-Modal Matching**: Combine text, visual, and nutritional similarity
4. **ESCI Ranking**: Apply food-adapted ESCI framework for relevance scoring

### Next Steps for Implementation

1. **Data Integration**: Merge datasets with consistent schema
2. **Feature Engineering**: Implement food-specific similarity metrics
3. **Model Training**: Use cooking queries for intent classification
4. **Evaluation Framework**: Adapt ESCI metrics for food domain validation

The combination of these datasets provides a solid foundation for building a comprehensive food search system that handles diverse query types and leverages both visual and textual food information.