# Food Scan Benchmark


This notebook provides a complete, end-to-end workflow for benchmarking Multimodal Large Language Models (MLLMs) on January's food image dataset (JFID).

**The process is as follows:**

1.  **Setup:** Install dependencies and configure API keys.
2.  **Define Components:** Set up Pydantic schemas, the dataset loader, the model wrapper, and evaluation metrics.
3.  **Run Evaluation:** Loop through the dataset, send images to a chosen MLLM, and collect predictions.
4.  **Analyze Results:** Calculate metrics and summarize the model's performance.

The dataset is downloaded from a public S3 bucket and cached locally.


## Setup


Add your API keys to the environment:

```
echo OPENAI_API_KEY="sk-..."
echo GEMINI_API_KEY="..."
```


In [2]:
# Install packages
%pip install --upgrade litellm boto3 pandas tqdm python-dotenv pydantic tabulate openai scikit-learn scipy plotly openpyxl

/Users/amirhosseinian/January/food-scan-benchmarks/.venv/bin/python: No module named pip
Note: you may need to restart the kernel to use updated packages.


## Imports


In [27]:
from scipy import stats
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import warnings
from pydantic import BaseModel, Field
from typing import List
import openai
from sklearn.metrics.pairwise import cosine_similarity
from scipy.optimize import linear_sum_assignment
import hashlib
import pandas as pd
import numpy as np
import ast
import boto3
from botocore import UNSIGNED
from botocore.client import Config
import tarfile
import litellm
from litellm.exceptions import APIError
from typing import Optional
import json
import base64
from pathlib import Path
from tqdm.auto import tqdm
import asyncio
import time
from typing import Union, Tuple
from difflib import SequenceMatcher

warnings.filterwarnings("ignore")

## Core Components


### Schema Definition


In [4]:
class Ingredient(BaseModel):
    name: str = Field(description="Name of the ingredient, e.g., 'scrambled eggs'")
    quantity: float = Field(description="Numerical quantity of the ingredient")
    unit: str = Field(description="Unit of measurement, e.g., 'cup', 'slice', 'g'")
    calories: float = Field(description="Estimated calories for this ingredient")
    carbs: float = Field(
        description="Estimated grams of carbohydrates for this ingredient"
    )
    protein: float = Field(description="Estimated grams of protein for this ingredient")
    fat: float = Field(description="Estimated grams of fat for this ingredient")


class TotalMacros(BaseModel):
    calories: float = Field(description="Total estimated calories for the entire meal")
    carbs: float = Field(
        description="Total estimated grams of carbohydrates for the entire meal"
    )
    protein: float = Field(
        description="Total estimated grams of protein for the entire meal"
    )
    fat: float = Field(description="Total estimated grams of fat for the entire meal")


class FoodAnalysis(BaseModel):
    meal_name: str = Field(
        description="A descriptive name for the meal, e.g., 'Breakfast Platter'"
    )
    ingredients: List[Ingredient] = Field(
        description="A list of all identified ingredients and their nutritional information"
    )
    total_macros: TotalMacros = Field(
        description="The sum of macros for all ingredients"
    )


### Model Costs Configuration


In [5]:
MODEL_COSTS = {
    "gpt-4.1": {"input": 2.00, "output": 8.00},
    "gpt-4o": {"input": 2.50, "output": 10.00},
    "gpt-4o-mini": {"input": 0.15, "output": 0.60},
    "gemini/gemini-2.5-flash-preview-05-20": {"input": 0.15, "output": 0.60},
    "gemini/gemini-2.5-pro-preview-06-05": {"input": 1.25, "output": 10.00},
}


def calculate_cost(model_name: str, input_tokens: int, output_tokens: int) -> float:
    """Calculate the cost for a model based on token usage."""
    if model_name not in MODEL_COSTS:
        return 0.0

    costs = MODEL_COSTS[model_name]
    input_cost = (input_tokens / 1_000_000) * costs["input"]
    output_cost = (output_tokens / 1_000_000) * costs["output"]
    return round(input_cost + output_cost, 6)

In [6]:
# helper function
def img2b64(path: Path) -> str:
    """Converts an image file to a base64 encoded string for API calls."""
    encoded = base64.b64encode(path.read_bytes()).decode()
    return f"data:image/jpeg;base64,{encoded}"

### Dataset Class


In [7]:
class FoodScanDataset:
    """Handles downloading, caching, and loading the food dataset."""

    _S3_BUCKET = "january-food-image-dataset-public"
    _S3_KEY = "food-scan-benchmark-dataset.tar.gz"

    def __init__(self, root: Path):
        self.root = root.expanduser()
        self.img_dir = self.root / "food-scan-benchmark-dataset" / "fsb_images"
        self.csv_path = (
            self.root / "food-scan-benchmark-dataset" / "food_scan_bench_v1.csv"
        )

        if not self.csv_path.exists():
            self._download_and_extract()

        self.df = pd.read_csv(self.csv_path)

    def _download_and_extract(self):
        print(f"Dataset not found in {self.root}. Downloading from S3...")
        self.root.mkdir(parents=True, exist_ok=True)
        local_archive = self.root / "fsb.tar.gz"

        s3 = boto3.client(
            "s3",
            config=Config(signature_version=UNSIGNED),
        )
        with open(local_archive, "wb") as f:
            s3.download_fileobj(self._S3_BUCKET, self._S3_KEY, f)

        print("Download complete. Extracting...")
        with tarfile.open(local_archive) as tar:
            tar.extractall(path=self.root)
        local_archive.unlink()
        print("Extraction complete.")

    def __len__(self):
        return len(self.df)

    def __getitem__(self, idx: int):
        row = self.df.iloc[idx]
        img_path = self.img_dir / row.image_filename

        try:
            ingredients = ast.literal_eval(row.ingredients_list)
        except (ValueError, SyntaxError):
            ingredients = []

        return {
            "image_id": row.image_id,
            "image_path": img_path,
            "gt": {
                "meal_name": row.meal_name,
                "ingredients": ingredients,
                "macros": {
                    "calories": row.total_calories,
                    "carbs": row.total_carbs,
                    "protein": row.total_protein,
                    "fat": row.total_fat,
                },
            },
        }

### Model Wrapper


In [28]:
class LiteModel:
    """A robust wrapper around any LiteLLM-supported vision model with prompt engineering options."""

    PROMPT_VARIANTS = {
        "detailed": "You are an expert nutritionist with 20 years of experience. Analyze this food image very carefully and provide the most accurate breakdown possible. Consider portion sizes, cooking methods, and hidden ingredients.",
        "step_by_step": "You are an expert nutritionist. Please analyze this image step by step: 1) First identify all visible food items, 2) Estimate portion sizes, 3) Calculate nutritional content for each item, 4) Sum the totals. Be precise and systematic.",
        "conservative": "You are a conservative nutritionist who prefers to underestimate rather than overestimate. Analyze this food image and provide a realistic, slightly conservative nutritional breakdown.",
        "confident": "You are a highly confident nutritionist. Analyze this food image and provide your best estimate of the nutritional content. Trust your expertise.",
    }

    def __init__(self, model_name: str, prompt_variant="detailed", **litellm_kwargs):
        self.model_name = model_name
        self.kwargs = litellm_kwargs
        self.prompt_variant = prompt_variant
        self.system_prompt = self.PROMPT_VARIANTS.get(
            prompt_variant, self.PROMPT_VARIANTS["detailed"]
        )

    async def analyse(self, img_path: Path) -> Tuple[Optional[dict], Optional[str]]:
        """
        Analyzes an image and returns a structured dict with cost info, or None and an error message on failure.

        Returns:
            Tuple[Optional[dict], Optional[str]]: A tuple of (result, error_message).
                                                  On success, result is a dict and error_message is None.
                                                  On failure, result is None and error_message is a string.
        """
        b64_img = img2b64(img_path)

        messages = [
            {"role": "system", "content": self.system_prompt},
            {
                "role": "user",
                "content": [
                    {
                        "type": "text",
                        "text": "Analyze this food image and provide a detailed nutritional breakdown. Include: meal name, all ingredients with quantities/units, and complete macro information (calories, carbs, protein, fat) for each ingredient and the total meal.",
                    },
                    {"type": "image_url", "image_url": {"url": b64_img}},
                ],
            },
        ]

        try:
            resp = await litellm.acompletion(
                model=self.model_name,
                messages=messages,
                response_format=FoodAnalysis,
                temperature=0.0,
                **self.kwargs,
            )
            raw = resp.choices[0].message.content.strip()
            data = json.loads(raw)

            usage = resp.usage
            input_tokens = usage.prompt_tokens if usage else 0
            output_tokens = usage.completion_tokens if usage else 0
            cost = calculate_cost(self.model_name, input_tokens, output_tokens)

            result = FoodAnalysis(**data).model_dump()
            result["cost_usd"] = cost
            result["prompt_variant"] = self.prompt_variant

            return result, None

        except APIError as e:
            error_msg = f"API Error: {e}"
            print(f"{error_msg} for {img_path.name}")
            return None, error_msg
        except Exception as e:
            error_msg = f"Unexpected Error: {e}"
            print(f"{error_msg} for {img_path.name}")
            return None, error_msg

### Metrics


In [19]:
class Metrics:
    """Comprehensive metrics for food analysis evaluation."""

    _embedding_cache = {}

    @staticmethod
    def _normalize_ingredient_list(ingredients: Union[str, list, None]) -> List[dict]:
        """
        IMPROVEMENT: Centralized cleanup function.
        Safely parses ingredient data which might be a string representation of a list.
        This avoids repeating the same try-except block in multiple metric functions.
        """
        if not ingredients:
            return []
        if isinstance(ingredients, list):
            return ingredients
        if isinstance(ingredients, str):
            try:
                parsed = ast.literal_eval(ingredients)
                return parsed if isinstance(parsed, list) else []
            except (ValueError, SyntaxError):
                return []
        return []

    @staticmethod
    async def get_embedding(text, model="text-embedding-3-small"):
        """Get OpenAI embedding with caching."""
        cache_key = hashlib.md5(f"{text}_{model}".encode()).hexdigest()
        if cache_key in Metrics._embedding_cache:
            return Metrics._embedding_cache[cache_key]
        try:
            client = openai.AsyncOpenAI()
            response = await client.embeddings.create(model=model, input=text.strip())
            embedding = response.data[0].embedding
            Metrics._embedding_cache[cache_key] = embedding
            return embedding
        except Exception as e:
            print(f"Error getting embedding for '{text}': {e}")
            return [0.0] * 1536

    @staticmethod
    async def semantic_ingredient_match_embeddings(
        gt_ingredients, pred_ingredients, threshold=0.75
    ):
        """Semantic ingredient matching using OpenAI embeddings and cosine similarity."""
        # SIMPLIFICATION: Use the centralized normalization function
        gt_ingredients = Metrics._normalize_ingredient_list(gt_ingredients)
        pred_ingredients = Metrics._normalize_ingredient_list(pred_ingredients)

        def normalize_name(item):
            return str(item.get("name", "")).lower().strip()

        gt_names = [normalize_name(x) for x in gt_ingredients if normalize_name(x)]
        pred_names = [normalize_name(x) for x in pred_ingredients if normalize_name(x)]

        if not gt_names and not pred_names:
            return 1.0, []
        if not gt_names or not pred_names:
            return 0.0, []

        gt_embeddings = await asyncio.gather(
            *(Metrics.get_embedding(name) for name in gt_names)
        )
        pred_embeddings = await asyncio.gather(
            *(Metrics.get_embedding(name) for name in pred_names)
        )

        gt_embeddings = [emb for emb in gt_embeddings if emb and len(emb) > 1]
        pred_embeddings = [emb for emb in pred_embeddings if emb and len(emb) > 1]

        if not gt_embeddings or not pred_embeddings:
            return 0.0, []

        similarity_matrix = cosine_similarity(gt_embeddings, pred_embeddings)
        cost_matrix = 1 - similarity_matrix
        row_indices, col_indices = linear_sum_assignment(cost_matrix)

        matches = 0
        match_details = []
        for row_idx, col_idx in zip(row_indices, col_indices):
            similarity = similarity_matrix[row_idx, col_idx]
            if similarity >= threshold:
                matches += 1
                match_details.append(
                    {
                        "gt_ingredient": gt_names[row_idx],
                        "pred_ingredient": pred_names[col_idx],
                        "similarity": similarity,
                    }
                )

        recall = matches / len(gt_names)
        return recall, match_details

    @staticmethod
    def semantic_ingredient_match(gt_ingredients, pred_ingredients, threshold=0.7):
        """Fallback method using string similarity."""
        # SIMPLIFICATION: Use the centralized normalization function
        gt_ingredients = Metrics._normalize_ingredient_list(gt_ingredients)
        pred_ingredients = Metrics._normalize_ingredient_list(pred_ingredients)

        def normalize_name(item):
            name_str = item.get("name", "") if isinstance(item, dict) else item
            return str(name_str).lower().strip().replace("-", " ").replace("_", " ")

        gt_names = [normalize_name(x) for x in gt_ingredients if normalize_name(x)]
        pred_names = [normalize_name(x) for x in pred_ingredients if normalize_name(x)]

        if not gt_names:
            return 1.0 if not pred_names else 0.0

        matches = 0
        for gt_name in gt_names:
            if pred_names:
                best_match = max(
                    SequenceMatcher(None, gt_name, pred_name).ratio()
                    for pred_name in pred_names
                )
                if best_match >= threshold:
                    matches += 1

        return matches / len(gt_names)

    @staticmethod
    def ingredients_f1(gt, pred):
        """F-score on ingredient names (case-insensitive)."""
        # SIMPLIFICATION: Use the centralized normalization function
        gt_list = Metrics._normalize_ingredient_list(gt)
        pred_list = Metrics._normalize_ingredient_list(pred)

        def _name(x):
            return (
                str(x.get("name", "")).lower().strip()
                if isinstance(x, dict)
                else str(x).lower().strip()
            )

        g_names = {_name(x) for x in gt_list if _name(x)}
        p_names = {_name(x) for x in pred_list if _name(x)}

        if not g_names and not p_names:
            return 1.0
        if not g_names or not p_names:
            return 0.0

        tp = len(g_names & p_names)
        if tp == 0:
            return 0.0

        precision = tp / len(p_names)
        recall = tp / len(g_names)
        return 2 * precision * recall / (precision + recall)

    @staticmethod
    def macro_mae(gt_mac: dict, pred_mac: dict):
        """Calculates Mean Absolute Error over the four main macros."""
        keys = ["calories", "carbs", "protein", "fat"]
        errors = [abs(gt_mac.get(k, 0) - pred_mac.get(k, 0)) for k in keys]
        return float(np.mean(errors))

    @staticmethod
    def macro_percentage_error(gt_mac, pred_mac):
        """Calculate percentage error for each macro."""
        keys = ["calories", "carbs", "protein", "fat"]
        errors = {}
        for key in keys:
            gt_val = gt_mac.get(key, 0)
            pred_val = pred_mac.get(key, 0)
            if gt_val > 0:
                errors[key] = abs(gt_val - pred_val) / gt_val * 100
            else:
                errors[key] = 0 if pred_val == 0 else 100
        return errors

    @staticmethod
    def ingredient_count_accuracy(gt_ingredients, pred_ingredients):
        """How well does the model predict the number of ingredients?"""
        gt_count = len(Metrics._normalize_ingredient_list(gt_ingredients))
        pred_count = len(Metrics._normalize_ingredient_list(pred_ingredients))
        if gt_count == 0 and pred_count == 0:
            return 1.0
        if gt_count == 0:
            return 0.0
        return 1 - abs(gt_count - pred_count) / gt_count

## Evaluation Function


In [32]:
async def _process_sample(
    idx: int,
    ds: FoodScanDataset,
    llm: LiteModel,
    model_name: str,
    use_embeddings: bool = True,
) -> dict:
    start_time = time.time()
    sample = ds[idx]
    pred, error_msg = await llm.analyse(sample["image_path"])
    end_time = time.time()
    gt = sample["gt"]

    item = {
        "image_id": sample["image_id"],
        "model": model_name,
        "response_time_seconds": round(end_time - start_time, 2),
    }

    if pred is None:
        item.update(
            {
                "f1_ing": 0.0,
                "semantic_match": 0.0,
                "semantic_match_embeddings": 0.0,
                "ingredient_count_acc": 0.0,
                "mae_mac": None,
                "error": error_msg or "failed",
                "cost_usd": 0.0,
                "calories_pct_error": None,
                "carbs_pct_error": None,
                "protein_pct_error": None,
                "fat_pct_error": None,
                "match_details": None,
            }
        )
    else:
        gt_ingredients = gt["ingredients"]
        pred_ingredients = pred.get("ingredients", [])
        gt_macros = gt["macros"]
        pred_macros = pred.get("total_macros", {})

        semantic_match_embeddings, match_details = 0.0, None
        if use_embeddings:
            try:
                (
                    semantic_match_embeddings,
                    match_details,
                ) = await Metrics.semantic_ingredient_match_embeddings(
                    gt_ingredients, pred_ingredients
                )
            except Exception as e:
                print(
                    f"Error in embedding similarity for image {sample['image_id']}: {e}"
                )
                semantic_match_embeddings = Metrics.semantic_ingredient_match(
                    gt_ingredients, pred_ingredients
                )

        # FIX: Calculate percentage errors first, then assign them to explicitly named keys.
        pct_errors = Metrics.macro_percentage_error(gt_macros, pred_macros)

        item.update(
            {
                "f1_ing": Metrics.ingredients_f1(gt_ingredients, pred_ingredients),
                "semantic_match": Metrics.semantic_ingredient_match(
                    gt_ingredients, pred_ingredients
                ),
                "semantic_match_embeddings": semantic_match_embeddings,
                "ingredient_count_acc": Metrics.ingredient_count_accuracy(
                    gt_ingredients, pred_ingredients
                ),
                "mae_mac": Metrics.macro_mae(gt_macros, pred_macros),
                "error": None,
                "cost_usd": pred.get("cost_usd", 0.0),
                # This ensures the column names are correct
                "calories_pct_error": pct_errors.get("calories"),
                "carbs_pct_error": pct_errors.get("carbs"),
                "protein_pct_error": pct_errors.get("protein"),
                "fat_pct_error": pct_errors.get("fat"),
                "match_details": match_details,
            }
        )

    return item


async def run_evaluation(
    models: Union[str, List[str]],
    cache_dir: Path,
    max_items: Optional[int] = None,
    max_concurrent: int = 5,
    prompt_variants: Optional[List[str]] = None,
    use_embeddings: bool = True,
) -> pd.DataFrame:
    """Run evaluation with multiple models and prompt variants."""
    models = [models] if isinstance(models, str) else models
    prompt_variants = prompt_variants or ["detailed"]

    ds = FoodScanDataset(cache_dir)
    n = min(max_items, len(ds)) if max_items else len(ds)
    all_results = []

    for prompt_variant in prompt_variants:
        print(f"\n=== Running with prompt variant: '{prompt_variant}' ===")
        print(
            f"Using {'OpenAI embeddings' if use_embeddings else 'string-based'} semantic similarity"
        )

        async def _run_model_with_prompt(model_name: str, position: int):
            llm = LiteModel(model_name, prompt_variant=prompt_variant)
            sem = asyncio.Semaphore(max_concurrent)
            pbar = tqdm(
                total=n,
                desc=f"{model_name} ({prompt_variant})".ljust(40),
                position=position,
                leave=True,
                dynamic_ncols=True,
                colour="green",
                bar_format="{desc}│{bar}│ {n_fmt}/{total_fmt}",
            )

            async def _worker(i: int):
                async with sem:
                    result = await _process_sample(
                        i, ds, llm, f"{model_name}_{prompt_variant}", use_embeddings
                    )
                    pbar.update(1)
                    return result

            results = await asyncio.gather(*(_worker(i) for i in range(n)))
            pbar.close()
            return results

        model_tasks = [
            _run_model_with_prompt(model, i) for i, model in enumerate(models)
        ]
        nested_results = await asyncio.gather(*model_tasks)
        all_results.extend(item for sublist in nested_results for item in sublist)

    return pd.DataFrame(all_results)

## Analysis and Visualization


In [92]:
class BenchmarkAnalyzer:
    """Comprehensive analysis and visualization of benchmark results."""

    def __init__(self, results_df):
        self.df = results_df.copy()
        if not self.df.empty:
            self.successful_df = self.df[self.df["error"].isna()].copy()
        else:
            self.successful_df = pd.DataFrame()

    def summary_statistics(self):
        """Generate comprehensive summary statistics."""
        print("=== BENCHMARK SUMMARY ===\n")
        if self.df.empty:
            print("No results to analyze.")
            return

        for model in sorted(self.df["model"].unique()):
            model_df = self.df[self.df["model"] == model]
            successful = model_df[model_df["error"].isna()]
            success_rate = (len(successful) / len(model_df)) * 100

            print(f"--- {model} ---")
            print(
                f"  Success Rate: {success_rate:.1f}% ({len(successful)}/{len(model_df)})"
            )

            if not successful.empty:
                print(
                    f"  Avg Semantic Match (Embeddings): {successful['semantic_match_embeddings'].mean():.3f}"
                )
                print(
                    f"  Avg F1 Score (Ingredients): {successful['f1_ing'].mean():.3f}"
                )
                print(f"  Avg MAE (Macros): {successful['mae_mac'].mean():.1f}")
                print(f"  Avg Cost per Image: ${successful['cost_usd'].mean():.4f}")
                print(
                    f"  Avg Response Time: {successful['response_time_seconds'].mean():.1f}s\n"
                )

        self.analyze_errors()

    def analyze_errors(self):
        """IMPROVEMENT: Analyze and summarize the specific errors encountered."""
        print("--- ERROR ANALYSIS ---")
        error_df = self.df[self.df["error"].notna()]
        if error_df.empty:
            print("No errors encountered. All API calls were successful.\n")
            return

        print("Error counts by model:")
        error_summary = error_df.groupby("model")["error"].count().sort_index()
        print(error_summary.to_string())

        print("\nMost common error messages:")
        common_errors = (
            error_df["error"].str.split(":").str[0].value_counts().nlargest(5)
        )
        print(common_errors.to_string())
        print()

    def analyze_ingredient_matches(self):
        """Analyze detailed ingredient matching results."""
        if (
            self.successful_df.empty
            or "match_details" not in self.successful_df.columns
        ):
            print("No match details available.")
            return

        print("=== INGREDIENT MATCHING ANALYSIS ===\n")
        all_matches = self.successful_df.explode("match_details").dropna(
            subset=["match_details"]
        )
        if all_matches.empty:
            print("No successful ingredient matches found.")
            return

        matches_df = pd.json_normalize(all_matches["match_details"])

        matches_df["model"] = all_matches["model"].values

        print(f"Total successful ingredient matches: {len(matches_df)}")
        print(
            f"Average similarity score: {matches_df['similarity'].mean():.3f} (std: {matches_df['similarity'].std():.3f})"
        )

        print("\n--- TOP 10 INGREDIENT MATCHES ---")
        print(
            matches_df.nlargest(10, "similarity")[
                ["gt_ingredient", "pred_ingredient", "similarity", "model"]
            ].to_string(index=False)
        )

        print("\n--- 10 MOST CHALLENGING MATCHES (Lowest Similarity) ---")
        challenging = matches_df[matches_df["similarity"] < 0.85].nsmallest(
            10, "similarity"
        )
        if not challenging.empty:
            print(
                challenging[
                    ["gt_ingredient", "pred_ingredient", "similarity", "model"]
                ].to_string(index=False)
            )
        else:
            print("No challenging matches found (all similarities >= 0.85).")

    def statistical_significance_test(self, metric="semantic_match_embeddings"):
        """Test statistical significance between models using t-test."""
        if self.successful_df.empty or metric not in self.successful_df.columns:
            print(f"Metric '{metric}' not available for significance testing.")
            return

        models = sorted(self.successful_df["model"].unique())
        if len(models) < 2:
            return

        print(f"\n=== STATISTICAL SIGNIFICANCE (T-TEST on {metric}) ===\n")
        for i, model1 in enumerate(models):
            for model2 in models[i + 1 :]:
                data1 = self.successful_df[self.successful_df["model"] == model1][
                    metric
                ]
                data2 = self.successful_df[self.successful_df["model"] == model2][
                    metric
                ]

                if len(data1) > 1 and len(data2) > 1:
                    stat, p_value = stats.ttest_ind(data1, data2, equal_var=False)
                    significance = (
                        "***"
                        if p_value < 0.001
                        else "**"
                        if p_value < 0.01
                        else "*"
                        if p_value < 0.05
                        else "ns"
                    )
                    print(
                        f"{model1.ljust(45)} vs. {model2.ljust(45)} | p-value: {p_value:.4f} {significance}"
                    )
        print()

    def create_performance_dashboard(self):
        """Create a comprehensive performance comparison dashboard with improved styling."""
        if self.successful_df.empty:
            print("No successful predictions to plot.")
            return

        models = sorted(self.df["model"].unique())
        short_labels = {m: m.split("/", 1)[1] if "/" in m else m for m in models}
        idx_map = {m: i + 1 for i, m in enumerate(short_labels)}

        colors = [
            "#1f77b4",
            "#ff7f0e",
            "#2ca02c",
            "#d62728",
            "#9467bd",
            "#8c564b",
            "#e377c2",
            "#7f7f7f",
            "#bcbd22",
            "#17becf",
        ]

        metric_col = "semantic_match_embeddings"

        fig = make_subplots(
            rows=2,
            cols=3,
            subplot_titles=(
                "Embedding Similarity Distribution",
                "F1 Score (Ingredients)",
                "Macro Nutritional MAE",
                "Cost vs. Performance",
                " Response Time Distribution",
                "Success Rate by Model",
            ),
            vertical_spacing=0.12,
            horizontal_spacing=0.08,
            specs=[
                [
                    {"secondary_y": False},
                    {"secondary_y": False},
                    {"secondary_y": False},
                ],
                [
                    {"secondary_y": False},
                    {"secondary_y": False},
                    {"secondary_y": False},
                ],
            ],
        )

        for i, model in enumerate(short_labels):
            m_idx = str(idx_map[model])
            color = colors[i % len(colors)]
            d_ok = self.successful_df[self.successful_df["model"] == model]

            box_style = dict(
                marker_color=color,
                marker_line_color="rgba(0,0,0,0.3)",
                marker_line_width=1,
                fillcolor=f"rgba({','.join(str(int(color[i : i + 2], 16)) for i in (1, 3, 5))}, 0.6)",
                line_color=color,
                line_width=2,
            )

            fig.add_trace(
                go.Box(
                    y=d_ok["semantic_match_embeddings"],
                    name=m_idx,
                    legendgroup=m_idx,
                    showlegend=False,
                    boxpoints="outliers",
                    jitter=0.3,
                    pointpos=-1.8,
                    **box_style,
                ),
                row=1,
                col=1,
            )

            fig.add_trace(
                go.Box(
                    y=d_ok["f1_ing"],
                    name=m_idx,
                    legendgroup=m_idx,
                    showlegend=False,
                    boxpoints="outliers",
                    jitter=0.3,
                    pointpos=-1.8,
                    **box_style,
                ),
                row=1,
                col=2,
            )

            fig.add_trace(
                go.Box(
                    y=d_ok["mae_mac"],
                    name=m_idx,
                    legendgroup=m_idx,
                    showlegend=False,
                    boxpoints="outliers",
                    jitter=0.3,
                    pointpos=-1.8,
                    **box_style,
                ),
                row=1,
                col=3,
            )

            fig.add_trace(
                go.Scatter(
                    x=d_ok["cost_usd"],
                    y=d_ok[metric_col],
                    mode="markers",
                    marker=dict(
                        color=color,
                        size=8,
                        line=dict(width=1, color="rgba(0,0,0,0.3)"),
                        opacity=0.7,
                    ),
                    name=m_idx,
                    legendgroup=m_idx,
                    showlegend=False,
                    hovertemplate="<b>%{fullData.name}</b><br>"
                    + "Cost: $%{x:.4f}<br>"
                    + "Similarity: %{y:.3f}<br>"
                    + "<extra></extra>",
                ),
                row=2,
                col=1,
            )

            fig.add_trace(
                go.Box(
                    y=d_ok["response_time_seconds"],
                    name=m_idx,
                    legendgroup=m_idx,
                    showlegend=False,
                    boxpoints="outliers",
                    jitter=0.3,
                    pointpos=-1.8,
                    **box_style,
                ),
                row=2,
                col=2,
            )

        success_rates = (
            self.df.groupby("model")["error"]
            .apply(lambda x: x.isna().mean() * 100)
            .reindex(short_labels)
        )

        fig.add_trace(
            go.Bar(
                x=[idx_map[m] for m in success_rates.index],
                y=success_rates.values,
                marker=dict(
                    color=[colors[i % len(colors)] for i in range(len(short_labels))],
                    line=dict(color="rgba(0,0,0,0.3)", width=1),
                    opacity=0.8,
                ),
                showlegend=False,
                hovertemplate="<b>Model %{x}</b><br>"
                + "Success Rate: %{y:.1f}%<br>"
                + "<extra></extra>",
                text=[f"{rate:.1f}%" for rate in success_rates.values],
                textposition="outside",
            ),
            row=2,
            col=3,
        )

        for i, model in enumerate(short_labels):
            m_idx = str(idx_map[model])
            fig.add_trace(
                go.Scatter(
                    x=[None],
                    y=[None],
                    mode="markers",
                    marker=dict(
                        size=12,
                        color=colors[i % len(colors)],
                        symbol="circle",
                        line=dict(width=2, color="rgba(0,0,0,0.3)"),
                    ),
                    legendgroup=m_idx,
                    showlegend=True,
                    name=f"{m_idx}: {short_labels[model]}",
                )
            )

        axis_style = dict(
            showgrid=True,
            gridcolor="rgba(128,128,128,0.2)",
            gridwidth=1,
            zeroline=True,
            zerolinecolor="rgba(128,128,128,0.4)",
            zerolinewidth=1,
            tickfont=dict(size=11, color="#2f2f2f"),
        )

        for row in [1, 2]:
            for col in [1, 2, 3]:
                fig.update_xaxes(**axis_style, row=row, col=col)
                fig.update_yaxes(**axis_style, row=row, col=col)

        fig.update_xaxes(title_text="Cost per Image ($)", row=2, col=1)
        fig.update_yaxes(title_text="Similarity Score", row=1, col=1)
        fig.update_yaxes(title_text="F1 Score", row=1, col=2)
        fig.update_yaxes(title_text="Mean Absolute Error", row=1, col=3)
        fig.update_yaxes(title_text="Similarity Score", row=2, col=1)
        fig.update_yaxes(title_text="️ Response Time (sec)", row=2, col=2)
        fig.update_yaxes(title_text="Success Rate (%)", row=2, col=3)

        fig.update_layout(
            height=900,
            width=1400,
            title=dict(
                text="<b>Model Performance Dashboard</b>",
                x=0.5,
                xanchor="center",
                font=dict(size=24, color="#1f1f1f", family="Arial Black"),
            ),
            showlegend=True,
            legend=dict(
                title="<b>Models</b>",
                title_font=dict(size=14, color="#1f1f1f"),
                font=dict(size=11),
                bgcolor="rgba(255,255,255,0.8)",
                bordercolor="rgba(128,128,128,0.5)",
                borderwidth=1,
                x=1.02,
                y=1,
                xanchor="left",
                yanchor="top",
            ),
            plot_bgcolor="rgba(248,249,250,0.8)",
            paper_bgcolor="white",
            font=dict(family="Arial, sans-serif", size=11, color="#2f2f2f"),
            margin=dict(l=80, r=150, t=120, b=80),
            hovermode="closest",
        )

        config = dict(
            displayModeBar=True,
            displaylogo=False,
            modeBarButtonsToRemove=["pan2d", "lasso2d"],
            toImageButtonOptions=dict(
                format="png",
                filename="model_performance_dashboard",
                height=900,
                width=1400,
                scale=2,
            ),
        )

        fig.show(config=config)

        fig.write_html("performance_dashboard.html", config=config)

    def macro_accuracy_heatmap(self):
        """Create a heatmap of macro percentage error by model."""
        if self.successful_df.empty:
            return
        macro_cols = [
            "calories_pct_error",
            "carbs_pct_error",
            "protein_pct_error",
            "fat_pct_error",
        ]
        heatmap_data = (
            self.successful_df.groupby("model")[macro_cols].mean().sort_index()
        )
        short_labels = {
            m: m.split("/", 1)[1] if "/" in m else m for m in heatmap_data.index
        }

        heatmap_data.index = [short_labels[m] for m in heatmap_data.index]

        fig = px.imshow(
            heatmap_data,
            labels=dict(x="Macro Nutrient", y="Model", color="Avg % Error"),
            x=["Calories", "Carbs", "Protein", "Fat"],
            y=heatmap_data.index,
            text_auto=True,
            aspect="auto",
            color_continuous_scale="RdYlGn_r",
            title="<b>Average Percentage Error by Macro (Lower is Better)</b>",
        )
        fig.update_traces(texttemplate="%{z}%")
        fig.show()

    def export_detailed_report(self, filename="benchmark_report.xlsx"):
        """Export a detailed report with raw data and summary to an Excel file."""
        print(f"Exporting detailed report to {filename}...")
        with pd.ExcelWriter(filename, engine="openpyxl") as writer:
            self.df.to_excel(writer, sheet_name="Raw Results", index=False)

            if not self.successful_df.empty:
                summary = (
                    self.df.groupby("model")
                    .agg(
                        success_rate=("error", lambda x: x.isna().mean()),
                        semantic_match_embeddings_mean=(
                            "semantic_match_embeddings",
                            "mean",
                        ),
                        f1_ing_mean=("f1_ing", "mean"),
                        mae_mac_mean=("mae_mac", "mean"),
                        cost_usd_total=("cost_usd", "sum"),
                        response_time_seconds_mean=("response_time_seconds", "mean"),
                    )
                    .reset_index()
                )
                summary.to_excel(writer, sheet_name="Summary Statistics", index=False)
        print("Export complete.")

## Run Benchmark


In [94]:
BENCHMARK_CONFIG = {
    "models": [
        "gpt-4o-mini",
        "gpt-4o",
        "gemini/gemini-2.5-flash-preview-05-20",
        "gemini/gemini-2.5-pro-preview-06-05",
    ],
    "max_items": 10,
    "cache_dir": Path(".cache/food_scan_bench"),
    "max_concurrent_requests": 10,
    "prompt_variants": ["detailed", "step_by_step"],
    "use_embeddings_for_matching": True,
    "report_filename": "benchmark_results.xlsx",
}

results_df = await run_evaluation(
    models=BENCHMARK_CONFIG["models"],
    cache_dir=BENCHMARK_CONFIG["cache_dir"],
    max_items=BENCHMARK_CONFIG["max_items"],
    max_concurrent=BENCHMARK_CONFIG["max_concurrent_requests"],
    prompt_variants=BENCHMARK_CONFIG["prompt_variants"],
    use_embeddings=BENCHMARK_CONFIG["use_embeddings_for_matching"],
)


=== Running with prompt variant: 'detailed' ===
Using OpenAI embeddings semantic similarity


gpt-4o-mini (detailed)                  │          │ 0/10

gpt-4o (detailed)                       │          │ 0/10

gemini/gemini-2.5-flash-preview-05-20 (detailed)│          │ 0/10

gemini/gemini-2.5-pro-preview-06-05 (detailed)│          │ 0/10


=== Running with prompt variant: 'step_by_step' ===
Using OpenAI embeddings semantic similarity


gpt-4o-mini (step_by_step)              │          │ 0/10

gpt-4o (step_by_step)                   │          │ 0/10

gemini/gemini-2.5-flash-preview-05-20 (step_by_step)│          │ 0/10

gemini/gemini-2.5-pro-preview-06-05 (step_by_step)│          │ 0/10

In [95]:
analyzer = BenchmarkAnalyzer(results_df)

analyzer.summary_statistics()
analyzer.analyze_ingredient_matches()
analyzer.statistical_significance_test()
analyzer.create_performance_dashboard()
analyzer.macro_accuracy_heatmap()
analyzer.export_detailed_report(BENCHMARK_CONFIG["report_filename"])

display(results_df.head())

=== BENCHMARK SUMMARY ===

--- gemini/gemini-2.5-flash-preview-05-20_detailed ---
  Success Rate: 100.0% (10/10)
  Avg Semantic Match (Embeddings): 0.340
  Avg F1 Score (Ingredients): 0.224
  Avg MAE (Macros): 106.5
  Avg Cost per Image: $0.0012
  Avg Response Time: 17.9s

--- gemini/gemini-2.5-flash-preview-05-20_step_by_step ---
  Success Rate: 100.0% (10/10)
  Avg Semantic Match (Embeddings): 0.304
  Avg F1 Score (Ingredients): 0.143
  Avg MAE (Macros): 111.3
  Avg Cost per Image: $0.0013
  Avg Response Time: 18.0s

--- gemini/gemini-2.5-pro-preview-06-05_detailed ---
  Success Rate: 100.0% (10/10)
  Avg Semantic Match (Embeddings): 0.278
  Avg F1 Score (Ingredients): 0.071
  Avg MAE (Macros): 79.6
  Avg Cost per Image: $0.0234
  Avg Response Time: 32.3s

--- gemini/gemini-2.5-pro-preview-06-05_step_by_step ---
  Success Rate: 100.0% (10/10)
  Avg Semantic Match (Embeddings): 0.290
  Avg F1 Score (Ingredients): 0.168
  Avg MAE (Macros): 79.1
  Avg Cost per Image: $0.0209
  Avg Respo

Exporting detailed report to benchmark_results.xlsx...
Export complete.


Unnamed: 0,image_id,model,response_time_seconds,f1_ing,semantic_match,semantic_match_embeddings,ingredient_count_acc,mae_mac,error,cost_usd,calories_pct_error,carbs_pct_error,protein_pct_error,fat_pct_error,match_details
0,fsb_00000,gpt-4o-mini_detailed,9.83,0.5,0.6,0.6,0.6,114.15,,0.004072,95.838433,123.880597,9.634551,119.653179,"[{'gt_ingredient': 'scrambled eggs', 'pred_ing..."
1,fsb_00001,gpt-4o-mini_detailed,10.88,0.75,0.75,0.75,1.0,12.775,,0.005722,8.858859,8.247423,40.625,44.174757,"[{'gt_ingredient': 'almond cake', 'pred_ingred..."
2,fsb_00002,gpt-4o-mini_detailed,12.1,0.0,0.0,0.0,0.333333,15.925,,0.005673,9.354605,4.891304,138.095238,19.58042,[]
3,fsb_00003,gpt-4o-mini_detailed,6.12,0.333333,0.25,0.25,0.5,181.35,,0.005671,68.881974,54.894434,80.769231,80.0,"[{'gt_ingredient': 'pie crust', 'pred_ingredie..."
4,fsb_00004,gpt-4o-mini_detailed,12.88,0.0,0.0,0.0,0.666667,2.275,,0.005684,6.567993,5.395683,27.272727,25.0,[]
