ProductBench: LLM Benchmark for Product Knowledge

ProductBench is a benchmark designed to evaluate the product knowledge of Large Language Models (LLMs). It consists of two main tasks:

Label Augmentation: The LLM is given a condensed or ambiguous product label and must output a clear, human-readable label.
Product Re-ranking: The LLM is given a search query and a list of potential product matches, and it must re-rank the products from most to least relevant.

The project also includes a web-based UI with a leaderboard to visualize the results and a price-to-performance graph.

Methodology

The goal of this benchmark is to estimate which model would be the most relevant for a product RAG system like a recommendation system, and measure both domain specific language as well as the ability to understand noisy labels (corrupted text, with brands...).

The models currently evaluated are mostly non-thinking ones because they are the one where the latency is the smallest.

Qwen 3 model have been tested in a previous iterations of the benchmark but has a weird behaviour (unreliable to benchmark), they have therefore been discarded for the time being and will certainly be reintroduced when the quirks have been solved (like the need of the /nothink mention).

The exception to the non thinking model category is gemini flash 3 which is the judge model (clearly specified in the leaderboard).

This benchmark is fully reproducible suign the "benchmark runner" script.

Limitation

This benchmark being open source on github, it it possible that that the benchmark will be added in future LLM training datasets, which they could overfit onto. Model released before the 4th of January 2026 should be contamination free.

Gemini 3 Pro has been used to get inspirations for some of the examples, especially in languages like German, Spanish and Polish. All examples of the benchmark have been human picked or adjusted. The UI and the benchmark logic have been in part vibe coded.

Setup

Clone the repository:

git clone https://github.com/your-username/productbench.git
cd productbench

Install the required dependencies:
```
pip install -r requirements.txt
```

Usage

To run the benchmarks, execute the benchmark_runner.py script:

python benchmark_runner.py

The website is a static one, so just go on index.html to see the leaderboard at the end

Project Structure

main.py: The main entry point for running the benchmarks and launching the UI.
requirements.txt: The list of Python dependencies.
productbench/: The main project directory.
- data/: Contains the JSON data for the benchmarks.
- label_augmentation/: The logic for the label augmentation task.
- product_reranking/: The logic for the product re-ranking task.
- ui/: The Flask web application for the leaderboard.
  - app.py: The Flask application factory.
  - static/: CSS and JavaScript files.
  - templates/: HTML templates.

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
productbench		productbench
reranker_models		reranker_models
tests		tests
.gitignore		.gitignore
BENCHMARK_RESULTS.json		BENCHMARK_RESULTS.json
BENCHMARK_RESULTS.md		BENCHMARK_RESULTS.md
README.md		README.md
RERANKER_RESULTS.json		RERANKER_RESULTS.json
RERANKER_RESULTS.md		RERANKER_RESULTS.md
benchmark_runner.py		benchmark_runner.py
index.html		index.html
pricing_map.json		pricing_map.json
requirements.txt		requirements.txt
reranker_runner.py		reranker_runner.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ProductBench: LLM Benchmark for Product Knowledge

Methodology

Limitation

Setup

Usage

Project Structure

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ProductBench: LLM Benchmark for Product Knowledge

Methodology

Limitation

Setup

Usage

Project Structure

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages