A pipeline for merging duplicate product listings that carry inconsistent names (Hebrew, English, abbreviated, ALL-CAPS) and surfacing the lowest price per unique product.
Input CSV (id, name, price, category [, group_id])
│
▼
┌─────────────────────────────────────────────────────┐
│ Stage 0: Hebrew → English Normalization │
│ • gpt-4o-mini translates Hebrew names once │
│ • Results cached in cache/normalizations.json │
│ • Pure-English names pass through unchanged │
└──────────────────────┬──────────────────────────────┘
│ normalized names
▼
┌─────────────────────────────────────────────────────┐
│ Stage 1: Embeddings + FAISS ANN │
│ • text-embedding-3-small → 1536-dim vectors │
│ • Cached in cache/embeddings.json │
│ • FAISS IndexFlatIP on L2-normalized vectors │
│ → cosine similarity via inner product │
└──────────────────────┬──────────────────────────────┘
│ candidate pairs
▼
┌─────────────────────────────────────────────────────┐
│ Stage 2: Category-Aware Union-Find Clustering │
│ • Only products in the same category are compared │
│ • Union-Find merges pairs above threshold (0.72) │
└──────────────────────┬──────────────────────────────┘
│ raw clusters
▼
┌─────────────────────────────────────────────────────┐
│ Stage 3: LLM Cluster Refinement │
│ • Full cluster in one GPT-4o-mini call (not per-pair) │
│ • Verifies membership, splits false positives │
│ • Generates canonical English product name │
│ • Hard rule: 256GB ≠ 512GB, 55" ≠ 65", 9KG ≠ 11KG │
└──────────────────────┬──────────────────────────────┘
│ verified clusters + canonical names
▼
┌─────────────────────────────────────────────────────┐
│ Stage 4: Merge & Min Price │
│ • Picks lowest-priced listing per group │
│ → output/deduplicated_products.csv │
└──────────────────────┬──────────────────────────────┘
│ (only when group_id present)
▼
┌─────────────────────────────────────────────────────┐
│ Evaluation │
│ → output/evaluation.png │
│ → output/evaluation.json │
└─────────────────────────────────────────────────────┘
pip install -r requirements.txt
export OPENAI_API_KEY="sk-..."
python src/pipeline.pyThe pipeline reads from data/products_benchmark.csv and writes to output/ (created automatically).
On repeat runs, both caches are loaded — zero API calls unless new names appear.
├── data/
│ └── products_benchmark.csv # 432 products | 14 categories | group_id as GT
├── src/
│ └── pipeline.py # full pipeline (single file)
├── cache/
│ ├── embeddings.json # embedding cache (auto-created)
│ └── normalizations.json # Hebrew→English translation cache (auto-created)
├── output/ # auto-created
│ ├── deduplicated_products.csv
│ ├── evaluation.png
│ └── evaluation.json
├── requirements.txt
└── README.md
id,name,price,category[,group_id]
1,Samsung Galaxy S24 Ultra 256GB,4999,smartphones,SM-001
2,סמסונג גלקסי S24 אולטרה 256GB,4799,smartphones,SM-001
3,Samsung Galaxy S24 Ultra 512GB,5299,smartphones,SM-002group_id is optional — when present the pipeline runs evaluation automatically at the end.
Key invariant: every variant within a group_id must refer to the exact same product including storage/capacity/screen size. Different specs = different group_id.
CSV (output/deduplicated_products.csv) — flat table, one row per unique product:
| Column | Description |
|---|---|
canonical_name |
Normalized English product name |
category |
Product category |
min_price |
Lowest price found by the pipeline |
gt_min_price |
Ground-truth minimum price (only when group_id is present) |
best_deal_id |
ID of the listing with the lowest price |
best_deal_name |
Original name of the best-deal listing |
duplicate_count |
Number of listings merged into this product |
all_variants |
All listings, pipe-separated with prices |
The pipeline is benchmarked on 432 product listings across 14 categories (129 unique products, each with 3–4 naming variants in Hebrew, English, abbreviated, and ALL-CAPS forms).
Left — Metrics bar chart
Nine metrics grouped by method. Color bands show which group each metric belongs to:
- Blue — Pair-based (Precision / Recall / F1): The industry standard for deduplication. The unit of measurement is a pair of products. Unaffected by the large number of true-negative pairs.
- Green — Cluster-level (Purity / Coverage): Purity > 1.0 is a known artifact of the formula when products are perfectly split; Coverage measures how many true duplicates were successfully grouped.
- Orange — B-Cubed (Precision / Recall / F1): Computed per entity rather than per pair, so large clusters don't dominate the score.
- Purple — Adjusted Rand Index: A single global partition-similarity score, corrected for chance. 1.0 = perfect, 0.0 = random.
Middle — Pair Breakdown
Raw counts of True Positives (correct merges), False Positives (wrong merges), and False Negatives (missed duplicates). The dominant error type is FN — the pipeline occasionally over-splits conservative clusters.
Right — Price Quality
- Price Accuracy: fraction of deduplicated groups where the pipeline found the true minimum price.
- Avg Overcharge: mean relative overpayment when the minimum was missed (kept near 0 by the LLM stage).
| Metric | Score |
|---|---|
| Pair Precision | 1.000 |
| Pair Recall | 0.914 |
| Pair F1 | 0.955 |
| Cluster Purity | 1.002 |
| Cluster Coverage | 0.954 |
| B-Cubed F1 | 0.968 |
| Adjusted Rand Index | 0.953 |
| Price Accuracy | 95.0% |
| Avg Overcharge | 0.4% |
| Constant | Default | Effect |
|---|---|---|
SIMILARITY_THRESHOLD |
0.72 | Cosine similarity cutoff for Union-Find. Lower = wider candidate net; LLM handles more false positives. |
TOP_K |
10 | FAISS neighbors checked per product. Raise for denser catalogs. |
llm_refinemakes one API call per cluster. Use the OpenAI Batch API at scale.- Both caches (
embeddings.json,normalizations.json) make incremental runs free — only new/changed names hit the API.
