Skip to content

TalKleinBgu/Zap

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Zap Product Deduplication — Embeddings + LLM Pipeline

A pipeline for merging duplicate product listings that carry inconsistent names (Hebrew, English, abbreviated, ALL-CAPS) and surfacing the lowest price per unique product.


Pipeline Architecture

Input CSV  (id, name, price, category [, group_id])
    │
    ▼
┌─────────────────────────────────────────────────────┐
│  Stage 0: Hebrew → English Normalization             │
│  • gpt-4o-mini translates Hebrew names once          │
│  • Results cached in cache/normalizations.json       │
│  • Pure-English names pass through unchanged         │
└──────────────────────┬──────────────────────────────┘
                       │ normalized names
                       ▼
┌─────────────────────────────────────────────────────┐
│  Stage 1: Embeddings + FAISS ANN        │
│  • text-embedding-3-small → 1536-dim vectors         │
│  • Cached in cache/embeddings.json                   │
│  • FAISS IndexFlatIP on L2-normalized vectors        │
│    → cosine similarity via inner product             │
└──────────────────────┬──────────────────────────────┘
                       │ candidate pairs
                       ▼
┌─────────────────────────────────────────────────────┐
│  Stage 2: Category-Aware Union-Find Clustering       │
│  • Only products in the same category are compared   │
│  • Union-Find merges pairs above threshold (0.72)    │
└──────────────────────┬──────────────────────────────┘
                       │ raw clusters
                       ▼
┌─────────────────────────────────────────────────────┐
│  Stage 3: LLM Cluster Refinement                     │
│  • Full cluster in one GPT-4o-mini call (not per-pair)    │
│  • Verifies membership, splits false positives       │
│  • Generates canonical English product name          │
│  • Hard rule: 256GB ≠ 512GB, 55" ≠ 65", 9KG ≠ 11KG  │
└──────────────────────┬──────────────────────────────┘
                       │ verified clusters + canonical names
                       ▼
┌─────────────────────────────────────────────────────┐
│  Stage 4: Merge & Min Price                          │
│  • Picks lowest-priced listing per group             │
│  → output/deduplicated_products.csv                  │
└──────────────────────┬──────────────────────────────┘
                       │ (only when group_id present)
                       ▼
┌─────────────────────────────────────────────────────┐
│  Evaluation                                          │
│  → output/evaluation.png                             │
│  → output/evaluation.json                            │
└─────────────────────────────────────────────────────┘

Quick Start

pip install -r requirements.txt
export OPENAI_API_KEY="sk-..."
python src/pipeline.py

The pipeline reads from data/products_benchmark.csv and writes to output/ (created automatically).
On repeat runs, both caches are loaded — zero API calls unless new names appear.


File Structure

├── data/
│   └── products_benchmark.csv   # 432 products | 14 categories | group_id as GT
├── src/
│   └── pipeline.py              # full pipeline (single file)
├── cache/
│   ├── embeddings.json          # embedding cache (auto-created)
│   └── normalizations.json      # Hebrew→English translation cache (auto-created)
├── output/                      # auto-created
│   ├── deduplicated_products.csv
│   ├── evaluation.png
│   └── evaluation.json
├── requirements.txt
└── README.md

Input Format

id,name,price,category[,group_id]
1,Samsung Galaxy S24 Ultra 256GB,4999,smartphones,SM-001
2,סמסונג גלקסי S24 אולטרה 256GB,4799,smartphones,SM-001
3,Samsung Galaxy S24 Ultra 512GB,5299,smartphones,SM-002

group_id is optional — when present the pipeline runs evaluation automatically at the end.

Key invariant: every variant within a group_id must refer to the exact same product including storage/capacity/screen size. Different specs = different group_id.


Output Format

CSV (output/deduplicated_products.csv) — flat table, one row per unique product:

Column Description
canonical_name Normalized English product name
category Product category
min_price Lowest price found by the pipeline
gt_min_price Ground-truth minimum price (only when group_id is present)
best_deal_id ID of the listing with the lowest price
best_deal_name Original name of the best-deal listing
duplicate_count Number of listings merged into this product
all_variants All listings, pipe-separated with prices

Evaluation Results

The pipeline is benchmarked on 432 product listings across 14 categories (129 unique products, each with 3–4 naming variants in Hebrew, English, abbreviated, and ALL-CAPS forms).

Evaluation Report

Reading the Chart

Left — Metrics bar chart
Nine metrics grouped by method. Color bands show which group each metric belongs to:

  • Blue — Pair-based (Precision / Recall / F1): The industry standard for deduplication. The unit of measurement is a pair of products. Unaffected by the large number of true-negative pairs.
  • Green — Cluster-level (Purity / Coverage): Purity > 1.0 is a known artifact of the formula when products are perfectly split; Coverage measures how many true duplicates were successfully grouped.
  • Orange — B-Cubed (Precision / Recall / F1): Computed per entity rather than per pair, so large clusters don't dominate the score.
  • Purple — Adjusted Rand Index: A single global partition-similarity score, corrected for chance. 1.0 = perfect, 0.0 = random.

Middle — Pair Breakdown
Raw counts of True Positives (correct merges), False Positives (wrong merges), and False Negatives (missed duplicates). The dominant error type is FN — the pipeline occasionally over-splits conservative clusters.

Right — Price Quality

  • Price Accuracy: fraction of deduplicated groups where the pipeline found the true minimum price.
  • Avg Overcharge: mean relative overpayment when the minimum was missed (kept near 0 by the LLM stage).

Current Numbers

Metric Score
Pair Precision 1.000
Pair Recall 0.914
Pair F1 0.955
Cluster Purity 1.002
Cluster Coverage 0.954
B-Cubed F1 0.968
Adjusted Rand Index 0.953
Price Accuracy 95.0%
Avg Overcharge 0.4%

Key Parameters

Constant Default Effect
SIMILARITY_THRESHOLD 0.72 Cosine similarity cutoff for Union-Find. Lower = wider candidate net; LLM handles more false positives.
TOP_K 10 FAISS neighbors checked per product. Raise for denser catalogs.

Scalability Notes

  • llm_refine makes one API call per cluster. Use the OpenAI Batch API at scale.
  • Both caches (embeddings.json, normalizations.json) make incremental runs free — only new/changed names hit the API.

About

Product deduplication pipeline for Israeli price-comparison — Hebrew/English normalization, FAISS embeddings, LLM cluster refinement. Pair F1: 0.955

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages