Skip to content

QurusX/Product-Matching-Engine

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

1 Commit
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Product Matching Engine πŸš€

Product Matching Banner

High-precision E-commerce deduplication system based on CV and NLP.

7th Solo Place out of 110 teams in the Ozon Tech ML Competition. ROC-AUC Score: 0.9216


πŸ“Œ Overview

This repository contains a production-ready solution for the product matching task. The engine predicts whether two products are identical (target = 1) or different (target = 0) by analyzing multi-modal data: titles, technical attributes, and images.

🏁 Key Results

Model ROC-AUC Focus
AutoGluon (Stacked) 0.9216 Maximum accuracy (Top-7 Leaderboard)
HistGradientBoosting 0.9100 Production-ready, lightweight & fast

πŸ“¦ Data Architecture

The system processes five distinct data streams provided by Ozon Tech:

  • Textual: Titles, descriptions, and pre-trained BERT embeddings.
  • Visual: ResNet embeddings of product images.
  • Technical: Product categories and structured attributes (attributes.csv).
  • Relational: Pairs of products for training (train.csv).

πŸ”§ Engineering Pipeline

The core strength of this solution lies in advanced feature engineering:

  • NLP Features: Levenshtein distance, Jaccard similarity, string length analysis.
  • Attribute Analysis: Jaccard similarity of attribute sets, count differences, binary matching flags.
  • CV Analysis: Cosine and Euclidean distances between image embeddings, entropy-based features.
  • Categorical: Multi-level subcategory matching.

🧠 Model Architectures

1. AutoGluon Tabular (The Heavy Hitter)

A sophisticated multi-model stack with automatic ensembling.

2. HistGradientBoosting + Optuna (The Fast Track)

An alternative lightweight pipeline designed for high-load environments.

πŸ“‚ Repository Structure

Product-Matching-Engine/
β”œβ”€β”€ assets/             # Project branding and images
β”œβ”€β”€ notebooks/          # Research & Development
β”‚   β”œβ”€β”€ 01_Feature_Extraction.ipynb
β”‚   β”œβ”€β”€ 02_Category_Analysis.ipynb
β”‚   β”œβ”€β”€ 03_Model_Training_HGB.ipynb
β”‚   └── 04_AutoML_Training_AutoGluon.ipynb
β”œβ”€β”€ src/                # Production scripts (Inference)
β”‚   β”œβ”€β”€ inference_hgb.py
β”‚   └── inference_autogluon.py
└── requirements.txt    # Project dependencies

πŸ” Analysis & Future Work

The high performance was achieved through massive feature generation and AutoML. Analysis of top-3 solutions shows that further gains can be made by fine-tuning Transformers (BERT) specifically on product categories to capture deeper semantic relationships.


Developed by Nikolay Alymov Expertise in AI, RecSys, and Enterprise Automation.

About

πŸš€ High-precision Product Matching Engine for E-commerce. Multi-modal AI (Computer Vision + NLP) for automated catalog deduplication. Top-7 solo solution in Ozon Tech ML Competition (ROC-AUC: 0.9216).

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors