Product Matching Engine 🚀

High-precision E-commerce deduplication system based on CV and NLP.

7th Solo Place out of 110 teams in the Ozon Tech ML Competition. ROC-AUC Score: 0.9216

📌 Overview

This repository contains a production-ready solution for the product matching task. The engine predicts whether two products are identical (target = 1) or different (target = 0) by analyzing multi-modal data: titles, technical attributes, and images.

🏁 Key Results

Model	ROC-AUC	Focus
AutoGluon (Stacked)	0.9216	Maximum accuracy (Top-7 Leaderboard)
HistGradientBoosting	0.9100	Production-ready, lightweight & fast

📦 Data Architecture

The system processes five distinct data streams provided by Ozon Tech:

Textual: Titles, descriptions, and pre-trained BERT embeddings.
Visual: ResNet embeddings of product images.
Technical: Product categories and structured attributes (attributes.csv).
Relational: Pairs of products for training (train.csv).

🔧 Engineering Pipeline

The core strength of this solution lies in advanced feature engineering:

NLP Features: Levenshtein distance, Jaccard similarity, string length analysis.
Attribute Analysis: Jaccard similarity of attribute sets, count differences, binary matching flags.
CV Analysis: Cosine and Euclidean distances between image embeddings, entropy-based features.
Categorical: Multi-level subcategory matching.

🧠 Model Architectures

1. AutoGluon Tabular (The Heavy Hitter)

A sophisticated multi-model stack with automatic ensembling.

Presets: best_quality for maximum precision.
Strategy: Automated hyperparameter tuning and multi-layer stacking.
File: notebooks/04_AutoML_Training_AutoGluon.ipynb

2. HistGradientBoosting + Optuna (The Fast Track)

An alternative lightweight pipeline designed for high-load environments.

Optimization: Bayesian search via Optuna.
Strengths: Handles class imbalance, minimal inference latency.
File: notebooks/03_Model_Training_HGB.ipynb

📂 Repository Structure

Product-Matching-Engine/
├── assets/             # Project branding and images
├── notebooks/          # Research & Development
│   ├── 01_Feature_Extraction.ipynb
│   ├── 02_Category_Analysis.ipynb
│   ├── 03_Model_Training_HGB.ipynb
│   └── 04_AutoML_Training_AutoGluon.ipynb
├── src/                # Production scripts (Inference)
│   ├── inference_hgb.py
│   └── inference_autogluon.py
└── requirements.txt    # Project dependencies

🔍 Analysis & Future Work

The high performance was achieved through massive feature generation and AutoML. Analysis of top-3 solutions shows that further gains can be made by fine-tuning Transformers (BERT) specifically on product categories to capture deeper semantic relationships.

Developed by Nikolay Alymov Expertise in AI, RecSys, and Enterprise Automation.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.idea		.idea
assets		assets
notebooks		notebooks
src		src
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Product Matching Engine 🚀

High-precision E-commerce deduplication system based on CV and NLP.

📌 Overview

🏁 Key Results

📦 Data Architecture

🔧 Engineering Pipeline

🧠 Model Architectures

1. AutoGluon Tabular (The Heavy Hitter)

2. HistGradientBoosting + Optuna (The Fast Track)

📂 Repository Structure

🔍 Analysis & Future Work

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Product Matching Engine 🚀

High-precision E-commerce deduplication system based on CV and NLP.

📌 Overview

🏁 Key Results

📦 Data Architecture

🔧 Engineering Pipeline

🧠 Model Architectures

1. AutoGluon Tabular (The Heavy Hitter)

2. HistGradientBoosting + Optuna (The Fast Track)

📂 Repository Structure

🔍 Analysis & Future Work

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages