7th Solo Place out of 110 teams in the Ozon Tech ML Competition. ROC-AUC Score: 0.9216
This repository contains a production-ready solution for the product matching task. The engine predicts whether two products are identical (target = 1) or different (target = 0) by analyzing multi-modal data: titles, technical attributes, and images.
| Model | ROC-AUC | Focus |
|---|---|---|
| AutoGluon (Stacked) | 0.9216 | Maximum accuracy (Top-7 Leaderboard) |
| HistGradientBoosting | 0.9100 | Production-ready, lightweight & fast |
The system processes five distinct data streams provided by Ozon Tech:
- Textual: Titles, descriptions, and pre-trained BERT embeddings.
- Visual: ResNet embeddings of product images.
- Technical: Product categories and structured attributes (attributes.csv).
- Relational: Pairs of products for training (train.csv).
The core strength of this solution lies in advanced feature engineering:
- NLP Features: Levenshtein distance, Jaccard similarity, string length analysis.
- Attribute Analysis: Jaccard similarity of attribute sets, count differences, binary matching flags.
- CV Analysis: Cosine and Euclidean distances between image embeddings, entropy-based features.
- Categorical: Multi-level subcategory matching.
A sophisticated multi-model stack with automatic ensembling.
- Presets:
best_qualityfor maximum precision. - Strategy: Automated hyperparameter tuning and multi-layer stacking.
- File:
notebooks/04_AutoML_Training_AutoGluon.ipynb
An alternative lightweight pipeline designed for high-load environments.
- Optimization: Bayesian search via Optuna.
- Strengths: Handles class imbalance, minimal inference latency.
- File:
notebooks/03_Model_Training_HGB.ipynb
Product-Matching-Engine/
βββ assets/ # Project branding and images
βββ notebooks/ # Research & Development
β βββ 01_Feature_Extraction.ipynb
β βββ 02_Category_Analysis.ipynb
β βββ 03_Model_Training_HGB.ipynb
β βββ 04_AutoML_Training_AutoGluon.ipynb
βββ src/ # Production scripts (Inference)
β βββ inference_hgb.py
β βββ inference_autogluon.py
βββ requirements.txt # Project dependencies
The high performance was achieved through massive feature generation and AutoML. Analysis of top-3 solutions shows that further gains can be made by fine-tuning Transformers (BERT) specifically on product categories to capture deeper semantic relationships.
Developed by Nikolay Alymov Expertise in AI, RecSys, and Enterprise Automation.
