This repository is dedicated to the Home Depot Item Relevance project, which aims to determine relevance of the item according to search query. The project focuses on developing models using different NLP techniques: Classical ML, Character Based, Word Based, Pretrained, Combined
- DoubleLSTMDataset.py - Dataset for DoubleLSTMSiamese Model
- DoubleLSTMSiameseLSTM.py - DoubleLSTMSiamese Model
- BartDataset.py - Dataset for BartSiamese Model
- BartSiamese.py - BartSiamese Model
- bart_utils.py - util functions for Bart
- CharDataset.py - Dataset for CharSiameseLSTM Model
- CharSiameseLSTM.py - CharSiameseLSTM Model
- char_utils.py - util functions for character-based model
- ClassicalML.py - contains all funciton used for training classical ML algorithms
- GLOBALS.py - contains all global variables
- new_preproc.py - contains new prepocessing functions
- old_preproc.py - contains old prepocessing functions
- WordDataset.py - Dataset for WordSiameseLSTM Model
- WordSiameseLSTM.py - WordSiameseLSTM Model
- word_utils.py - util functions for word-based model
- 2LSTM.ipynb - main for training double LSTM model
- Bart.ipynb - main for training Bart-based model
- Character.ipynb - main for training character-based model
- Naive.ipynb - main for training naive model for comparison
- Word.ipynb - main for training word-based model
The dataset is composed of different features about items and search queries. In our project we used:
- Product Descriptions
- Search Terms
- Relevance of the search to item description
- Model Structure:
Same as in characted-based model
- Model Structure:
Same as in characted-based model but with 2 LSTM based on input
- The project was really challenging, especially preprocessing the data. We think the reason our word-based model got the best results is because of good data preprocessing tuned especially for the task. Although Bert is a very strong and complex model, it trained on very different text and not only Home Depot items, that is the possible reason why it did not outperformed our model. It is important to mention that training Bert model was much faster than model from zero, so it is always trade-off between the quality of the model and time.