The Repo works best in collab.
Notebook1 - Cleaning, EDA & Preparation for Modelling
Notebook2 - Modelling
Drive Folder
- data.csv - Raw Dataset Provided.
- processed_data.csv - Processed Dataset generated by Notebook1.
- below_thresh_index.txt - Indexes of examples from dataset whose category was rare in the dataset, generated by Notebook1 . More details in the Notebook1 .
- Models/Pretrained-bert - Saved Pretrained model generated by Notebook 2 if TRAIN = True and used for loading and inference in Notebook 2.

Models Used

Things tried & Further Improvements

In Notebook 1 - Preprocessing, for all the text features, lemmatization was tried using spacy, it was dropped as not much changes were seen due to the vocabulary & the pipeline took too much time to lemmatize >30 mins for ~20k samples.
The description feature was more of specification than a description with a semantic sense, so the product_specifications deemed more useful for fine-tuning a pretrained model for Sequence Classification.
Due to using TFidf for the 1st model, around ~47k features were generated, SparsePCA was tried to reduce it, since the dataset was too large, collab crashed. Since already The 1st model was giving a decent score, IncrementalPCA wasn't tried which could have overcome the memory issue.
For pretraining DistilBert was used which gave decent score with ~20% examples(Bert memory issue) and only one feature product_specifications(for the 2nd model) was used as it had a semantic order.
For both Random Forest & Seq Classification Weighted F1 Score is calculated to ensure imbalance of dataset is taken care of.
It is interesting to see predictions of both the model against discarded examples(those which did not have target). Amazing what Transfer Learning can do, with just 20 example for each category
To improve the performance, Hyperparameter tuning can be done, the Pretrained model can be trained with more data.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
MIDAS_Modelling.ipynb		MIDAS_Modelling.ipynb
MIDAS_Preprocessing.ipynb		MIDAS_Preprocessing.ipynb
README.md		README.md
requirements.txt		requirements.txt