MIDAS Lab Task-3 NLP
- The Repo works best in collab.
- Notebook1 - Cleaning, EDA & Preparation for Modelling
- Notebook2 - Modelling
- Drive Folder
data.csv
- Raw Dataset Provided.processed_data.csv
- Processed Dataset generated by Notebook1.below_thresh_index.txt
- Indexes of examples from dataset whose category was rare in the dataset, generated by Notebook1 . More details in the Notebook1 .Models/Pretrained-bert
- Saved Pretrained model generated by Notebook 2 ifTRAIN = True
and used for loading and inference in Notebook 2.
- Random Forest Classifer, Weighted F1 Score -
0.9764
- DistilBert Uncased, Weighted F1 Score -
0.8970
- In Notebook 1 - Preprocessing, for all the text features, lemmatization was tried using spacy, it was dropped as not much changes were seen due to the vocabulary & the pipeline took too much time to lemmatize >30 mins for ~20k samples.
- The
description
feature was more of specification than a description with a semantic sense, so theproduct_specifications
deemed more useful for fine-tuning a pretrained model for Sequence Classification. - Due to using TFidf for the 1st model, around ~47k features were generated, SparsePCA was tried to reduce it, since the dataset was too large, collab crashed. Since already The 1st model was giving a decent score, IncrementalPCA wasn't tried which could have overcome the memory issue.
- For pretraining DistilBert was used which gave decent score with ~20% examples(Bert memory issue) and only one feature
product_specifications
(for the 2nd model) was used as it had a semantic order. - For both Random Forest & Seq Classification Weighted F1 Score is calculated to ensure imbalance of dataset is taken care of.
- It is interesting to see predictions of both the model against discarded examples(those which did not have target). Amazing what Transfer Learning can do, with just 20 example for each category
- To improve the performance, Hyperparameter tuning can be done, the Pretrained model can be trained with more data.