A multi-label text classification model that predicts relevant tags for Data Science questions.The model can classify 299 types of question tags from here.
Data was collected from here in two steps.
- Question URLs Scraping: The datascience question URLs and their corresponding titles were scraped using the script this notebook. The dataset is available here.
- Question Details Scraping: For each URL listed in
ques_urls.csvthe question details (title, URL, description, tags) were scraped with this notebook. The complete dataset is stored in here.
In total, 24,500 datascience question details were scraped.
Initially, there were 692 different question tags in the dataset. After analysis, 393 rare tags—those appearing in less than 0.1% of questions were found.Rare tags and any questions that contained only these tags were removed from the dataset. Finally, the dataset had 299 different tags across 24,340 questions.
Three models were trained and later compressed using ONNX. The table below summarizes the key performance metrics for both the original and compressed models.
| Metric | distilroberta-base | distilroberta-base (quantized) | roberta-base | roberta-base (quantized) | bert-base-uncased | bert-base-uncased (quantized) |
|---|---|---|---|---|---|---|
| Precision (Micro) | 0.654 | 0.652 | 0.658 | 0.668 | 0.655 | 0.698 |
| Precision (Macro) | 0.308 | 0.291 | 0.280 | 0.289 | 0.283 | 0.223 |
| Recall (Micro) | 0.304 | 0.303 | 0.302 | 0.294 | 0.300 | 0.198 |
| Recall (Macro) | 0.151 | 0.146 | 0.136 | 0.130 | 0.139 | 0.075 |
| F1 Score (Micro) | 0.415 | 0.414 | 0.414 | 0.409 | 0.411 | 0.308 |
| F1 Score (Macro) | 0.184 | 0.179 | 0.169 | 0.163 | 0.170 | 0.103 |
| Size | 314.3 MB | 79 MB | 476.6 MB | 129.9 MB | 418.8 MB | 105.4 MB |
The distilroberta-base model outperformed other models and achieved 98% accuracy and F1-Score(Micro) of 42%. Hence this model was deployed to HuggingFace Spaces Gradio App. The implementation can be found in deployment or here.
Developed a Flask Webapp and deployed to Render. It takes data science questions as input and classifies the relevant tags associated with the question via HuggingFace API. Check out the web app from here.
- Clone the repo
git clone https://github.com/Atquiya-Labiba/MultiLabel-Data-Science-Tags-Classifier.git- Initialize and activate virtual environment
python -m venv venv
venv\Scripts\activate- Install dependencies
pip install -r requirements.txtpython scraper/question_url_scraper.pypython scraper/question_details_scraper.py
