MultiLabel Data Science Tags Classification

Overview

A multi-label text classification model that predicts relevant tags for Data Science questions.The model can classify 299 types of question tags from here.

Data Collection and Preprocessing

Data was collected from here in two steps.

Question URLs Scraping: The datascience question URLs and their corresponding titles were scraped using the script this notebook. The dataset is available here.
Question Details Scraping: For each URL listed in ques_urls.csv the question details (title, URL, description, tags) were scraped with this notebook. The complete dataset is stored in here.

In total, 24,500 datascience question details were scraped.

Initially, there were 692 different question tags in the dataset. After analysis, 393 rare tags—those appearing in less than 0.1% of questions were found.Rare tags and any questions that contained only these tags were removed from the dataset. Finally, the dataset had 299 different tags across 24,340 questions.

Model Training and ONNX Inference

Three models were trained and later compressed using ONNX. The table below summarizes the key performance metrics for both the original and compressed models.

Metric	distilroberta-base	distilroberta-base (quantized)	roberta-base	roberta-base (quantized)	bert-base-uncased	bert-base-uncased (quantized)
Precision (Micro)	0.654	0.652	0.658	0.668	0.655	0.698
Precision (Macro)	0.308	0.291	0.280	0.289	0.283	0.223
Recall (Micro)	0.304	0.303	0.302	0.294	0.300	0.198
Recall (Macro)	0.151	0.146	0.136	0.130	0.139	0.075
F1 Score (Micro)	0.415	0.414	0.414	0.409	0.411	0.308
F1 Score (Macro)	0.184	0.179	0.169	0.163	0.170	0.103
Size	314.3 MB	79 MB	476.6 MB	129.9 MB	418.8 MB	105.4 MB

Model Deployment

The distilroberta-base model outperformed other models and achieved 98% accuracy and F1-Score(Micro) of 42%. Hence this model was deployed to HuggingFace Spaces Gradio App. The implementation can be found in deployment or here.

🌐 Web Deployment

Developed a Flask Webapp and deployed to Render. It takes data science questions as input and classifies the relevant tags associated with the question via HuggingFace API. Check out the web app from here.

Build from Source

Clone the repo

git clone https://github.com/Atquiya-Labiba/MultiLabel-Data-Science-Tags-Classifier.git

Initialize and activate virtual environment

python -m venv venv
venv\Scripts\activate

Install dependencies

pip install -r requirements.txt

Run the Selenium Scraper

python scraper/question_url_scraper.py

Scrape Question Details

python scraper/question_details_scraper.py

Contact

Email Me

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
assets		assets
data		data
dataloaders		dataloaders
deployment		deployment
models		models
notebooks		notebooks
scraper		scraper
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MultiLabel Data Science Tags Classification

Overview

Data Collection and Preprocessing

Model Training and ONNX Inference

Model Deployment

🌐 Web Deployment

Build from Source

Run the Selenium Scraper

Scrape Question Details

Contact

About

Uh oh!

Releases

Packages

Languages

License

Atquiya-Labiba/MultiLabel-Data-Science-Tags-Classifier

Folders and files

Latest commit

History

Repository files navigation

MultiLabel Data Science Tags Classification

Overview

Data Collection and Preprocessing

Model Training and ONNX Inference

Model Deployment

🌐 Web Deployment

Build from Source

Run the Selenium Scraper

Scrape Question Details

Contact

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages