To predict as many as tags possible with high Precision and Recall.
The dataset was obtained from kaggle. The given problem is multi-label classification problem. The dataset contains features such as Id, Title, Body and Tags. Data preprocessing and cleaning was done to remove html tags and hyperlinks. Micro-Averaged F1-Score was used as performance metric as mentioned on Kaggle.
Data: https://www.kaggle.com/c/facebook-recruiting-iii-keyword-extraction/data
- As a part of feature engineering, a new named as question was created as a combination of title + body
- Code,HTML Tags and Stopwords were remmoved from body as part of data cleaning.
- Objective of this case study was to Suggest the tags based on the content that was there in the question posted on Stackoverflow.
- The given dataset contains 6M data point in train with Id,Title,body and Tags as features.
- EDA was done on tags and it is found that "c#", "java", "php", "asp.net", "javascript", "c++" are some of the most frequent tags.
- On an avg. 2.88 tags were present perquestion.
- We are considering only 5500 tags which covers 99.04 % of questions
- Various machine learning models were tried and tested with OvR classifier to get the best results.
- Logistic regression with TFIDF gave best accuracy of 0.236 trained on 1M data pts.
- Model accuracy degraded as we reduced the number of data points which is as expected