This repository contains the implementation of a machine learning model for Tag Prediction on Stack Overflow questions. The goal is to suggest relevant tags based on the content of the questions posted on the platform, helping users to categorize their questions effectively.
Stack Overflow is the largest and most trusted online community for developers to learn, share knowledge, and build their careers. It features questions and answers on a wide range of programming topics. Each month, over 50 million developers visit the platform to find solutions to their programming challenges.
Tags play a crucial role on Stack Overflow, as they help categorize questions and make them more accessible to others with relevant expertise. However, it can be difficult for users to always select the appropriate tags. This project aims to build a model that can predict the relevant tags based on the content of the questions.
The task is to predict the tags associated with a question based on its content. These tags help in organizing and categorizing the questions, making it easier for users to find and answer them.
- Source: Kaggle Competition
- Data Source: Kaggle Competition Dataset
- Youtube Video: Video on Tag Prediction
- Research Paper 1: Tagging and Keyword Extraction
- Research Paper 2: Paper on Tagging Techniques
- Objective: Predict as many relevant tags as possible for a given question with high precision and recall.
- Constraints:
- Incorrect tag predictions can negatively impact user experience on Stack Overflow.
- There are no strict latency constraints, meaning that the prediction does not need to be in real-time.
- Achieving a balance between precision and recall is crucial to ensure that relevant tags are suggested without overloading the user with incorrect ones.
pip install -r requirements.txt3. Ensure that you have downloaded the dataset from Kaggle and placed it in the appropriate directory.
- Sample 1M data points.
- Separate code-snippets from Body.
- Remove Spcial characters from Question title and description (not in code).
- Remove stop words (Except 'C').
- Remove HTML Tags.
- Convert all the characters into small letters.
- Use SnowballStemmer to stem the words.
- Converting tags for multilabel problems.
- Taking only a subset of tags around 5k to solve the problem.
