Stack-Overflow-Tag-Prediction

This repository contains the implementation of a machine learning model for Tag Prediction on Stack Overflow questions. The goal is to suggest relevant tags based on the content of the questions posted on the platform, helping users to categorize their questions effectively.

1. Business Problem

1.1 Description

Stack Overflow is the largest and most trusted online community for developers to learn, share knowledge, and build their careers. It features questions and answers on a wide range of programming topics. Each month, over 50 million developers visit the platform to find solutions to their programming challenges.

Tags play a crucial role on Stack Overflow, as they help categorize questions and make them more accessible to others with relevant expertise. However, it can be difficult for users to always select the appropriate tags. This project aims to build a model that can predict the relevant tags based on the content of the questions.

Problem Statement

The task is to predict the tags associated with a question based on its content. These tags help in organizing and categorizing the questions, making it easier for users to find and answer them.

Source: Kaggle Competition

1.2 Source / Useful Links

Data Source: Kaggle Competition Dataset
Youtube Video: Video on Tag Prediction
Research Paper 1: Tagging and Keyword Extraction
Research Paper 2: Paper on Tagging Techniques

1.3 Real World / Business Objectives and Constraints

Objective: Predict as many relevant tags as possible for a given question with high precision and recall.
Constraints:
- Incorrect tag predictions can negatively impact user experience on Stack Overflow.
- There are no strict latency constraints, meaning that the prediction does not need to be in real-time.
- Achieving a balance between precision and recall is crucial to ensure that relevant tags are suggested without overloading the user with incorrect ones.

2. Install the required dependencies:

pip install -r requirements.txt

3. Ensure that you have downloaded the dataset from Kaggle and placed it in the appropriate directory.

4. Data Preprocessing

Sample 1M data points.
Separate code-snippets from Body.
Remove Spcial characters from Question title and description (not in code).
Remove stop words (Except 'C').
Remove HTML Tags.
Convert all the characters into small letters.
Use SnowballStemmer to stem the words.

5. Machine Learning Models

Converting tags for multilabel problems.
- Taking only a subset of tags around 5k to solve the problem.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
README.md		README.md
Stack Overflow Tag Prediction.ipynb		Stack Overflow Tag Prediction.ipynb
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Stack-Overflow-Tag-Prediction

1. Business Problem

1.1 Description

Problem Statement

1.2 Source / Useful Links

1.3 Real World / Business Objectives and Constraints

2. Install the required dependencies:

3. Ensure that you have downloaded the dataset from Kaggle and placed it in the appropriate directory.

4. Data Preprocessing

5. Machine Learning Models

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Arjun1425/Stack-Overflow-Tag-Prediction

Folders and files

Latest commit

History

Repository files navigation

Stack-Overflow-Tag-Prediction

1. Business Problem

1.1 Description

Problem Statement

1.2 Source / Useful Links

1.3 Real World / Business Objectives and Constraints

2. Install the required dependencies:

3. Ensure that you have downloaded the dataset from Kaggle and placed it in the appropriate directory.

4. Data Preprocessing

5. Machine Learning Models

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages