GitHub - QED0711/stack_overflow_nlp: A tag classification data science project using NLP and Stack Overflow posts

Stack Overflow Tag Predictor

Classifying Posts Using NLP

See here for the companion web app

Authors:

Summary

Using raw text data retrieved from Stack Overflow posts, we predict the main programming language tag for each post.

We begin by performing natural language processing (NLP) using the NLTK library to extract feature data from the raw posts. We then train and measure the accuracy of a number of different machine learning models.

Our top three models were logistic regression, multinomial NB, and random forest classifier. All produced accuracy scores around 80%. Using all the models together in majority vote, we were able to get about 83% accuracy.

As a secondary analysis, we attempted to perform topic clustering on the processed dataset. The results for this clustering analysis were inconclusive.

Conclusion

Our final conclusion is that, while we are able to get relatively good results in predicting language, topics within or among languages are numerous, share many common words, and are difficult to distinguish.

If you would like to see the final model (logistic regression, 81% accuracy) in action, see our companion web app for this project.

For a visual slide deck summary, see here

Dataset

All data was retrieved directly from Stack Overflow using Google BigQuery.

We limited our dataset to a little over 32 thousand unique posts with five of the most popular programming language categories:

Java | C# | Javascript | Python| C++

File Structure

Final Analysis:

Our final, high-level analysis can be found in:

/notebooks/Stack_Overflow_NLP_Summary_Notebook.ipynb

Cleaned Dataset:

The dataset we used in our final analysis can be found in:

/data/final/text_target.pkl

Primary Classes and Functions

We wrote custom classes and helper functions to handle text preprocessing/NLP and the formation and evaluation of our model pipelines. The code for those classes can be found in the respective folders listed below:

A notebook demonstrating the use of each class can be found in:

/notebooks/class_demonstration.ipynb

Final Report (PDF):

PDF version of final report can be found in:

/data/reports/Stack_Overflow_Tag_Predictor.pdf

Acknowledgements

In doing research for this project, we found the following articles very helpful:

Topic Modeling and Latent Dirichlet Allocation (LDA) in Python
A basic exploration and tutorial for LDA in python

Gensim Tutorial – A Complete Beginners Guide
A guide for text preprocessing/analysis using the Gensim Library

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
data		data
exploration		exploration
notebooks		notebooks
reports		reports
utils		utils
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

exploration

exploration

notebooks

notebooks

reports

reports

utils

utils

.gitignore

.gitignore

README.md

README.md

Repository files navigation

Stack Overflow Tag Predictor

Classifying Posts Using NLP

Authors:

Summary

Conclusion

Dataset

File Structure

Final Analysis:

Cleaned Dataset:

Primary Classes and Functions

Final Report (PDF):

Acknowledgements

About

Releases

Packages

Contributors 2

Languages

QED0711/stack_overflow_nlp

Folders and files

Latest commit

History

Repository files navigation

Stack Overflow Tag Predictor

Classifying Posts Using NLP

Authors:

Summary

Conclusion

Dataset

File Structure

Final Analysis:

Cleaned Dataset:

Primary Classes and Functions

Final Report (PDF):

Acknowledgements

About

Resources

Stars

Watchers

Forks

Languages