Text Classification

QUALITY CLASSIFICATION OF QUESTIONS ON STACK OVERFLOW

Ho Thanh Duy Khanh†, Bui Nguyen Phuong Linh†, Nguyen Thi Nguyet† and Nguyen Thi Phuong Thao†

† VNUHCM - University of Information Technology, Viet Nam.

Contributing authors: [20521445, 20521527, 20521689, 20521936]@gm.uit.edu.vn

Abstract

Community Question Answering (CQA) is the field of computational linguistics that deals with problems derived from the questions and answers posted to websites and it has a growing popularity as a way of providing and researching information. Crowdsourced knowledge is a resource for users yet it can raise concerns about the quality of the shared content. As recognizing good questions that can improve the CQA services and the user’s experience, the study focuses on question quality instead. Using a dataset of questions and answers posted to the Stack Overflow website, we have analyzed and conducted quality classification of questions. In addition to taking advantage of the natural language processing capabilities of neural network Deep Learning models such as LSTM, Bi-LSTM, and Distil-BERT, we also apply some Machine Learning classification models like Logistic Regression, Multinomial Na ̈ıve Bayes, Decision Tree, Random Forest. Then we compare all the models and give the best model to help classify the quality of the question. Initially, the result was obtained with the Distil-BERT model with the highest accuracy of 91.80%.

Deep learning and machine learning methods

Deep learning: LSTM, Bi-LSTM, and Distil-BERT.
Machine learning: Logistic Regression, Multinomial Naıve Bayes, Decision Tree, Random Forest.

Dataset

Link Kaggle The dataset consists of over 60,000 data samples that are collected from the Stack Overflow website. These questions were asked in a time period ranging from January 1st, 2016 to January 1st, 2020. The dataset includes 2 data files: train.csv (45,000 samples), and valid.csv (15,000 samples). The dataset consists of 6 features: a unique question ID, a question title, the main body or content of the question, tags representing the important words (keywords) in the question, the creation date of the question as well as the class/label of the question. The label itself consists of 3 classes:

High-Quality (HQ): questions that receive a score of more than 30 from the community and are not edited a single time by anyone.
Low-Quality Edited (LQ EDIT): questions that receive a negative score and multiple edits from the community.
Low-Quality Closed (LQ CLOSE): questions that were immediately closed by the community due to their extremely poor quality. These questions are sorted according to their question ID. Also, the main content or text of the questions are in the HTML format and the dates are in the UTC format.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
Bi-LSTM.jpg		Bi-LSTM.jpg
Distil-BERT.png		Distil-BERT.png
LSTM.png		LSTM.png
Pipline.png		Pipline.png
README.md		README.md
Result.png		Result.png
main.ipynb		main.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Text Classification

Abstract

Deep learning and machine learning methods

Dataset

Pipline

Result

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Text Classification

Abstract

Deep learning and machine learning methods

Dataset

Pipline

Result

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages