Skip to content

Moon2909/TextClassification

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Text Classification

QUALITY CLASSIFICATION OF QUESTIONS ON STACK OVERFLOW

Ho Thanh Duy Khanh†, Bui Nguyen Phuong Linh†, Nguyen Thi Nguyet† and Nguyen Thi Phuong Thao†

VNUHCM - University of Information Technology, Viet Nam.

Contributing authors: [20521445, 20521527, 20521689, 20521936]@gm.uit.edu.vn

Abstract

Community Question Answering (CQA) is the field of computational linguistics that deals with problems derived from the questions and answers posted to websites and it has a growing popularity as a way of providing and researching information. Crowdsourced knowledge is a resource for users yet it can raise concerns about the quality of the shared content. As recognizing good questions that can improve the CQA services and the user’s experience, the study focuses on question quality instead. Using a dataset of questions and answers posted to the Stack Overflow website, we have analyzed and conducted quality classification of questions. In addition to taking advantage of the natural language processing capabilities of neural network Deep Learning models such as LSTM, Bi-LSTM, and Distil-BERT, we also apply some Machine Learning classification models like Logistic Regression, Multinomial Na ̈ıve Bayes, Decision Tree, Random Forest. Then we compare all the models and give the best model to help classify the quality of the question. Initially, the result was obtained with the Distil-BERT model with the highest accuracy of 91.80%.

Deep learning and machine learning methods

  • Deep learning: LSTM, Bi-LSTM, and Distil-BERT.
  • Machine learning: Logistic Regression, Multinomial Naıve Bayes, Decision Tree, Random Forest.

Dataset

Link Kaggle The dataset consists of over 60,000 data samples that are collected from the Stack Overflow website. These questions were asked in a time period ranging from January 1st, 2016 to January 1st, 2020. The dataset includes 2 data files: train.csv (45,000 samples), and valid.csv (15,000 samples). The dataset consists of 6 features: a unique question ID, a question title, the main body or content of the question, tags representing the important words (keywords) in the question, the creation date of the question as well as the class/label of the question. The label itself consists of 3 classes:

  • High-Quality (HQ): questions that receive a score of more than 30 from the community and are not edited a single time by anyone.
  • Low-Quality Edited (LQ EDIT): questions that receive a negative score and multiple edits from the community.
  • Low-Quality Closed (LQ CLOSE): questions that were immediately closed by the community due to their extremely poor quality. These questions are sorted according to their question ID. Also, the main content or text of the questions are in the HTML format and the dates are in the UTC format.

Pipline

Pipline

Result

Result

About

QUALITY CLASSIFICATION OF QUESTIONS ON STACK OVERFLOW

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors