A classifier that classifies comments and non-comments.
Python 2 is used but of course switching to Python 3 is possible with minor changes.
Libraries in use:
Put data files
data/folder. Each data file should have csv format with each line containing an index number and a comment or non-comment.
main.pyto get the classification model and see the performance.
Save the model as pkl file for later use (optional).
For any given article, the pieces of text in the comment section are classified by a binary classifier to distinguish between comment and non-comment data and to filter out the comments. Before the classification is done, the text chunks are pre-processed. Firstly, the text chunks are tokenized, and the punctuation is removed. Secondly, domain-specific stop words are removed. These ones are words, which appear often in comment sections. After the preprocessing, a trained random forest classifier is used to retrieve the comment text with associated meta data.