A classifier that classifies a piece of text to comments and non-comments. The text is typically crawled from web pages of online news websites. In order to retrieve all the relevant comments we crawl and parse all text in the comment section and use this classifier to filter out the non-comments. For details please refer to Deliverable 6.1.
Python 2 is used but of course switching to Python 3 is possible with minor changes.
Libraries in use:
nltk, numpy, pandas, sklearn.
-
Put data files
native_comments.csvandarea_without_comments.csvintodata/folder. Each data file should have csv format with each line containing an index number and a comment or non-comment. -
Run
main.pyto get the classification model and see the performance. -
Save the model as pkl file for later use (optional).
For any given article, the pieces of text in the comment section are classified by a binary classifier to distinguish between comment and non-comment data and to filter out the comments. Before the classification is done, the text chunks are pre-processed. Firstly, the text chunks are tokenized, and the punctuation is removed. Secondly, domain-specific stop words are removed. These ones are words, which appear often in comment sections. After the preprocessing, a trained random forest classifier is used to retrieve the comment text with associated meta data.