Skip to content
master
Switch branches/tags
Code
This branch is up to date with master.
Contribute

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
 
 
 
 
 
 

unikob-comment-classifier

A classifier that classifies a piece of text to comments and non-comments. The text is typically crawled from web pages of online news websites. In order to retrieve all the relevant comments we crawl and parse all text in the comment section and use this classifier to filter out the non-comments. For details please refer to Deliverable 6.1.

Requirements

Python 2 is used but of course switching to Python 3 is possible with minor changes.

Libraries in use: nltk, numpy, pandas, sklearn.

Usage

  1. Put data files native_comments.csv and area_without_comments.csv into data/ folder. Each data file should have csv format with each line containing an index number and a comment or non-comment.

  2. Run main.py to get the classification model and see the performance.

  3. Save the model as pkl file for later use (optional).

Technical details

For any given article, the pieces of text in the comment section are classified by a binary classifier to distinguish between comment and non-comment data and to filter out the comments. Before the classification is done, the text chunks are pre-processed. Firstly, the text chunks are tokenized, and the punctuation is removed. Secondly, domain-specific stop words are removed. These ones are words, which appear often in comment sections. After the preprocessing, a trained random forest classifier is used to retrieve the comment text with associated meta data.

Main contributors

Jun Sun, Nico Daheim.

About

A classifier that classifies comments and non-comments based on machine learning.

Resources

License

Packages

No packages published

Languages