Content Extraction via Text Density (SIGIR11)
Switch branches/tags
Nothing to show
Clone or download
Latest commit adfd7a4 Sep 21, 2015
Permalink
Failed to load latest commit information.
ContentExtraction Update ContentExtraction/clean.h Jan 13, 2012
README.md update readme Sep 21, 2015

README.md

#Content Extraction via Text Density (CETD)

Introduction

This program is developed to detect and remove the additional content (e.g. ads, navigation menus, copyright notices etc) around the main content of a webpage.

Before using the source code, make sure you have already installed QT sdk.

Contact: Fei Sun, Institute Of Computing Technology, ofey.sunfei@gmail.com, 
Project page: http://ofey.me/projects/cetd/

Citation

@inproceedings{Sun:2011:DBC:2009916.2009952,
author = {Sun, Fei and Song, Dandan and Liao, Lejian},
title = {DOM based content extraction via text density},
booktitle = {Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval},
series = {SIGIR '11},
year = {2011},
isbn = {978-1-4503-0757-4},
location = {Beijing, China},
pages = {245--254},
numpages = {10},
url = {http://doi.acm.org/10.1145/2009916.2009952},
doi = {10.1145/2009916.2009952},
acmid = {2009952},
publisher = {ACM},
address = {New York, NY, USA},
keywords = {composite text density, content extraction, densitysum, text density},
}

##License

The GPL version 3, read it at http://www.gnu.org/licenses/gpl.txt