newspaper data collector
A web crawler for collecting online newspaper data and stores into document database.
Crawler based on scrapy framework.
Task control by creating local index page.
Databases/WIP using mongodb.
-
use
generate-index-hbrb.py [from date: yyyymmdd] [to date]to generate index page between specified dates. -
use
python -m SimpleHTTPServer 9080 &to make the index page visible to spider. -
run
scrapy crawl whrb, add parameters, e.g.: "-t json -o foo.json" to collect data.
Exactly the same with hbrb.
-
use
generate-index-ckxx.py [from page] [to page]to grab individual report URLs from headline pages. -
same as hbrb and whwb.
-
same as hbrb and whwb.
-
escaped-unicode-print.py
Print collected text (escaped unicode), for checking purpose.
-
import-to-mongodb.py [collection] [json file]
Import json into mongodb "newspaper" database and specified collection.
-
json-to-item-in-line.py [json file]
Convert gathered json to one-line mode, actually the most current scrapy defaults to that.
-
run-server
Simplify the calling of python internal web server
python -m SimpleHTTPServer.