Skip to content

ShiZhan/newspapers

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

32 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Introduction

newspaper data collector

A web crawler for collecting online newspaper data and stores into document database.

Crawler based on scrapy framework.

Task control by creating local index page.

Databases/WIP using mongodb.

Spiders

hbrb

Hubei Daily

  1. use generate-index-hbrb.py [from date: yyyymmdd] [to date] to generate index page between specified dates.

  2. use python -m SimpleHTTPServer 9080 & to make the index page visible to spider.

  3. run scrapy crawl whrb , add parameters, e.g.: "-t json -o foo.json" to collect data.

whwb

Wuhan nightly

Exactly the same with hbrb.

ckxx

Can Kao Xiao Xi

  1. use generate-index-ckxx.py [from page] [to page] to grab individual report URLs from headline pages.

  2. same as hbrb and whwb.

  3. same as hbrb and whwb.

Utilities

  • escaped-unicode-print.py

    Print collected text (escaped unicode), for checking purpose.

  • import-to-mongodb.py [collection] [json file]

    Import json into mongodb "newspaper" database and specified collection.

  • json-to-item-in-line.py [json file]

    Convert gathered json to one-line mode, actually the most current scrapy defaults to that.

  • run-server

    Simplify the calling of python internal web server python -m SimpleHTTPServer .

About

larger crawling experiment

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages