Skip to content

OlivierBlanvillain/crawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

BlogForever crawler

Install for python 2.6:

pip install scrapy==0.18.4
pip install lxml httplib2 feedparser selenium python-Levenshtein
install http://phantomjs.org/download.html to /opt/phantomjs/bin/phantomjs

Run:

scrapy crawl newcrawl -a startat=http://www.quantumdiaries.org/
scrapy crawl updatecrawl -a startat=http://www.quantumdiaries.org/ -a since=1388593000

Test:

pip install pytest pytest-incremental
py.test

Source tree docstrings:

bibcrawl
├── model
│   ├── commentitem.py: Blog comment Item
│   ├── objectitem.py: Super class of comment and post item
│   └── postitem.py: Blog post Item
├── pipelines
│   ├── backendpropagate.py: Saves the item in the back-end
│   ├── downloadfeeds.py: Downloads comments web feed
│   ├── downloadimages.py: Download images
│   ├── extractcomments.py: Extracts all comments from html using the comment feed
│   ├── files.py: Files pipeline back-ported to python 2.6
│   ├── processhtml.py: Process html to extract article, title and author
│   └── renderjavascript.py: Renders the original page with PhantomJS and takes a screenshot
├── spiders
│   ├── newcrawl.py: Entirely crawls a new blog
│   ├── rsscrawl.py: Super class of new and update crawl
│   └── updatecrawl.py: Partialy crawls a blog for new content of the web feed
├── utils
│   ├── contentextractor.py: Extracts the content of blog posts using a RSS feed
│   ├── ohpython.py: Essential functions that should have been part of python core
│   ├── parsing.py: Parsing functions
│   ├── priorityheuristic.py: Priority heuristic for page download, favors page with links to posts
│   ├── stringsimilarity.py: Dice's coefficient similarity function
│   └── webdriverpool.py: Pool of PhantomJS processes to parallelize page rendering
├── blogmonitor.py: Queries the database and starts new and update crawls when needed
└── settings.py: Scrapy settings

TODO:

Add to the DB, per blog

  • link to web-feed
  • latest etag of this feed
  • date of last crawl (unix format)

Blog monitor algo:

if isFresh, start an updatecrawl with last crawl date
otherwise we are fine for this blog.

About

Blog crawler for the blogforever project.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages