Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP
A collection of helpers for running scrapers built with Scrapy in ScraperWiki
Python
branch: master

Fetching latest commit…

Cannot retrieve the latest commit at this time

Failed to load latest commit information.
scrapyrwiki
.gitignore
MANIFEST.in
README.rst
setup.py

README.rst

scrapyrwiki

A collection of helpers for running scrapers built with Scrapy in ScraperWiki

Launch scraper without scrapy CLI

Example:

from scrapy.conf import settings
from scrapyrwiki import run_spider

def main():
    run_spider(MySpider(), settings)

if __name__ == '__main__':
    main()

Save produced data to ScraperWiki

Just add "scrapyrwiki.pipelines.ScraperWikiPipeline" to ITEM_PIPELINES

Example:

from scrapy.conf import settings
from scrapyrwiki import run_spider

def scraperwiki():
    options = {
        'SW_SAVE_BUFFER': 5,
        'SW_UNIQUE_KEYS': {"MyItem": ['url']},
        'ITEM_PIPELINES': ['scrapyrwiki.pipelines.ScraperWikiPipeline'],
    }
    settings.overrides.update(options)
    run_spider(MySpider(), settings)


if __name__ == 'scraper':
    scraperwiki()

Check spider contracts in CI

Just launch spider with run_tests

Example:

from scrapyrwiki import run_tests
from scrapy.conf import settings

run_tests(MySpider(), "output.xml", settings)

Note: For testing the HTTP cache is used. In the directory where the script is launched there must be a scrapy.cfg (needed by Scrapy to identify that's a scraper directory) and a .scrapy directory with the HTTP cache db.

The output is in XUnit format, tested on Jenkins

Log scraper errors to Sentry

Install scrapy-sentry and set the environment variable SENTRY_DSN with the Sentry key. Scrapyrwiki will handle everything for you.

Something went wrong with that request. Please try again.