Scrapy Cluster

This Scrapy project uses Redis and Kafka to create a distributed on demand scraping cluster.

The goal is to distribute seed URLs among many waiting spider instances, whose requests are coordinated via Redis. Any other crawls those trigger, as a result of frontier expansion or depth traversal, will also be distributed among all workers in the cluster.

The input to the system is a set of Kafka topics and the output is a set of Kafka topics. Raw HTML and assets are crawled interactively, spidered, and output to the log. For easy local development, you can also disable the Kafka portions and work with the spider entirely via Redis, although this is not recommended due to the serialization of the crawl requests.

Dependencies

Please see requirements.txt for Pip package dependencies across the different sub projects.

Other important components required to run the cluster

Python 2.7: https://www.python.org/downloads/
Redis: http://redis.io
Zookeeper: https://zookeeper.apache.org
Kafka: http://kafka.apache.org

Core Concepts

This project tries to bring together a bunch of new concepts to Scrapy and large scale distributed crawling in general. Some bullet points include:

The spiders are dynamic and on demand, meaning that they allow the arbitrary collection of any web page that is submitted to the scraping cluster
Scale Scrapy instances across a single machine or multiple machines
Coordinate and prioritize their scraping effort for desired sites
Persist data across scraping jobs
Execute multiple scraping jobs concurrently
Allows for in depth access into the information about your scraping job, what is upcoming, and how the sites are ranked
Allows you to arbitrarily add/remove/scale your scrapers from the pool without loss of data or downtime
Utilizes Apache Kafka as a data bus for any application to interact with the scraping cluster (submit jobs, get info, stop jobs, view results)
Allows for coordinated throttling of crawls from independent spiders on separate machines, but behind the same IP Address

Scrapy Cluster test environment

To set up a pre-canned Scrapy Cluster test environment, make sure you have the latest Virtualbox + Vagrant >= 1.7.4 installed. Vagrant will automatically mount the base scrapy-cluster directory to the /vagrant directory, so any code changes you make will be visible inside the VM.

Steps to launch the test environment:

vagrant up in base scrapy-cluster directory.
vagrant ssh to ssh into the VM.
sudo supervisorctl status to check that everything is running.
cd /vagrant to get to the scrapy-cluster directory.
conda create -n sc scrapy --yes to create a conda virtualenv with Scrapy pre-installed.
source activate sc to activate your virtual environment.
pip install -r requirements.txt to install Scrapy Cluster dependencies.
./run_offline_tests.sh to run offline tests.
./run_online_tests.sh to run online tests (relies on kafka, zookeeper, redis).

Documentation

Please check out our official Scrapy Cluster documentation for more details on how everything works!

Branches

The master branch of this repository contains the latest stable release code for Scrapy Cluster 1.1.

The dev branch contains bleeding edge code and is currently working towards Scrapy Cluster 1.2. Please note that not everything may be documented, finished, tested, or finalized but we are happy to help guide those who are interested.

Name		Name	Last commit message	Last commit date
Latest commit History 314 Commits
ansible		ansible
crawler		crawler
docs		docs
elk		elk
kafka-monitor		kafka-monitor
redis-monitor		redis-monitor
utils		utils
.gitignore		.gitignore
.travis.yml		.travis.yml
LICENSE		LICENSE
README.md		README.md
Vagrantfile		Vagrantfile
conda_env.yml		conda_env.yml
migrate.py		migrate.py
requirements.txt		requirements.txt
run_offline_tests.sh		run_offline_tests.sh
run_online_tests.sh		run_online_tests.sh

License

7WebPages/scrapy-cluster

Folders and files

Latest commit

History

Repository files navigation

Scrapy Cluster

Dependencies

Core Concepts

Scrapy Cluster test environment

Steps to launch the test environment:

Documentation

Branches

About

Resources

License

Stars

Watchers

Forks

Languages