hannibal

A light-weighted, distributed crawler implemented based on asyncio, aiohttp and redis.

For commemorating genneral Hannibal Barca.

Collecting

Two application design modes supported:

Single node contains both collecting and parsing functionality.
Distribute structure contains lots of collecting nodes and parsing nodes, interacts with reds-based queue and URL pool.

Two collecting modes supported:

Collecting for a list of task.
Increasing collection.

Parsing

Only basic parsing schedule is implemented in parser module, you need to overide the Parser class and implemented the page parsing function by yourself. Multiple parser modules are supported, beautiful soup, pyquery or regular expression are available.

Queue and Pool

Both of the two brokers mentioned above are implemented in memory based mode and redis based mode. For memory based queue and pool, basic storage mechanism also implemented.

Quick Start

Installation

pip install hannibal

As mentioned above, in this project 2 collecting modes are implemented in module LocalCollector and DistributeCollector.

LocalCollector requires a mission queue for passing packed collecting missions, a href pool for avoiding duplicate collecting, and a parser function for parsing collecting result. For DistributeCollector, the requirement is basically the same as DistributeCollector's, except it requires a parse queue for passing the collecting result to another parser node.

Simple Usage

from hannibal.spider import LocalCollector, MemPool, MemQueue, extract_json, Mission

pool = MemPool(name='http_bin')
queue = MemQueue(name='http_bin', limited=True)
url_list = [Mission(unique_tag=i, url='http://httpbin.org/get?t=%d' % i) for i in range(1, 500)]
queue.init_queue(url_list)
collector = LocalCollector(mission_queue=queue, href_pool=pool, parse_function=collect_function, cache_size=10)
collector.conquer()

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
.idea		.idea
examples		examples
hannibal		hannibal
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

hannibal

Collecting

Parsing

Queue and Pool

Quick Start

Installation

Simple Usage

About

Releases

Packages

Languages

License

JorgenLiu/hannibal

Folders and files

Latest commit

History

Repository files navigation

hannibal

Collecting

Parsing

Queue and Pool

Quick Start

Installation

Simple Usage

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages