DistributedCrawler

This is a Distributed Web Crawler Project using C++ on Linux platform .

This project introduces Consistent Hash algorithm, which is used to solve the strategy of URL partition, hot-spot problem and load balancing between web crawler nodes and ensure that the distributed crawler has good scalability, balancing, fault tolerance.
In order to meet the politeness and priority needs of the web crawler, this project designs and implements a URL queue based on Mercator model.
The solutions to large-scale URLs deduplication,DNS resolution,page crawling and parsing and some other key problems are given.
This project designs and implements a thread pool model for efficient and multi-threaded page collection.
A scheme for downloaded page storage is given, which creats indexd files and data files to manage and store the downloaded data.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
Debug		Debug
consistentHash		consistentHash
distributed		distributed
fetch		fetch
fetchedPage		fetchedPage
filter		filter
urlFrontier		urlFrontier
.cproject		.cproject
.cproject2		.cproject2
.project		.project
README.md		README.md
codeline.sh		codeline.sh
config.h		config.h
hello.py		hello.py
launch.py		launch.py
main.cpp		main.cpp

Provide feedback