quora_distributed_crawler

This project is a distributed web crawler, which is specially developed for crawling data from Quora.com.

Getting Started

Prerequisites

Server message queuing middleware, using rabbitmq or redis.
Using virtual environments is not recommended, as it makes the task impossible.

Dependency	Version
Python	3.11 or higher
RabbitMQ	latest
Redis	latest

Installation

git clone https://github.com/LxYxvv/quora_distributed_crawler.git
cd quora_distributed_crawler
pip install -r requirements.txt

Start server

cd quora_distributed_crawler/server
python main.py

Configuration worker

Set the broker_url in the config.py file to your message middleware address. Set the worker_concurrency worker process to 2, to prevent too many and frequent crawler requests. Set the url in the utils/upload.py

Submit tasks

How to submit a task to the queue?
Please refer to celery doc. need to configure the broker_url of the server config.py

Start worker

python main.py

Start worker by android

Download ZeroTermux https://github.com/hanxinhao000/ZeroTermux/releases

pkg update && pkg upgrade
pkg install python3

Then Installation > Configuration worker > Start worker

Name		Name	Last commit message	Last commit date
Latest commit History 95 Commits
api		api
notebooks		notebooks
postman		postman
server		server
tasks		tasks
tests		tests
utils		utils
.gitignore		.gitignore
README.md		README.md
config.py		config.py
docker-compose.yml		docker-compose.yml
main.py		main.py
redis.conf.template		redis.conf.template
requirements.txt		requirements.txt
task_provider.py		task_provider.py

Leozw12/quora_distributed_crawler

Folders and files

Latest commit

History

Repository files navigation

quora_distributed_crawler

Getting Started

Prerequisites

Installation

Start server

Configuration worker

Submit tasks

Start worker

Start worker by android

About

Resources

Stars

Watchers

Forks

Languages