Search Engine Crawler

This project is a distributed, depth-first web scraper built using Scrapy, designed to extract information from specified websites. It utilizes scrapy-playwright for handling JavaScript-rendered content, ensuring comprehensive data extraction from modern web pages. The crawler is designed for distributed operation, allowing multiple instances to work concurrently and efficiently.

Setup

Clone the repository:

git clone [repository_url]
cd search_engine_crawler

Create and activate a Python virtual environment (Optional, for local development/testing):
```
python3 -m venv venv
source venv/bin/activate
```
Install dependencies (Optional, for local development/testing):
```
pip install -r requirements.txt
```

Running the Spider

The web_spider uses default start_urls and allowed_domains defined in search_engine_crawler/constants.py. sitemap_urls can still be provided as command-line arguments.

Running with Docker Compose (Recommended)

This project is configured to run using Docker Compose, which simplifies setup and deployment, especially with Redis for distributed crawling.

Build and run the Docker containers:
```
docker-compose up --build
```
This command will:
- Build the Docker images for the init_urls and crawler services.
- Start a Redis server.
- Run init_urls to push initial URLs from start_urls.txt to Redis.
- Start the crawler service, which will begin scraping URLs from Redis.
Spawning Multiple Crawlers: To run multiple instances of the crawler for increased concurrency, use the --scale flag:
```
docker-compose up --build --scale crawler=3
```
Replace 3 with the desired number of crawler instances.
Stopping the services:
```
docker-compose down
```

Local Development (without Docker Compose)

To run the spider locally (without Docker Compose), ensure you have activated your virtual environment and installed dependencies.

To run the spider using the default URLs:

scrapy crawl web_spider

To provide sitemap_urls (if applicable):

scrapy crawl web_spider -a sitemap_urls="http://example.com/sitemap.xml"

Project Structure

scrapy.cfg: Scrapy project configuration.
search_engine_crawler/: The main Python package for the Scrapy project.
- __init__.py: Makes search_engine_crawler a Python package.
- settings.py: Scrapy project settings, including middleware and pipeline configurations.
- items.py: Defines the data structure for scraped items.
- pipelines.py: Processes scraped items (e.g., saving to a database).
- middlewares.py: Custom downloader and spider middlewares.
- constants.py: Defines default start_urls and allowed_domains.
- spiders/: Directory containing the spider definitions.
  - __init__.py: Makes spiders a Python package.
  - web_spider.py: The main spider for crawling web pages, using scrapy-playwright for dynamic content.
requirements.txt: Lists all Python dependencies.
scraped_data.db: SQLite database for storing scraped data (if configured in pipelines).
CHANGELOG.md: Documents all notable changes to the project.
Dockerfile: Defines the Docker image for the crawler and URL initializer.
docker-compose.yml: Orchestrates the multi-container Docker application (Redis, URL initializer, crawler).
init_redis_urls.sh: Script to push initial URLs from start_urls.txt to Redis.
start_urls.txt: Contains the list of initial URLs for the crawler.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Search Engine Crawler

Setup

Running the Spider

Running with Docker Compose (Recommended)

Local Development (without Docker Compose)

Project Structure

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
search_engine_crawler		search_engine_crawler
static		static
templates		templates
.dockerignore		.dockerignore
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
Dockerfile		Dockerfile
README.md		README.md
__init__.py		__init__.py
docker-compose.yml		docker-compose.yml
init_redis_urls.sh		init_redis_urls.sh
requirements.txt		requirements.txt
scraped_data.db		scraped_data.db
scrapy.cfg		scrapy.cfg
search_app.py		search_app.py
start_urls.txt		start_urls.txt

RahmaniErfan/distributed-web-crawler

Folders and files

Latest commit

History

Repository files navigation

Search Engine Crawler

Setup

Running the Spider

Running with Docker Compose (Recommended)

Local Development (without Docker Compose)

Project Structure

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages