Skip to content

DocRoms/RustCrawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

3 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

RustCrawler

A comprehensive, production-ready web crawler built in Rust to analyze and improve website quality.

Features

RustCrawler provides three types of website analysis with multiple output formats:

πŸ” SEO Crawler

Analyzes search engine optimization aspects:

  • Title tag presence and length
  • Meta description tags
  • H1 heading tags
  • Canonical URL tags
  • Robots meta tags
  • Internal link validation (configurable limit)

⚑ Performance Crawler

Evaluates website performance metrics:

  • Response time measurement
  • Page size analysis
  • External resource counting (scripts, stylesheets)
  • Compression detection (Brotli, Gzip, Deflate)

β™Ώ A11Y (Accessibility) Crawler

Checks web accessibility standards:

  • HTML lang attribute
  • Image alt attributes
  • ARIA landmarks and attributes
  • Semantic HTML5 tags
  • Form label associations
  • Skip navigation links

πŸ“Š Output Formats

  • Terminal: Color-coded, human-readable output
  • JSON: Machine-readable format for integration
  • HTML: Styled report for sharing

Prerequisites

  • Docker
  • Make

Note: Rust and Cargo are NOT required on your host machine. They are included in the Docker container.

Getting Started

Installation

First, build the Docker image with the latest Rust version:

make install

This command downloads and sets up a Docker container with the latest version of Rust.

Usage

Interactive Mode

make run            # Run in debug mode
make run-release    # Run in release mode
make run-release    # Run in release mode

Follows an interactive prompt to select URL and crawlers.

CLI Mode

# Analyze a URL with all crawlers
docker run --rm rustcrawler cargo run -- --url https://example.com --all

# Run specific crawlers
docker run --rm rustcrawler cargo run -- --url https://example.com --seo --performance

# Generate JSON report
docker run --rm rustcrawler cargo run -- --url https://example.com --all --format json --output report.json

# Generate HTML report
docker run --rm rustcrawler cargo run -- --url https://example.com --all --format html --output report.html

# Use custom configuration
docker run --rm rustcrawler cargo run -- --url https://example.com --all --config config.json

# Override settings
docker run --rm rustcrawler cargo run -- --url https://example.com --all --timeout 60 --max-links 20

CLI Options

  • --url <URL>: URL to analyze
  • --seo: Run SEO crawler
  • --performance: Run Performance crawler
  • --a11y: Run A11Y crawler
  • --all: Run all crawlers
  • --format <terminal|json|html>: Output format (default: terminal)
  • --output <FILE>: Output file for JSON/HTML
  • --config <FILE>: Configuration file path
  • --timeout <SECONDS>: Request timeout
  • --max-links <N>: Maximum internal links to check

Configuration File

Create a config.json:

{
  "timeout_secs": 30,
  "max_links_to_check": 10,
  "user_agent": "RustCrawler/0.1.0",
  "follow_redirects": true,
  "max_redirects": 5
}

Building and Running

All commands run inside the Docker container, so you don't need Rust installed locally.

Build

make build          # Build in debug mode
make build-release  # Build in release mode

Testing

make test           # Run all tests (17 tests)
make test-verbose   # Run tests with verbose output

Code Formatting

make format         # Format code with rustfmt
make format-check   # Check formatting without modifying files

Linting

make lint           # Run clippy linter
make check          # Check if code compiles

Development

make shell          # Open a shell in the Docker container

Cleaning

make clean          # Remove build artifacts and Docker image

Help

make help           # Display all available targets

Dependencies

The project uses the following main dependencies:

  • reqwest - HTTP client for making requests
  • url - URL parsing and validation
  • colored - Terminal color output
  • tokio - Async runtime
  • thiserror - Custom error types
  • serde / serde_json - Serialization for JSON output
  • clap - Command-line argument parsing
  • chrono - Date/time handling for reports

Project Structure

The project follows Rust best practices with a modular architecture:

RustCrawler/
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ main.rs              # Application entry point with CLI
β”‚   β”œβ”€β”€ lib.rs               # Library root with public exports
β”‚   β”œβ”€β”€ cli.rs               # CLI argument definitions
β”‚   β”œβ”€β”€ config.rs            # Configuration management
β”‚   β”œβ”€β”€ error.rs             # Custom error types
β”‚   β”œβ”€β”€ config.rs            # Configuration management
β”‚   β”œβ”€β”€ error.rs             # Custom error types
β”‚   β”œβ”€β”€ client.rs            # HTTP client wrapper
β”‚   β”œβ”€β”€ models.rs            # Data models and validation
β”‚   β”œβ”€β”€ output.rs            # JSON/HTML report generation
β”‚   β”œβ”€β”€ utils.rs             # Utility functions for I/O and display
β”‚   └── crawlers/
β”‚       β”œβ”€β”€ mod.rs           # Crawler trait and common functions
β”‚       β”œβ”€β”€ seo.rs           # SEO crawler implementation
β”‚       β”œβ”€β”€ performance.rs   # Performance crawler implementation
β”‚       └── a11y.rs          # Accessibility crawler implementation
β”œβ”€β”€ Cargo.toml               # Rust dependencies and project configuration
β”œβ”€β”€ Dockerfile               # Docker container setup
β”œβ”€β”€ Makefile                 # Build and run commands
β”œβ”€β”€ ARCHITECTURE.md          # Detailed architecture documentation
└── README.md                # This file

Architecture Highlights

  • Modular Design: Each crawler is implemented in its own module with the Crawler trait
  • Separation of Concerns: HTTP client, models, configuration, and utilities are separate modules
  • Error Handling: Custom error types using thiserror for better error messages
  • Configuration: Externalized configuration with JSON file support
  • CLI + Interactive: Supports both command-line and interactive modes
  • Multiple Outputs: Terminal, JSON, and HTML report formats
  • Testable: 17 unit tests covering all major functionality
  • Extensible: Easy to add new crawlers by implementing the Crawler trait
  • Type Safety: Strong typing with custom models for data structures
  • Library + Binary: Can be used as a library or standalone application

Contributing

When contributing to this project:

  1. Ensure your code builds with make build
  2. Run tests with make test (17 tests should pass)
  3. Format code with make format
  4. Check for linting issues with make lint
  5. Follow Rust naming conventions and best practices
  6. Add tests for new functionality

Recent Improvements

Version 0.1.0

  • βœ… Custom error types with thiserror
  • βœ… Configuration management (JSON file support)
  • βœ… CLI with clap for non-interactive use
  • βœ… JSON and HTML output formats
  • βœ… Configurable timeouts and limits
  • βœ… User-agent customization
  • βœ… Redirect policy configuration
  • βœ… 17 comprehensive unit tests

Future Enhancements

  • Async/await for parallel crawling
  • HTML parser (scraper crate) for more accurate analysis
  • Integration tests with mock servers
  • Sitemap crawling
  • Rate limiting
  • Retry logic with exponential backoff

About

Rust crawler to improve our websites

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published