Skip to content

Cheemx/crawler

Repository files navigation

Crawler

A concurrent web crawler written in Go that extracts page data and discovers links within a domain.

Features

  • Concurrent crawling with configurable worker pool
  • Respects domain boundaries (only crawls pages within the same domain)
  • Extracts H1 tags and outgoing links from each page
  • Thread-safe operation with mutex locks
  • Configurable maximum page limit

Installation

go install github.com/Cheemx/crawler@latest

Usage

crawler <base_url> <max_concurrent_requests> <max_pages>

Arguments

  1. base_url - The starting URL to crawl (must include protocol, e.g., https://example.com)
  2. max_concurrent_requests - Maximum number of concurrent HTTP requests (integer)
  3. max_pages - Maximum number of pages to crawl before stopping (integer)

Examples

# Crawl up to 100 pages with 5 concurrent requests
crawler https://example.com 5 100

# Crawl up to 50 pages with 10 concurrent requests
crawler https://blog.example.com 10 50

Output

The crawler crawls all the sites and it's redirecting sites and adds a clear-cut CSV report in file report.csv which can be used further data operations.

How It Works

  1. The crawler starts at the provided base URL
  2. It fetches the HTML content and extracts page data (H1 tags and links)
  3. For each discovered link within the same domain, it spawns a new goroutine to crawl that page
  4. The concurrency control channel limits the number of simultaneous requests
  5. A visited map prevents crawling the same URL multiple times
  6. Crawling stops when the maximum page limit is reached or no more pages are found

Limitations

  • Does not handle authentication or cookies
  • No rate limiting beyond concurrent request control

About

Web Scraper written in Golang

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published