A concurrent web crawler written in Go that extracts page data and discovers links within a domain.
- Concurrent crawling with configurable worker pool
- Respects domain boundaries (only crawls pages within the same domain)
- Extracts H1 tags and outgoing links from each page
- Thread-safe operation with mutex locks
- Configurable maximum page limit
go install github.com/Cheemx/crawler@latestcrawler <base_url> <max_concurrent_requests> <max_pages>base_url- The starting URL to crawl (must include protocol, e.g., https://example.com)max_concurrent_requests- Maximum number of concurrent HTTP requests (integer)max_pages- Maximum number of pages to crawl before stopping (integer)
# Crawl up to 100 pages with 5 concurrent requests
crawler https://example.com 5 100
# Crawl up to 50 pages with 10 concurrent requests
crawler https://blog.example.com 10 50The crawler crawls all the sites and it's redirecting sites and adds a clear-cut CSV report in file report.csv which can be used further data operations.
- The crawler starts at the provided base URL
- It fetches the HTML content and extracts page data (H1 tags and links)
- For each discovered link within the same domain, it spawns a new goroutine to crawl that page
- The concurrency control channel limits the number of simultaneous requests
- A visited map prevents crawling the same URL multiple times
- Crawling stops when the maximum page limit is reached or no more pages are found
- Does not handle authentication or cookies
- No rate limiting beyond concurrent request control