Crawler

A concurrent web crawler written in Go that extracts page data and discovers links within a domain.

Features

Concurrent crawling with configurable worker pool
Respects domain boundaries (only crawls pages within the same domain)
Extracts H1 tags and outgoing links from each page
Thread-safe operation with mutex locks
Configurable maximum page limit

Installation

go install github.com/Cheemx/crawler@latest

Usage

crawler <base_url> <max_concurrent_requests> <max_pages>

Arguments

base_url - The starting URL to crawl (must include protocol, e.g., https://example.com)
max_concurrent_requests - Maximum number of concurrent HTTP requests (integer)
max_pages - Maximum number of pages to crawl before stopping (integer)

Examples

# Crawl up to 100 pages with 5 concurrent requests
crawler https://example.com 5 100

# Crawl up to 50 pages with 10 concurrent requests
crawler https://blog.example.com 10 50

Output

The crawler crawls all the sites and it's redirecting sites and adds a clear-cut CSV report in file report.csv which can be used further data operations.

How It Works

The crawler starts at the provided base URL
It fetches the HTML content and extracts page data (H1 tags and links)
For each discovered link within the same domain, it spawns a new goroutine to crawl that page
The concurrency control channel limits the number of simultaneous requests
A visited map prevents crawling the same URL multiple times
Crawling stops when the maximum page limit is reached or no more pages are found

Limitations

Does not handle authentication or cookies
No rate limiting beyond concurrent request control

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
csv_report.go		csv_report.go
go.mod		go.mod
go.sum		go.sum
htmlParsing.go		htmlParsing.go
main.go		main.go
normalizeURL.go		normalizeURL.go
normalizeURL_test.go		normalizeURL_test.go
parsing_test.go		parsing_test.go

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Crawler

Features

Installation

Usage

Arguments

Examples

Output

How It Works

Limitations

About

Uh oh!

Releases

Packages

Languages

Cheemx/crawler

Folders and files

Latest commit

History

Repository files navigation

Crawler

Features

Installation

Usage

Arguments

Examples

Output

How It Works

Limitations

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages