Concurrent Web Crawler

Concurrent web crawler in Go using goroutines.

Usage

In your main func

crawler.NewCrawler(&crawler.Config{
		Depth:            3,
		Breadth:          0,                      // Using breadth = 0 to get all links on a page.
		NumWorkers:       100,
		RequestThrottler: 100 * time.Millisecond, // 10 requests per second for each **domain**.
	}, outputFile, errorFile, &crawler.SameDomain{}).Crawl([]string{
		"https://example.com",
		"https://another.com"
	})

Output is in CSV format of <depth>,<url>,<text>,<page_title>,<page_content>. Can be directly loaded to Pandas.

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
graphes		graphes
README.md		README.md
crawler.go		crawler.go
crawler_internal.go		crawler_internal.go
go.mod		go.mod
go.sum		go.sum
logger.go		logger.go
parser.go		parser.go
parser_test.go		parser_test.go
pruner.go		pruner.go
scraper.go		scraper.go
util.go		util.go
util_test.go		util_test.go

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Concurrent Web Crawler

Usage

Design

How it works

About

Releases 1

Packages

Languages

TrentaIcedCoffee/crawler

Folders and files

Latest commit

History

Repository files navigation

Concurrent Web Crawler

Usage

Design

How it works

About

Resources

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages