Program to crawl the web. Uses companion package creep.

Given a website or list of websites, fetch the web page, and then fetch all the pages linked to by that website.
And so on? Sometimes. See below.

To install:
$ go get github.com/RickyS/Crawl
$ go get github.com/RickyS/Creep

You'll neeed both packages, the depend on each other. The main program is crawl. The working package is creep. Note the capital Cs on the names to 'go get'.

To run:
$ go run crawl.go
Or, on Linux:
$ go install -v -x && crawl

Instead of command line arguments, the crawl program reads a json file. A simple small file is the default:
iana.json, which is among the included files.

crawl.go contains the line:
jobData := creep.LoadJobData("iana.json")
which specifies that the parameter file iana.json will be used.

Todo: The name of the json file should be a command line argument.

Technically, crawling the web isn't possible. The number of websites is effectively infinite. So after numerous tests that filled up my machine, I have artificially truncated the number of web pages crawled.

There are parameters in the json file that adjust the limitations. TBD.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
alias.bash		alias.bash
crawl.go		crawl.go
crawl.sublime-project		crawl.sublime-project
crawl.todo		crawl.todo
golang.json		golang.json
golangbig.json		golangbig.json
iana.json		iana.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Program to crawl the web. Uses companion package creep.

About

Releases

Packages

Languages

License

RickyS/Crawl

Folders and files

Latest commit

History

Repository files navigation

Program to crawl the web. Uses companion package creep.

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages