A pragmatic and fast way to crawl Common Crawl with Golang.
Go
Switch branches/tags
Nothing to show
Clone or download
Recursive Lambda God
Recursive Lambda God Added report card
Latest commit b459e2b May 3, 2017
Permalink
Failed to load latest commit information.
.gitignore Base commit May 2, 2017
LICENSE Initial commit May 2, 2017
README.md Added report card May 3, 2017
extract.go Made sure to create match data file May 2, 2017
wet.paths Base commit May 2, 2017

README.md

Go Crawl

A pragmatic and fast way to crawl Common Crawl with Golang.

By Chris Cates & Licensed under MIT.

Go Report Card

How does it work?

  1. Simply run go run extract.go in the root directory.
  2. It will extract all files with matches for "chiropractor"
  3. Feel free to tinker with the code to extract what you need.
  4. Works with .WET, .WARC and .WAT files.

Why not multithreaded?

Could easily use multi threading to optimize this project however the biggest bottleneck is the network not the go language itself.

If you're CURLing into a Hadoop cluster. Then highly suggest optimizing this code with multithreading.