A pragmatic and fast way to crawl Common Crawl with Golang.
By Chris Cates & Licensed under MIT.
How does it work?
- Simply run
go run extract.goin the root directory.
- It will extract all files with matches for "chiropractor"
- Feel free to tinker with the code to extract what you need.
- Works with .WET, .WARC and .WAT files.
Why not multithreaded?
Could easily use multi threading to optimize this project however the biggest bottleneck is the network not the go language itself.
If you're CURLing into a Hadoop cluster. Then highly suggest optimizing this code with multithreading.