Skip to content
master
Switch branches/tags
Code

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
 
 
 
 
src
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Common Crawler

🕸 A simple and easy way to extract data from Common Crawl with little or no hassle.

Go Version License Build Status Go Report Card

Notice in regards to development

Currently I do not have the capacity to hire full time, however, I do have the intention of hiring someone to help build infrastructure related to CommonCrawl. All Gitcoin bounties are currently on hold. When I do have time to further invest in this project, will discuss full time devops developer to work on said project. All payment will be done in DAI and resource allocation will be approximately 5k/mo.

As a GUI

An electron based interface that works with a Go server will be available.

As a library

Install as a dependency:

go get https://github.com/ChrisCates/CommonCrawler

Access the library functions by importing it:

import(
  cc "github.com/ChrisCates/CommonCrawler"
)

func main() {
  cc.scan()
  cc.download()
  cc.extract()
  // And so forth
}

As a command line tool

Install from source:

go install  https://github.com/ChrisCates/CommonCrawler

Or you can curl from Github:

curl https://github.com/ChrisCates/CommonCrawler/raw/master/dist/commoncrawler -o commoncrawler

Then run as a binary:

# Output help
commoncrawler --help

# Specify configuration
commoncrawler --base-uri https://commoncrawl.s3.amazonaws.com/
commoncrawler --wet-paths wet.paths
commoncrawler --data-folder output/crawl-data
commoncrawler --start 0
commoncrawler --stop 5 # -1 will loop through all wet files from wet.paths

# Start crawling the web
commoncrawler start --stop -1

Compilation and Configuration

Installing dependencies

go get github.com/logrusorgru/aurora

Downloading data with the application

First configure the type of data you want to extract.

// Config is the preset variables for your extractor
type Config struct {
    baseURI     string
    wetPaths    string
    dataFolder  string
    matchFolder string
    start       int
    stop        int
}

//Defaults
Config{
    start:       0,
    stop:        5,
    baseURI:     "https://commoncrawl.s3.amazonaws.com/",
    wetPaths:    path.Join(cwd, "wet.paths"),
    dataFolder:  path.Join(cwd, "/output/crawl-data"),
    matchFolder: path.Join(cwd, "/output/match-data"),
}

With Docker

docker build -t commoncrawler .
docker run commoncrawler

Without Docker

go build -i -o ./dist/commoncrawler ./src/*.go
./dist/commoncrawler

Or you can run simply just run it.

go run src/*.go

Resources

  • MIT Licensed

  • If people are interested or need it. I can create a documentation and tutorial page on https://commoncrawl.chriscates.ca

  • You can post issues if they are valid, and, I could potentially fund them based on priority.

About

🕸 A simple way to extract data from Common Crawl

Topics

Resources

License

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •