Skip to content

Commit

Permalink
README: General overview of content
Browse files Browse the repository at this point in the history
+ TODO: Full instructions for downloading the data files locally
+ TODO: Compiling to a runnable JAR using Eclipse
+ TODO: Link back to CommonCrawl.org
  • Loading branch information
Smerity committed Apr 2, 2014
1 parent cffe1b8 commit 893c2c2
Showing 1 changed file with 20 additions and 0 deletions.
20 changes: 20 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
![Common Crawl Logo](http://commoncrawl.org/wp-content/uploads/2012/04/ccLogo.png)

# Common Crawl WARC Examples

This repository contains both wrappers for processing WARC files in Hadoop MapReduce jobs and also Hadoop examples to get you started.

There are three examples for Hadoop processing:

+ [WARC files] HTML tag frequency counter using raw HTTP responses
+ [WAT files] Server response analysis using response metadata
+ [WET files] Classic word count example using extracted text

All three assume initially that the files are stored locally but can be trivially modified to pull them down from Common Crawl's Amazon S3 bucket.
To acquire the files, you can use [S3Cmd](http://s3tools.org/s3cmd) or similar.

s3cmd get s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2013-48/segments/1386163035819/warc/CC-MAIN-20131204131715-00000-ip-10-33-133-15.ec2.internal.warc.gz

# License

MIT License, as per `LICENSE`

0 comments on commit 893c2c2

Please sign in to comment.