Prometheus Common Crawl Extractors

This repository contains mapreduce extractors to preprocess and extract websites from the common crawl corpus.

You may use the mrjob.conf to configure running the jobs on AWS EMR.

Installation

The original ccmrjob repo uses Python 2.7 however this has been upgraded to Python 3. That entails using a different library to read the warc files.

For Python 3:

python3 -m venv venv
source venv/bin/activate
pip install -r requirements_python3.text

To do local testing the get-data.sh script downloads 100 WET files for testing purpose. It uses httpie for downloading, so either install that or change the script to use cURL or wget.

./get-data.sh input/test-100.wet

Extractors

Obama Born Extractor

This simple extractors finds documents containing a regex specifing "obama born in".

Locally test using:

python obama_born_extractor.py --conf-path mrjob.conf --no-output --output-dir out input/test-1.wet

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
input		input
.gitignore		.gitignore
README.md		README.md
get-data.sh		get-data.sh
mrcc.py		mrcc.py
mrjob.conf		mrjob.conf
obama_born_extractor.py		obama_born_extractor.py
requirements_python2.txt		requirements_python2.txt
requirements_python3.txt		requirements_python3.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Prometheus Common Crawl Extractors

Installation

Extractors

Obama Born Extractor

About

Releases

Packages

Languages

ErikGartner/prometheus-cc-extractor

Folders and files

Latest commit

History

Repository files navigation

Prometheus Common Crawl Extractors

Installation

Extractors

Obama Born Extractor

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages