Creates CDXJ index for the CommonCrawl NEWS dataset (there is official index server).
-
Set AWS API key and secret in
boto.cfg
(see example: example_boto.cfg) -
Set GNU parallel
nodefile
(see example: example_nodefile)- Copy this directory to the same path on all machines
-
Set parameters as environment variables:
- PYTHON (default: python3)
- OUTPUT_DIR (default: $(PWD)/output)
- BOTO_CFG (default: $(PWD)/boto.cfg)
- NO_OF_THREADS (default: 80)
- NICEVALUE (default: 10)
-
Set languages to collect in languages_to_collect.txt. The format is
"[LANGUAGE NAME AS IN LINGUA]":
(because it is grepped from a JSONL for speed concerns)
Run make
to execute the whole process or consult with the Makefile for the individual steps
This code is licensed under the GPL 3.0 license.