Scans CommonCrawl datasets for keywords. Scans the whole month of CommonCrawl data using Amazon EC2 c5n.16xlarge instance for hundreds of keywords in about 4 hours. Developed with support from the EU and the Populism & Civic Engagement H2020 project.
Various setup steps for installing on a AWS Ubuntu 20.04
wget -O- https://apt.corretto.aws/corretto.key | sudo apt-key add - sudo add-apt-repository 'deb https://apt.corretto.aws stable main' sudo apt-get update; sudo apt-get install -y java-15-amazon-corretto-jdk sudo apt install build-essential cmake libboost-all-dev ragel maven git clone git://github.com/intel/hyperscan cd hyperscan cmake -DBUILD_SHARED_LIBS=YES make sudo make install cd git clone https://github.com/CitizensFoundation/pace-commoncrawl-scanner.git cd pace-commoncrawl-scanner mvn clean package mkdir /home/ubuntu/pace-commoncrawl-scanner/results cd /home sudo ln -s ubuntu/ robert cd cd pace-commoncrawl-scanner
Prepare the page ranks file into the condensed format
processScripts/getLatestPageRanking.sh 2020 11 https://commoncrawl.s3.amazonaws.com/projects/hyperlinkgraph/cc-main-2020-jul-aug-sep/host/cc-main-2020-jul-aug-sep-host-ranks.txt.gz processScripts/processHostRanksFile.sh 2020 11
Step 1 - Download files list
processScripts/getLatestWetPathsAndDownloadAll.sh 2020 11 https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2020-50/wet.paths.gz 72000
Step 2- Download, gunzip and scan the files
processScripts/scan.sh 2020 11
Step 3 - Import into ElasticSearch (can be done in parallel with step 2)
processScripts/importToES.sh 2020 11
This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 822337. Any dissemination of results here presented reflects only the consortium’s view. The Agency is not responsible for any use that may be made of the information it contains.