Skip to content

CommonCrawl keyword scanner. Time for month of CC data on EC2 c5.18xlarge instance for hundreds of keywords takes about 3 hours. LLM (BERT) based 2nd level filtering. Developed with support from the EU and the Populism & Civic Engagement H2020 project.

License

Notifications You must be signed in to change notification settings

CitizensFoundation/pace-keyword-scanner

Repository files navigation

pace-commoncrawl-scanner

Scans CommonCrawl datasets for keywords. Scans the whole month of CommonCrawl data using Amazon EC2 c5n.16xlarge instance for hundreds of keywords in about 4 hours. Developed with support from the EU and the Populism & Civic Engagement H2020 project.

screenshot

Various setup steps for installing on a AWS Ubuntu 20.04

wget -O- https://apt.corretto.aws/corretto.key | sudo apt-key add - 
sudo add-apt-repository 'deb https://apt.corretto.aws stable main'
sudo apt-get update; sudo apt-get install -y java-15-amazon-corretto-jdk

sudo apt install build-essential cmake libboost-all-dev ragel maven

git clone git://github.com/intel/hyperscan
cd hyperscan
cmake -DBUILD_SHARED_LIBS=YES
make 
sudo make install

cd

git clone https://github.com/CitizensFoundation/pace-commoncrawl-scanner.git
cd pace-commoncrawl-scanner
mvn clean package

mkdir /home/ubuntu/pace-commoncrawl-scanner/results

cd /home
sudo ln -s ubuntu/ robert

cd
cd pace-commoncrawl-scanner

Prepare the page ranks file into the condensed format

processScripts/getLatestPageRanking.sh 2020 11 https://commoncrawl.s3.amazonaws.com/projects/hyperlinkgraph/cc-main-2020-jul-aug-sep/host/cc-main-2020-jul-aug-sep-host-ranks.txt.gz
processScripts/processHostRanksFile.sh 2020 11

Step 1 - Download files list

processScripts/getLatestWetPathsAndDownloadAll.sh 2020 11 https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2020-50/wet.paths.gz 72000

Step 2- Download, gunzip and scan the files

processScripts/scan.sh 2020 11

Step 3 - Import into ElasticSearch (can be done in parallel with step 2)

processScripts/importToES.sh 2020 11

eu logo

This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 822337. Any dissemination of results here presented reflects only the consortium’s view. The Agency is not responsible for any use that may be made of the information it contains.

About

CommonCrawl keyword scanner. Time for month of CC data on EC2 c5.18xlarge instance for hundreds of keywords takes about 3 hours. LLM (BERT) based 2nd level filtering. Developed with support from the EU and the Populism & Civic Engagement H2020 project.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published