Sign-of-life Crawler

This repository contains the source code of the sign-of-life domain crawler.

File architecture

sign_of_life_crawler
├── .docker
│   └── data            # Folder for input data of docker instances 
├── app_domains		# Source code of the crawler in particular:
│   ├── config.py       # General parameters of the crawler, including the input folder to scan "input_folder"
│   └── main_domains.py # Main script of the crawler
├── input		# Input list of URLS + various necessary input files (Registrar lists, Trained ML models etc...)
│   └── folder_X        # The CSV files with the input lists of domains must be added to a folder in input   
├── inter               # Folder containing intermediary results
├── output              # Folder containing output files with the crawler classification, 1 file/input file
├── docs                # Documentation of the crawler
├── env_required        # Specific library to install
└── logging        	# Logging folder if the option is activated in config.py

Setting Up the environment

Hardware

System requirements:

number of files opened/stored in one folder: 10k files (CONTROLLER_LIMIT parameter in config.py)
memory at gathering of website pages: ~~ 8GB RAM (average) for 10k (CONTROLLER_LIMIT parameter in config.py)
number of processes in parallel: must be less than number of CPUs of your machine (MAX_PROCESSES and WORKERS_POST_PROCESSING in config.py)
No/deactivated antivirus (that may remove intermediary files that contain HTML pages)

Softwares

You can manually install all dependencies with Anaconda or use contenerised system using Docker.

Docker

Set the .env ( cp .docker/.env.example .docker/.env ) to configure the database connection. There is service inside for psql database, but code accepts remote connections

After this build docker normally (cd .docker && docker compose build)

Running docker system allows you to deploy crawler in number of instances (replicas). By default docker compose deploys 2 replicas which needs at least 16GB of RAM. To controll number of replicas use --scale crawler=<number_of_instances>

In summary to run crawler by docker:

docker compose up --scale crawler=<number_of_instances>

Every container scans data/*.csv files for a URLs (it's a shared directory with the .docker/data from the host).

Anaconda (Python)

Download the lastest anaconda at https://www.anaconda.com/products/individual
Install it, tick the box to add conda to the PATH.

At the end of this step, the command "conda" must be recognized by the command line

PostGre SQL

In Windows:

Download and install PostGreSQL at https://www.postgresql.org/download/

In Linux, run:

sudo apt-get install -y libpq-dev

Chrome

In Windows:

If not already installed, download and install Chrome at https://www.google.com/chrome/

In Linux:

for a system-wide chrome, run the following commands:

wget https://dl.google.com/linux/direct/google-chrome-stable_current_amd64.deb

sudo apt install ./google-chrome-stable_current_amd64.deb

ChromeDriver

Get chromedriver link at https://chromedriver.chromium.org/downloads
Make sure the version matches your version of Chrome (available by typing in Chrome address bar: chrome://settings/help)

In Windows:

Download and unzip the driver in given location. Then add that location to the PATH variable.

In Linux :

for example with the version 83:
- wget https://chromedriver.storage.googleapis.com/83.0.4103.39/chromedriver_linux64.zip
- unzip chromedriver_linux64.zip
- cp ./chromedriver /usr/bin/chromedriver

Visual C++ Build Tools

In windows:

Download and install Visual studio build tools at https://visualstudio.microsoft.com/thank-you-downloading-visual-studio/?sku=BuildTools&rel=16
When installing, make the box "C++ build tools" is ticked

In Linux:

sudo apt install build-essential

Libraries

To install the required libraries, run the following commands:

conda create -n centr python=3.7

conda activate centr

python -m pip install -U pip

pip install -r requirements_before_torch.txt

pip install torch==1.3.1+cpu -f https://download.pytorch.org/whl/torch_stable.html

pip install -r requirements_after_torch.txt

To compromise on some conflicts of versions:
pip install -r requirements_final.txt

For word forms library:

navigate to "sign_of_life/env_required/word_forms-master"

conda activate centr (if not already activated)

python setup.py install

Usage

Use

Put the list of domains to scan in 1 or multiple CSV files. Each file must have a column named "url".
Move these files into a unique folder (named "folder_X" for this example) in sign_of_life_crawler\input
In config.py, update the parameter "input_folder" = join(MAIN_DIR, "input", "folder_X")
run the main script:
python main_domains.py

At the end of the run, the output file is in saved in output, with the same name as the input file to which the prefix "final_" is added

Set the connection to the database in .env fill ( cp .env.example .env ).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Sign-of-life Crawler

File architecture

Setting Up the environment

Hardware

Softwares

Docker

Anaconda (Python)

PostGre SQL

Chrome

ChromeDriver

Visual C++ Build Tools

Libraries

Usage

Categories

Use

About

Releases

Packages

Contributors 3

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.docker		.docker
app_domains		app_domains
docs		docs
env_required		env_required
img		img
input		input
inter		inter
logging		logging
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
entrypoint.sh		entrypoint.sh
report_input.sh		report_input.sh
req_april2020.txt		req_april2020.txt
requirements_after_torch.txt		requirements_after_torch.txt
requirements_before_torch.txt		requirements_before_torch.txt
requirements_final.txt		requirements_final.txt
run.sh		run.sh

License

CENTRprojects/Signs-of-life

Folders and files

Latest commit

History

Repository files navigation

Sign-of-life Crawler

File architecture

Setting Up the environment

Hardware

Softwares

Docker

Anaconda (Python)

PostGre SQL

Chrome

ChromeDriver

Visual C++ Build Tools

Libraries

Usage

Categories

Use

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages