_____ _ _____ _ _ _
/ ____| | | | __ \| | (_) | |
| (___ __ _ _ _ __ _| |_| |__) | |__ _ ___| |__
\___ \ / _` | | | |/ _` | __| ___/| '_ \| / __| '_ \
____) | (_| | |_| | (_| | |_| | | | | | \__ \ | | | - Crawler
|_____/ \__, |\__,_|\__,_|\__|_| |_| |_|_|___/_| |_|
| |
|_|
Welcome to SquatPhish-Crawler. It is part of SquatPhish project to crawler the squatting domains for phishing pages detection.
A distributed crawler to capture screenshots and log the redirection
- web-based crawling
- mobile-based-crawling
- distributed-crawling using shared memory counter
- add JS tracking
- auto submitting forms and loggin
bash install.sh
Run the demo:
node demo.js
You will get a web version and a mobile version screenshot, their redirections and HTML source in test folder.
The usage can be customized by following function calls in demo.js.
We provide a distributed crawling framework to crawl a large number of URLs.
The idea borrows from Hadoop with MapReduce. We separate the URL list into different chunks (stored under chunk folder), we then apply task dispatch for each sublist with a chunk ID.
The demo provides a step-by-step instruction to crawl FB squatting domains. How to detect squatting domains can be referred to SquatPhish-1
We use the data test/crawl_demo_url_list.txt as an example:
- Clean chunks and create data folder:
rm -rf chunks/*
mkdir data
- Split the list into chunks: The first argument is the file. The second argument is the number of chunks.
python3 parse.py test/crawl_demo_url_list.txt 10
- Compile the task dispatch: You could define your PROC_MAX in task_dispatcher.c
## define PROC_MAX 5 //The maximum processes the machine can tolerate
gcc task_dispatcher.c --std=c99
- Run jobs with an index range:
./a.out 0 10
You could separate tasks among different servers, e.g.,
server 1 >> ./a.out 0 2
server 2 >> ./a.out 3 4
server 3 >> ./a.out 5 10
- Get results in a new data folder
bash run_distribute.sh test/crawl_demo_url_list.txt 10
This is a research prototype, use at your own risk.
If you feel this tool is useful, cite the tool as 🐕 SquatPhish 🐕 is highly appreiciated.
Core contributor: ke tian @ririhedou
Thanks hang hu @0xorz for reproduction testing.
Current version is 0.0.2, updated at June 04 2018