Home

What is DarkSpider?

DarkSpider is a python script to crawl and extract (regular or onion) webpages through TOR network.

Warning

Crawling is not illegal, but violating copyright is. It’s always best to double check a website’s T&C before crawling them. Some websites set up what’s called robots.txt to tell crawlers not to visit those pages. This crawler will allow you to go around this, but we always recommend respecting robots.txt.

Keep in mind

Extracting and crawling through TOR network take some time. That's normal behaviour; you can find more information here.

What makes it simple?

With a single argument you can read an .onion webpage or a regular one through TOR Network and using pipes you can pass the output at any other tool you prefer.

$ python darkspider.py -u http://github.com/ | grep 'google-site-verification'
    <meta name="google-site-verification" content="xxxx">

If you want to crawl the links of a webpage use the -c and you will get a folder all the extracted links. You can even use -d to crawl them and so on. As far, there is also the necessary argument -p to wait some seconds before the next crawl.

$ python darkspider.py -v -u http://github.com/ -c -d 2 -p 2
[ DEBUG ] TOR is ready!
[ DEBUG ] Your IP: XXX.XXX.XXX.XXX :: Tor Connection: True
[ DEBUG ] URL :: http://github.com
[ DEBUG ] Folder created :: github.com
[ INFO  ] Crawler started from http://github.com with 2 depth, 2.0 seconds delay and using 16 Threads. Excluding 'None' links.
[ INFO  ] Step 1 completed :: 87 result(s)
[ INFO  ] Step 2 completed :: 4228 result(s)
[ INFO  ] Network Structure created :: github.com/network_structure.json

Note

Output in Readme is trimmed for better readability. General verbose output is much detailed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Home

What is DarkSpider?

What makes it simple?

Clone this wiki locally