How to Use

As Extractor

Note

Extractor takes maximum file name length under consideration and creates sub-directories based on the url.

http://a.com/b.ext?x=&y=$%z2 -> a.com/b.extxyz2_.html (a.com folder with b.extxyz2_.html file in it)

To just extract a single webpage to terminal:

$ python darkspider.py -u http://github.com/
## Termex :: Extracting http://github.com to terminal
## http://github.com ::
<!DOCTYPE html>
...
</html>

Extract into a file (github.html) without the use of TOR:

$ python darkspider.py -w -u http://github.com -o github.html
## Outex :: Extracting http://github.com to github.com/github.html

Extract to terminal and find only the line with google-site-verification:

$ python darkspider.py -u http://github.com/ | grep 'google-site-verification'
    <meta name="google-site-verification" content="xxxx">

Extract to file and find only the line with google-site-verification using yara:

$ python darkspider.py -v -w -u https://github.com -e -y 0
...

Note

Update res/keyword.yar to search for other keywords. Use -y 0 for raw html searching and -y 1 for text search only.

Extract a set of webpages (imported from file) to a folder:

$ python darkspider.py -i links.txt -f links_output
...

As Crawler

Crawl the links of the webpage without the use of TOR, also show verbose output (really helpful):

$ python darkspider.py -v -w -u http://github.com/ -c
[ DEBUG ] Your IP: XXX.XXX.XXX.XXX :: Tor Connection: False
[ DEBUG ] URL :: http://github.com/
[ DEBUG ] Folder created :: github.com
[ INFO  ] Crawler started from http://github.com with 1 depth, 0 seconds delay and using 16 Threads. Excluding 'None' links.
[ INFO  ] Step 1 completed :: 87 result(s)
[ INFO  ] Network Structure created :: github.com/network_structure.json

Crawl the webpage with depth 2 (2 clicks) and 5 seconds waiting before crawl the next page:

$ python darkspider.py -v -u http://github.com/ -c -d 2 -p 5
[ DEBUG ] TOR is ready!
[ DEBUG ] Your IP: XXX.XXX.XXX.XXX :: Tor Connection: True
[ DEBUG ] URL :: http://github.com
[ DEBUG ] Folder created :: github.com
[ INFO  ] Crawler started from http://github.com with 2 depth, 5.0 seconds delay and using 16 Threads. Excluding 'None' links.
[ INFO  ] Step 1 completed :: 87 result(s)
[ INFO  ] Step 2 completed :: 4228 result(s)
[ INFO  ] Network Structure created :: github.com/network_structure.json

Crawl the webpage with depth 1 (1 clicks), 1 seconds pause and exclude links that match .*\.blog:

$ python darkspider.py -v -u http://github.com/ -c -d 1 -p 1 -z ".*\.blog"
[ DEBUG ] TOR is ready!
[ DEBUG ] Your IP: XXX.XXX.XXX.XXX :: Tor Connection: True
[ DEBUG ] URL :: http://github.com/
[ DEBUG ] Folder created :: github.com
[ INFO  ] Crawler started from http://github.com with 1 depth, 1.0 second delay and using 16 Threads. Excluding '.*\.blog' links.
[ INFO  ] Step 1 completed :: 85 result(s)
[ INFO  ] Network Structure created :: github.com/network_structure.json

As Crawler + Extractor

You can crawl a page and also extract the webpages into a folder with a single command:

$ python darkspider.py -v -u http://github.com/ -c -d 1 -p 1 -e
[ DEBUG ] TOR is ready!
[ DEBUG ] Your IP: XXX.XXX.XXX.XXX :: Tor Connection: True
[ DEBUG ] URL :: http://github.com/
[ DEBUG ] Folder created :: github.com
[ INFO  ] Crawler started from http://github.com with 1 depth, 1.0 second delay and using 16 Threads. Excluding 'None' links.
[ INFO  ] Step 1 completed :: 87 result(s)
[ INFO  ] Network Structure created :: github.com/network_structure.json
[ INFO  ] Cinex :: Extracting from github.com/links.txt to github.com/extracted
[ DEBUG ] File created :: github.com/extracted/github.com/collections_.html
...
[ DEBUG ] File created :: github.com/extracted/github.community/_.html

Note

The default (and only for now) file for crawler's links is the links.txt document. To extract along with crawl -e argument is required.

Following the same logic; you can parse all these pages to grep (for example) and search for specific text:

$ python darkspider.py -u http://github.com/ -c -e | grep '</html>'
</html>
</html>
...

As Crawler + Extractor + Keyword Search

You can crawl a page, perform a keyword search and extract the webpages that match the findings into a folder with a single command:

$ python darkspider.py -v -u http://github.com/ -o github.html -y 0
[ DEBUG ] TOR is ready!
[ DEBUG ] Your IP: XXX.XXX.XXX.XXX :: Tor Connection: True
[ DEBUG ] URL :: http://github.com/
[ DEBUG ] Folder created :: github.com
[ INFO  ] Outex :: Extracting http://github.com to github.com/github.html
[ DEBUG ] http://github.com :: Yara match found!

Visualization of the Network Structure

Provide -s argument to create graphs to gain insights from the generated data,

$ python darkspider.py -u "http://github.com/" -c -d 2 -p 1 -t 32 -s
## Crawler started from http://github.com with 2 depth, 1.0 second delay and using 32 Threads. Excluding 'None' links.
## Step 1 completed :: 87 result(s)
## Step 2 completed :: 4508 result(s)
## Network Structure created :: github.com/network_structure.json
## Generating :: Scatter Plot of the indegree vs nodes of the graph...
## Generating :: Bar Graph of the indegree vs percentage of nodes of the graph...
## Generating :: Scatter Plot of the outdegree vs nodes of the graph...
## Generating :: Bar Graph of the outdegree vs percentage of nodes of the graph...
## Generating :: Bar Graph of the eigenvector centrality vs percentage of nodes of the graph...
## Generating :: Bar Graph of the pagerank vs percentage of nodes of the graph...
## Generating :: Visualization of the graph...

Note

Output in Readme is trimmed for better readability. General verbose output is much detailed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to Use

As Extractor

As Crawler

As Crawler + Extractor

As Crawler + Extractor + Keyword Search

Visualization of the Network Structure

Clone this wiki locally