Skip to content
Pratik Pingale edited this page Sep 29, 2022 · 1 revision

As Extractor

Note

Extractor takes maximum file name length under consideration and creates sub-directories based on the url.

http://a.com/b.ext?x=&y=$%z2 -> a.com/b.extxyz2_.html (a.com folder with b.extxyz2_.html file in it)

  • To just extract a single webpage to terminal:
$ python darkspider.py -u http://github.com/
## Termex :: Extracting http://github.com to terminal
## http://github.com ::
<!DOCTYPE html>
...
</html>
  • Extract into a file (github.html) without the use of TOR:
$ python darkspider.py -w -u http://github.com -o github.html
## Outex :: Extracting http://github.com to github.com/github.html
  • Extract to terminal and find only the line with google-site-verification:
$ python darkspider.py -u http://github.com/ | grep 'google-site-verification'
    <meta name="google-site-verification" content="xxxx">
  • Extract to file and find only the line with google-site-verification using yara:
$ python darkspider.py -v -w -u https://github.com -e -y 0
...

Note

Update res/keyword.yar to search for other keywords. Use -y 0 for raw html searching and -y 1 for text search only.

  • Extract a set of webpages (imported from file) to a folder:
$ python darkspider.py -i links.txt -f links_output
...

As Crawler

  • Crawl the links of the webpage without the use of TOR, also show verbose output (really helpful):
$ python darkspider.py -v -w -u http://github.com/ -c
[ DEBUG ] Your IP: XXX.XXX.XXX.XXX :: Tor Connection: False
[ DEBUG ] URL :: http://github.com/
[ DEBUG ] Folder created :: github.com
[ INFO  ] Crawler started from http://github.com with 1 depth, 0 seconds delay and using 16 Threads. Excluding 'None' links.
[ INFO  ] Step 1 completed :: 87 result(s)
[ INFO  ] Network Structure created :: github.com/network_structure.json
  • Crawl the webpage with depth 2 (2 clicks) and 5 seconds waiting before crawl the next page:
$ python darkspider.py -v -u http://github.com/ -c -d 2 -p 5
[ DEBUG ] TOR is ready!
[ DEBUG ] Your IP: XXX.XXX.XXX.XXX :: Tor Connection: True
[ DEBUG ] URL :: http://github.com
[ DEBUG ] Folder created :: github.com
[ INFO  ] Crawler started from http://github.com with 2 depth, 5.0 seconds delay and using 16 Threads. Excluding 'None' links.
[ INFO  ] Step 1 completed :: 87 result(s)
[ INFO  ] Step 2 completed :: 4228 result(s)
[ INFO  ] Network Structure created :: github.com/network_structure.json
  • Crawl the webpage with depth 1 (1 clicks), 1 seconds pause and exclude links that match .*\.blog:
$ python darkspider.py -v -u http://github.com/ -c -d 1 -p 1 -z ".*\.blog"
[ DEBUG ] TOR is ready!
[ DEBUG ] Your IP: XXX.XXX.XXX.XXX :: Tor Connection: True
[ DEBUG ] URL :: http://github.com/
[ DEBUG ] Folder created :: github.com
[ INFO  ] Crawler started from http://github.com with 1 depth, 1.0 second delay and using 16 Threads. Excluding '.*\.blog' links.
[ INFO  ] Step 1 completed :: 85 result(s)
[ INFO  ] Network Structure created :: github.com/network_structure.json

As Crawler + Extractor

  • You can crawl a page and also extract the webpages into a folder with a single command:
$ python darkspider.py -v -u http://github.com/ -c -d 1 -p 1 -e
[ DEBUG ] TOR is ready!
[ DEBUG ] Your IP: XXX.XXX.XXX.XXX :: Tor Connection: True
[ DEBUG ] URL :: http://github.com/
[ DEBUG ] Folder created :: github.com
[ INFO  ] Crawler started from http://github.com with 1 depth, 1.0 second delay and using 16 Threads. Excluding 'None' links.
[ INFO  ] Step 1 completed :: 87 result(s)
[ INFO  ] Network Structure created :: github.com/network_structure.json
[ INFO  ] Cinex :: Extracting from github.com/links.txt to github.com/extracted
[ DEBUG ] File created :: github.com/extracted/github.com/collections_.html
...
[ DEBUG ] File created :: github.com/extracted/github.community/_.html

Note

The default (and only for now) file for crawler's links is the links.txt document. To extract along with crawl -e argument is required.

  • Following the same logic; you can parse all these pages to grep (for example) and search for specific text:
$ python darkspider.py -u http://github.com/ -c -e | grep '</html>'
</html>
</html>
...

As Crawler + Extractor + Keyword Search

  • You can crawl a page, perform a keyword search and extract the webpages that match the findings into a folder with a single command:
$ python darkspider.py -v -u http://github.com/ -o github.html -y 0
[ DEBUG ] TOR is ready!
[ DEBUG ] Your IP: XXX.XXX.XXX.XXX :: Tor Connection: True
[ DEBUG ] URL :: http://github.com/
[ DEBUG ] Folder created :: github.com
[ INFO  ] Outex :: Extracting http://github.com to github.com/github.html
[ DEBUG ] http://github.com :: Yara match found!

Visualization of the Network Structure

  • Provide -s argument to create graphs to gain insights from the generated data,
$ python darkspider.py -u "http://github.com/" -c -d 2 -p 1 -t 32 -s
## Crawler started from http://github.com with 2 depth, 1.0 second delay and using 32 Threads. Excluding 'None' links.
## Step 1 completed :: 87 result(s)
## Step 2 completed :: 4508 result(s)
## Network Structure created :: github.com/network_structure.json
## Generating :: Scatter Plot of the indegree vs nodes of the graph...
## Generating :: Bar Graph of the indegree vs percentage of nodes of the graph...
## Generating :: Scatter Plot of the outdegree vs nodes of the graph...
## Generating :: Bar Graph of the outdegree vs percentage of nodes of the graph...
## Generating :: Bar Graph of the eigenvector centrality vs percentage of nodes of the graph...
## Generating :: Bar Graph of the pagerank vs percentage of nodes of the graph...
## Generating :: Visualization of the graph...