-
Notifications
You must be signed in to change notification settings - Fork 7
How to Use
Pratik Pingale edited this page Sep 29, 2022
·
1 revision
Note
Extractor takes maximum file name length under consideration and creates sub-directories based on the url.
http://a.com/b.ext?x=&y=$%z2
->a.com/b.extxyz2_.html
(a.com
folder withb.extxyz2_.html
file in it)
- To just extract a single webpage to terminal:
$ python darkspider.py -u http://github.com/
## Termex :: Extracting http://github.com to terminal
## http://github.com ::
<!DOCTYPE html>
...
</html>
- Extract into a file (github.html) without the use of TOR:
$ python darkspider.py -w -u http://github.com -o github.html
## Outex :: Extracting http://github.com to github.com/github.html
- Extract to terminal and find only the line with google-site-verification:
$ python darkspider.py -u http://github.com/ | grep 'google-site-verification'
<meta name="google-site-verification" content="xxxx">
- Extract to file and find only the line with google-site-verification using
yara
:
$ python darkspider.py -v -w -u https://github.com -e -y 0
...
Note
Update
res/keyword.yar
to search for other keywords. Use-y 0
for raw html searching and-y 1
for text search only.
- Extract a set of webpages (imported from file) to a folder:
$ python darkspider.py -i links.txt -f links_output
...
- Crawl the links of the webpage without the use of TOR, also show verbose output (really helpful):
$ python darkspider.py -v -w -u http://github.com/ -c
[ DEBUG ] Your IP: XXX.XXX.XXX.XXX :: Tor Connection: False
[ DEBUG ] URL :: http://github.com/
[ DEBUG ] Folder created :: github.com
[ INFO ] Crawler started from http://github.com with 1 depth, 0 seconds delay and using 16 Threads. Excluding 'None' links.
[ INFO ] Step 1 completed :: 87 result(s)
[ INFO ] Network Structure created :: github.com/network_structure.json
- Crawl the webpage with depth 2 (2 clicks) and 5 seconds waiting before crawl the next page:
$ python darkspider.py -v -u http://github.com/ -c -d 2 -p 5
[ DEBUG ] TOR is ready!
[ DEBUG ] Your IP: XXX.XXX.XXX.XXX :: Tor Connection: True
[ DEBUG ] URL :: http://github.com
[ DEBUG ] Folder created :: github.com
[ INFO ] Crawler started from http://github.com with 2 depth, 5.0 seconds delay and using 16 Threads. Excluding 'None' links.
[ INFO ] Step 1 completed :: 87 result(s)
[ INFO ] Step 2 completed :: 4228 result(s)
[ INFO ] Network Structure created :: github.com/network_structure.json
- Crawl the webpage with depth 1 (1 clicks), 1 seconds pause and exclude links that match
.*\.blog
:
$ python darkspider.py -v -u http://github.com/ -c -d 1 -p 1 -z ".*\.blog"
[ DEBUG ] TOR is ready!
[ DEBUG ] Your IP: XXX.XXX.XXX.XXX :: Tor Connection: True
[ DEBUG ] URL :: http://github.com/
[ DEBUG ] Folder created :: github.com
[ INFO ] Crawler started from http://github.com with 1 depth, 1.0 second delay and using 16 Threads. Excluding '.*\.blog' links.
[ INFO ] Step 1 completed :: 85 result(s)
[ INFO ] Network Structure created :: github.com/network_structure.json
- You can crawl a page and also extract the webpages into a folder with a single command:
$ python darkspider.py -v -u http://github.com/ -c -d 1 -p 1 -e
[ DEBUG ] TOR is ready!
[ DEBUG ] Your IP: XXX.XXX.XXX.XXX :: Tor Connection: True
[ DEBUG ] URL :: http://github.com/
[ DEBUG ] Folder created :: github.com
[ INFO ] Crawler started from http://github.com with 1 depth, 1.0 second delay and using 16 Threads. Excluding 'None' links.
[ INFO ] Step 1 completed :: 87 result(s)
[ INFO ] Network Structure created :: github.com/network_structure.json
[ INFO ] Cinex :: Extracting from github.com/links.txt to github.com/extracted
[ DEBUG ] File created :: github.com/extracted/github.com/collections_.html
...
[ DEBUG ] File created :: github.com/extracted/github.community/_.html
Note
The default (and only for now) file for crawler's links is the
links.txt
document. To extract along with crawl-e
argument is required.
- Following the same logic; you can parse all these pages to grep (for example) and search for specific text:
$ python darkspider.py -u http://github.com/ -c -e | grep '</html>'
</html>
</html>
...
- You can crawl a page, perform a keyword search and extract the webpages that match the findings into a folder with a single command:
$ python darkspider.py -v -u http://github.com/ -o github.html -y 0
[ DEBUG ] TOR is ready!
[ DEBUG ] Your IP: XXX.XXX.XXX.XXX :: Tor Connection: True
[ DEBUG ] URL :: http://github.com/
[ DEBUG ] Folder created :: github.com
[ INFO ] Outex :: Extracting http://github.com to github.com/github.html
[ DEBUG ] http://github.com :: Yara match found!
- Provide
-s
argument to create graphs to gain insights from the generated data,
$ python darkspider.py -u "http://github.com/" -c -d 2 -p 1 -t 32 -s
## Crawler started from http://github.com with 2 depth, 1.0 second delay and using 32 Threads. Excluding 'None' links.
## Step 1 completed :: 87 result(s)
## Step 2 completed :: 4508 result(s)
## Network Structure created :: github.com/network_structure.json
## Generating :: Scatter Plot of the indegree vs nodes of the graph...
## Generating :: Bar Graph of the indegree vs percentage of nodes of the graph...
## Generating :: Scatter Plot of the outdegree vs nodes of the graph...
## Generating :: Bar Graph of the outdegree vs percentage of nodes of the graph...
## Generating :: Bar Graph of the eigenvector centrality vs percentage of nodes of the graph...
## Generating :: Bar Graph of the pagerank vs percentage of nodes of the graph...
## Generating :: Visualization of the graph...
Note
Output in Readme is trimmed for better readability. General verbose output is much detailed.