cc.py

Simple multi threaded tool to extract domain related data from commoncrawl.org

Usage

ccp.py [-h] -d domain -o path [-t THREADS] [-f index1] [-f index2]

necessary arguments:
  -d, --domain   The domain you want to search for in CC data.
  -o, --outfile  The path and filename where you want the results to be saved to.

optional arguments:
  -h, --help     Show help message and exit
  -f, --filter   Use only indices which contain this string
  -t, --threads  Threads for requests

Examples

Search for github.com and save to /home/folder/cc/data.txt

python3 ccp.py -d github.com -o /home/folder/cc/data.txt

Search for github.com in indices which contain "CC-MAIN-2017-09", save to data.txt

python3 ccp.py -d github.com -o ./data.txt -f CC-MAIN-2017-09

Search for github.com in indices which contain "2013" and "2014", save to data.txt

python3 ccp.py -d github.com -o ./data.txt -f 2014 -f 2013

Search for github.com using 10 threads, save to data.txt

python3 ccp.py -d github.com -o ./data.txt -t 10

grep tips

I am no grep expert but I know how to extract data, if you have better solutions for my existing commands OR additional ideas what to search for: PR

Find entries which end with popular file extension indicating dynamic pages etc:

grep -i -E '\.(php|asp|dev|jsp|wsdl|xml|cgi|json|html)$' /home/folder/cc/data.txt

Find interesting files like backups, archives, log files...

grep -i -E '\.(zip|rar|tar|bkp|sql|zip|bz2|gz|txt|bak|conf|log|error|debug|yml|lock|template|tpl)$' /home/folder/cc/data.txt

Find entries which contain popular strings like "admin" etc:

grep -i -E '(admin|account|debug|control|config|upload|system|secret|environment|dashboard)$' /home/folder/cc/data.txt

Find files which begin with "." (htaccess, ...):

grep -i -E '\/\.' /home/folder/cc/data.txt

Find obvious backup files:

grep -i -E '(\.bkp|\.bak|backup|\.dump|\.sql)' /home/folder/cc/data.txt

Extract subdomains:

sed -e 's|^[^/]*//||' -e 's|^www\.||' -e 's|/.*$||' /home/folder/cc/data.txt | grep -v ":" | grep -v "@" | grep -v "?" | grep -v "/" | sort -u

Find urls with parameters in it:

grep -i -E '(\?|\&)(.*?)=((.*?)|)' /home/folder/cc/data.txt | sort -u

Dependencies

python3
requests
argparse
json

Information

This project was initially forked from cc.py but since I refactored it completely and si9int took another path I decided to create a stand alone project.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
data		data
inc		inc
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
ccp.py		ccp.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

cc.py

About

Releases

Packages

Languages

License

Damian89/commonCrawlParser

Folders and files

Latest commit

History

Repository files navigation

cc.py

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages