Skip to content
Crawl {robots,humans,security}.txt files
Branch: master
Clone or download
Latest commit fa2fb44 Apr 9, 2019
Type Name Latest commit message Commit time
Failed to load latest commit information.
.editorconfig Log stats Mar 20, 2019
.gitignore Add blacklist for file response Mar 28, 2019
domains-1k.txt Add more datasets Mar 21, 2019
domains-25k.txt Add 25k domain list Mar 29, 2019

text file crawler (tfc)

To quench my curiousity, I wanted to gauge the usage & adoption of the following pseudo-standard text files:

Given a domains.txt file containing one domain per line, the Node.js script will fire off requests for each of the files. Given network I/O is the constraint, this can take a while.

NOTE: This script isn't particularly efficient in terms of memory usage. If you encounter issues running of memory, pass the --max-old-space-size flag like so: node --max-old-space-size=4096 tfc.

Redirects are capped at 20 and validity is based off the HTTP status code, Content-Type, and first few values of the response data. After completing, the statistics will be printed out. Valid text files found will be written to files/, which is created & wiped for you each time the script is started.

If you're interested in a write-up about this along with the metrics, you should check out my article.


Make a domains.txt by making your own or symlinking one of the provided:

ln -s domains-faang.txt domains.txt

Then, grab the dependencies & start it up:

npm install && npm start

Not all requests receive a response & hang indefinitely. If it's been a while, just Ctrl + C the process, which will print out the stats before exiting.


David. Jeff.



You can’t perform that action at this time.