Skip to content
Crawl {robots,humans,security}.txt files
Branch: master
Clone or download
Latest commit fa2fb44 Apr 9, 2019
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
.editorconfig Log stats Mar 20, 2019
.gitignore Add blacklist for file response Mar 28, 2019
README.md
domains-100.txt
domains-100k.txt
domains-10k.txt
domains-1k.txt Add more datasets Mar 21, 2019
domains-1m.txt
domains-25k.txt Add 25k domain list Mar 29, 2019
domains-faang.txt
package-lock.json
package.json
tfc.js
top-1m.csv.zip

README.md

text file crawler (tfc)

To quench my curiousity, I wanted to gauge the usage & adoption of the following pseudo-standard text files:

Given a domains.txt file containing one domain per line, the Node.js script will fire off requests for each of the files. Given network I/O is the constraint, this can take a while.

NOTE: This script isn't particularly efficient in terms of memory usage. If you encounter issues running of memory, pass the --max-old-space-size flag like so: node --max-old-space-size=4096 tfc.

Redirects are capped at 20 and validity is based off the HTTP status code, Content-Type, and first few values of the response data. After completing, the statistics will be printed out. Valid text files found will be written to files/, which is created & wiped for you each time the script is started.

If you're interested in a write-up about this along with the metrics, you should check out my article.

Usage

Make a domains.txt by making your own or symlinking one of the provided:

ln -s domains-faang.txt domains.txt

Then, grab the dependencies & start it up:

npm install && npm start

Not all requests receive a response & hang indefinitely. If it's been a while, just Ctrl + C the process, which will print out the stats before exiting.

Thanks

David. Jeff.

License

MIT.

You can’t perform that action at this time.