text file crawler (tfc)
To quench my curiousity, I wanted to gauge the usage & adoption of the following pseudo-standard text files:
Given a domains.txt file containing one domain per line, the Node.js script will fire off requests for each of the files. Given network I/O is the constraint, this can take a while.
NOTE: This script isn't particularly efficient in terms of memory usage. If you encounter issues running of memory, pass the
--max-old-space-sizeflag like so:
node --max-old-space-size=4096 tfc.
Redirects are capped at 20 and validity is based off the HTTP status code,
Content-Type, and first few values of the response data. After completing, the
statistics will be printed out. Valid text files found will be written to
files/, which is created & wiped for you each time the script is started.
If you're interested in a write-up about this along with the metrics, you should check out my article.
Make a domains.txt by making your own or symlinking one of the provided:
ln -s domains-faang.txt domains.txt
Then, grab the dependencies & start it up:
npm install && npm start
Not all requests receive a response & hang indefinitely. If it's been a while, just Ctrl + C the process, which will print out the stats before exiting.