GitHub - ArchiveTeam/robots-relapse: Downloading robots.txt files for science!

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
README		README
aria2c.sh		aria2c.sh
cronjob.sh		cronjob.sh
diff.sh		diff.sh
robots2sqlite.sh		robots2sqlite.sh
verifydownloadedfile.sh		verifydownloadedfile.sh

Repository files navigation

By default this will download robots.txt files from the "top" 10000 websites according to Alexa's rankings.
It will then use the file utility to remove rubbish (such as HTML or image files we got served). This is ugly but I cannot think of a better way to do this. Websites serve you the craziest things for robots.txt and I do not want to keep all that.
Once the files were downloaded, it will insert their contents in a sqlite3 database.
Then it will 7z the downloaded files and delete them from the directory structure.

It is meant to be run daily.

Dependencies:
aria2c for downloading
sqlite3 for storage
7z for backups/archival of the originally downloaded files

How to use:
Make sure the files are executable. Put them into an empty directory. Edit the path in cronjob.sh. Launch this file for the whole deal. Alternatively launch aria2c.sh, afterwards run robots2sqlite.sh (you must pass the directory path). You can play with diff.sh.

Warning:
At the moment this is not very finished, will probably delete all your data.

TODO:
- A lot.
- How about: Discard unchanged files before inserting them into the database. Argh, how to do this if I remove the files each day? Diff against database query? Ugh ugh UGH!
--- No, bad idea. The diff option is something one can do with the dataset later on. Do not make things too complicated...