forked from SpiritQuaddicted/robots-relapse
ArchiveTeam/robots-relapse
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
By default this will download robots.txt files from the "top" 10000 websites according to Alexa's rankings. It will then use the file utility to remove rubbish (such as HTML or image files we got served). This is ugly but I cannot think of a better way to do this. Websites serve you the craziest things for robots.txt and I do not want to keep all that. Once the files were downloaded, it will insert their contents in a sqlite3 database. Then it will 7z the downloaded files and delete them from the directory structure. It is meant to be run daily. Dependencies: aria2c for downloading sqlite3 for storage 7z for backups/archival of the originally downloaded files How to use: Make sure the files are executable. Put them into an empty directory. Edit the path in cronjob.sh. Launch this file for the whole deal. Alternatively launch aria2c.sh, afterwards run robots2sqlite.sh (you must pass the directory path). You can play with diff.sh. Warning: At the moment this is not very finished, will probably delete all your data. TODO: - A lot. - How about: Discard unchanged files before inserting them into the database. Argh, how to do this if I remove the files each day? Diff against database query? Ugh ugh UGH! --- No, bad idea. The diff option is something one can do with the dataset later on. Do not make things too complicated...
About
Downloading robots.txt files for science!
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published
Languages
- Shell 100.0%