Skip to content

ArchiveTeam/robots-relapse

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

By default this will download robots.txt files from the "top" 10000 websites according to Alexa's rankings.
It will then use the file utility to remove rubbish (such as HTML or image files we got served). This is ugly but I cannot think of a better way to do this. Websites serve you the craziest things for robots.txt and I do not want to keep all that.
Once the files were downloaded, it will insert their contents in a sqlite3 database.
Then it will 7z the downloaded files and delete them from the directory structure.

It is meant to be run daily.

Dependencies:
aria2c for downloading
sqlite3 for storage
7z for backups/archival of the originally downloaded files

How to use:
Make sure the files are executable. Put them into an empty directory. Edit the path in cronjob.sh. Launch this file for the whole deal. Alternatively launch aria2c.sh, afterwards run robots2sqlite.sh (you must pass the directory path). You can play with diff.sh.

Warning:
At the moment this is not very finished, will probably delete all your data.

TODO:
- A lot.
- How about: Discard unchanged files before inserting them into the database. Argh, how to do this if I remove the files each day? Diff against database query? Ugh ugh UGH!
--- No, bad idea. The diff option is something one can do with the dataset later on. Do not make things too complicated...

About

Downloading robots.txt files for science!

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Shell 100.0%